All Featurizers

Roberta-Zinc480M-102M

This is a Roberta style masked language model trained on ~480m SMILES strings from the ZINC database. The model has ~102m parameters and was trained for 150000 iterations with a batch size of 4096 to a validation loss of ~0.122.

Updated on May 14, 2023

GPT2-Zinc480M-87M

This is a GPT2 style autoregressive language model trained on ~480m SMILES strings from the ZINC database available. The model has ~87m parameters and was trained for 175000 iterations with a batch size of 3072 to a validation loss of ~.615.

Updated on May 14, 2023

ChemGPT-1.2B

ChemGPT (1.2B params) is a transformer model for generative molecular modeling, which was pretrained on the PubChem10M dataset.

Updated on May 4, 2023

ChemGPT-19M

ChemGPT (19M params) is a transformers model for generative molecular modeling, which was pretrained on the PubChem10M dataset.

Updated on May 4, 2023

ChemGPT-4.7M

ChemGPT (4.7M params) is a transformer model for generative molecular modeling, which was pretrained on the PubChem10M dataset.

Updated on May 4, 2023

MolT5

MolT5 is a self-supervised learning framework that pretrains transformer-based models on vast amounts of unlabeled natural language text and molecule strings allowing generation of high-quality outputs for molecule captioning and text-based molecule generation.

Updated on May 3, 2023

desc3D

3D molecular descriptors are numerical representations of chemical and physical properties of molecules that are based on 3D structures of molecules.

Updated on May 3, 2023

desc2D

2D molecular descriptors are numerical representations of chemical and physical properties of molecules that are based on 2D structures of molecules. We augment the RDKit 2D descriptors with additional optional properties.

Updated on May 3, 2023

mordred

Mordred calculates over 1800 molecular descriptors, including constitutional, topological, electronic, and geometrical descriptors, among others. Both 2D and 3D descriptors are supported and optional.

Updated on May 3, 2023

scaffoldkeys

Scaffold Keys are a method for representing scaffold using substructure features and were proposed by Peter Ertl in: Identification of Bioisosteric Scaffolds using Scaffold Keys

Updated on May 3, 2023

electroshape

Compute Electroshape descriptors as described by Armstrong et al. in ElectroShape: fast molecular similarity calculations incorporating shape, chirality and electrostatics.

Updated on May 3, 2023

usrcat

USRCAT is a real-time ultrafast shape recognition with pharmacophoric constraints. It integrates atom type to the traditional USR descriptor to improve the performance of shape-based virtual screening

Updated on May 3, 2023

usr

Ultrafast Shape Recognition (USR) is a Ligand-Based Virtual Screening methods that condense 3-dimensional information about molecular shape, as well as other properties, into a small set of numeric descriptors

Updated on May 3, 2023

cats3d

3D version of the 6 Potential Pharmacophore Points CATS (Chemically Advanced Template Search) pharmacophore. This version differs from `pharm3D-cats` on the process to make the descriptors fuzzy, which is closer to the original paper implementation. This version uses the 3D distance matrix between pharmacophoric points

Updated on May 3, 2023

cats2d

2D version of the 6 Potential Pharmacophore Points CATS (Chemically Advanced Template Search) pharmacophore. This version differs from `pharm2D-cats` on the process to make the descriptors fuzzy, which is closer to the original paper implementation. Implementation is based on work by Rajarshi Guha (08/26/07) and Chris Arthur (1/11/2015)

Updated on May 3, 2023

pharm3D-cats

3D version of the CATS pharmacophores computed with the Pharm2D module in RDKit.

Updated on May 3, 2023

pharm2D-cats

2D topological pharmacophores computed by the Pharm2D module in RDKit using the CATS (Chemically Advanced Template Search) feature definition.

Updated on May 3, 2023

pharm2D-default

3D version of the pharmacophores computed with the default rdkit feature definition: https://github.com/rdkit/rdkit/blob/master/Data/BaseFeatures.fdef

Updated on May 3, 2023

pharm3D-gobbi

3D version of the 2D pharmacophores defined in the Gobbi and Poppinger (1998) paper. 8 pharmacophore feature types were listed (hydrogen bond acceptor, hydrogen bond donor, basic group, acidic group, hydrophobic group, halogen, attachment point to an aliphatic ring, and attachment point to an aromatic ring)

Updated on May 3, 2023

pharm2D-gobbi

2D pharmacophores computed by the Pharm2D module in RDKit. Gobbi pharmacophore were designed for selecting compounds from large combinatorial libraries, as defined in the Gobbi and Poppinger (1998) paper. 8 pharmacophore feature types were listed (hydrogen bond acceptor, hydrogen bond donor, basic group, acidic group, hydrophobic group, halogen, attachment point to an aliphatic ring, and attachment point to an aromatic ring)

Updated on May 3, 2023

pharm3D-pmapper

Pmapper is a Python module to generate 3D pharmacophore signatures and fingerprints. Signatures uniquely encode 3D pharmacophores with hashes suitable for fast identification of identical pharmacophores. See https://github.com/DrrDom/pmapper

Updated on May 3, 2023

pharm2D-pmapper

2D pharmacophores computed by the Pharm2D module in RDKit using Pmapper feature definition. Pmapper is a Python module to generate pharmacophore signatures and fingerprints. See https://github.com/DrrDom/pmapper

Updated on May 3, 2023

ChemBERTa-77M-MTR

ChemBERTa is a pre-trained language model for molecules based on (Ro)BERT(a) trained on PubChem compounds. The MTR version was pretrained using mutitask regression objective, while the MLM version was pretrained using a masked language modeling objective

Updated on Mar 20, 2023

ChemBERTa-77M-MLM

ChemBERTa is a pre-trained language model for molecules based on (Ro)BERT(a) trained on PubChem compounds. The MTR version was pretrained using mutitask regression objective, while the MLM version was pretrained using a masked language modeling objective

Updated on Mar 20, 2023

atompair-count

The Atompair-Count fingerprint is essentially the same as the atompair fingerprint. However, instead of being hashed into a binary vector, there is no hashing process and simply a count vector is returned

Updated on Feb 16, 2023

topological-count

The Topological-Count fingerprint is essentially the same as the Topological fingerprint. However, instead of being hashed into a binary vector, there is no hashing process and simply a count vector is returned

Updated on Feb 16, 2023

fcfp-count

The FCFP-Count (Functional Class Fingerprints-Count) is essentially the same as the FCFP. However, instead of being hashed into a binary vector, there is no hashing process and simply a count vector is returned

Updated on Feb 16, 2023

ecfp-count

The ECFP-Count (Extended Connectivity Fingerprints-Coun is essentially the same as the ECFP. However, instead of being hashed into a binary vector, there is no hashing process and simply a count vector is returned

Updated on Feb 16, 2023

estate

Electrotopological state (Estate) indices are numerical values computed for each atom in a molecule, and which encode information about both the topological environment of that atom and the electronic interactions due to all other atoms in the molecule.

Updated on Feb 16, 2023

erg

Extended Reduced Graph approach (ErG) describes a molecular structure by defining its pharmacophoric points and the topological distance between them. It uses a pairwise combination of pharmacophores and their distance to set a corresponding bit in a vector. The ErG fingerprint implements fuzzy incrementation, which favours retrieval of actives with different core structures (scaffold hopping).

Updated on Feb 16, 2023

secfp

SMILES extended connectivity fingerprint (SECFP), is a fingerprint variant on MinHash fingerprints (MHFPs) SMILES-based circular substructure hashing scheme, folded by the same modulo 𝑛 operation that is used by ECFP.

Updated on Feb 16, 2023

map4

MinHashed atom-pair fingerprint up to a diameter of four bonds (MAP4) is suitable for both small and large molecules by combining substructure and atom-pair concepts. In this fingerprint the circular substructures with radii of r = 1 and r = 2 bonds around each atom in an atom-pair are written as two pairs of SMILES, each pair being combined with the topological distance separating the two central atoms. These so-called atom-pair molecular shingles are hashed, and the resulting set of hashes is MinHashed to form the MAP4 fingerprint.

Updated on Feb 16, 2023

pattern

Pattern fingerprints were designed to be used in substructure screening. The algorithm identifies features in the molecule by doing substructure searches using a small number of very generic SMARTS patterns and then hashing each occurrence of a pattern based on the atom and bond types involved. The fact that a particular pattern matched the molecule at all is also stored by hashing the pattern ID and size.

Updated on Feb 16, 2023

rdkit

This is an RDKit-specific fingerprint that is inspired by (though it differs significantly from) public descriptions of the Daylight fingerprint. The fingerprinting algorithm identifies all subgraphs in the molecule within a particular range of sizes, hashes each subgraph to generate a raw bit ID, that is then folded into the requested fingerprint size as binary vectors. Options are available to generate count-based forms of the fingerprint or “non-folded” forms (using a sparse representation).

Updated on Feb 16, 2023

topological

Topological torsion fingerprints are a type of molecular fingerprint that represents the topological features of a molecule based on its graph representation. They are generated by computing the frequencies of all possible molecular torsions in a molecule and then encoding them as a binary vector.

Updated on Feb 16, 2023

fcfp

Functional-class fingerprints (FCFPs) are an extension of ECFPs which incorporate information about the functional classes of atoms in a molecule. FCFPs are intended to capture more abstract property-based substructural features and leverage atomic characteristics that relate more to pharmacophoric features (e.g. hydrogen donor/acceptor, polarity, aromaticity, etc.).

Updated on Feb 16, 2023

ecfp

Extended-connectivity fingerprints (ECFPs) are a family of circular fingerprints that are commonly used for the measure of molecular similarity. They are based on the connectivity of atoms in molecular graphs.

Updated on Feb 16, 2023

avalon

Similar to Daylight fingerprints, Avalon uses a fingerprint generator that enumerates certain paths and feature classes of the molecular graph. The fingerprint bit positions are hashed from the description of the feature; however, the hash codes for all the path-style features are computed implicitly while they are enumerated.

Updated on Feb 16, 2023

All your molecular featurizers at your fingertips