All your molecular featurizers at your fingertips
Think molfeat is missing a featurizer?
Contribute to molfeat by following our tutorials.
Roberta-Zinc480M-102M
This is a Roberta style masked language model trained on ~480m SMILES strings from the ZINC database. The model has ~102m parameters and was trained for 150000 iterations with a batch size of 4096 to a validation loss of ~0.122.
Updated on
GPT2-Zinc480M-87M
This is a GPT2 style autoregressive language model trained on ~480m SMILES strings from the ZINC database available. The model has ~87m parameters and was trained for 175000 iterations with a batch size of 3072 to a validation loss of ~.615.
Updated on
ChemGPT-1.2B
ChemGPT (1.2B params) is a transformer model for generative molecular modeling, which was pretrained on the PubChem10M dataset.
Updated on
ChemGPT-19M
ChemGPT (19M params) is a transformers model for generative molecular modeling, which was pretrained on the PubChem10M dataset.
Updated on
ChemGPT-4.7M
ChemGPT (4.7M params) is a transformer model for generative molecular modeling, which was pretrained on the PubChem10M dataset.
Updated on
MolT5
MolT5 is a self-supervised learning framework that pretrains transformer-based models on vast amounts of unlabeled natural language text and molecule strings allowing generation of high-quality outputs for molecule captioning and text-based molecule generation.
Updated on
desc3D
3D molecular descriptors are numerical representations of chemical and physical properties of molecules that are based on 3D structures of molecules.
Updated on
desc2D
2D molecular descriptors are numerical representations of chemical and physical properties of molecules that are based on 2D structures of molecules. We augment the RDKit 2D descriptors with additional optional properties.
Updated on
mordred
Mordred calculates over 1800 molecular descriptors, including constitutional, topological, electronic, and geometrical descriptors, among others. Both 2D and 3D descriptors are supported and optional.
Updated on
scaffoldkeys
Scaffold Keys are a method for representing scaffold using substructure features and were proposed by Peter Ertl in: Identification of Bioisosteric Scaffolds using Scaffold Keys
Updated on
electroshape
Compute Electroshape descriptors as described by Armstrong et al. in ElectroShape: fast molecular similarity calculations incorporating shape, chirality and electrostatics.
Updated on
usrcat
USRCAT is a real-time ultrafast shape recognition with pharmacophoric constraints. It integrates atom type to the traditional USR descriptor to improve the performance of shape-based virtual screening
Updated on
cats3d
3D version of the 6 Potential Pharmacophore Points CATS (Chemically Advanced Template Search) pharmacophore. This version differs from `pharm3D-cats` on the process to make the descriptors fuzzy, which is closer to the original paper implementation. This version uses the 3D distance matrix between pharmacophoric points
Updated on
cats2d
2D version of the 6 Potential Pharmacophore Points CATS (Chemically Advanced Template Search) pharmacophore. This version differs from `pharm2D-cats` on the process to make the descriptors fuzzy, which is closer to the original paper implementation. Implementation is based on work by Rajarshi Guha (08/26/07) and Chris Arthur (1/11/2015)
Updated on
pharm3D-cats
3D version of the CATS pharmacophores computed with the Pharm2D module in RDKit.
Updated on
pharm2D-cats
2D topological pharmacophores computed by the Pharm2D module in RDKit using the CATS (Chemically Advanced Template Search) feature definition.
Updated on
pharm2D-default
3D version of the pharmacophores computed with the default rdkit feature definition: https://github.com/rdkit/rdkit/blob/master/Data/BaseFeatures.fdef
Updated on
pharm3D-gobbi
3D version of the 2D pharmacophores defined in the Gobbi and Poppinger (1998) paper. 8 pharmacophore feature types were listed (hydrogen bond acceptor, hydrogen bond donor, basic group, acidic group, hydrophobic group, halogen, attachment point to an aliphatic ring, and attachment point to an aromatic ring)
Updated on
pharm2D-gobbi
2D pharmacophores computed by the Pharm2D module in RDKit. Gobbi pharmacophore were designed for selecting compounds from large combinatorial libraries, as defined in the Gobbi and Poppinger (1998) paper. 8 pharmacophore feature types were listed (hydrogen bond acceptor, hydrogen bond donor, basic group, acidic group, hydrophobic group, halogen, attachment point to an aliphatic ring, and attachment point to an aromatic ring)
Updated on
pharm3D-pmapper
Pmapper is a Python module to generate 3D pharmacophore signatures and fingerprints. Signatures uniquely encode 3D pharmacophores with hashes suitable for fast identification of identical pharmacophores. See https://github.com/DrrDom/pmapper
Updated on
pharm2D-pmapper
2D pharmacophores computed by the Pharm2D module in RDKit using Pmapper feature definition. Pmapper is a Python module to generate pharmacophore signatures and fingerprints. See https://github.com/DrrDom/pmapper
Updated on
ChemBERTa-77M-MTR
ChemBERTa is a pre-trained language model for molecules based on (Ro)BERT(a) trained on PubChem compounds. The MTR version was pretrained using mutitask regression objective, while the MLM version was pretrained using a masked language modeling objective
Updated on
ChemBERTa-77M-MLM
ChemBERTa is a pre-trained language model for molecules based on (Ro)BERT(a) trained on PubChem compounds. The MTR version was pretrained using mutitask regression objective, while the MLM version was pretrained using a masked language modeling objective
Updated on
atompair-count
The Atompair-Count fingerprint is essentially the same as the atompair fingerprint. However, instead of being hashed into a binary vector, there is no hashing process and simply a count vector is returned
Updated on
topological-count
The Topological-Count fingerprint is essentially the same as the Topological fingerprint. However, instead of being hashed into a binary vector, there is no hashing process and simply a count vector is returned
Updated on
fcfp-count
The FCFP-Count (Functional Class Fingerprints-Count) is essentially the same as the FCFP. However, instead of being hashed into a binary vector, there is no hashing process and simply a count vector is returned
Updated on
estate
Electrotopological state (Estate) indices are numerical values computed for each atom in a molecule, and which encode information about both the topological environment of that atom and the electronic interactions due to all other atoms in the molecule.
Updated on
erg
Extended Reduced Graph approach (ErG) describes a molecular structure by defining its pharmacophoric points and the topological distance between them. It uses a pairwise combination of pharmacophores and their distance to set a corresponding bit in a vector. The ErG fingerprint implements fuzzy incrementation, which favours retrieval of actives with different core structures (scaffold hopping).
Updated on
map4
MinHashed atom-pair fingerprint up to a diameter of four bonds (MAP4) is suitable for both small and large molecules by combining substructure and atom-pair concepts. In this fingerprint the circular substructures with radii of r = 1 and r = 2 bonds around each atom in an atom-pair are written as two pairs of SMILES, each pair being combined with the topological distance separating the two central atoms. These so-called atom-pair molecular shingles are hashed, and the resulting set of hashes is MinHashed to form the MAP4 fingerprint.
Updated on
pattern
Pattern fingerprints were designed to be used in substructure screening. The algorithm identifies features in the molecule by doing substructure searches using a small number of very generic SMARTS patterns and then hashing each occurrence of a pattern based on the atom and bond types involved. The fact that a particular pattern matched the molecule at all is also stored by hashing the pattern ID and size.
Updated on
rdkit
This is an RDKit-specific fingerprint that is inspired by (though it differs significantly from) public descriptions of the Daylight fingerprint. The fingerprinting algorithm identifies all subgraphs in the molecule within a particular range of sizes, hashes each subgraph to generate a raw bit ID, that is then folded into the requested fingerprint size as binary vectors. Options are available to generate count-based forms of the fingerprint or “non-folded” forms (using a sparse representation).
Updated on
topological
Topological torsion fingerprints are a type of molecular fingerprint that represents the topological features of a molecule based on its graph representation. They are generated by computing the frequencies of all possible molecular torsions in a molecule and then encoding them as a binary vector.
Updated on
fcfp
Functional-class fingerprints (FCFPs) are an extension of ECFPs which incorporate information about the functional classes of atoms in a molecule. FCFPs are intended to capture more abstract property-based substructural features and leverage atomic characteristics that relate more to pharmacophoric features (e.g. hydrogen donor/acceptor, polarity, aromaticity, etc.).
Updated on
avalon
Similar to Daylight fingerprints, Avalon uses a fingerprint generator that enumerates certain paths and feature classes of the molecular graph. The fingerprint bit positions are hashed from the description of the feature; however, the hash codes for all the path-style features are computed implicitly while they are enumerated.
Updated on
gin_supervised_masking
GIN neural network model pre-trained with masked modelling on molecules from ChEMBL.
Updated on
gin_supervised_infomax
GIN neural network model pre-trained with mutual information maximisation on molecules from ChEMBL.
Updated on
gin_supervised_edgepred
GIN neural network model pre-trained with supervised learning and edge prediction on molecules from ChEMBL.
Updated on
jtvae_zinc_no_kl
A JTVAE pre-trained on ZINC for molecule generation, without KL regularization
Updated on
pcqm4mv2_graphormer_base
Pretrained Graph Transformer on PCQM4Mv2 Homo-Lumo energy gap prediction using 2D molecular graphs.
Updated on
gin_supervised_contextpred
GIN neural network model pre-trained with supervised learning and context prediction on molecules from ChEMBL.
Updated on