以下为卖家选择提供的数据验证报告:
数据描述
#Outline It was reported that an estimated 4292,000 new cancer cases and 2814,000 cancer deaths would occur in China in 2015. [Chen, W., etc. (2016), Cancer statistics in China, 2015.][1]
Small molecules play an non-trivial role in cancer chemotherapy. Here I focus on inhibitors of 8 protein kinases(name: abbr):
- Cyclin-dependent kinase 2: cdk2
- Epidermal growth factor receptor erbB1: egfr_erbB1
- Glycogen synthase kinase-3 beta: gsk3b
- Hepatocyte growth factor receptor: hgfr
- MAP kinase p38 alpha: map_k_p38a
- Tyrosine-protein kinase LCK: tpk_lck
- Tyrosine-protein kinase SRC: tpk_src
- Vascular endothelial growth factor receptor 2: vegfr2
For each protein kinase, several thousand inhibitors are collected from chembl database, in which molecules with IC50 lower than 10 uM are usually considered as inhibitors, otherwise non-inhibitors.
#Challenge Based on those labeled molecules, build your model, and try to make the right prediction.
Additionally, more than 70,000 small molecules are generated from pubchem database. And you can screen these molecules to find out potential inhibitors. P.S. the majority of these molecules are non-inhibitors.
#DataSets(hdf5 version) There are 8 protein kinase files and 1 pubchem negative samples file. Taking "cdk2.h5" as an example:
import h5py from scipy import sparse hf = h5py.File("../input/cdk2.h5", "r") ids = hf["chembl_id"].value # the name of each molecules ap = sparse.csr_matrix((hf["ap"]["data"], hf["ap"]["indices"], hf["ap"]["indptr"]), shape=[len(hf["ap"]["indptr"]) - 1, 2039]) mg = sparse.csr_matrix((hf["mg"]["data"], hf["mg"]["indices"], hf["mg"]["indptr"]), shape=[len(hf["mg"]["indptr"]) - 1, 2039]) tt = sparse.csr_matrix((hf["tt"]["data"], hf["tt"]["indices"], hf["tt"]["indptr"]), shape=[len(hf["tt"]["indptr"]) - 1, 2039]) features = sparse.hstack([ap, mg, tt]).toarray() # the samples' features, each row is a sample, and each sample has 3*2039 features labels = hf["label"].value # the label of each molecule
#DataSets(csv version) The first comuln is the label, the left 8192 columns are features(fingerprints).
#Q&A Q. What does the "chembl_id" encode for in the PubChem negative samples? A. Pubchem CID.
Q. Difference between h5 and csv data? A. The csv format and h5 are different versions. You can use either of them.
The h5 version: In my sample code (pk_input.py and pk_random_forest.py, I use a combination of three sets of fingerprints(ap, mg, and tt for short), and each set contains 2039 fingerprints, the total number of fingerprints(columns) are 6117.
The csv version: The first column is the label, the rest columns are feature. Only the set of Mogan(mg for short) fingerprints is used as features, and the number is 8192.
Q. How the features(fingerprints) are generated? How does the features(fingerprints) selected? A. (Take csv version as an example)The features are a subset of fingerprints. The fingerprints' ID is just a integer(e.g. 10552354, 10552386, 10552674) and meaningless, thus not showed. That's why there are no column names in csv files.
The csv file contains label and features. Expect the first column is label, the left columns are the feature data. The feature data is a matrix with shape=(N, 8192), where N is the number of molecules, and 8192 is the number of fingerprints.
The fingerprints are selected if it is 'frequent' enough. The 'rare' fingerprints: the fingerprints only appear in <5%(the ratio could be adjusted) of all the molecules in the dataset.
Q. What does 'ap', 'mg', and 'tt' mean in h5 files? A. 'ap', 'mg', 'tt' are short for 'Atom Pairs', 'Morgan Fingerprints (Circular Fingerprints)', 'Topological Torsions'. They are three different sets of molecular fingerprints calculated by RDKIT. For more information: http://www.rdkit.org/docs/GettingStartedInPython.html#topological-fingerprints
Q. ID of the molecules in csv version? A. Unfortunately it's missed and could not be recovered. If you need the molecules, you can use the ID from h5 files, and retrieve their structure in SMILES format by chembl_id from ChEMBL database(or in sdf format by pubchem cid from Pubchem database ), and finally, use RDKit to compute any fingerprints needed.
#Reference
