^^。

verify-tagBELKA: supplementary calcs & data for ML

biotechnologyexploratory data analysisdata cleaningclassificationfeature engineering

9

已售 0
148.48MB

数据标识:D17171496092095081

发布时间:2024/05/31

以下为卖家选择提供的数据验证报告:

数据描述

All stored data and calculations are for the Leash Bio - Predict New Medicines with BELKA competition, and based on its corresponding datasets.

"Train subsets" folder (the subset will not be changed or reuploaded): The data from the parquet file with the train dataset, containing SMILES (Simplified Molecular-Input Line-Entry System) for ~295M molecules and the labels as binary binding classifications, one per protein target out of three targets, were divided into 3 subsets for each protein, so that each subset contains all molecules that bind to it and the same number of random molecules that do not bind to it.

The subsets contain: HSA protein: 816, 820 molecules sEH protein: 1,449,064 molecules BRD4 protein: 913,928 molecules

The subsets are stored in fast and readable in r qs format, which you can read as follows:

library(qs) dt <- qread("/kaggle/input/belka-supplementary-calcs-and-data-for-ml/train subsets/BRD4_all_bind1_rand_bind0.qs") 

"Smiles" filder Contains all unique SMILES from all the three building blocks of the test and train sets in a scv (all_bb_smiles_by_bb.csv) and SDF format (all_bb_smiles.sdf). The conversion can be easily done with the ChemmineOB package and the OpenBabel software, but they unfortunately they are not available in kagge.

all_bb_sdfset <- smiles2sdf(named_vector_of_smiles) write.SDF(all_bb_sdfset, file = "all_bb_smiles_sdfset.sdf") 

"Features" filder Contains different features in csv format, genereted in this notebook

"for sim analysis" filder Contains some precalculated data for this notebook investigating similarity between train and test molecules and a possible way to track generalizability of an ML model.


data icon
BELKA: supplementary calcs & data for ML
9
已售 0
148.48MB
申请报告