OP2: supplementary calcs & data for ML

老下头

geneticsbiologyeducationfeature engineering

￥24

已售 0

702.43MB

数据标识：D17175053170585283

发布时间：2024/06/04

The data are for Open Problems – Single-Cell Perturbations

Format: The supplementary files are mostly in fast and readable with R format .qs and features are stored as csv files.

Features generated in the Feature engineering notebook:

TE_features_pca (152) - target encoding followed by PCA
pca_TE_features (1228) - PCA of targets followed by target encoding
dummy_features (152) - dummy encoded features
molecula_descriptors (167) - molecular descriptors from the rcdk R package
morgan_firgerptints_features (1024) - Morgan fingerprints from the SMILES
ChemBERTa embeddings - Author - ALEKSEY TREPETSKY; dataset; notebook; article
adata features (18211 and 85) generated in local PC (3090), the code is shown in this notebook

additions_to_train

genes_with_same_value - genes with the same expression changes in the entire de_train - an artifact created by the LIMMA model. These genes have zero expression in adata_train.
id_map_test_sets - id_map with marked private and public subsets

The DrugBank data were exported 30 September2023 (DrugBank Release Version 5.1.10, released on 2023-01-04) and analyzed in this notebook.

MLP_experiments

MLP_metrics_aug_noise_filt - using the MLP NN I ran several experiments with filtration of train data, adding the noise to features, and data augmentation, - tricks which improved or did not affect LB scores and compare my validation scheme (on test drugs) with several others in terms of how well they reflect the changes in LB score (see the results).
CV_LB_metrics_for_MLP - metrics for 51 model's versions, public and private LB scores. The analysis is here.

GOenrich Visualizations of GO enrichment analysis for this notebook.

adata: Raw counts in sparse format (genes by rows and cells by columns) from adata_train.parquet (competition data)

看了又看

验证报告

以下为卖家选择提供的数据验证报告：

OP2: supplementary calcs & data for ML

￥24

已售 0

702.43MB

申请报告