以下为卖家选择提供的数据验证报告:
数据描述
The data are for Open Problems – Single-Cell Perturbations
Format: The supplementary files are mostly in fast and readable with R format .qs and features are stored as csv files.
Features generated in the Feature engineering notebook:
- TE_features_pca (152) - target encoding followed by PCA
- pca_TE_features (1228) - PCA of targets followed by target encoding
- dummy_features (152) - dummy encoded features
- molecula_descriptors (167) - molecular descriptors from the rcdk R package
- morgan_firgerptints_features (1024) - Morgan fingerprints from the SMILES
- ChemBERTa embeddings - Author - ALEKSEY TREPETSKY; dataset; notebook; article
- adata features (18211 and 85) generated in local PC (3090), the code is shown in this notebook
additions_to_train
- genes_with_same_value - genes with the same expression changes in the entire de_train - an artifact created by the LIMMA model. These genes have zero expression in adata_train.
- id_map_test_sets - id_map with marked private and public subsets
The DrugBank data were exported 30 September2023 (DrugBank Release Version 5.1.10, released on 2023-01-04) and analyzed in this notebook.
MLP_experiments
- MLP_metrics_aug_noise_filt - using the MLP NN I ran several experiments with filtration of train data, adding the noise to features, and data augmentation, - tricks which improved or did not affect LB scores and compare my validation scheme (on test drugs) with several others in terms of how well they reflect the changes in LB score (see the results).
- CV_LB_metrics_for_MLP - metrics for 51 model's versions, public and private LB scores. The analysis is here.
GOenrich Visualizations of GO enrichment analysis for this notebook.
adata: Raw counts in sparse format (genes by rows and cells by columns) from adata_train.parquet (competition data)

OP2: supplementary calcs & data for ML
702.43MB
申请报告