老下头

verify-tagOP2: supplementary calcs & data for ML

geneticsbiologyeducationfeature engineering

24

已售 0
702.43MB

数据标识:D17175053170585283

发布时间:2024/06/04

以下为卖家选择提供的数据验证报告:

数据描述

The data are for Open Problems – Single-Cell Perturbations

Format: The supplementary files are mostly in fast and readable with R format .qs and features are stored as csv files.

Features generated in the Feature engineering notebook:

  • TE_features_pca (152) - target encoding followed by PCA
  • pca_TE_features (1228) - PCA of targets followed by target encoding
  • dummy_features (152) - dummy encoded features
  • molecula_descriptors (167) - molecular descriptors from the rcdk R package
  • morgan_firgerptints_features (1024) - Morgan fingerprints from the SMILES
  • ChemBERTa embeddings - Author - ALEKSEY TREPETSKY; dataset; notebook; article
  • adata features (18211 and 85) generated in local PC (3090), the code is shown in this notebook

additions_to_train

  • genes_with_same_value - genes with the same expression changes in the entire de_train - an artifact created by the LIMMA model. These genes have zero expression in adata_train.
  • id_map_test_sets - id_map with marked private and public subsets

The DrugBank data were exported 30 September2023 (DrugBank Release Version 5.1.10, released on 2023-01-04) and analyzed in this notebook.

MLP_experiments

  • MLP_metrics_aug_noise_filt - using the MLP NN I ran several experiments with filtration of train data, adding the noise to features, and data augmentation, - tricks which improved or did not affect LB scores and compare my validation scheme (on test drugs) with several others in terms of how well they reflect the changes in LB score (see the results).
  • CV_LB_metrics_for_MLP - metrics for 51 model's versions, public and private LB scores. The analysis is here.

GOenrich Visualizations of GO enrichment analysis for this notebook.

adata: Raw counts in sparse format (genes by rows and cells by columns) from adata_train.parquet (competition data)

data icon
OP2: supplementary calcs & data for ML
24
已售 0
702.43MB
申请报告