以下为卖家选择提供的数据验证报告:
数据描述
Dataset GSE50161 on brain cancer gene expression from CuMiDa
- 5 classes
- 54676 genes
- 130 samples
About
Here we present the Curated Microarray Database (CuMiDa), a repository containing 78 handpicked cancer microarray datasets, extensively curated from 30.000 studies from the Gene Expression Omnibus (GEO), solely for machine learning. The aim of CuMiDa is to offer homogeneous and state-of-the-art biological preprocessing of these datasets, together with numerous 3-fold cross validation benchmark results to propel machine learning studies focused on cancer research. The database make available various download options to be employed by other programs, as well for PCA and t-SNE results. CuMiDa stands different from existing databases for offering newer datasets, manually and carefully curated, from samples quality, unwanted probes, background correction and normalization, to create a more reliable source of data for computational research.
http://sbcb.inf.ufrgs.br/cumida
References
Feltes, B.C.; Chandelier, E.B.; Grisci, B.I.; Dorn, M. (2019) CuMiDa: An Extensively Curated Microarray Database for Benchmarking and Testing of Machine Learning Approaches in Cancer Research. Journal of Computational Biology, 26 (4), 376-386. [https://doi.org/10.1089/cmb.2018.0238]
Grisci, B. I., Feltes, B. C., & Dorn, M. (2019). Neuroevolution as a tool for microarray gene expression pattern identification in cancer research. Journal of biomedical informatics, 89, 122-133. [https://doi.org/10.1016/j.jbi.2018.11.013]
Inspiration
- How to deal with class imbalance for classification?
- How to identify the most important genes for the classification of each cancer subtype?
- Is it possible to discover subtypes?
- How to beat the classification and clustering benchmarks for this dataset listed on the CuMiDa website?
