以下为卖家选择提供的数据验证报告:
数据描述
Microbiome
> The human gut contains trillions of microbial inhabitants, making it one of the most densely populated environments on the planet. The symbiosis between these organisms and the human host is extremely complex, and we are only beginning to understand the impact of the gut microbiota on human biology. Knowledge of the chemical reactions performed and compounds produced by gut microbes will provide new insights into their roles in influencing human health. By studying the gene content of the human gut microbiome and the enzymes encoded by these genes, we hope to better understand the chemical capabilities of this microbial community. However, the activities of the vast majority of enzymes found in microbiomes are unknown.
>Shotgun metagenomic sequencing is a relatively new sequencing approach that allows insight to be gained into community biodiversity and function. The function of shotgun metagenomic sequencing is to sequence the genomes of untargeted cells in a community in order to elucidate community composition and function. Research using the method, taps into several fields due to the broad existence of large microbial communities. For example, the study of soil microbiota has led to advances in understanding and treating plant pathogens. In human gut microbiota, the use of shotgun metagenomics discovered how common antibiotic genes are in our gut bacteria. By Sara Ryding.
Dataset
This dataset was created by the team of Edoardo Pasolli, Duy Tin Truong, Faizan Malik, Levi Waldron, and Nicola Segata; they published a research article in July of 2016 , and created MetAML - Metagenomic prediction Analysis based on Machine Learning.
The authors used 8 publicly available metagenomic datasets, and applied MetaPhlAn2 to generate species abundance features. Their goal was to classify diseases using obtained abundance features, and to determine best ML models for this task. Though their experiments they settled on RandomForest as the best classifier for most diseases, with SVM doing better for some diseases.
I transposed abundance data, to 'traditional' view rows-cases/columns-features as opposed to what MethPhlAn2 produced, and saved as csv files for simpler ingestion.
New approaches
Can we get better predictions? Different models? Ensembling? Can we determine which sets of species define better predictions, and therefore are related to specific diseases?
Acknowledgements
Pasolli E, Truong DT, Malik F, Waldron L, Segata N (2016) Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights. PLoS Comput Biol 12(7): e1004977. research article
>Banner image by Sara López Gilabert/SAPIENS
