以下为卖家选择提供的数据验证报告:
数据描述
Context
Complete Genome of a Family of five - Two Parents, Three Siblings (Genome Phenotype SNPs Raw Data)
Genomics is a branch of molecular biology that involves structure, function, variation, evolution and mapping of genomes. There are several companies offering next generation sequencing of human genomes from complete 3 billion base-pairs to a few thousand Phenotype SNPs. I have used 23andMe (using Illumina HumanOmniExpress-24) for this family's DNA’s Phenotype SNPs. I am sharing the entire raw dataset of the family of five (Father, Mother and Three Brothers) here for the international research community for the following reasons:
I am a firm believer in open datasets, transparency, and the right to learn, research, explores, and educate. I do not want to restrict the knowledge flow for mere privacy concerns. Hence, I am offering this entire family DNA raw data for the world to use for research without worrying about privacy.
Most of available test datasets for research come from western world and we don’t see much from under-developing countries. I thought to share this data to bridge the gap and I expect others to follow the trend.
I would be the happiest man on earth, if a life can be saved, knowledge can be learned, an idea can be explore, or a fact can be found using this DNA dataset. Please use it the way you will
Content
Family Origin: Pakistani
Country of Grandparents/Ancestors: India (Kerana, Utter Pradesh - UP)
Files: Father, Mother, Child 1, Child 2, Child 3 (All CSVs)
Size: 75 MB
Sources: 23andMe Personalized Genome Reports
The research community is still progressively working in this domain and it is agreed upon by professionals that genomics is still in its infancy. You now have the chance to explore this novel domain via this dataset and become one of the few genomics early adopters.
The dataset is a complete genome extracted from www.23andme.com and is represented as a sequence of SNPs represented by the following symbols: A (adenine), C (cytosine), G (guanine), T (thymine), D (base deletions), I (base insertions), and '_' or '-' if the SNP for particular location is not accessible. It contains Chromosomes 1-22, X, Y, and mitochondrial DNA.
A complete list of the exact SNPs (base pairs) available and their data-set index can be found at https://api.23andme.com/res/txt/snps.b4e00fe1db50.data
For more information about how the data-set was extracted follow https://api.23andme.com/docs/reference/#genomes
Moreover, for a more detailed understanding of the data-set content please acquaint yourself with the description of https://api.23andme.com/docs/reference/#genotypes
Acknowledgements
Users are allowed to use, copy, distribute and cite the dataset as follows: “Zeeshan-ul-hassan Usmani, Family of Give Genomic Dataset by 23andMe, Kaggle Dataset Repository, March 7, 2021.”
Useful Links
You may use the following human genome database sites for help:
GenBank - https://www.ncbi.nlm.nih.gov/genbank/
The Human Genome Project - https://www.genome.gov/hgp/
Genomes OnLine Database (GOLD) - https://gold.jgi.doe.gov
Complete Genomics - http://www.completegenomics.com/public-data/
Inspiration
Some ideas worth exploring:
Any individuals in the dataset more susceptible to cancer?
Does he/she tend to gain weight?
Where is his/her place of origin?
Which gene determines certain biological feature (cancer susceptibility, fat generation rate, hair color etc.
How does this phenotype SNPs compare with other similar datasets from the western-world?
How the family differ in genomic makeup? Which traits are silent, which ones are dominant?
What would be the likely cause of death for any given person?
What are the most likely diseases/illnesses this family is going to face in lifetime?
What is unique about this dataset?
Can you compare the genomes within this family and see which diseases will have less or more impact on a given family member?
Can you delineate recombination sites precisely, identify sequence errors or find rare SNPs?
What else you can extract from this dataset when it comes to personal trait, intelligence level, ancestry and body makeup?
