以下为卖家选择提供的数据验证报告:
数据描述
Acknowledgement
This dataset is an update of the Protein Secondary Structure Dataset. I am indebted to alfrandom for the original work and wonderful Github repository that was used to create this update as well as the permission to create this update. I hope that this update is helpful and extends the original in useful and meaningful ways.
Background
Proteins are the operational units of life and perform an extensive number of fundamental functions from enzymes and immune function to movement and structure. The most basic description of a protein is its primary amino acid sequence - the collection of individual subunits that make up the protein. The sequence of amino acids folds into a few fundamental shapes termed secondary structures. From there, the secondary structures fold further into the 3-dimensional shape of the protein - the tertiary structure. Further, multiple individual proteins may group together into a final functional unit - quarternary structure.
While AlphaFold has made huge strides toward solving the problem of predicting 3D protein structure just from the primary sequence, it has benefited from decades of work in crystallography and structural biology that created a rich collection of X-ray, cryo-electron microscopy, and NMR structures, as well as fundamental research into how proteins evolve.
Dataset
This dataset provides a collection of protein sequences and their secondary structures observed in 3D crystal structures.
The original dataset was created in 2018 and consisted of 9078 sequences with lengths ranging from 20 to 1632 amino acids. In this update, I have used the latest data from the RCSB-PDB (as of 6 August 2022) and relaxed some of the criteria used for data culling. Specifically, the original dataset had a cutoff of 25% identity for any pair of sequences and a 2.0 Angstrom resolution of the crystal structure. In this update, the following cutoffs were used and provided sequences of at least 40 amino acids in length.
2022-08-06 Dataset UPDATE I found that this dataset is not current as of 2022-08-06, but rather sometime in July 2020. The ss.txt file downloadable from PDB is dated July 2020 and is no longer updated with new information. I’ve left the file names here as they correspond to the culled file list, but the actual sequence and structure content is 2 years older.
Percent Identity | Resolution | Number of Sequences |
---|---|---|
25% | 2.0 | 7320 |
25% | 2.5 | 9646 |
30% | 2.5 | 13406 |
UPDATE I developed new code to download all PDB files in the culled lists (15500+ structures, missing about 150 that could not be downloaded) using BioPython. I then generated all the SST3 and SST8 structural information using BioPython and DSSP. This added over 1000 structures to each file. All code will be updated on the pdb-secondary-structure-2022 github repository. The PDB structures will not be included due to space limitations (10 GB uncompressed).
Percent Identity | Resolution | Number of Sequences |
---|---|---|
25% | 2.0 | 8313 |
25% | 2.5 | 10931 |
30% | 2.5 | 15080 |
Files and Column Descriptions
2022-08-06-pdbintersect-pisces_pc25_r2.0.csv
2022-08-06-pdbintersect-pisces_pc25_r2.5.csv
2022-08-06-pdbintersect-pisces_pc30_r2.5.csv
2022-12-17-pdbintersect-pisces_pc25_r2.0.csv
2022-12-17-pdbintersect-pisces_pc25_r2.5.csv
2022-12-17-pdbintersect-pisces_pc30_r2.5.csv
These files contain the subset of sequences and secondary structures culled based upon specific percent identity and structure resolution cutoffs. PISCES lists were used to create the datasets provided here.
More on Secondary Structure
SST-8 and SST-3 classifications are provided for each protein sequence. SST-8 consists of 8 categories of secondary structures based upon geometric rules of classification. SST-3 gathers similar SST-8 catagories into a simpler and more general set of structures.
SST-8 Category | Description | SST-3 Category |
---|---|---|
E | β-strand | E |
B | β-bridge | E |
H | α-helix | H |
G | 3-helix | H |
I | π-helix | H |
C | Loops and irregular elements | C |
T | Turn | C |
S | Bend | C |
Code
A fork (pdb-secondary-structure-2022) of the original Github repository was made. Updates are noted in the code and the README.md.
