PSSNet—An Accurate Super-Secondary Structure for Protein Segmentation
Abstract
:1. Introduction
- Models based on the sequence-to-sequence architecture, where the protein structure is considered as a sequence of amino acids with the main characteristics of their localization, i.e., contact map. Featured sequences are processed using a group of recurrent layers [24].
- Models based on 3D-CloudSegmentation, in which each atom of a molecule is represented as a point in a 3D space. PointNet, PointNet++, and dynamic graph CNN (DGCNN) architectures [25] are used to segment and classify structures.
- Models based on the representation of a protein molecule as a 3D volumetric object (protein voxelization) with subsequent processing by 3D-Convolution family networks [26].
- Models based on the representation of a protein molecule as a graph with subsequent processing by graph neural networks (CGNs) [27].
2. Results and Discussion
2.1. SSS Segmentation Using the PSSNet Model
2.2. Practical Evaluation of the Model: Key Issues
- (a)
- Capture of excess structure sections;
- (b)
- Breakages in structure element links;
- (c)
- Incorrect definition of corners (typical for the αα-corner motif (70°–90°)).
- Capturing extra sections is a prerequisite for SSS with α-helix elements, and it usually manifests if the distance between the last and first Cα atoms of the first and the second structures, respectively, is 9.7 Å. However, the network at the output from the GRU layer generates a feature map that captures both structures (Figure 1e). This issue was fixed by reducing the number of neighbors in the knn-graph, when generating features, or by producing a sufficiently larger sample of such structures and subsequently retraining the model. The identification and extraction of such elements from the PSSKB database are currently ongoing.
- Breakages in structures generally occurred in low-resolution (>4.0 Å) PDB files, but the proportion was insignificant and relatively narrow compared to the total size of the consolidated databank.
- The network identified curved helices with a large angle of inflection as two elements with incorrectly defined angles for αα-corner structures. Rigorous analysis revealed that the issue can be effectively resolved only if we performed retraining of the model on a meaningfully larger representative sample that covers all such elements; the retraining and sample collecting are currently in progress.
3. Materials and Methods
3.1. Data Preparation
3.2. Feature Extraction and Input Encoding
3.2.1. Node-Level Features
- {sin, cos} ◦ {φ, ψ, ω}, where φ, ψ, and ω are the torsion angles calculated for , , , and and
- Unit vectors of the directions to the Cα-atoms in the main chain (= ( − ) and = ( − ));
- Unit direction vectors to the C-atom in the main chain ( = () and = ( − ));
- Cosines of the angles between vectors ;
- The distance between the C-atom in the chain ;
- A unit vector that determines the conditional direction of the side chain (direction of the Cβ atom), =–. This vector is calculated from the tetrahedral representation of the geometry of the N, Cα, and C atoms as follows:
- The amino acid sequence is encoded as a sequence of numbers (0–21).
3.2.2. Edge-Level Features
- a unit vector defining the direction between neighboring vertices,;
- the distance between the vertices of the graph is encoded using Gaussian radial basis functions:
3.3. Network Architecture
3.4. Training and Performance Evaluation
4. Conclusions
- small datasets for rapid, efficient learning, and retaining;
- ability to generalize features on a relatively small training set;
- good recognition accuracy (mean IoU > 0.83);
- huge amounts of information (such as that in the PDB and AlphaFold) can be assessed within a reasonable timeframe.
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Wetlaufer, D.B. Nucleation, Rapid Folding, and Globular Intrachain Regions in Proteins. Proc. Natl. Acad. Sci. USA 1973, 70, 697–701. [Google Scholar] [CrossRef] [Green Version]
- Karplus, M.; Weaver, D.L. Protein-Folding Dynamics. Nature 1976, 260, 404–406. [Google Scholar] [CrossRef]
- Anfinsen, C.B. Principles That Govern the Folding of Protein Chains. Science 1973, 181, 223–230. [Google Scholar] [CrossRef] [Green Version]
- Hartl, F.U. Molecular Chaperones in Cellular Protein Folding. Nature 1996, 381, 571–580. [Google Scholar] [CrossRef]
- Dobson, C.M. Protein Folding and Misfolding. Nature 2003, 426, 884–890. [Google Scholar] [CrossRef]
- Abkevich, V.I.; Gutin, A.M.; Shakhnovich, E.I. Specific Nucleus as the Transition State for Protein Folding: Evidence from the Lattice Model. Biochemistry 1994, 33, 10026–10036. [Google Scholar] [CrossRef]
- Fersht, A.R. Nucleation Mechanisms in Protein Folding. Curr. Opin. Struct. Biol. 1997, 7, 3–9. [Google Scholar] [CrossRef]
- MacCarthy, E.; Perry, D.; KC, D.B. Advances in Protein Super-Secondary Structure Prediction and Application to Protein Structure Prediction. In Protein Supersecondary Structures: Methods and Protocols; Kister, A.E., Ed.; Methods in Molecular Biology; Springer: New York, NY, USA, 2019; pp. 15–45. [Google Scholar] [CrossRef]
- Rudnev, V.R.; Kulikova, L.I.; Nikolsky, K.S.; Malsagova, K.A.; Kopylov, A.T.; Kaysheva, A.L. Current Approaches in Supersecondary Structures Investigation. Int. J. Mol. Sci. 2021, 22, 11879. [Google Scholar] [CrossRef]
- Robinson, J.A. The Design, Synthesis and Conformation of Some New β-Hairpin Mimetics: Novel Reagents for Drug and Vaccine Discovery. Synlett 2000, 2000, 429–441. [Google Scholar] [CrossRef]
- Robinson, J.A. β-Hairpin Peptidomimetics: Design, Structures and Biological Activities. Acc. Chem. Res. 2008, 41, 1278–1288. [Google Scholar] [CrossRef]
- Tikhonov, D.; Kulikova, L.; Kopylov, A.T.; Rudnev, V.; Stepanov, A.; Malsagova, K.; Izotov, A.; Kulikov, D.; Zulkarnaev, A.; Enikeev, D.; et al. Proteomic and Molecular Dynamic Investigations of PTM-Induced Structural Fluctuations in Breast and Ovarian Cancer. Sci. Rep. 2021, 11, 19318. [Google Scholar] [CrossRef]
- Brownlee, J. A Gentle Introduction to Probability Density Estimation. Machine Learning Mastery. Available online: https://machinelearningmastery.com/probability-density-estimation/ (accessed on 14 February 2022).
- Niranjan Pramanik, N.P. Kernel Density Estimation— Kernel Construction and Bandwidth Optimization using Maximum Likelihood Cross Validation. Analytics Vidhya. Available online: https://medium.com/analytics-vidhya/kernel-density-estimation-kernel-construction-and-bandwidth-optimization-using-maximum-b1dfce127073 (accessed on 14 February 2022).
- Schmidler, S.C.; Liu, J.S.; Brutlag, D.L. Bayesian Segmentation of Protein Secondary Structure. J. Comput. Biol. 2000, 7, 233–248. [Google Scholar] [CrossRef] [PubMed]
- Sun, L.; Hu, X.; Li, S.; Jiang, Z.; Li, K. Prediction of Complex Super-Secondary Structure Βαβ Motifs Based on Combined Features. Saudi J. Biol. Sci. 2016, 23, 66–71. [Google Scholar] [CrossRef] [Green Version]
- Kumar, M.; Bhasin, M.; Natt, N.K.; Raghava, G.P.S. BhairPred: Prediction of Beta-Hairpins in a Protein from Multiple Alignment Information Using ANN and SVM Techniques. Nucleic Acids Res. 2005, 33, W154–W159. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Xia, X.; Longo, L.M.; Sutherland, M.A.; Blaber, M. Evolution of a Protein Folding Nucleus. Protein Sci. 2016, 25, 1227–1240. [Google Scholar] [CrossRef] [Green Version]
- AlQuraishi, M. Machine Learning in Protein Structure Prediction. Curr. Opin. Chem. Biol. 2021, 65, 1–8. [Google Scholar] [CrossRef]
- Melvin, I.; Ie, E.; Kuang, R.; Weston, J.; Noble, W.S.; Leslie, C. SVM-Fold: A Tool for Discriminative Multi-Class Protein Fold and Superfamily Recognition. BMC Bioinform. 2007, 8, S2. [Google Scholar] [CrossRef] [Green Version]
- Flot, M.; Mishra, A.; Kuchi, A.S.; Hoque, M.T. StackSSSPred: A Stacking-Based Prediction of Supersecondary Structure from Sequence. In Protein Supersecondary Structures: Methods and Protocols; Kister, A.E., Ed.; Methods in Molecular Biology; Springer: New York, NY, USA, 2019; pp. 101–122. [Google Scholar] [CrossRef]
- Kuhn, M.; Meiler, J.; Baker, D. Strand-Loop-Strand Motifs: Prediction of Hairpins and Diverging Turns in Proteins. Proteins 2004, 54, 282–288. [Google Scholar] [CrossRef]
- Cruz, L.; Rao, J.S.; Teplow, D.B.; Urbanc, B. Dynamics of Metastable β-Hairpin Structures in the Folding Nucleus of Amyloid β-Protein. J. Phys. Chem. B 2012, 116, 6311–6325. [Google Scholar] [CrossRef] [Green Version]
- Li, Z.; Yu, Y. Protein Secondary Structure Prediction Using Cascaded Convolutional and Recurrent Neural Networks. arXiv 2016, arXiv:1604.07176[cs, q-bio]. [Google Scholar]
- Kalimeris, A.G.; Emiris, I. Deep Learning on Point Clouds for 3D Protein Classification Based on Secondary Structure. Available online: https://pergamos.lib.uoa.gr/uoa/dl/object/2880834/file.pdf (accessed on 14 February 2022).
- Stepniewska-Dziubinska, M.; Zielenkiewicz, P.; Siedlecki, P. Detection of Protein-Ligand Binding Sites with 3D Segmentation. 2019. Available online: https://www.researchgate.net/publication/332438981_Detection_of_protein-ligand_binding_sites_with_3D_segmentation (accessed on 14 February 2022).
- Gligorijević, V.; Renfrew, P.D.; Kosciolek, T.; Leman, J.K.; Berenberg, D.; Vatanen, T.; Chandler, C.; Taylor, B.C.; Fisk, I.M.; Vlamakis, H.; et al. Structure-Based Protein Function Prediction Using Graph Convolutional Networks. Nat. Commun. 2021, 12, 3168. [Google Scholar] [CrossRef]
- Xiang, T.; Zhang, C.; Song, Y.; Yu, J.; Cai, W. Walk in the Cloud: Learning Curves for Point Clouds Shape Analysis. arXiv 2021, arXiv:2105.01288[cs]. [Google Scholar]
- Papers with Code—ModelNet40 Benchmark (3D Point Cloud Classification). Available online: https://paperswithcode.com/sota/3d-point-cloud-classification-on-modelnet40 (accessed on 14 February 2022).
- Sborgi, L.; Verma, A.; Sadqi, M.; de Alba, E.; Muñoz, V. Protein Folding at Atomic Resolution: Analysis of Autonomously Folding Supersecondary Structure Motifs by Nuclear Magnetic Resonance. In Protein Supersecondary Structures; Kister, A.E., Ed.; Methods in Molecular Biology; Humana Press: Totowa, NJ, USA, 2013; pp. 205–218. [Google Scholar] [CrossRef]
- Kubelka, J.; Hofrichter, J.; Eaton, W.A. The Protein Folding “Speed Limit”. Curr. Opin. Struct. Biol. 2004, 14, 76–88. [Google Scholar] [CrossRef] [Green Version]
- Muñoz, V. Conformational Dynamics and Ensembles in Protein Folding. Annu. Rev. Biophys. Biomol. Struct. 2007, 36, 395–412. [Google Scholar] [CrossRef]
- Shafi, S.; Singh, A.; Gupta, P.; Chawla, P.A.; Fayaz, F.; Sharma, A.; Pottoo, F.H. Deciphering the Role of Aberrant Protein Post-Translational Modification in the Pathology of Neurodegeneration. CNS Neurol. Disord. Drug Targets 2021, 20, 54–67. [Google Scholar] [CrossRef]
- Venables, J.P. Aberrant and Alternative Splicing in Cancer. Cancer Res. 2004, 64, 7647–7654. [Google Scholar] [CrossRef] [Green Version]
- Indeykina, M.I.; Popov, I.A.; Kozin, S.A.; Kononikhin, A.S.; Kharybin, O.N.; Tsvetkov, P.O.; Makarov, A.A.; Nikolaev, E.N. Capabilities of MS for Analytical Quantitative Determination of the Ratio of α- and ΒAsp7 Isoforms of the Amyloid-β Peptide in Binary Mixtures. Anal. Chem. 2011, 83, 3205–3210. [Google Scholar] [CrossRef]
- Tilli, T.M.; Mello, K.D.; Ferreira, L.B.; Matos, A.R.; Accioly, M.T.S.; Faria, P.A.S.; Bellahcène, A.; Castronovo, V.; Gimba, E.R. Both Osteopontin-c and Osteopontin-b Splicing Isoforms Exert pro-tumorigenic Roles in Prostate Cancer Cells. Prostate 2012, 72, 1688–1699. [Google Scholar] [CrossRef]
- Su, Z.-D.; Sun, L.; Yu, D.-X.; Li, R.-X.; Li, H.-X.; Yu, Z.-J.; Sheng, Q.-H.; Lin, X.; Zeng, R.; Wu, J.-R. Quantitative Detection of Single Amino Acid Polymorphisms by Targeted Proteomics. J. Mol. Cell Biol. 2011, 3, 309–315. [Google Scholar] [CrossRef]
- Petrovskiy, D. Supersecondary_Structures_Dataset.zip. figshare. Dataset. 2022. Available online: https://figshare.com/articles/dataset/supersecondary_structures_dataset_zip/21529812/1 (accessed on 18 November 2022). [CrossRef]
- Dufter, P.; Schmitt, M.; Schütze, H. Position Information in Transformers: An Overview. arXiv 2021, arXiv:2102.11090[cs]. [Google Scholar] [CrossRef]
- Mayachita, I. Understanding Graph Convolutional Networks for Node Classification. Towards Data Science. Available online: https://towardsdatascience.com/understanding-graph-convolutional-networks-for-node-classification-a2bfdb7aba7b (accessed on 14 February 2022).
- Jing, B.; Eismann, S.; Suriana, P.; Townshend, R.J.L.; Dror, R. Learning from Protein Structure with Geometric Vector Perceptrons. 2020. Available online: https://arxiv.org/abs/2009.01411 (accessed on 18 November 2022).
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762[cs]. [Google Scholar]
- Zhang, H.; Li, M.; Wang, M.; Zhang, Z. Understand Graph Attention Network—DGL 0.6.1 Documentation. Available online: https://docs.dgl.ai/en/0.6.x/tutorials/models/1_gnn/9_gat.html (accessed on 14 February 2022).
- Dauparas, J.; Anishchenko, I.; Bennett, N.; Bai, H.; Ragotte, R.J.; Milles, L.F.; Wicky, B.I.M.; Courbet, A.; de Haas, R.J.; Bethel, N.; et al. Robust deep learning-based protein sequence design using ProteinMPNN. Science 2022, 378, 49–56. [Google Scholar] [CrossRef] [PubMed]
SSS | PDB (185,469 Structures) | AlphaFold (2021) (360,000 Structures) |
---|---|---|
βαβ-unit | 461,336 | 233,882 |
α-hairpin | 390,965 | 563,946 |
β-hairpin | 360,845 | 280,181 |
αα-corner | 5977 | 8153 |
SSS | PSSNet | CurveNet | DGCNN | |||
---|---|---|---|---|---|---|
Train | Val | Train | Val | Train | Val | |
βαβ-unit | 0.928 | 0.894 | 0.742 | 0.697 | 0.691 | 0.656 |
α-hairpin | 0.964 | 0.957 | 0.814 | 0.795 | 0.731 | 0.688 |
β-hairpin | 0.998 | 0.983 | 0.845 | 0.833 | 0.749 | 0.711 |
αα-corner | 0.933 | 0.991 | 0.781 | 0.732 | 0.621 | 0.571 |
Block | Layer | Description |
---|---|---|
Encoder | Embedding | Words in AA-sequence using a dense vector form. |
GVP | Module for learning vector- and scalar-valued functions over geometric vectors and scalars. | |
Norm | Layer normalization for vector features (L2-normalization). | |
GVPConv (5-layers) | Implements GVP transforms and uses message-passing mechanisms from neighboring nodes and edges to aggregate a function of hidden states and update node embedding at each graph propagation step. | |
Bi-GRU (2-layer module) | Recurrent unit with input and forget gates. The Bi-GRU considers two separate sequences: from right to left and vice-versa. We considered the sequence of the hidden states of the node features of the graph. | |
Self-attention | This mechanism allows the discovery of connections between elements of the input sequence and the selection of those required for future generations [42]. We considered the sequence of the hidden states of the node features of the graph. | |
Decoder | GVPConv + Bi-GRU (5-layers) | Decoder block to reconstruct and obtain the graph structure from the encoder’s hidden state. |
GVP | Last GVP module. | |
Multi-head graph attention | This module has a one-way scalar sigmoid output to predict node labels [43]. |
SSS | Mean IoU (Training) | Mean IoU (Validation) | F1 |
---|---|---|---|
βαβ-unit | 0.928 | 0.894 | 0.978 |
α-hairpin | 0.964 | 0.957 | 0.984 |
β-hairpin | 0.998 | 0.983 | 0.991 |
αα-corner | 0.933 | 0.991 | 0.994 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Petrovsky, D.V.; Rudnev, V.R.; Nikolsky, K.S.; Kulikova, L.I.; Malsagova, K.M.; Kopylov, A.T.; Kaysheva, A.L. PSSNet—An Accurate Super-Secondary Structure for Protein Segmentation. Int. J. Mol. Sci. 2022, 23, 14813. https://doi.org/10.3390/ijms232314813
Petrovsky DV, Rudnev VR, Nikolsky KS, Kulikova LI, Malsagova KM, Kopylov AT, Kaysheva AL. PSSNet—An Accurate Super-Secondary Structure for Protein Segmentation. International Journal of Molecular Sciences. 2022; 23(23):14813. https://doi.org/10.3390/ijms232314813
Chicago/Turabian StylePetrovsky, Denis V., Vladimir R. Rudnev, Kirill S. Nikolsky, Liudmila I. Kulikova, Kristina M. Malsagova, Arthur T. Kopylov, and Anna L. Kaysheva. 2022. "PSSNet—An Accurate Super-Secondary Structure for Protein Segmentation" International Journal of Molecular Sciences 23, no. 23: 14813. https://doi.org/10.3390/ijms232314813