Scientists Proposed OntoProtein For Protein Models
Self-supervised protein language models are successful in learning protein models. With increased computing capacity, existing protein language models pre-trained with millions of sequences may extend the parameter scale from the million-level to the billion-level and produce spectacular improvements. However, those prevalent techniques seldom explore adding knowledge graphs, which may give rich, structured information facts for improved protein representations. The knowledge graphs that include useful biology information may help show proteins better when combined with data from other sources.
OntoProtein was suggested by researchers from Zhejiang University in China, headed by Ningyu Zhang and Zhen Bi. In protein pre-training models, this first generic framework employs structure from Gene Ontology. They created a new large-scale knowledge network based on Gene Ontology and linked proteins, and gene annotation texts or protein sequences characterize all nodes in the graph. They suggested a novel contrastive learning method that simultaneously uses knowledge-aware negative sampling to maximize the knowledge graph and protein embedding during pre-training. OntoProtein can do better than current methods that use pre-trained protein language models. It can predict how proteins interact with each other and how they work.
OntoProtein is the first broad framework to incorporate external knowledge graphs into protein pre-training. It is a protein pre-training model with gene ontology embedding. The researchers proposed a hybrid encoder to represent English text and protein sequences and contrastive learning using knowledge-aware negative sampling to maximize the knowledge graph and protein sequence embedding during pre-training. They encoded the node descriptions (go annotations) as the relevant entity embeddings for knowledge embedding. They extend their use of gene ontology to molecular function, cellular components, and biological processes, and they develop a knowledge-aware negative sampling strategy for the knowledge embedding aim. With the mask language models, OntoProtein inherits the high capacity of protein comprehension from protein language models. OntoProtein can also integrate biology knowledge into its representation of proteins with supervision from knowledge graphs by the knowledge embedding object. This object doesn't care what kind of protein task you're trying to do. You can change the structure of the model and add new training goals to make it work for different types of functions.
To build the OntoProtein, the researchers used the mask protein modeling object and the knowledge embedding aim. A novel knowledge graph dataset was generated by combining Gene Ontology and publicly annotated proteins. This dataset was used to train the model, then tested in many downstream tasks. The Tasks Assessing Protein Embeddings (TAPE) benchmark was employed to assess protein representation learning. In TAPE, there are three sorts of functions: structural, evolutionary, and protein engineering. To analyze OntoProtein, they chose six sample datasets, including secondary structure (SS) and contact prediction. Protein-protein interactions (PPI) are high-specificity physical contacts formed between two or more protein molecules. They are seen as a sequence classification challenge and are judged on three different datasets of different sizes.
OntoProtein outperforms all other proteins in all tests. OntoProtein surpasses TAPE Transformer and ProtBert in structure and contact prediction, demonstrating that it may benefit from useful biological knowledge graphs in pre-training. OntoProtein exhibited its ability to predict fluorescence. OntoProtein, on the other hand, does not do well in protein engineering, homology, and stability prediction, all of which are regression tasks. This is most likely owing to the pre-training object's absence of sequence-level goals. The suggested method may be considered pre-training for human language and protein (the language of life). This research aims to determine how to read the language of life's code by making proteins with information about genes.
The researchers first incorporated external factual information from gene ontology into protein models. OntoProtein (protein pretraining with gene ontology embedding) is the first broad framework to integrate external knowledge graphs into protein pre-training. Experiment findings on everyday protein tasks show that effective information injection aids in understanding and uncovering the language of life. Furthermore, OntoProtein is compatible with the model parameters of many pre-trained protein language models, implying that users may utilize the existing pre-trained parameters on OntoProtein without changing the architecture. The promising findings indicate future efforts to improve OntoProtein by infusing more helpful information with gene ontology selection and expanding this technique to other sequence generation challenges for protein design.