• Medicine
  • Astronomy
  • Sociology
  • Technology
  • Spirituality
  • History
  • Open Access
  • News

Scientists Proposed OntoProtein For Protein Models

Self-supervised protein language models are successful in learning protein models. With increased computing capacity, existing protein language models pre-trained with millions of sequences may extend the parameter scale from the million-level to the billion-level and produce spectacular improvements. However, those prevalent techniques seldom explore adding knowledge graphs, which may give rich, structured information facts for improved protein representations. The knowledge graphs that include useful biology information may help show proteins better when combined with data from other sources.

OntoProtein was suggested by researchers from Zhejiang University in China, headed by Ningyu Zhang and Zhen Bi. In protein pre-training models, this first generic framework employs structure from Gene Ontology. They created a new large-scale knowledge network based on Gene Ontology and linked proteins, and gene annotation texts or protein sequences characterize all nodes in the graph. They suggested a novel contrastive learning method that simultaneously uses knowledge-aware negative sampling to maximize the knowledge graph and protein embedding during pre-training. OntoProtein can do better than current methods that use pre-trained protein language models. It can predict how proteins interact with each other and how they work.

OntoProtein

OntoProtein is the first broad framework to incorporate external knowledge graphs into protein pre-training. It is a protein pre-training model with gene ontology embedding. The researchers proposed a hybrid encoder to represent English text and protein sequences and contrastive learning using knowledge-aware negative sampling to maximize the knowledge graph and protein sequence embedding during pre-training. They encoded the node descriptions (go annotations) as the relevant entity embeddings for knowledge embedding. They extend their use of gene ontology to molecular function, cellular components, and biological processes, and they develop a knowledge-aware negative sampling strategy for the knowledge embedding aim. With the mask language models, OntoProtein inherits the high capacity of protein comprehension from protein language models. OntoProtein can also integrate biology knowledge into its representation of proteins with supervision from knowledge graphs by the knowledge embedding object. This object doesn't care what kind of protein task you're trying to do. You can change the structure of the model and add new training goals to make it work for different types of functions.

Illustration of protein molecule
Illustration of protein molecule

Mask Protein Modeling And Knowledge Embedding

To build the OntoProtein, the researchers used the mask protein modeling object and the knowledge embedding aim. A novel knowledge graph dataset was generated by combining Gene Ontology and publicly annotated proteins. This dataset was used to train the model, then tested in many downstream tasks. The Tasks Assessing Protein Embeddings (TAPE) benchmark was employed to assess protein representation learning. In TAPE, there are three sorts of functions: structural, evolutionary, and protein engineering. To analyze OntoProtein, they chose six sample datasets, including secondary structure (SS) and contact prediction. Protein-protein interactions (PPI) are high-specificity physical contacts formed between two or more protein molecules. They are seen as a sequence classification challenge and are judged on three different datasets of different sizes.

OntoProtein outperforms all other proteins in all tests. OntoProtein surpasses TAPE Transformer and ProtBert in structure and contact prediction, demonstrating that it may benefit from useful biological knowledge graphs in pre-training. OntoProtein exhibited its ability to predict fluorescence. OntoProtein, on the other hand, does not do well in protein engineering, homology, and stability prediction, all of which are regression tasks. This is most likely owing to the pre-training object's absence of sequence-level goals. The suggested method may be considered pre-training for human language and protein (the language of life). This research aims to determine how to read the language of life's code by making proteins with information about genes.

Conclusion

The researchers first incorporated external factual information from gene ontology into protein models. OntoProtein (protein pretraining with gene ontology embedding) is the first broad framework to integrate external knowledge graphs into protein pre-training. Experiment findings on everyday protein tasks show that effective information injection aids in understanding and uncovering the language of life. Furthermore, OntoProtein is compatible with the model parameters of many pre-trained protein language models, implying that users may utilize the existing pre-trained parameters on OntoProtein without changing the architecture. The promising findings indicate future efforts to improve OntoProtein by infusing more helpful information with gene ontology selection and expanding this technique to other sequence generation challenges for protein design.

Comments (0 comments)

    Recent Articles

    • What Is My Angel Number? Number That Feels Spiritually Significant

      What Is My Angel Number? Number That Feels Spiritually Significant

      What Is My Angel Number - Your angel number is a spiritually important number for you. It's usually a number associated with your name or birthdate. It might also be a series of numbers that you see continually throughout time.

    • Music Origin - Overview From Begining To Present Day

      Music Origin - Overview From Begining To Present Day

      Music origin is likely to have occurred in Syria about 3400 years ago.

    • Instructions For Authors - A Step-by-Step Guide For Submitting A Scientific Paper

      Instructions For Authors - A Step-by-Step Guide For Submitting A Scientific Paper

      The instructions for authors are a unique set of criteria for each journal.

    • Peer Review - An Overview Of The Process, Benefits, And Pitfalls

      Peer Review - An Overview Of The Process, Benefits, And Pitfalls

      You will get information about our comprehensive, productive, and open-minded peer review process.

    • Alpha Brain – A Premium Brain Supplement For Improving Memory And Focus

      Alpha Brain – A Premium Brain Supplement For Improving Memory And Focus

      Alpha Brain is a nutritional supplement, not a medicine. The chemicals in Alpha Brain are among the most powerful cognitive enhancers available.

    • Irocit – Use, Dosage And Side Effects

      Irocit – Use, Dosage And Side Effects

      Irocit is an iron, folic acid, and zinc preparation that is taken orally and is used to treat iron, folic acid, and zinc deficiency.

    • The Ability Of Kalanchoe Tubiflora To Fight Cancer

      The Ability Of Kalanchoe Tubiflora To Fight Cancer

      Kalanchoe tubiflora has a possible anti-cancer agent that stops cells from dividing and makes them less likely to live.

    • Angel Number Birthday - Represents Completion And Rebirth

      Angel Number Birthday - Represents Completion And Rebirth

      Angel Number Birthday meaning is always spiritual, and it denotes, among other things, the beginning of a season of completing things.

    • Angel Number 33 Means - Being Creative And Expressing Oneself

      Angel Number 33 Means - Being Creative And Expressing Oneself

      Angel Number 33 Is known as a "Master Number," which indicates it has a greater vibration than other numbers.