Challenge by Retro Biosciences
Extract knowledge from all publicly available sources regarding protein sequence-to-function relationships to empower future protein and gene reengineering efforts against aging.
Given a human protein X — how do we extract knowledge from all publicly available sources regarding its sequence-to-function relationship to empower future protein and gene reengineering efforts?
Protein reengineering efforts are often bottlenecked by lack of sufficient sequence-to-function data that would inform first rounds of designs. This challenge aims to create a comprehensive knowledge base of known protein modifications linked to functional outcomes in experiments.
The mission is to speed up research on protein engineering, especially in the context of aging. The aggregated data will help researchers identify the most promising approaches to modifying wild-type protein sequences.
Essentially, an agent is expected to reproduce a GenAge type database but writing actual articles about the protein/gene sequence-to-function relationships related to longevity.
For starters, you can use WikiCrow by FutureHouse as a reference format (Wikipedia-style articles about genes, e.g. APOE).
This is the key requirement! The system must establish clear relationships between protein/gene sequences and their functional outcomes related to longevity.
Write articles about protein/gene sequence-to-function relationships related to longevity. Include information about:
• Small molecule binding data — integrate binding information for additional context
• Tunable coarse-graining — from individual nucleotides/amino acids to larger domains or even families of domains
Can your approach be applied to any human gene?
Can your approach recover at least 5 various sources of modifications for each gene?
Is your source of protein sequence modification data relevant to aging? Is there association with lifespan?
Bonus points if agent extracts original figures with key data from source studies and cites them in the article.
Gene/Protein Name/ID <> Protein/DNA Sequence <> Interval in Sequence <> Function (text format)
Use standard protein name and/or Uniprot ID linked to a protein sequence
Specify intervals in the protein sequence & introduced modifications and the change in function the modifications induced
Test your agent with these specific proteins to validate its capability to extract comprehensive sequence-to-function relationships:
Your agent should be able to find:
Should be able to recover the results of SuperSOX:
Should recover all major APOE variants and their longevity associations:
Should recover papers converting OCT6 into a reprogramming factor:
Gene/Protein Name/ID
↓
Protein/DNA Sequence
↓
Interval in Sequence
↓
Function (Text Format)
↓
Modification Effects
↓
Longevity AssociationThe desired structure should enable researchers to quickly identify sequence intervals of interest, understand their functional roles, and see how modifications in those regions affect longevity-related outcomes.
Scientific literature access
Wikipedia-style gene articles (example: APOE)
Comprehensive protein sequence and annotation database
Protein structure predictions
Protein families, domains and functional sites
Database of aging-related genes
Longevity-associated genes database
Eukaryotic Linear Motif resource
Having a clear knowledge base of known protein modifications linked to functional outcomes in experiments is going to speed up research on protein engineering, especially in the context of aging.
Ready to accelerate protein engineering for longevity? Build the knowledge base that will transform aging research.