The landscape of tolerated genetic variation in humans and primates

Publikation: Bidrag til tidsskriftTidsskriftartikelForskningfagfællebedømt

Dokumenter

  • Preprint

    Indsendt manuskript, 9,15 MB, PDF-dokument

  • Hong Gao
  • Tobias Hamp
  • Jeffrey Ede
  • Joshua G. Schraiber
  • Jeremy McRae
  • Moriel Singer-Berk
  • Yanshen Yang
  • Anastasia S.D. Dietrich
  • Petko P. Fiziev
  • Lukas F.K. Kuderna
  • Laksshman Sundaram
  • Yibing Wu
  • Aashish Adhikari
  • Yair Field
  • Chen Chen
  • Serafim Batzoglou
  • Francois Aguet
  • Gabrielle Lemire
  • Rebecca Reimers
  • Daniel Balick
  • Mareike C. Janiak
  • Martin Kuhlwilm
  • Joseph D. Orkin
  • Shivakumara Manu
  • Alejandro Valenzuela
  • Juraj Bergman
  • Marjolaine Rousselle
  • Felipe Ennes Silva
  • Lidia Agueda
  • Julie Blanc
  • Marta Gut
  • Dorien de Vries
  • Ian Goodhead
  • R. Alan Harris
  • Muthuswamy Raveendran
  • Axel Jensen
  • Idriss S. Chuma
  • Julie E. Horvath
  • Christina Hvilsom
  • David Juan
  • Peter Frandsen
  • Fabiano R. de Melo
  • Fabrício Bertuol
  • Hazel Byrne
  • Iracilda Sampaio
  • Izeni Farias
  • João Valsecchi do Amaral
  • Mariluce Messias
  • Maria N.F. da Silva
  • Mihir Trivedi
  • Rogerio Rossi
  • Tomas Hrbek
  • Nicole Andriaholinirina
  • Clément J. Rabarivola
  • Alphonse Zaramody
  • Clifford J. Jolly
  • Jane Phillips-Conroy
  • Gregory Wilkerson
  • Christian Abee
  • Joe H. Simmons
  • Eduardo Fernandez-Duque
  • Sree Kanthaswamy
  • Fekadu Shiferaw
  • Dongdong Wu
  • Long Zhou
  • Yong Shao
  • Julius D. Keyyu
  • Sascha Knauf
  • Minh D. Le
  • Esther Lizano
  • Stefan Merker
  • Arcadi Navarro
  • Thomas Bataillon
  • Tilo Nadler
  • Chiea Chuen Khor
  • Jessica Lee
  • Patrick Tan
  • Weng Khong Lim
  • Andrew C. Kitchener
  • Dietmar Zinner
  • Ivo Gut
  • Amanda Melin
  • Katerina Guschanski
  • Mikkel Heide Schierup
  • Robin M.D. Beck
  • Govindhaswamy Umapathy
  • Christian Roos
  • Jean P. Boubli
  • Monkol Lek
  • Shamil Sunyaev
  • Anne O'Donnell-Luria
  • Heidi L. Rehm
  • Jinbo Xu
  • Jeffrey Rogers
  • Tomas Marques-Bonet
  • Kyle Kai How Farh
INTRODUCTION
Millions of people have received genome and exome sequencing to date, a collective effort that has illuminated for the first time the vast catalog of small genetic differences that distinguish us as individuals within our species. However, the effects of most of these genetic variants remain unknown, limiting their clinical utility and actionability. New approaches that can accurately discern disease-causing from benign mutations and interpret genetic variants on a genome-wide scale would constitute a meaningful initial step towards realizing the potential of personalized genomic medicine.
RATIONALE
As a result of the short evolutionary distance between humans and nonhuman primates, our proteins share near-perfect amino acid sequence identity. Hence, the effects of a protein-altering mutation found in one species are likely to be concordant in the other species. By systematically cataloging common variants of nonhuman primates, we aimed to annotate these variants as being unlikely to cause human disease as they are tolerated by natural selection in a closely related species. Once collected, the resulting resource may be applied to infer the effects of unobserved variants across the genome using machine learning.
RESULTS
Following the strategy outlined above we obtained whole-genome sequencing data for 809 individuals from 233 primate species and cataloged 4.3 million common missense variants. We confirmed that human missense variants seen in at least one nonhuman primate species were annotated as benign in the ClinVar clinical variant database in 99% of cases. By contrast, common variants from mammals and vertebrates outside the primate lineage were substantially less likely to be benign in the ClinVar database (71 to 87% benign), restricting this strategy to nonhuman primates. Overall, we reclassified more than 4 million human missense variants of previously unknown consequence as likely benign, resulting in a greater than 50-fold increase in the number of annotated missense variants compared to existing clinical databases.
To infer the pathogenicity of the remaining missense variants in the human genome, we constructed PrimateAI-3D, a semisupervised 3D-convolutional neural network that operates on voxelized protein structures. We trained PrimateAI-3D to separate common primate variants from matched control variants in 3D space as a semisupervised learning task. We evaluated the trained PrimateAI-3D model alongside 15 other published machine learning methods on their ability to distinguish between benign and pathogenic variants in six different clinical benchmarks and demonstrated that PrimateAI-3D outperformed all other classifiers in each of the tasks.
CONCLUSION
Our study addresses one of the key challenges in the variant interpretation field, namely, the lack of sufficient labeled data to effectively train large machine learning models. By generating the most comprehensive primate sequencing dataset to date and pairing this resource with a deep learning architecture that leverages 3D protein structures, we were able to achieve meaningful improvements in variant effect prediction across multiple clinical benchmarks.
OriginalsprogEngelsk
TidsskriftScience (New York, N.Y.)
Vol/bind380
Udgave nummer6648
Antal sider13
ISSN0036-8075
DOI
StatusUdgivet - 2023

ID: 356423037