Autism Research at the Intersection of Genomic Discovery and Big Data
By Autism Speaks Chief Science Officer Robert Ring
This week, we announced a collaboration with Google to advance a historic research project that will fundamentally change our scientific understanding of autism. In doing so, this project holds the potential to reveal important clues about what causes autism and unlock the knowledge essential for improving the precision of medical care that’s needed by many of those living on the autism spectrum.
As most readers of this column know, autism is a spectrum of neurodevelopmental disorders that begin during early childhood and affect an estimated 1 in 68 kids and uncounted adults in the United States. Other than being four times more common in boys, autism doesn’t discriminate – not by race, nation or socioeconomic level.
We have much to learn about autism and its causes. However, study after study implicates a strong genetic basis. By genetics, I mean more than the narrow view of inherited genetic traits. Genetics also includes the diverse ways that the function and structure of one’s genome can be changed by a vast array of non-inherited factors such as exposure to our environment.
I have frequently said “when it comes to understanding autism risk, all roads lead to genome.” If we’re going to advance our understanding of autism, we must start by detailing what’s happening at the level of the genome. The game will be fundamentally changed with this in hand.
To accomplish this, Autism Speaks launched an unprecedented research program to develop the largest database of complete genomic information on individuals with autism and their closest family members. This is our Autism Ten Thousand Genomes Program, or Aut10K for short. Like our Autism Genetic Resource Exchange (AGRE), the Aut10K database will serve as an open-access resource for the entire autism research community. Moreover, it’s likely to become a vital resource to the larger field of human genomic discovery.
Genome-guided medical care
Each individual genome sequenced and stored in the Aut10K database will be associated with an array of detailed clinical information about the donor, which has been collected in a standardized way. This clinical information includes diagnoses and a rich diversity of related medical and research information. This data enables researchers to ask the big questions about how genetic mutations not only lead to the development of autism but also its many associated (comorbid) medical conditions such as anxiety, GI symptoms, sleep disturbances and seizures.
The Aut10K database will offer a platform for researchers to ask simple-to-complex questions about genetic cause and effect. Ultimately the answers to these questions will radically change our approaches to medical care and our development of new treatments. It will move our field closer to real opportunities for personalized medicine.
I am frequently asked about the anonymity of donor information in Aut10K. I want to emphasize for everyone here, that all the data stored in Aut10K has been “de-identified,” or anonymized, according to standard research practices. This ensures the privacy of our donors and is common to resources of this kind.
What is genomic sequence?
Genetic information is stored as molecules of DNA in every cell of our bodies. This DNA is made up of long combinations of the four “letters” A, T, C, and G – the so-called genetic sequence. Your genome contains more than 3 billion letters of this code.
It took the original Human Genome Project more than $3 billion and over 13 years to complete the sequencing of the first human genome. Today, with technological innovations, we can sequence an entire genome of any individual for around $2,500 in two weeks. These developments have put in our reach something that would have been unimaginable a few years ago.
At Autism Speaks, we saw this as an opportunity to completely sequence the large repositories of DNA samples (associated with clinical information) that we worked for years to develop. AGRE, which took almost 15 years to develop, represents the largest private collection of autism families in the world. It will be among the first targets of sequencing for Aut10K.
Managing “Big Data”
Although we can now generate genomic data affordably, managing this unprecedented resource brings other substantial costs and new technical challenges.
As we began laying out the roadmap for Aut10K’s path to the complete genomic sequencing of thousands of individuals, we quickly understood that the future scale of data generated by the program would rapidly exceed the capacity and capabilities of our usual partners in the academic world. For example, the average size of a data file containing just one raw unprocessed genomic sequence runs about 100 gigabytes. That’s the equivalent of 50 high-definition movies downloaded onto your computer.
If one takes a step back, the new era of genomic discovery has become as much an engineering challenge as it once was a biological hurdle. To manage this, we had to reach beyond life sciences. We had to forge a new collaboration with experts in storing, analyzing and providing access to what is frequently referred to as “big data.”
Entering the Google Cloud
Enter Google, arguably the best in the business of big datasets. Over the past year, we have worked with Google engineers and scientists who understand the challenges that genetic researchers are facing with managing their big data. They were already exploring ways to leverage the tremendous capacity and capability of the Google Cloud to deliver solutions.
This week, Autism Speaks announces our new collaboration with Google, to host our Aut10K database in the Google Cloud. We have established an agreement that will ensure that qualified users of our database will also be able to access the computational resources required to analyze its scale of data.
The relationship is truly a game-changer for the Aut10K program and brings together two organizations that share a common desire to use technology to “do good things for the world.” The Aut10K is a prime example of this goal at work for individuals living with autism.
The Aut10K is underway, led by its newly appointed director Steve Scherer, of the University of Toronto’s Hospital for Sick Children (SickKids) and its Centre for Applied Genomics. Already, 1,000 completed genomes are entering the Google Cloud. We have several thousand additional cases in the midst of sequencing.
Dr. Scherer and his colleagues published their analysis of the first 100 genomes, a great accomplishment, in the American Journal of Human Genetics last summer. These findings have already advanced understanding of autism and, in some cases, provided information useful in guiding diagnosis and treatment.
This historic and exciting work would not be possible without the passion and support of Autism Speaks tremendous community of families, donors and volunteers. Thank you, and please let us hear from you.
Hear more from Dr. Ring in the video clips below: