Our knowledge of the human genome has increased exponentially since British chemist Frederick Sanger won the 1980 Nobel Prize for his eponymous approach to DNA sequencing. While “Sanger sequencing” offered great accuracy, it was also restricted to a single DNA fragment. As the Human Genome Project demonstrated, Sanger sequencing offered low throughput, but very high cost--the project took more than a decade and cost more than $3 billion.
The mid-2000’s brought next generation sequencing (NGS), a technique that offers much greater speed at a much lower cost; by 2011, an entire human genome could be sequenced in a few weeks, for about $10,000. NGS has proven revolutionary in the study of genetic diseases because it allows genome-wide detection of variants, including novel ones. NGS can also be applied to pharmacogenetic problems, allowing researchers to evaluate not only why certain drugs are ineffective for specific patients, but also to predict the success of a drug based on genetic factors.
According to the National Human Genome Research Institute, the next phase of genetic research is characterized by cooperative, large-scale initiatives like the Encyclopedia of DNA Elements (ENCODE) Consortium. These initiatives require a much more team-oriented approach to research because they bring together researchers and institutions from all over the world.
The collaborative nature of human genome research makes it an ideal candidate for cloud computing. Furthermore, NGS also presents a significant challenge in terms of data storage and processing, which can also be addressed through cloud technologies. The following case studies included in UberCloud's Compendium of Life Sciences, explore applications and benefits of cloud technologies for human genome research.
NGS is a high-throughput sequencing technology that is revolutionizing medicine; its accuracy and breadth of coverage make it useful for diagnosing and treating hereditary diseases and cancer, as well as infectious disease. However, the advantages of NGS lead to corollary challenges: storing, sharing, and analyzing the data strains disk space, network bandwidth and processor resources.
Cloud computing can alleviate some of these problems by flexibly providing virtualized compute and storage resources on-demand. Cluster computing, running on top of virtualized infrastructure can assist in coordinating the compute and storage resources for high-throughput applications, such as when processing NGS data. Bioinformatics workflow services such as those provided by Nimbus Informatics are built on top of infrastructure providers such as Amazon Web Services (AWS) and promise to provide an end-to-end platform that automates storing, organizing and processing large amounts of data in the cloud.
Workflow run status is displayed in a table. The team ran approximately 30 workflows simultaneously on Nimbus' AWS-based system. |
In this case study, the research team used SeqWare AMI, a comprehensive, open-source toolbox billed as a “genome center in a box.” SeqWare AMI is pre-set to run workflows directly on the EC2 instance, without an external cluster, making it a great option for development and testing. To test the scalability of the system by processing the samples in parallel, the team used Nimbus Informatics, a provider of hosted computational workflows that enable the analysis of complex genomics data for customers lacking a local HPC infrastructure.
The team identified three benefits of this solution:
The figure above illustrates the solvated PI3Kα in a water box. |
Cancer is a leading cause of death worldwide. The World Health Organization estimates that by 2030, 21 million people will die from the disease each year. One of the signaling pathways which, when deregulated, becomes centrally involved in several types of cancers, like colon, mammary, and endometrial tumorigenesis, is served by phosphoinositide-3-kinase alpha (PI3Kα).
The importance of PI3Kα in cancer is highlighted by the discovery that PIK3CA, the gene encoding the catalytic p110α subunit of PI3Kα, is frequently mutated in human malignancies.
In this case study, researchers sought insights into the oncogenic mechanism of two commonly expressed PI3Kα mutants by studying their conformational changes with molecular dynamics (MD) simulations, in comparison with the PI3Kα wild-type (normal, noncancerous) protein. The research team was particularly interested in evaluating the efficacy of cloud computing for performing MD simulations using GROMACS software. GRNET was used as the cloud computing service provider, providing a virtual machine with eight cores and GROMACS installation. Linear scaling was observed up to the eight available cores.
The team identified multiple benefits from the project:
Rapid progress in NGS is paving the way to individualized cancer treatment based on the sequencing of the cancer genome. This process is still evolving, and validated open source pipelines to process the enormous amounts of data involved are not available to everyone interested in this field.
Software programs used to align NGS data to the reference genome are governed by user-defined parameters and map the short-sequenced reads to the reference genome with varying accuracy. Most of the alignment tools require a long time to obtain accurate mappings of the short reads to the reference genome.
The authors of this case study aimed to start preparing guidelines on the optimal choices of open source alignment tools to be included into the NGS data-processing pipelines by matching these tools to the available computational resources. They had two primary objectives:
To enable portability of the computation environment (including the aligners, scripts, configuration settings) UberCloud Linux container images were used. The same image was used in the AWS, Bare Metal #1, and Bare Metal #2 test. The Linux container images were easily ported between environments by the Operations Engineer.
The team gained two particularly useful insights from the project:
All three of these case studies are available in the UberCloud Compendium of Life Sciences Case Studies. The compendium includes nine case studies that explore innovative applications of cloud simulations in medicine and healthcare.
|