Genomic workflow using Microsoft Genomics Cloud Services
Genome sequencing has been a challenge when it comes to handling vast amount of unstructured datasets, which first needs to be processed to find variants, and thereafter analyse those variants, to do something meaningful with it.
In human, the sequence machine has to process about 3 billion units of DNA across 23,000 genes. Current estimates indicate that only 0.1% of our DNA may vary, which means any two unrelated individuals may differ at less than 3 million DNA positions.
All humans have almost the same sequence of 3 billion DNA bases (A,C,G, or T) distributed among their 23 pairs of chromosomes, but at certain locations there are differences, and these variations are called polymorphisms.
One of the type of polymorphism called single nucleotide polymorphisms (SNPs), pronounced as “snips” which involves variation of a single base pair. The study of these variants have proven to help human health, understand complex diseases such as heart disease, asthma, diabetes, and cancer and also predict an individual’s response to certain drugs or even score risk factor of developing particular diseases. Human genome does change over time, based on age, environment, diet, and other factors and the degree of change is similar among family members.
Genomics analyses is broken down into 3-stage analysis w.r.t data interpretation. The sequencer machine processes the order of nucleic acid sequence as text strings (Primary analysis — Preparation & collection), perform mapping and aggregation on those strings to a reference genome (Secondary analysis — Discovery), and finally ingest variant data into genomic analyses tools for insights (Tertiary analysis — Machine learning & reporting).
In this article, our primary focus would be to process a sample genome data, using Microsoft Genomics secondary analysis, and kickstart with the setup options, to facilitate the Tertiary analysis using Azure Databricks. We will leave subject matter and functional aspects of this broad topic to bioinformatics experts.
Microsoft Genomics provides a cloud hosted solution to process outputs from the primary analyses, and help create a variant call format file, as elaborated in the matrix above. It uses Burrows-Wheeler Aligner (BWA) and the Genome Analysis Toolkit (GATK), which are part of the Broad Institute’s Best Practices analysis pipeline, producing results faster and with less overhead.
Microsoft Genomics is a ready-to-go solution, which could be easily implemented in business, with minimum efforts. Handling genomics data brings many legal and ethical topics on table, which urges for data privacy and security. Microsoft meets international standards for security, privacy and quality. It is ISO-certified and covered by Microsoft HIPAA BAA. More information about Health Insurance Portability and Accountability Act (HIPAA) is found under this link: https://docs.microsoft.com/en-us/compliance/regulatory/offering-hipaa-hitech
Once your azure environment is set up, by signing up for “Azure Free Trial Subscription”, one could create “Genomics account” from the available Azure services.
There are 2 more steps involved, to complete the Environment setup. One needs to install Python and Microsoft Genomics Python client (MsGen) on your local environment, to trigger genomics workflow in cloud. Microsoft Genomics Python client is compatible with Python 2.7.14 and is available to download from below link.
Using Python pip, one could install Microsoft Genomics client (MsGen) locally.
pip install msgen
The final step would be to download the “Config file” from the Genomics account, as shown below.
One could test the working of Genomics client, using the command as shown below.
msgen list -f "D:\Workspace\*********\Azure\Genomics\config.txt"
The primary sequence analysis produces FASTQ text files, contains millions of entries of raw sequence data in gigabytes which becomes the input for the secondary analysis workflow.
The secondary service accepts raw samples in “two paired end read” FASTQ files as input, and produces outputs such as .BAM, .BAI and .VCF files, along with the associated log files.
In Bioinformatics, the FASTQ file is a text-based format which represents either a nucleotide sequence or amino acid (protein) sequences, using single-letter codes. A single human genome takes up to 100 gigabytes of storage space.
One needs to upload FASTQ files into Azure Blob storage to perform secondary analysis. Microsoft Genomics service expect inputs to be stored as block blobs. It also writes the processed output files as block blobs. The Azure storage is optimized to work efficiently with cloud services, with speedy upload and download throughputs. In the Azure storage account, we will create 2 containers to hold sequence input and output data, as shown below.
We will be using publicly available sample data, to see Microsoft Genomics service in working. One could either upload data files using Azure storage GUI or automate the upload routine (from on-premises to Azure Blob) using the command line utility called “AzCopy” as shown below.
azcopy cp "D:/Workspace/*********/Azure/Genomics/Data/Input/" "https://genomics.blob.core.windows.net/input/?sv=2019-12- 12&ss=b&srt=sco&sp=rwdlacx&se=2021-01-28T20:18:11Z&st=2021-01- 28T12:18:****************************************" --recursive=true
Once the raw sequence files are uploaded into Blob storage, one should update the configuration file locally, with input and output containers, as shown below.
With this, we are ready to start the secondary analysis, using Microsoft Genomics workflow.
Using Genomics client, one could trigger a workflow. Genomics workflow process raw sequence data using a reference genome.
Using the config file, one could set the workflow configuration such as input and output data containers or choose a reference genome, as shown below.
The genomic pipeline does the sequence alignment w.r.t reference genome using Burrows-Wheeler Aligner (BWA), creates a mapped sequence with quality score (SAM file or a binary compressed BAM file) and then identifies, where the mapped data and reference genomes differs, using Bayesian algorithm (analytical technique) which finally creates a Variant Call Format (VCF file) as output. The whole process of scanning through hundreds of gigabytes of data, gets reduced to just a few megabytes, containing unique variants of interest w.r.t an individual.
Variants are markers, specific traits ranging from physical attributes to disease susceptibility, and these variants makes an individual unique. Possible genetic variants are Transitions, Transversions, Insertion/Deletion (Indels) and Structural variation.
One could run a workflow using the msgen Python client, as shown below.
msgen submit -f "D:\Workspace\*********\Azure\Genomics\config.txt" -b1 "chr21_1.fq.gz" -b2 "chr21_2.fq.gz"
Once the workflow is submitted, it takes somewhere around 30 minutes to 2 hours to finish depending upon the workload. One could request the workflow status, as shown below.
msgen list -f "D:\Workspace\*********\Azure\Genomics\config.txt"
Genomics secondary analysis service outputs VCF file, optimized for downstream analysis, along with BAM and logfiles, as shown below.
With secondary analysis accomplished, one could focus on the next stage of analysis, using the VCF data.
Secondary analysis deals with different "Types of Sequencing". DNA Sequencing gives a complete genetic profile of an organism, whereas sequencing RNA reflects only the sequences that are actively expressed in the cells. Exome sequencing makes up only 1.5% of the whole human genome, consists of all protein coding genes. Majority of mutations that cause diseases are found in Exome. One must choose reference genome accordingly based on the sequence types.
Once we have the gene sequence variations in the form of a Variant Call Format (VCF), it’s pretty straightforward from here onwards, to use the machine learning and analytical tools, to perform tertiary analysis.
One could perform the tertiary analyses using online visual tools such as Integrative Genomics Viewer (IGV) or USCS Genome Browser. Using Azure Databricks, one could use “Glow” package to perform Bioinformatics analyses using Apache Spark. Glow includes functions to handle VCF data, using spark read and write or even create delta file out of it.
To run genomics-based workload in Azure Databricks, one needs to enable “Databricks Runtime for Genomics” via the Admin Console.
This would allow one to create and configure a cluster using “Genomics runtime”, as shown below.
One could start querying VCF files, right from the Blob storage without any data movement using SPARK as shown below.
a VCF file contains 8 mandatory columns, and many optional columns, which records information about the samples, as shown below.
One could also flatten the columns, to make them readable and focus on relevant columns, as shown below.
One could store VCF datasets as Delta file or table, in Delta Lake on Azure Databricks. This is extremely handy when one needs to maintain multiple version of datasets and perform analysis on them.
This ability to reference previous versions allow one to rollback at given checkpoint (datetime), and reproduce data as necessary for the analysis.
Glow also includes summary statistics functions, which could be directly applied on variant data, and added as new columns into the data frame, such as Hardy Weinberg equilibria or Summary stats.
One of the plot, commonly used in genome wide association studies (GWAS) to display variants (SNPs) is the scatter plot called “Manhattan plot”. It plots group of variants (as P-value, which is a measure of probability) against the chromosome (CHR-number). Each point represents a genetic variant. A sample of Manhattan plot is shown below.
One should pay attention to the costs incurred by these services. The cluster usage time should be strictly monitored and one should "Terminate" the cluster once the data is processed in Databricks. One may also trigger the genomics workload on a cluster, using Azure Data Factory, which would only be active during the job lifetime.
Also, Azure comes with three storage tiers: Hot, Cold, and Archive. Keeping genomics data in Hot storage would be expensive, hence should change storage tiers accordingly once done with data processing. The Hot tier is most expensive, with fastest access to data whereas Cold and Archive are the cheapest, with slower or no immediate access to data, respectively.
The technical advancement in handling genomic datasets has proven speedy combat against viruses i.e. Covid-19, by creating next generation mRNA vaccines (messenger RNA). The mRNA technology has revolutionised vaccine development, which differs from the traditional method of injecting a weak or dead virus into our body to activate the immune system. The mRNA vaccines uses the protein instruction from the virus itself i.e. Covid-19 SPIKE protein, and instead of creating these proteins in the lab/shipping them, they brought the whole protein generation factory into our human body, as illustrated below.
Microsoft genomics has made it robust and effortless to perform secondary analysis using genomics workflows. Having Microsoft manage the underlying infrastructure benefits researchers around the world, who could easily share and collaborate on genomic data, saving time and operational costs.
As seen in this article, Microsoft genomics is a great option for prototyping and long-term solution in health and research organizations. One could upload anonymized genomes of healthy and cancerous cells, and build a public data repositories.
With simple, consistent and reliable service, one is able to drive breakthroughs in understanding and treating complex diseases such as heart disease, asthma, diabetes, and cancer with precision medicine. With incredible future in mRNA technology, and even clinical trials could be supported with genomic evidence, would considerably speedup time to market with new drugs. The ease of data processing opens up personalized medical care w.r.t medicine, prevention and treatment thereby reducing costs tremendously in the health sector. Finally, leaving you with this genomics quote …
“What more powerful form of study of mankind could there be than to read our own instruction book?” -Francis Collins
Originally published at https://www.linkedin.com.