Genomics DataModel, v 1.0 using Apache Spark and Elastic Search

Barani Dakshinamoorthy
2 min readAug 6, 2020

At MedDataXtract.com, We have been working on “Genomics DataModel v 1.0” to explore the capabilities of Apache Spark in combination with Elastic Search. I must say I am quite overwhelmed with the development experience and ease of use of these trendy tools.

Genomics is a huge challenge w.r.t large unstructured datasets. This means, analyze enormous amounts of DNA-sequence data to find variations that affect health, disease or drug response. In humans that means searching through about 3 billion units of DNA across 23,000 genes. All humans have almost the same sequence of 3 billion DNA bases (A,C,G, or T) distributed between their 23 pairs of chromosomes. But at certain locations there are differences — these variations are called polymorphisms.

Current estimates indicate that up to 0.1% of our DNA may vary a bit, meaning any two unrelated individuals may differ at less than 3 million DNA positions. We have been mapping and collecting these variations (Single Nucleotide Polymorphisms — SNPs) in our GeneSequence DataLake. We have been also mapping them to specific symptoms, which we categorized into 4 mapping attributes. These attributes are Capabilites, Diseases, Variations and SNPs with No Significant Effect (NSE).

Our initial phase was to recognize and collect all known PROMOTORs, BASE PAIRs, SPECIFIC PATTERNs and VARIATIONs (SNPs). This is still work in progress. We have been using Apache Spark cluster computing, to read unstructured data and store only recognizable patterns into MS-SQL server database.

SPARK SQL has his own pattern search capabilites. We have also explored this possibility using Elastic Search. The analytical engine of Elastic Search and its visual tool, which is Kibana helps us explore patterns seemlessly using their dashboard. Meanwhile, we are working on this project, it still has its own challenages, which need to be improved from future research.

We thought of mentioning the potentials of these open source tools, which comes handy in mining patterns in unstructured data world. Our experienced consultants are ready to take a deep-dive into your business, navigate through your data architecture and help you pick the right big data tools for your organization, making your data-driven journey most scalable and cost-effective.

All rights reserved, copyright, MedDataXtract.com

Published By MedDataXtract

Originally published at https://www.linkedin.com

Buy me a coffee

https://www.linkedin.com/pulse/genomics-datamodel-v-10-using-apache-spark-elastic-search-ashwin-b/

--

--

Barani Dakshinamoorthy

Founder, Data Integration, Innovation, Technology Driven professional. A Microsoft Certified Solutions Associate (MCSA) in the field of SQL Server development.