Build Data Analytics platform using Azure Databricks

Barani Dakshinamoorthy

--

Inspired by my previous article which looked at Apache Spark, a Big Data framework used to process, query and analyze batch or real-time data at very high speed, I pursued my voyage across big data analytics, keeping my focus mainly on Apache Spark.

Apache Spark came a long way, gained increasing acceptance among Big Data community. Its services are bundled into Databricks, a Unified Analytics Platform which sits on top of Apache Spark unifying data science, integration (engineering) and business. Databricks offers a fully managed spark clusters in cloud with support, services and additional features.

Databricks entered into a partnership with Microsoft in year 2017, followed by its adoption of Unified Analytics Platform in 2018, which debuted Microsoft Azure Databricks, a fully managed Platform-as-a-Service (PaaS) Cloud Platform.

Azure Databricks was built in collaboration with Microsoft to simplify the process of big data and AI solutions for Microsoft customers, by combining the best of Databricks and Azure. It is well integrated within Azure data-related services such as Azure Cosmos DB, Azure RDS, Azure Blob storage, Azure SQL Data Warehouse and many more, enabling massive data-processing (both structured and unstructured data) thereby leveraging endless possibilities, all under one roof.

In this article, I run through the steps to set up Azure Databricks platform thereby, understanding the data preparation lifecycle, to feed datasets into ML (machine learning) models. The scope of this article is restricted to dataset preparation, enabling a quick start on Azure Databricks. For in-depth architecture and set up parameters, please refer to the Microsoft Databricks page at https://azure.microsoft.com/en-us/services/databricks/

Setup

Once your environment is set up, by signing up for “Azure Free Trial Subscription”, one must change the subscription to “pay-as-you-go” via the profile setting, to run clusters (vCPUs) in Azure Databricks.

From your Azure portal, search for “Azure Databricks” and create your first Databricks Workspace.

Launching workspace, should bring you to a new page, where the Analytics magic happens.

There are four options in this section, which makes the data preparation lifecycle complete, ready to be consumed and train ML models.

Cluster

Databricks cluster is a set of computation resources on which you run data analytics workloads. Using this option, one configures and spin up a tailored-made cluster, based on the project needs.

Data

Once your cluster is up and running, one could start uploading data using “Data” option.

Upload File:

Using “Create table”, one could start uploading data into Databricks table.

This creates a persistent table in the default database of your cluster, stored on a blob storage. If you choose to terminate your cluster, the data is always available and retrieved back from the blob storage, when the cluster is turned on again.

By clicking on the table, one get to see a preview of the table schema and its sample data.

DBFS:

Databricks File System (DBFS) is a distributed file system mounted into an enclosed Azure Databricks workspace, available to the clusters and notebooks. It is a decoupled data layer on top of Azure object storage.

Tip: 
The default storage location of DBFS root is dbfs:/root

One could enable the “DBFS File browser” via the admin console settings > advanced tab, to easily browse through the DBFS files. By default, this option is disabled.

Using Databricks CLI, one could also upload the file from “anywhere” to Databricks. One needs to authenticate access to DBFS by creating access token at the user settings.

Using DBFS CLI, one could automate daily upload of raw data into DBFS file system, which could be processed using Azure Databricks Jobs.

Azure Storage:

Another way to bring data into Azure Databricks is the Blob storage. This is where, one could leverage the full potential of Azure data-related services by storing massive raw files, and access them using Azure Databricks notebooks.

To use this feature, one needs to create a storage account via Azure portal.

Within the storage account, create storage containers to hold unstructured data.

To gain access to the Azure storage, one must create access policy and keys on the storage containers, to authenticate and let other systems place files in these containers.

Tip: 
Using Microsoft Azure Storage Explorer, one could easily navigate and manage files, just like windows explorer.

Similarly, one could setup Azure Data Lake, which is massively scalable and secure, for high-performance analytics workloads.

Workspace

Using workspace, one could start creating notebooks in different languages to perform certain task.

Tip: 
One could write snippets of code, with language defined at the beginning such as %sql for SQL, %r for R or %scala for Scala.

Data preparation:

Once the data sources are defined, one could start building data pipelines across various data storage systems, streamlining data-inputs for MODEL building.

Extraction:

Using spark data frame, one could read data for various data sources. Below are few examples of “read” from two different data sources.

Reading table from a DBFS:/ root

Once the data is extracted and cleansed, one may choose to persist the data as files, or choose to store them as tables, ready to be consumed by Analytical models.

Reading data from a BLOB storage

Ingestion:

One may choose to ingest data into any Azure database services. Below is an example of ingesting data into Databricks database, which is being extracted from a blob storage.

Tip: 
For larger data computes, one could use Azure Data Factory (ADF) which connect Databricks Notebook via Linked Services and run them as ADF data pipelines (a pipeline is a logical grouping of activities that perform a task together).
Azure Data Factory (ADF) is an ETL tool, used for data transformation, integration and orchestration across several different Azure services in cloud.

Jobs

Databricks notebook could be scheduled to automate the data analytics workload. Below is a job, which ingests data into a Databricks database.

One could debug the jobs using Spark UI, Logs and Metrics, accessible via the job history.

Tip: 
One could also setup an email alert via the Job Advanced options.

Finally, once the automated datasets are in place, the “centralized model store” could start consuming data, to train MLflow Models.

Tip: 
A MLflow run is a collection of parameters, metrics, tags and artifacts associated with a machine learning model training process.

Conclusion

Azure Databricks provides a Unified Analytics Platform, which is fully managed, scalable, and secure cloud infrastructure, that bridges the divide between big data and machine learning. Azure Databricks features interactive exploration with a complete ML DevOps model life cycle, right from experimentation to production.

It reduces operational complexity and total cost of ownership, enabling organizations to achieve success with their AI initiatives thereby accelerates innovation and faster time to value. Finally, leaving you with this wonderful quote …

“There is no passion to be found playing small — in settling for a life that is less than the one you are capable of living.” — Nelson Mandela

Published By

Barani Dakshinamoorthy

Originally published at https://www.linkedin.com

Buy me a coffee

--

--

Barani Dakshinamoorthy

Founder, Data Integration, Innovation, Technology Driven professional. A Microsoft Certified Solutions Associate (MCSA) in the field of SQL Server development.