Data Integration using Apache Spark

Barani Dakshinamoorthy
5 min readAug 7, 2020

Apache Spark is Big Data framework used for processing, querying and analyzing batch or real-time data at very high speed.

This is achieved through a fundamental data structure of Spark called RDD (Resilient Distributed Datasets). This is an immutable collection of data objects, which are divided into logical partitions, distributed across different nodes of the cluster. This helps parallelize the data processing, and with the in-memory processing makes the whole orchestration even more efficient and faster. One could create Spark applications using Python, R, and Scala. Spark is built on Hadoop and has different components, which I am not covering in this article.

Setup

In my example, I am using Standalone spark on localhost. The push and pull of data from various endpoints are illustrated in above diagram. Below are my downloads to setup Spark including MS-JDBC driver for SQL connectivity, and the AWS Jars files to connect to AWS S3 Bucket.

Once you have setup spark on your localhost, you are ready to lunch spark session.

Spark session is the entry point for programming Spark applications, to interact with Dataset and Dataframe APIs provided by Spark. I am using default spark context (sc), throughout this example.

Based on number of cores on the local machine, the default parallelism is set.

Data Orchestration — Extract data (READ)

SQL: Once you have setup the JDBC driver for MS-SQL server, one could connect to SQL database and extract data.

Create a temp view, if you intend to do simple data transformation.

Spark SQL allows you to use SQL like queries to access, filter and transform data using the table view

Local Folder (CSV):

AWS S3 BUCKET: To access AWS S3 BUCKET directly from Spark, you need to set up AWS Credentials using the Hadoop Credential Provider.

AWS works with 3 different URI (Uniform Resource Identifier) schemes namely S3n, S3a and S3. S3 is a block-based overlay on top of Amazon S3, while S3n/S3a are object-based. S3n supports objects up to 5GB in size, while S3a supports objects up to 5TB and has higher performance. In this example, I am using S3a interface.

Every request made to the Amazon S3 API must be signed to ensure its authenticity. There are some regions which do not support Signature Version 2 (SigV2), and in that case you must use Version 4. My AWS data region is located in Europe (Frankfurt), which only supports SigV4. This property should be enabled at system level.

Data Orchestration — Ingest data (WRITE)

SQL: Ingest data into SQL Server database

Local Folder (CSV): Save data in desired format into your local folder i.e. CSV file.

AWS S3 BUCKET: Write data into a S3 bucket.

Using Spark Job Server, one could bundle the lines of code and submit the batch (bin/spark-submit script) to the Spark Job Server, which offers a powerful GUI interface for managing Spark jobs. Using Spark History Server, one could keep track of past jobs.

Apache Spark as data integration tool especially when you are dealing with data objects like structured and unstructured flat-files, is definitely a hassle free ride, by just setting up a cluster of multiple nodes. Due to its nature of in-memory data processing engine, it is claimed to be lightning fast. It is an open-source big data framework ideal for batch and real-time stream processing. My examples are not dealt with big data volumes, which I definitely would like to try in near future.

Together with MS-SQL Server, Apache Spark should definitely be part of any analytical infrastructure journey, to benefit the best of both world and start leveraging actionable data insights, for better business returns. Finally, leaving you with this wonderful note …

“Celebrate what you’ve accomplished, but raise the bar a little higher each time you succeed.” — Mia Hamm

Published By Barani Dakshinamoorthy

Originally published at https://www.linkedin.com

Buy me a coffee

https://www.linkedin.com/pulse/data-integration-using-apache-spark-barani-dakshinamoorthy/

--

--

Barani Dakshinamoorthy

Founder, Data Integration, Innovation, Technology Driven professional. A Microsoft Certified Solutions Associate (MCSA) in the field of SQL Server development.