Data Governance using Azure Purview
Azure Purview is a Unified Data Governance service which lets create a map of data landscape. One could quickly track and visualize the lineage of the data landscape across the organization.
A typical organization would have data in many forms such as files, databases (tables), models (ML) and reports, which resides across many places like on-prem, SAAS, Cloud etc. Azure Purview automatically discovers and classify data sources, without moving the data. The discovered data is then, indexed and brought together into an “Unified Data Map”. The user interface provides rich user experience enabling data producers, consumers and stewards easily collaborate with each other.
In this article, we walk through Azure Purview instance, register/scan new data source and explore published metadata from Purview Data map. The Data map is an intelligent graph describing all the data across the data estate.
Setup
Once your environment is set up, by signing up for “Azure Subscription”, using Purview account one could create end-to-end view of data, with help of Purview Data map.
To get started, one just need to follow 5 steps as stated below.
1. Register
One could register the data sources to connect to on-prem and multiple cloud data sources such as Amazon S3, Azure SQL Server, SAP, Teradata etc. It allows you to register data sources from multiple sources at once. Purview organize them as collection and visualize them as tree-view, which allows one to set “scan and classification” settings once at root-level of the collection, which will be applied to all the sources underneath.
It also works with unstructured and semi-structured data.
Tip:
Purview is fully integrated with Azure Monitor to setup alerts and monitor health of services and scans.
2. Scan
Once you have registered your data source, Purview scans through the data source to extract wide range of metadata such as technical and operational metadata, lineage and apply classifications to them. This all happens, without moving the data itself. These scan are performed serverless which means, one only pay for what is being used. All the metadata found during the scan are published into Azure Purview Data map.
Using Azure Key Vault, one could extend the access polices to Purview instance, which helps connecting to data sources such as Azure SQL Server.
Using credentials, Purview connects to external sources.
Once the connection is established, one could customize asset scans, which becomes part of the current and futures scans.
The completed scan results in an overview of assets as shown below, which then will be added into the Data Map.
Tip:
The Data map is exposed to Apache Atlas open APIs which enables programmatically to pull metadata from any data system. This helps, expand the Data map to numerous data sources. Apache Atlas provides open metadata management and governance capabilities for organizations to build a Catalog of their data assets, classify and govern these assets and provide collaboration capabilities around these data assets for data scientists, analysts and the data governance team. https://atlas.apache.org
3. Classification rules
One could use more than 100 pre-defined classification types, which could run on the dataset, completely automated during the scan.
One could even define own custom classification rules, specific to organization.
4. Schedule scan or run once
The data will continuously change, hence in order to capture all the changes, the scan needs to run periodically to ensure that, one has accurate and up-to-date understanding of data.
5. Explore Metadata
Once the metadata is loaded into the Data map, everyone in the organization could easily search and browse the data using Azure Purview Catalog experience.
Purview also provides a birds-eye-view which helps quickly see, where all the data resides in the organization. Using classification, one could easily find where the sensitive information resides across the organization. It also gives insights from SQL Server Stored procedure, showing lineage information as shown in below examples.
Conclusion
Getting an end-to-end view empowers working with data. One could quickly find out where the data came from to make business critical decisions. It also help build new data driven applications, once the data location and lineage information is known.
Purview changes the game for data insights and management. Understanding the data is the most important step in efficient data governance. Purview creates data lineage view, group data sources together, apply classification to protect sensitive information and create Data catalogs. Finally, leaving you with this factive quote …
“A problem well-defined is a problem half-solved.” -Charles Kettering
Published By
Barani Dakshinamoorthy
Originally published at https://www.linkedin.com