Implementing Data Quality with Amazon Deequ & Apache Spark

Prosenjit Chakraborty
9 min readNov 26, 2019

Data quality is an important aspect whenever we ingest data. In a big data scenario this becomes very challenging considering the high volume, velocity & variety of data. Incomplete or wrong data can lead more false predictions by a machine learning algorithm, we may lose opportunities to monetize our data because of the data issues and business can lose their confidence on the data.

Apache Spark has become a technology by default nowadays for big data ingestion & transformation. This becomes more robust with the managed service provided by Databricks and data pipeline built in Azure Data Factory v2 (we have other variations as well).

If I want to inject a data quality validation black box inside my data ingestion pipeline, probably the pipeline will look like the below diagram. Once the data are landed into my raw zone or staging zone, I may want to pass these through the black boxes before I consume or transform.

A typical data ingestion pipeline with data quality & other functionalities (data lake layers have not been shown).

The data flow may differ based on our use cases however, for this blog we’ll concentrate on designing the Data Quality & Bad Record Identification Black Boxes only.

--

--

Prosenjit Chakraborty
Prosenjit Chakraborty

Written by Prosenjit Chakraborty

Tech enthusiast, Principal Architect — Data & AI.

Responses (7)