Implementing Data Quality with Amazon Deequ & Apache Spark
Data quality is an important aspect whenever we ingest data. In a big data scenario this becomes very challenging considering the high volume, velocity & variety of data. Incomplete or wrong data can lead more false predictions by a machine learning algorithm, we may lose opportunities to monetize our data because of the data issues and business can lose their confidence on the data.
Apache Spark has become a technology by default nowadays for big data ingestion & transformation. This becomes more robust with the managed service provided by Databricks and data pipeline built in Azure Data Factory v2 (we have other variations as well).
If I want to inject a data quality validation black box inside my data ingestion pipeline, probably the pipeline will look like the below diagram. Once the data are landed into my raw zone or staging zone, I may want to pass these through the black boxes before I consume or transform.
The data flow may differ based on our use cases however, for this blog we’ll concentrate on designing the Data Quality & Bad Record Identification Black Boxes only.