Transforming Aggregate Data Models into Relational Models using Apache Spark

Image for post
Image for post

Though NoSQL databases and aggregate data models have become much more popular now a days, the aggregate data model has more complex structure than relational model. If I have to run analytics, it’ll be easier if I can keep the data in tabular structures like relational models.

Apache Spark can convert a complex xml or json structure (popular way to store aggregates) into a flat tabular structure very easily.

In this blog, we’ll see how we can transform.

We have considered a sample order structure which contains buyer contact info, shipping address, shipping price, line items with ordered products and packaging details as an example (find below).

We have represented the example xml structure graphically below and how the elements will be mapped into multiple tables:

We can even maintain enforced relationships or logical relationships.

Library used

For this example we have used the spark-xml library (Maven coordinate: com.databricks:spark-xml_2.12:0.5.0) for parsing and querying XML data. Check the latest version from https://github.com/databricks/spark-xml.

Similarly JSON files can be read, refer this for appropriate method.

Defining the Schema

While reading records by Spark specially from semi-structured or unstructured record source, we should be very careful about the schema. If we let spark to infer the schema, depending on the input file the generated schema may differ.

As an example, the <packages> can have multiple <package>s if the products can’t be grouped into single <package> — may be two products are coming from the same warehouse whereas, third product will be directly delivered from the individual seller.

So, if we load an ‘orders’ xml with having only 1 <package> inside every <packages> , e.g.

and we let Spark infer the schema, Spark will infer as below:

Now, we load an ‘orders’ xml with multiple <package>s inside every <packages>, e.g.

and again we let Spark infer the schema, Spark will infer it differently. Note that, inside packages: struct there is a package: array now.

Now, if we create a schema (which should be based on the supplied XSD) -

and ask Spark to use this schema to load any xmls (single & multiple package inside every packages respectively), the DataFrame schema should be always same.

Defining the schema can be more complex depending on how complex the XML/JSON structure is, rather than CSV/tabular structures!

The explode Function

explode is an very useful function to create a new row for each element in the given array (or map column).

Now, as we’re ready with some knowledge let start transform the complex xml into multiple tables with relationships.

Extract the Customer Information

Extracting the Order information

Extract the Shipping Price information

Extracting the LineItem information

This is not as straightforward as the previous ones. Check the lineItems struct and lineItem array inside of it.

Use the explode function as mentioned previously.

Extracting the Package information

Similarly the lineItems, we’ll use explode to extract package from packages.

Extracting the PackageLineItem information

Conclusion

So, we have seen how easily we can transform any complex aggregated model into a tabular structure which can be further analyzed using other Spark features or can be directly saved into a relational database.

Also as mentioned, create the schema based on the aggregated model schema (e.g. XSD) to avoid any transformation failure or data type conflict while saving.

Thanks for reading. In case you want to connect, please ping me via LinkedIn.

Written by

Tech enthusiast, Azure Big Data Architect.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store