Dealing with huge datasets can sometimes be quite complicated, especially when filtering specific data needed to carry out analytics. For instance, if you are working on a predictive model particularized for a certain airport or aircraft type, the process of filtering ADS-B trajectories or FDM data reading from your data lake might become very time consuming. Data management is as important as data analysis or machine learning model training, since the way data is structured might indirectly worsen data loading and processing times, or increase cleaning code complexity, among other aspects.
In this post, we are going to discuss the emerging lakehouse architecture (first introduced by Databricks) and how it might improve not only the accessibility to the data, but also speed up its processing. In particular, it takes advantage of Spark and saves costs in comparison with typical data architectures, such as warehouses or hybrid combinations of data lakes and warehouses. Data lakehouses, or simply lakehouses, emerged as a combination of data lakes and data warehouses powered by Databricks. This new paradigm has become very popular recently, and tries to overcome the limitations of its predecessors. Back in the past, data warehouses were the cornerstone of data management architectures, being able to handle larger data sizes. However, they were tied to structured data, and companies started to collect other types of data (unstructured, semi-structured, etc.), being more aligned with the 3 Vs big data paradigm (ref), which probably sounds familiar to you as it’s a must-have slide in almost any big data conference.
About a decade ago, data lakes emerged to overcome larger datasets coming from a variety of sources and presenting different formats (e.g. images, audio, text, videos, etc.). However, these repositories of raw data lack some critical advantages of data warehouses, such as transactions, data quality enforcement or consistency. Companies require flexible, high-performance systems for diverse data applications, including SQL analytics, real-time monitoring, data science, and machine learning. A common approach is to use multiple systems—a data lake, several data warehouses, and other specialized systems—but it introduces complexity and delay as data need to be moved or copied between different systems. In order to address the limitations provided by data lakes, lakehouses arose and implemented similar data structures and data management features as those in data warehouses, but on the kind of low cost storage used for data lakes.
As we already mentioned, Databricks pioneered this technology. They released Delta Lake as an open source table storage layer over the objects stored in our cloud (e.g. S3 or Blob), enabling, among other features, transactions, versioning, and the ability to combine batch and streaming in data lake. The Delta layer mainly solves all the issues data lakes used to have, becoming an extremely powerful tool when implementing streaming data pipelines or consuming machine learning models. Databricks’ Lakehouses follow a design pattern architecture to deliver multiple layers of data quality and curation, presenting a 3-table tier nomenclature:
Let’s process all this knowledge and try to apply some of these concept to the aviation field. As we previously discussed in this blog, time series are probably the predominant dataset in aviation industry. Although most of the sources can be consumed in tabular format, a consistent data architecture must work behind the scenes to select, filter, merge and consume subsets of this data to solve machine learning problems or design dashboards. Let’s provide some examples to better understand the potential of lakehousing in aviation.
Imagine that we are working in a project where radar (e.g. ADS-B) is the main data source. We managed to collect 3 years of data (e.g. JSON format) and store everything in the cloud, for instance using Amazon S3 or Azure Blob Storage services. This would correspond to the Bronze layer, since the schema isn’t enforced, and data might have been ingested in our platform either from an ADS-B stream, batch query against an external system, and/or downloaded outright. In this layer, we could also have additional data sources (e.g. weather, flight plans).
Our next step would be to structure and clean our data lake. Notice that, if we were following a legacy data architecture pattern, we would fall into the pitfall of creating an independent data warehouse to store our data preparation pipeline output, which would enable our data scientists to query our data using SQL, but also increase the system complexity and maintenance. That said, we then create a Silver layer in our data lake and partition our data by date (e.g. year, month and day), saving it using any data science state-of-the-art data formats such as parquet or avro. At this stage, the data is ready to be consumed and queried. But rather than isolating it in a warehouse, in this scenario, we would prefer to use lakehousing and benefit from Delta Lake technology. This will enable warehousing capabilities in the Silver layer, implementing performant and mutable tables in our data lake, and allowing queries to be run using SQL or through Spark directly for ETL development.
Though we have a huge amount of data in our system, we have missed a crucial point: our specific data science interests. Having the Silver layer as our single source of truth, we have complete freedom to carry out almost any data-driven analysis. Below are a couple of examples:
In conclusion, the emerging lakehousing paradigm, in combination with Delta Lake, offers a new world of possibilities in data architecture and management. Currently both SMEs and big companies from other industries, such as Apple or Disney+, are adopting data lakehouses as data architecture pattern. At DataBeacon we think that lakehouses can help aviation simplify existing data architecture designs or evolve legacy ones, save cloud costs and enable data scientists from the aviation industry to either better organize their data or consume larger datasets in a scalable and performant way.