What to do when your Data Lake grows out of control

Samuel Cristobal

2019-10-15 11:59:43
Reading Time: 2 minutes

In the past, most companies stored their data in relational databases distributed across departments silos; e.g., marketing data and operations data were stored independently. For a while, that approach worked fine, as department separated projects and only accessed data within their own silos, sometimes with duplicates. However, as data analytics techniques have become more popular and accessible, analysts have been starting to request data across several departments, clashing against this restrictive model.

Today, with the rise of Machine Learning and Big Data (and variety being one of the key elements, (read more here), it has become more common to mix data from different departments to better analyse for deeper insights. This has even included leveraging publicly accessible data or datasets that expand across multiple companies. This, in turn, has decentralized distributed data storages, which now can commonly transverse organizations and beyond.

During this same time period, companies started to value their data more, even before it had any use. Data is volatile, so if not captured, it evaporates into thin air and disappears, forever. Concerned on this note, most companies starting capturing any kind of data, useful or not. This lead to a switch from ETL (extract, transform and load) to ELT (extract, load and transform). Since some data wasn’t immediately useful, it was stored in its raw, original format without transformation. Later, it was only transformed when there was a clear, intended purpose for it.

The combination of the two factors above gave rise to Data Lakes. A Data Lake is simply the computer equivalent of a hodgepodge. The data is unstructured and formats do not need to be consistent, even across the same sources, including several versions of the same dataset and duplicates. Metadata is usually scarce as the goal of a Data Lake is to store vast volumes and varieties of data hoping that it might be put to use one day.

As a result, fishing data from a Data Lake can be a daunting, repetitive task. Data analysts in charge of Data Lakes face the same questions over and over: is there any relevant data available? Is it accessible? Does it have this or that field? Can it be recovered? Are there different versions of the same dataset? Etc.

Facing these same consistent issues, analysts in return ask: what if we use machine learning and crawlers to automatically explore/discover and classify/put some order or own data inside of the (messy) Data Lake? This line of thought led to Data Catalogs and the natural next step: making data available as a service, or DaaS: Data as a Service.

Let’s face it: aviation is only one of the fastest industries when it comes to travelling, but not in regarding technology adoption rate. We are still far from general DaaS providers, but even at a slow pace, the industry is moving in the right direction. There is a premium on data platforms for aviation; Airbus and Boeing both have their own platforms as manufacturers alongside operators like Lufthansa or authorities like EASA. Even among all these proprietary platforms, there is still space for independent data platforms like DataBeacon. We hope that one day a common Data Catalog for aviation will be available and that all platforms can communicate and share data among each other.

© datascience.aero