Can we feed predictive models with all the data collected?

Ernesto Gregori

2022-09-07 12:20:19
Reading Time: 3 minutes

About the amount of data generated

Over the last years, the data generation has dramatically increased, in part due to improvements in sensors and storage devices. In aviation, aircrafts are equipped with devices that take all kinds of measurements. In fact, by taking the example of the Virgin Atlantic’s 787, we see that a single flight can collect over a terabyte of data (

This is good news, since Machine Learning models benefit from data availability. There are two main ways to improve a model: by training them with more data and by finding better hyper parameters. The latest option is computationally expensive and usually yields worse results than adding more data.

“The more information the better” may seem like a rule of thumb, but we can end up running into some problems. Although having more data is generally beneficial, there are some issues that have to be taken care of. The main problem is the curse of dimensionality, which we introduced in this other entry of the blog (

A first example

Another common problem is data leakage: people tend to feed their models with all the data available, hoping to obtain a better predictor. But the results can be misleading.

Let’s say that we want to estimate the wind energy production in a given region. To do so, we collect the weather data of the past 10 years. We would probably find that the features related to the wind, such as the direction and the speed, allow us to perform very accurate predictions. In fact, after training the data with the first 8 years of data, the metrics obtained for the test set look great.

But this model is tricky: the predictions were great because the dataset contained all the weather information that we now have. Obviously, this data is highly correlated with wind energy production. But this model has suffered from data leakage, because the information regarding air speed and direction is available now, but wasn’t available back then.

In fact, if we tried to predict next week’s energy production with our model we would run into a problem: we do not have precise weather data for the future. This can of course be solved by adding weather forecast data, but the precision of the model would surely decrease since forecasts are less accurate.

Therefore, when building the model, we may have overestimated the algorithm’s capacity of predicting the target value; that is, we think that our model is better than it actually is.

A more dramatic example

In the previous case, we managed to use the trained model by finding another dataset. The results were worse than estimated, but it worked with a minor change. However, there are situations where this problem is not easy to solve.

When working in the aviation industry, the amount of data recorded is huge: as mentioned, aircrafts take measures constantly. Airlines can benefit from this fact: they can optimize their schedules, estimate the required fuel, find potential conflicts. However, the data has to be treated carefully, since there is still risk of leaking information from the test to the train set or from the present to the past, therefore altering the metrics of the models.

Let’s say that we train a model with all the data recorded: pressure, altitude, fuel, etc. All this information are typically computed many times every second. Models trained with all the data available would probably give outstanding results for many tasks, for example, for predicting the arrival runways. But it is not possible to generalize them for future events, since they rely on events that haven’t happened yet. That is, this model predicts the arrival runway only after knowing all the parameters of the aircraft. So after training the model we end up with a predictor that has no utility in real life, because we cannot use it until all the parameters are computed; that is, after the aircraft has already landed.

What to do in this situation

So we’ve realized that some of the data we have won’t be available when predicting new instances. Should we just remove that data and go on with the rest? Well, it depends. In some cases, this is the only way to proceed. But having to ignore all that other data is not always the optimal solution. In some cases, the data can be aggregated in a way that some information is extracted. This part is mainly about getting creative and being familiar with the data used to obtain as much useful information as possible.

Author: Ernesto Gregori