Tips to re-train Machine Learning models using post-COVID-19 data

Antonio Fernandez

2020-11-26 18:19:44

Reading Time: 4 minutes

Data scientists and machine learning engineers working on different industries agree on the importance of ensuring data quality, and invest a lot of time in tuning which training dataset best defines the problem they are trying to solve. The trade-off between bias and variance is a very well known concern that is present in almost any machine learning problem. Bias and variance are inherent properties that should be balanced when traning a supervised machine learning algorithms, avoiding to fall into underfitting or overfitting. Therefore it’s crutial that the training dataset reflects the reality that we aim to learn, otherwise predictions made over new observations won’t be aligned with the learnt distribution.

The understanding of a certain problem may evolve, or even change over time, and in an unprecedented way, that’s exactly the situation we are living nowadays. Due to the COVID-19 pandemic, the world has changed dramatically. The way we iteract each other is completely different from a few months ago, and restrictions on mobility and social distancing have definitely impacted the predictability of those activities based on human behavior. These circumstances affect to the majority of industries and business areas, but particularly to any data science problem that focuses on forecasting variables with a strong correlation with humans behaviour and lifestyle (e.g. demand, recommendations, profiling…). Of course, a lot of common machine learning problems affected by this “new normality” situation belong to transport and aviation industry.

Whether in development or production environments, almost any machine learning model that was trained using historical datasets prior to the first COVID-19 outbreak, and whose prediction relies heavily on human behaviour, have experienced a noticeable decline in performance. In particular for aviation, we have experienced a huge descend on air traffic volume (among other issues) as a consequence of the strict mobility restrictions between countries. But in which case studies are machine learning models affected by the “new normality” in aviation field? Mostly those predictive models based on forecasting the demand, understanding the precursors of delay, runway occupancy time prediction, and basically any scenario trying to improve network congestion has been affected. However models based on aircraft behavior, such as fuel consumption, or safety of operations use cases should not have been affected by this, except for the decrease on the amount of observations collected.

It exists a strong dilemma in wanting to retrain the affected machine learning models, especially those that are already deployed in a production environment. Is it recommended to avoid using historical data from after COVID-19 emerged? Or it’s preferable to forget about this period of time and wait to retrain these models, even if its performance has worsened. To be honest, although we are currently beginning a slow recovery process for this crisis, probably some things will never be the same, and maybe datasets prior to 2020 do not reflect the reality anymore for certain problems. Which would be the most suitable techniques that could be applied to adapt machine learning models to use “post-covid” historical datasets in training pipelines?

Is it fine to drop all data for the year 2020?

Depending on the scenarios, some companies and development teams agree on the difficulty for model to learn the context due to governments restrictions, which is a key variable on forecasts. Some decided to focus on Q4 2020 and 2021, other directly decided to remove the whole 2020 from the training set. This strategic is based on the incorrect assumption that the world will return to exactly what it used to be, prior to the pandemic. Even if we can quickly contain the virus, recovery will take time, therefore removing all data affected by COVID-19 from training is definitely a wrong decission built on an incorrect assumption.

Discover new underlying patterns on data

Unsupervised techniques or topological data analysis could help to identify weaknesses or underlying patterns on data. Training your models to learn from anomalies and better understand the impact, as COVID-19 is probably the biggest anomaly we’ve had in the last decades. Research through clustering or outlier detection enables the data take you on a journey of discovery, the shape of data showing events, entities, trends that are unique and important, and previously unknown. For instance, unsupervised methods might reveal new features that impact the recovery of air traffic demand, such as the virus recovery rate, or metrics from countries whose recovery is being faster than other (e.g. China).

You need to decompose the COVID-19 anomaly as much as possible, to be able to build models that lead your business in the right direction during the recovery. Your model must be able to identify the decline on passengers demand, and learn from short and long term historical data. In addition it’s crutial to track the recovery process all over the world, and identify which are the features that are being catalysts of this process, to apply them to your particular scenarios.

Improve the flexibility of your AI/ML solutions

Take time to monitor and improve you current modules, especially if they are already in production stage, and had always a claer view on how your data is handling the bias and variance and how your models are performing according to this, and prepare a contingency plan in case outcome degradation is detected due to possible deteriorations in data changes. It’s the perfect time to include MLOps good practices into your data science workflows, enabling to make agile decissions and rapidly adjust your training dataset in case data patterns suddenly change. This would make your machine learning models more sensitive to recent data, prioritizing a more dynamic learning at least until the situation has been fully reestablished.

In relation with prioritizing the label to condition how do we want the model to learn a supervised problem, Active Learning has a lot to do here. Several prioritisation scores exist to assign a priority to each data observation, for instance based on least confidence, margin between labels or entropy. These techniques could enhance the interpretability of the labels for models trained using pre-covid data, that are unable to model this “new normality” when a post-covid observation is inputted.

Conclusion

Without any doubt, data captured during COVID-19 recovery period must be considered while retraining Machine Learning models, as things probably will never be the same, at least during the next 2-3 years. Supervised machine learning models need to be supported with altirnative techniques (e.g. unsupervised or TDA) to discover new underlying patterns that help models to better understand this huge anomaly we are facing, named “COVID-19”, since now it’s part of our “new normality”. Data scientists and in generall any development team must have the capability to monitor their data pipelines to take control of advance analytics strategies and adapt more dynamically to the recovery situation. The inclusion of MLOps and Active Learning is highly recommended to achieve this goal. In aviation, previsions for air traffic volume recovery are not very promising, as stated by EUROCONTROL, since there are many factors that may speed up or delay the process, like when the vaccune is going to be released, its effectivity and global distribution to all the population.