Today’s Machine Learning challenge: automating automation

Dario Martinez

2018-07-04 10:00:06
Reading Time: 4 minutes

Applying traditional Data Science pipelines to real-world business is extremely time-consuming, resource-intensive and challenging. It requires a number of multidisciplinary experts, including Data Scientists and Data Engineers, which are two of the most demanding positions in the job market right now. Furthermore, when deploying a model in a changing production environment, the most suitable Machine Learning (ML) algorithm selection may vary over time, making the trained model outdated as the data inputs change. In this scenario, comprehensive automation of the algorithm selection and hyper-parameter tuning is crucial in developing, optimising and deploying a ML-based application that can be applied as an enterprise solution.

Data Scientists are usually swamped with tedious workloads that involve repetitive tasks that must be completed (or programmed) “manually”. Anyone involved in Data Analysis projects can agree with me in saying that, in a typical data pipeline, there are repetitive parts that could be automated. This does not mean that automation is going to substitute jobs, but automation can certainly make current data related positions more efficient. By automating the most mechanical tasks, Data Scientists can focus on more complex or creative aspects of data analysis. Automation is also very relevant to “ever-expanding” Big Data collections, for which the cost of developing, scoring, validating and deploying the variety of possible models is just too much for any small to medium-sized data team.

Automated Machine Learning (AutoML) is a new, ambitious and growing technology that aims to automise every step in a Data Mining pipeline. AutoML is the new kid on the block of Data Science research. It aims to accelerate the process of developing, evaluating and refining ML models by offering developers a toolkit. These tools use various approaches, including, but not limited to, specialised ML models themselves. In some aspects, this can be interpreted as ML algorithms that teach systems how to do ML using the best approach, which is a very innovative methodology.

Let’s take a look at which steps in a Data Science pipeline could be automated and how:

  1. Exploratory Data analysis: Describing and plotting data before starting a modelling exercise is always a requirement. Some kind of scripting of tasks, such as plotting of all your descriptive variables against the target variable or computing summary statistics, can save time. This should be an automation priority and relatively easy to implement due to ML developers relying on state-of-art visualisations and metrics. However, note that Data Scientists still need to analyse and explain the visualisations. They are also responsible for the decisions taken as based on the case study, their expertise in the field and the particularities of the dataset.
  2. Data preprocessing: Every dataset has its own idiosyncrasies in how one may need to encode categorical variables, correct or transform certain valuesclean or substitute missing values or categorise continuous variables.  As many of these transformations already follow a set of “if rules”, some methodologies for encoding and handling certain types are well defined and most tasks of this process could be automated. Ensuring data has been already automatically cleansed and prepared can save a lot time. This time could be invested in focusing on more complex data preparation tasks such as aggregating disperse data sources, merging complex datasets or de-identifying confidential data.
  3. Feature engineering: For a given training dataset, it is usually necessary to generate alternative feature representations that describe the variables better as interpreted by the ML model. Even the most experienced ML modelers waste a lot of time trying out different representations. Although this is not as straightforward of a task to automate, there are known representations that could generate sets of the most suitable features for a given dataset. Also, this kind of output is relevant at any level of expertise. On one hand, it could help low experience modelers understand that their features can be represented in better forms. On the other hand, it can aid experienced modelers by giving different candidate representations that could help them refine the features set concept.
  4. Algorithm selection: For a given feature set, any experienced modeler woulds easily identify the ML approach best suited for learning the problem. However, ML is slowly moving towards multi-model ensemblesand ANN based algorithms, making all the combinations of possible state-of-art ML solutions untestable. This kind of disparity will force modelers to use any kind of external tool that automatically trains and tests the data with all the suitable methods. Otherwise, this disparity will require too much time from modelers in manually testing all the models and choose the best performing algorithm. Also, there is a growing need for understanding whether, as per given feature set, it is better to use established ML algorithms (e.g. random forests, SVMs, k-means, etc.) or to jump into the research and development of more advanced solutions such as Deep Learning.
  5. Hyper-Parameter tuning: For a given statistical model, there are sets of parameters to tune to boost model accuracy, with a reasonable running time. The variety and range of hyper-parameters that can be tuned can be overwhelming and, for most of the cases, beyond the knowledge and expertise of the developers. This kind of tasks are a good target for automation. 
  6. Model diagnostics and validation: As a methodology for testing the performance of a model or for comparing different candidate models, some go-to validation metrics and methodologies are applied. A possible auto-generation of learning curves, partial dependence plots, feature importances, ROC, and even other metrics that illustrate the comparative performance in accuracy, efficiency or any other trade-off should be extremely useful.
  7. Model deployment and resource provisioning: Once a model is fully optimised and tested, it may be moved to a production environment and provide connection for both data input and results/predictions visualisation. In order to do this, most of the time a customised REST API will be generated or a Docker image will be deployed that runs on a public/private cloud. Furthermore, the model would require automatic scaling up or down based on the requirements of CPU, memory, storage or other resources. The automation of these kind of tasks are nothing new for Back-End Developers.
  8. (Bonus) Model governance: For a given deployed model, it is important to keep track of which version is running. It is necessary to ensure that the production model be sufficiently predictive, and automating the gathering of fresh data on which the model runs could help encourage updates. Model governance is particularly relevant as ML pipelines grow in complexity and require more monitoring, logging and auditing.

AutoML may not be as popular as other more actractive lines of research in Data Science such as Deep Learning, but solid AutoML solutions are mandatory for deploying robust ML-based apps in operational environments, especially taking into account the fact that we live in an evolving data-driven world.

Author: Dario Martinez