Machine Learning is more than loss minimisation

When you start learning and working on machine learning problems, you usually think that the most important machine learning skill to master is training the best model performing model. You learn and apply all these features engineering tricks, features selection techniques and metrics to select the best model. A huge amount of literature, educational material, practical exercises and excellent libraries provide insight on high-level implementation of every machine learning model.

In fact, finding and training the best machine learning model is the easiest part of the process. Even in supervised problems, you get instant feedback on unseen test data, making model performance evaluations almost perfect. Machine learning as a discipline has reached a very high level of maturity. Besides in cases of cutting edge innovations in ANNs and deep learning, most capable computer scientists have  easy access to machine learning that can easily solve simple problems.

What is the complexity of Machine Learning?

In reality, the main focus of a data science engineering team is more than pandas, scikit-learn and fancy Spark pipelines. Real complexities come with translating real world problems into prediction tasks, transmitting trust in the prediction model and studying the representativeness of a dataset to real-world behaviour. Additionally, teams must think about how wrong predictions might affect the user or how the user is going to behave in the presence of given predictions. Lastly, they must know how to debug the model if something goes wrong. Ask yourself: is your static model going to perform adequately in a dynamic, real world environment? 

Domain knowledge, interpretability, social impact and understanding the role of data are main issues. However, machine learning education and research is very focused on “finding the best model”, obviating the role of the data, human-machine interaction and complex interaction of the predictions in the real world. Nevertheless, the machine learning community is slowly shifting from a narrow perspective focused on model performance and scientific research to a broader understanding of the real world applicability.

If machine learning is not only about fitting a model, then what it is all about?

When working on your machine learning model, even if it is not the best model available, you will succeed if you consider these aspects:

  • Problem understanding: Transform your dataset into a prediction problem or a pattern recognition, even taking into account the machine learning model to be used and the required input.
  • Dataset generation: Explore the data. Understand the limitations and if the given dataset is enough to solve the problem. Consider if you need additional datasets.
  • Application context: Application field matters, not only in terms of required accuracy but also in terms of how your model will interact and change behaviours. Talk to experts and test the behaviour of the human components when they interact with the predictions. Your model might require periodic retraining.
  • Model readability: Accuracy, MSE or F-Score over cross-validated datasets are not enough. You need to understand the reasons behind your false positive rate or validate the precursors with field experts. Black boxes are not good – try to interpret results.
  • Model production deployment: Most data scientists don’t even think about this crucial phase of either deploying the model as a product (fully integrated interface) or as a service (e.g. using a REST API). In any case, you need to think about the requirements and limitations of the operational environment and plan the model specifications accordingly. Maybe you will need to retrain your model within a given time constraint as the cutting edge, deep learning model you googled won’t fit!

About Author

Darío Martínez

Darío is a Data Scientist, who is passionate about programming, statistics and business intelligence. He is quite the scientist; any piece of code can be manipulated to predict the future. The larger and messier the dataset, the better. Read more about Darío Martínez

Related Posts