Do agile methodologies fit in data science environments?

Dario Martinez

2019-01-23 12:05:17
Reading Time: 3 minutes

Extracting knowledge from data involves ingesting raw data and preparing it before training a Machine Learning model to validate and deploy it. Apart from requiring base knowledge in fields such as programming and statistics, extracting knowledge also requires high creativity. Because of this, the life cycle of data-based applications is non-linear and research intensive, which involves a higher degree of uncertainty. In this context, research means finding new knowledge and predicting the sources of that knowledge.

Agile methodologies have been very successful in traditional software development environments. Many data science teams blindly apply agile methodologies both out of its trending nature in the software engineering world and its reputation as good practice. But, do agile methodologies fit in research intensive environments? This question may seem rhetorical for many skilled engineers, and undoubtedly the general answer is “yes”. But research intensive environments, such as machine learning algorithms development, are very special and require some extra considerations. Let’s navigate together on some crucial, often overlooked points:

1. Machine Learning is not deterministic

This is the first barrier for adopting agile methodologies. For example, in software development, normally you just choose a limited set of tools and technologies. In ML development, you use a great variety of tools. Some of these tools require specific configuration and even need to be tuned specifically. Also, all of these steps need to be designed to a non-deterministic scale that, depending on the volume require different tools (e.g. Python/Pandas for “medium” data or Spark for Big Data). Finally, and not least important, you need model governance and version control in production environments that depend on past decisions. This variance makes traditional agile methodologies both hard to apply, but is necessary in keeping track of the methodology selected and the reasons behind the decisions.

Another important dimension is that, in research projects, failing to prove feasibility is a possible outcome. This means that many times you will end up with an undelivered functionality. The only way to combat this is having very specific research targets that can be easily tackled. Otherwise, the connection between research questions and customer needs could be too slow to produce relevant outcomes.

2. Model quality needs to be synched with user engagement

When validating the performance of Machine Learning algorithms, several levels of correctness are possible. For example, an optimal predictive model is known to present 95% accuracy. But, in reality, nearing this 95% requires too much effort, though a skillful data scientist would probably arrive quickly at a model that yields 85% of accuracy. Assuming that the model is implemented with sufficient engineering quality, the doubt of near-perfect accuracy will forever be present: was the solution fully optimised or not?

Agile methodologies prefer quick solutions over perfect ones. Having a solution that provides sufficient business value is often better than spending time optimizing and researching a non-perfect machine learning model. This shift of thought is particularly hard for researchers who are educated in academy, where user engagement is not considered. This often entails a cultural change.

3. Data Science teams are too diverse

Heterogeneity is something very good when studying data science teams. You find highly applied engineers working in infrastructure and data preparation, scientific researchers working in the math behind the ML models and operational experts working on how to deliver the model as a product (predictive analytics, vizs, etc.). This mixture of skills and expectations is too diverse for traditional agile methodologies (e.g. scrum epics, sprints). Furthermore, there is often a disconnect among researchers, product owners and customers in setting unrealistic expectations for products. This is the main reason why many machine learning products aren’t used in production.

These problems need to be addressed with good communication across teams. Sharing cross-role knowledge is crucial. The curious and open-minded views of researchers need to be combined with business and development knowledge in aligning expectations. Hybrid profiles also help, such as a research data scientist with experience in business intelligence.

4. Retrospectives are underestimated

The “retrospective” is the most important step of the agile development cycle. It is crucial for discussing how the work was done and how the methodology can be adapted to fulfill all goals. These tasks are very important, if we take into account the highly heterogeneous teams discussed in the last point. Aligning in terminology, methodology, mode of work, expectations and even mentality is of vast importance when adapting agile methodologies to data science projects. In fact, the more diverse the team is, the more important it becomes to have periodic retrospectives. Via team feedback, retrospectives offer the opportunity to inspect and adapt the working methodology. In fact, healthy iterations in methodology are considered by many to be a key to success in Data Science projects.

In conclusion, machine learning projects require agile methodologies more than any other software development projects. (Yes, machine learning is also software.) Agile methodologies are crucial for making heterogeneous data science communicate, provide feedback and align research questions with business goals. After all, the final objective of Machine Learning solutions is to make everyone’s life easier.

Author: Dario Martinez

© datascience.aero