The scientific method of machine learning

Samuel Cristobal

2019-03-29 13:13:53
Reading Time: 3 minutes

The scientific method and reproducibility

The scientific method has been around since at least the 17th century, if not earlier. It appeared as an empiric approach to acquiring knowledge and has driven the development of what today we consider modern science.  However, the scientific method is not immutable. In 1990, a linearized guideline for the scientific method was given by Crawford [1] and became widely accepted:

  1. Define a question
  2. Gather information and resources (observe)
  3. Form an explanatory hypothesis
  4. Test the hypothesis by performing an experiment and collecting data in a reproducible manner
  5. Analyze the data
  6. Interpret the data and draw conclusions that serve as a starting point for new hypothesis
  7. Publish results
  8. Retest (frequently done by other scientists)

The method is iterative, specifically between points 3-6, until the hypothesis is accepted or discarded. However, note that the data collection defined in step 4 is required in order to be able to retest the experiment and to arrive at the same conclusions. Without that possibility, science could not be reproducible, and therefore more close to an act of faith alone.

However, despite Machine Learning and Deep Learning being branches of mathematical sciences, many journals nowadays are publishing studies that use private and non-reproducible data. Without said data, it is impossible for the reader to reproduce the experiment. In these scenarios, the authors ask us to simply trust them.

The open data initiative

Similarly to the open source term coined in the 90s, the recent labeling of “open” data refers to free access to data sources. Certainly, there are no longer any technical limitations to true open data, and the challenges we face are more of the cultural, social and political variety. Is it possible to encourage and emphasize this current mindset? Is such an effort necessary? What are the implications of using private data to answer research questions?

For a long time, we have been told that knowledge is power. Now, we hear data being called the new “oil”.

Not long ago, only a few large corporations had the capabilities to extract knowledge from data – usually their own. However, as new information technologies have emerged, data processing and knowledge extraction has been democratized. There is no need for huge upfront costs in data centers as they can be rented in the could. New algorithms, APIs and toolkits are enormously simplifying data scientists workflows. Online courses can turn anyone into a data science expert in weeks (or at least, that’s the claim of the courses). Standardised data sets are everywhere and many times overused.

The data science in aviation case

At the last Data Science in Aviation Workshop in Cologne, many emphasized the need for open data platforms and standardized data sets to further foster AI applications in aviation.

In aviation, there is a lack of standardised datasets and researchers are urging stakeholders to license and engender such datasets. Without standardised datasets, reproducibility is not possible and therefore most of the research done with private data simply does not follow the scientific approach. Any conclusion based on a study that uses private data can not be corroborated and therefore has to be trusted – that is not science, that is faith.

This problem was faced by Innaxis and the consortium since the beginning and the solution proposed was the DataBeacon platform; an open, but private, secure data sharing and data processing platform for AI applications. I will be presenting DataBeacon at the next Strata conference in London. If you happen to be there, please do not hesitate to contact me.


[1] Crawford S, Stucki L (1990), “Peer review and the changing research record”, “J Am Soc Info Science”, vol. 41, pp. 223–28