Data Scientists vs. Software Engineers

Seddik Belkoura

2017-12-12 16:13:50
Reading Time: 3 minutes

A problem can always be observed or described differently through various lenses. The point of views of stakeholders, end-users, data scientists and software engineers have implications in the building of a solution, but it’s worth mentioning the differences that exist in the various approaches in order to avoid conflicts and unproductive workflows.

Why understanding key differences between data science and software engineering matters

As data science becomes increasingly used and mature within a field (in this case, aviation), it is common to observe collaboration between software/infrastructure engineer and data science teams. While both share coding responsibilities, software development and data science are fundamentally different. Data science is analogous to research: data scientists work closely with stakeholders to answer business questions leveraging their data. Conversely, software engineers seldom engage with stakeholders, but rather collaborate with data science teams to improve and adapt the computing constraints of an existing solution.

Both data scientists and software engineers conduct research, but each focuses on different questions is a good example of this paradigm. Aviation stakeholders wants to advance their understanding and management of safety issues through an organized and understandable platform. Data science teams are focused on answering specific questions that may output knowledge or a model. It is often an exploratory process to answer stakeholder questions; the research process is not easily predictable nor are the computational requisites always constant, which explains why data science teams typically need more flexibility and agility in the tools and infrastructure they use. It is not surprising to see data scientists finding memory and CPU problems while running computationally intensive experiments on their laptop or a private temporal cloud machine. Data science teams may use solutions to avoid excessive computation times such as quick and simple parallelization or increased memory, although these solutions cannot be used for scale or in the long-term. This being considered, data scientists turn to code most often (Python, Scala, etc.), but this is just a small part of their work as theoretical inputs (statistics, business intelligence, domain knowledge, etc) are significantly more important than coding. That is to say, programming is just the language to express or reflect the research process.

On the other hand, the software engineering teams adapt the proposed infrastructure architecture or the solution software from the data science team. Typically, the proposed solution from the data science team should be validated by the stakeholders before applying it through software engineering. Engineers also perform research, but their investigation focuses on different objectives, as the business problem has already been addressed by the data science team. Software engineers use tracking, monitoring and quality assurance techniques to understand the structure of the specific provided solution (through code) and optimize it to build scalable and high-performance workflows. Their work is not integrated with stakeholders’ input, but rather to the data science output; adapting the architecture to improve the computation performance of the overall delivered solution.

In practice, with some precautions, data science and software engineering teams can run in parallel

It is important to clearly understand the role of each teams for any data science project. In, the Innaxis Foundation and Research Institute (INX), Linköping University (LIU), Delft University of Technology (TUD), Technical University of Munich (TUM) and Centro de Referencia de Investigación, Desarrollo e Innovación ATM (CRIDA) are responsible for the data science process. Their needs in terms of computation are still varied at this level; depending on the testing each wants to perform. In parallel, a team of engineers from Fraunhofer-Gesellschaft (FRA) are in close coordination with data scientists to prepare an infrastructure that will best suit their needs. First, a series of preliminary requirements were provided to them (type of data bases, needs of parallelization, etc.) based on the data scientists’ knowledge, but the infrastructure is ultimately constructed to adapt to any new requisite that may develop in the future. This may be the case when the data science research advances and provides more performative solutions.

Featured image source

Author: Seddik Belkoura