Coining Data Science

Hector Ureta

2017-03-29 17:36:54
Reading Time: 6 minutes
There are incredibly vast amounts of data now available. Everywhere around us there is data being captured, for instance our location-tracked smartphones, the links we click (even to read this post) or other ways you are accessing DataScience.aero through your desktop computer or laptop. Companies in almost every industry are exploiting data for competitive advantage in the so called “data era”.
In the past, firms could employ teams of statisticians, modelers, and analysts to manually explore datasets, but the volume and variety of data have far surpassed the capacity of manual analysis. Nowadays, computers have become powerful and the data science field has had a rebirth between the boundary of the “old” statisticians and the “new” computer scientists. However, before we embark on the present-day data science capabilities, let’s first explore the roots of data science; how has the concept been coined over the years? This post explores and documents its recent history, the connections with other domains (statistics, computer science) and include some resourceful publications and references. Enjoy!

1962 John W. Tukey writes in “The Future of Data Analysis”, “For a long time I thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and doubt… I have come to feel that my central interest is in data analysis… Data analysis, and the parts of statistics which adhere to it, must…take on the characteristics of science rather than those of mathematics… data analysis is intrinsically an empirical science… How vital and how important… is the rise of the stored-program electronic computer? In many instances the answer may surprise many by being ‘important but not vital,’ although in others there is no doubt but what the computer has been ‘vital.’” In 1977, Tukey published Exploratory Data Analysis, arguing that more emphasis i needed on using data to suggest hypotheses to test and that Exploratory Data Analysis and Confirmatory Data Analysis “can—and should—proceed side by side.”

“How vital and how important… is the rise of the stored-program electronic computer?”

1974 Peter Naur publishes Concise Survey of Computer Methods in Sweden and the United States. The book is a survey of contemporary data processing methods that are used in a wide range of applications. The preface to the book tells the reader that a course plan was presented at the IFIP Congress in 1968, titled “Datalogy, the science of data and of data processes and its place in education,“ and later in the book Naur writes, ”the term ‘data science’ has been used freely.” Naur offers the following definition of data science: “The science of dealing with data, once they have been established, while the relation of the data to what they represent is delegated to other fields and sciences.”

1977 The International Association for Statistical Computing (IASC) is established. The Association states, “It is the mission to link traditional statistical methodology, modern computer technology, and the knowledge of domain experts in order to convert data into information and knowledge.”

1989 Gregory Piatetsky-Shapiro organizes and chairs the first Knowledge Discovery in Databases (KDD) workshop. In 1995, it became the annual ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).

1996 Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth publish “From Data Mining to Knowledge Discovery in Databases.” They write: “Historically, the notion of finding useful patterns in data has been given a variety of names, including data mining, knowledge extraction, information discovery, information harvesting, data archeology, and data pattern processing… In our view, KDD [Knowledge Discovery in Databases] refers to the overall process of discovering useful knowledge from data, and data mining refers to a particular step in this process.  Data mining is the application of specific algorithms for extracting patterns from data… the additional steps in the KDD process, such as data preparation, data selection, data cleaning, incorporation of appropriate prior knowledge, and proper interpretation of the results of mining, are essential to ensure that useful knowledge is derived from the data. Blind application of data-mining methods (rightly criticized as data dredging in the statistical literature) can be a dangerous activity, easily leading to the discovery of meaningless and invalid patterns.”

1997 The journal Data Mining and Knowledge Discovery is launched; the reversal of the order of the two terms in its title reflecting the ascendance of “data mining” as the more popular way to designate “extracting information from large databases.”

2001 William S. Cleveland publishes “Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics.” It is a plan “to enlarge the major areas of technical work of the field of statistics. Because the plan is ambitious and implies substantial change, the altered field will be called ‘data science.’” Cleveland relates the new discipline in the context of computer science and the contemporary work in data mining: “…the benefit to the data analyst has been limited, because the knowledge among computer scientists about how to think of and approach the analysis of data is limited, just as the knowledge of computing environments by statisticians is limited. A merger of knowledge bases would produce a powerful force for innovation. This suggests that statisticians should look to computing for knowledge today just as data science looked to mathematics in the past. … departments of data science should contain faculty members who devote their careers to advances in computing with data and who form partnership with computer scientists.”

January 2003 Launch of Journal of Data Science: “By ‘Data Science’ we mean almost everything that has something to do with data: collecting, analyzing, modeling…… yet the most important part is its applications; all sorts of applications. This journal is devoted to applications of statistical methods at large. The Journal of Data Science will provide a platform for all data workers to present their views and exchange ideas.”

May 2005 Thomas H. Davenport, Don Cohen, and Al Jacobson publish “Competing on Analytics,” a Babson College Working Knowledge Research Center report, describing “the emergence of a new form of competition based on the extensive use of analytics, data, and fact-based decision making… Instead of competing on traditional factors, companies are beginning to employ statistical and quantitative analysis and predictive modeling as primary elements of competition.” The research is later published by Davenport in the Harvard Business Review (January 2006) and is expanded (with Jeanne G. Harris) into the book Competing on Analytics: The New Science of Winning (March 2007).

June 2009 Troy Sadkowsky creates the data scientists group on LinkedIn as a companion to his website (datascientists.net).
February 2010 Kenneth Cukier writes in The Economist Special Report ”Data, Data Everywhere“: ”… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data.”

Source: Drew Conway

May 2011  Pete Warden writes “Why the term ‘data science’ is flawed but useful”. Warden states, “There is no widely accepted boundary for what’s inside and outside of data science’s scope. Is it just a faddish rebranding of statistics? I don’t think so, but I also don’t have a full definition. I believe that the recent abundance of data has sparked something new in the world, and when I look around I see people with shared characteristics who don’t fit into traditional categories. These people tend to work beyond the narrow specialties that dominate the corporate and institutional world, handling everything from finding the data, processing it at scale, visualizing it and writing it up as a story. They also seem to start by looking at what the data can tell them, and then picking interesting threads to follow, rather than the traditional scientist’s approach of choosing the problem first and then finding data to shed light on it.”

May 2011 David Smith writes in “’Data Science’:  What’s in a name?”: “The terms ‘Data Science’ and ‘Data Scientist’ have only been in common usage for a little over a year, but they’ve really taken off since then: many companies are now hiring for ‘data scientists’, and entire conferences are run under the name of ‘data science’. But despite the widespread adoption, some have resisted the change from the more traditional terms like ‘statistician’ or ‘quant’ or ‘data analyst’…. I think ‘Data Science’ better describes what we actually do: a combination of computer hacking, data analysis, and problem solving.”

Data Science Job Growth (by 2011)

September 2012 Tom Davenport and D.J. Patil publish “Data Scientist: The Sexiest Job of the 21st Century” in the Harvard Business Review. They write, “The title has been around for only a few years…But thousands of data scientists are already working at both start-ups and well-established companies. Their sudden appearance on the business scene reflects the fact that companies are now wrestling with information that comes in varieties and volumes never encountered before.”

October 2013 First Data Science in Aviation Workshop takes place in Madrid. This event leveraged the “Complex Data Mining” research thread of SESAR-Complex world, active since 2010 (DSIAW). Due to its wide success, this Workshop takes place each year. The 2016 edition of the Workshop involved a larger range of stakeholders including ANSPs, airlines, researchers, authorities and other industry professionals. The agenda discussed not only challenges of data science but also new frontiers, as it applies to aviation safety and other innovative fields.

Author: Hector Ureta

© datascience.aero