Aviation Data Compendium: Which data sources exist in aviation industry?

Antonio Fernandez

2020-02-20 18:06:13
Reading Time: 5 minutes

When applying data-mining techniques in the aviation field, we generally ask ourselves the same question: what data sources, public or private, exist that can be accessed? Surprisingly there are plenty of data silos (mostly private) containing information of almost any aspect regarding aviation. Data scientists and analysts could greatly benefit from knowing available data and having an opportunity to brainstorm about novelty solutions to improve any aviation sector. For instance, research into ways to improve airspace operations, detect safety incidents in advance or decrease aircraft emissions optimizing fuel consumption is just one powerful applications of all this data. Merging these data sources enables machine learning models to learn from past behaviors.

We have classified the extensive aviation data catalogue into three different categories: first, human-related data, which normally involves passengers, flight crew and ATCOs. Second, data generated inside aircrafts, mainly composed of metrics captured by on-board sensors during flights and maintenance reports. Finally, all data generated outside aircrafts, like flight plans, regulations or external factors that might influence flight performance such as weather, terrain or particular events.

Human-related data

Among all the roles involved in a flight execution, passengers, ATCOs and flight crew are ones whose information is the most demanded. Nevertheless, this data is quite sensitive, therefore kept private by most owners. These datasets are usually captured by airports, airlines and ANSPs. Below, we will explore some of these datasets.


Passengers generate huge amounts of data. Trips generally flow the same way, beginning when passengers leave home, arrive at the airport, pass security controls, wait in gate areas, board the plane, travel to destination, pick up their baggage and travel to their destination home. Airports capture a lot of information about how passengers behave within the airport. These data can be categorized divided into passenger movement inside the airport and passenger information captured on board.

Regarding A/D airports datasets, passengers flux refers to movements and actions performed by passengers in the busiest areas of the airport, like checks-in, security control, boarding pass scans, etc. It is related to the passenger connection dataset, where inbound and outbound passengers traffic is measured for each arrival and departure, inferring the airport occupancy and analyzing multi-hop journeys. Looking further into the passenger experience inside the airports, queue waiting times definitely have the highest impactThere are several datasets that measure actual waiting times for airport queues. These queues could be security checks, immigration or emigration controls placed at different airport locations. In fact, some airports also focus on accelerating the process of collecting baggages from passengers, installing sensors to track the baggage journey. These sensors register when the baggage belts start and end for each arrival and counts the number of bags transported. Some datasets even monitor bag positioning along the journey.

On-board passengers information is more reduced, including, as an example, seat maps within the aircraft that count the expected and actual occupancy. They usually measure first and second class seat occupancy. Passenger ticketing, is also important for tracking passenger names, contact and ticketing information, which in turn fuels loyalty programs for a certain airline.

ATCOs and Flight crew

In proportion, we are aware of fewer data sources regarding ATCOs and flight crew operations. The majority of available sets are about fatigue, which can occur in all professions. However, in Air Traffic Control (ATC), where a 24 hour service is required and safety must be maintained at all times, it is essential that air traffic controllers and air traffic engineers understand the potential risks of fatigue and know how they can manage these risks.

On the other hand, another interesting dataset is the speech dialogues between the pilot and ATCO, where any operational instruction given along the flight is recorded.

What about aircraft monitoring datasets?

Without a doubt, the most valuable data source that explains how an aircraft behaves during the flight is Flight Data Monitoring (FDM). This data source is very rich and precise, recording eight measures per second for hundreds of sensors (speed, flaps, angles, etc.) installed in the plane. FDM is mainly used by airline safety departments as a forensic tool, since it has to be extracted from the aircraft, decoded and loaded onto their platform to uncover potential safety occurrences. In fact, these safety occurrence reports are a very valuable dataset to prevent severe incidents from happening.

Another popular data source is space vectors, also known as Automatic Dependent Surveillance – Broadcast (ADS-B). ADS-B is a surveillance technology in which an aircraft determines its position via satellite navigation and periodically broadcasts it to antennas installed on ground stations. However, because signal has to be captured, the quality of the data depends on how many antennae are monitoring the airspace. A lack of receivers might produce jumps in the trajectory when aircraft fly over isolated regions. The Air Traffic Control Radar Beacon System (ATCRBS orRADAR track) is a system used by ATCs to enhance surveillance radar monitoring and separation of air traffic. When aircraft is near the ground, the position accuracy of these radar falls down. A-SMGCS (Advanced Surface Movement Guidance & Control System) is a system providing routing, guidance and surveillance for the ATC within the aerodrome visibility operational level (AVOL), which provides a more precise position measurement.

Furthermore, another interesting dataset is noise profiles, which works similarly to ADS-B, where antennae measure the noise produced by an aircraft in decibels (dB). Studies around this dataset aim to decrease the acoustic pollution generated by aircraft within a geographic area

Last but not least, maintenance reports are another good dataset that describe any issue detected in aircraft architecture. Moreover, part replacements are recorded, which allows research into predictive maintenance and helps forecast aircraft life-cycle.

What about flight contextual data generated by external sources?

Definitely yes, this is the biggest group. Multiple datasets contribute to flight context. All these data sources come from outside aircraft but are directly linked to flight performance. Flight Plans are the most popular dataset, which describes static information about the flight such as departure/arrival airports, airline, aircraft type, scheduled and actual times for departure and arrival, among others. These flight plans can be updated due to regulations, which may impact the route followed by the plane, departure times, etc., but we won’t go too deep into this dataset for simplicity’s sake.

Meteorological conditions around airports and paths en-route contribute a lot to the flight context. METAR is probably the most popular weather data source, having measurements on temperature, wind, etc. for each airport every 30 minutes. Regarding en-route weather, SIGMET is another public data source that categorizes weather events using polygon. NOAA dataset provides a worldwide grid, measuring temperature and wind for global warming purposes. More useful complementary datasets could be SODAR or WMA for wind profiles and SNOWTAM for runway snow metrics.

Terrain mapping and abnormal altitude spots is also important for avoiding ground proximity warnings. NOTAM is a log dataset that contains particular events happening around TMA (i.e buildings, military operations, etc). Also, the GEO dataset provides altitudes over maps for a given coordinates pair.


In this post, we have described many datasets in aviation. Though we probably haven’t covered them all,  a data scientist could have a clear idea on which data is available in the industry by reading this compendium. Most data sources described are private and belong to airlines, airports or ANSPs. Applying data-mining techniques over these aviation datasets would surely improve passenger experiences or air traffic performance, or perhaps even minimize environmental impacts and the amount of safety occurrences.

© datascience.aero