The ASA challenge site has two decades of flight data totaling many GB. Data contains year, month, day, origin, carrier, destination, delay, etc. Goal is to determine what factors (day, carrier, destination, etc) best predicts the delay time.From ASA website:
The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed.
The aim of the data expo is to provide a graphical summary of important features of the data set. This is intentionally vague in order to allow different entries to focus on different aspects of the data, but here are a few ideas to get you started:
- When is the best time of day/day of week/time of year to fly to minimise delays?
- Do older planes suffer more delays?
- How does the number of people flying between different locations change over time?
- How well does weather predict plane delays?
You are also welcome to work with interesting subsets: you might want to compare flight patterns before and after 9/11, or between the pair of cities that you fly between most often, or all flights to and from a major airport like Chicago (ORD).
- Can you detect cascading failures as delays in one airport create delays in others? Are there critical links in the system?
Next: how to use GraphChi for computing predictions on this dataset.