Anomaly detection – Centre for Research on Energy and Clean Air

Ensuring the integrity of air quality data is paramount for enforcing regulations and protecting public health. However, poor calibration of monitoring equipment or intentional data tampering (see here and here) can lead to underestimating / overestimating pollution levels, missing critical trends, lack of accountability and failing to take necessary actions to address air quality issues.

Purpose

The goal of this challenge is to identify monitoring stations that are systematically under-reporting, over-reporting or simply failing to report air pollution (PM2.5) levels. Ultimately, we aim to help official agencies and civil society prioritise inspections, and build legitimate trust in air quality data.

To do so, we are seeking algorithms that would detect stations with spurious records, using both temporal and spatial neighbour data. Other datasets (such as satellite or weather data) could also be included if deemed useful.

Dataset

We have built a dataset including hourly PM2.5 and PM10 measurement from official air quality monitoring stations in India, China, Germany and USA for the years 2021-2023. More countries and longer time periods can be made available upon request.

Join the Data Challenge

Whether you’re an academic researcher, a student, a data scientist or simply a data-enthusiast, you can help us save thousands of lives by joining this data challenge.

You can work by yourself, with friends and colleagues, or with us, at your own pace. To get started and access the dataset, simply fill the form below.

Frequently Asked Questions

It is hard to pinpoint what exactly is a spurious record. Perhaps it is more relevant to talk of a spurious station, which shows inconsistent behaviour or consistently biased behaviour. For instance, we might be interested in stations who tend to under-report PM2.5 values compared to the neighbour stations, and tend to do so only when PM2.5 values are relatively higher. Other signs of “spuriousness” could be the increased frequency of records around certain specific numbers.

Absolutely not. There could be calibration issues or temporary hiccups that affect station records.

Ultimately, we are interested in building relevant indexes to rank stations and identify those that are the least trustworthy. Note that more than one index might be required to cover all potential misreporting patterns.

Because of the associated health impacts, we are especially interested in patterns that would tend to underreport concentration levels but more holistic pictures are also warmly welcomed.

First of all, your results are yours and you are free to promote them in any way you like. As mentioned, we only ask you to release them under an open source license (e.g. MIT or GNU).

At this stage, we are testing the feasibility of such algorithms. If results are promising, we are hoping to initiate a longer-term project, with more dedicated resources, to support relevant authorities and civil society ensure air pollution data integrity.