Data Mining project 2019

Overview

This is the code for the Data Mining project of the course in 2019 at Politecnico di Milano.

The goal is to predict traffic average speeds in specific times and points of Italian roads.

We classified first with a mean absolute error of 8.5 Km/h (see the evaluation section for more about the metric).

Description

Traffic conditions are measured by sensors installed in the roads. They are identified by the road in which they are and the km. They provide measurements at interval of 15 minutes, in particular about:

the average speed
the min speed
the max speed
the speed standard deviation
the number of vehicles

During the day, several events may happen (e.g. accidents, roadworks, ...). Data tell us useful info about them:

type
location
duration
details

In addition, we have information also about the weather, hourly measured by a set of weather stations. We have:

min temperature
max temperature
nearest sensors

The goal is to predict the average speeds of each target sensor for the 4 quarters of hour immediately after the beginning of an event involving that sensor.

Dataset

You can download the dataset here: dataset.zip.

Extract the archive in: resources/dataset/originals.

You will get the following csv files:

speeds.csv.gz: contains the speeds measurements
events.csv.gz: contains data about the events
weather.csv.gz: contains data about the weather
distances.csv.gz: contains the nearest weather stations near each sensor. The format of the file is:

road, km | weather_station_id_0, distance_0, weather_station_id_1, distance_1, ...

sensors.csv.gz: contains some info about the roads

The files for speeds, events and weather are splitted in 3 files:

{file}_train: the file is used as training set
{file}_test: the file is used as validation set
{file}_2019: the file is used as test set

See the assignment document for additional details.

Evaluation

The evaluation metric is the Mean Absolute Error (MAE) between the real average speeds and the predicted ones.

Our solution

The task is a multi-regression problem, since we have to predict 4 real values. We built an ensemble of two gradient-boosted trees models, Catboost and LightGBM, each of them is trained in the following way: the prediction of a model is passed to the next model as a new feature. This is called and implemented in sklearn as multioutput regressor-chain. The models we used can handle missing values, but the sklearn implementation of the regressor-chain does not allow the presence of them in the dataset. So, we had to modify the code of sklearn to allow nan values.

Check the project presentation for further details on the models and results.

Name		Name	Last commit message	Last commit date
Latest commit History 330 Commits
Documenti		Documenti
notebooks		notebooks
resources/dataset/preprocessed		resources/dataset/preprocessed
src		src
.gitignore		.gitignore
README.md		README.md
assignment.pdf		assignment.pdf
cat_feature_index.8e78157d-11aa6026-f5068b87-d1000959.tmp		cat_feature_index.8e78157d-11aa6026-f5068b87-d1000959.tmp
preprocess.py		preprocess.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Mining project 2019

Overview

Description

Dataset

Evaluation

Our solution

About

Releases

Packages

Contributors 5

Languages

gianpy15/DMProject

Folders and files

Latest commit

History

Repository files navigation

Data Mining project 2019

Overview

Description

Dataset

Evaluation

Our solution

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages