This repository contains R code for the Pump it Up: Data Mining the Water Table competition on Driven Data.
The data is provided by Taarifa and the Tanzanian Ministry of Water. The goal is to predict whether a water pump is functional, functional but needs repairs or non functional.
I use H2O's random forest to get a score 0.821. I have uploaded my best (current) submission but not the data. Sign up at Driven Data to download the following files:
SubmissionFormat.csv
Test set values.csv
Training set labels.csv
Training set values.csv
The first step is to read the data and set some values to missing (NA
in R): See read-data.md
.
The next step is to clean up the features (transform some, remove others) and possibly engineer some new features: See transform-data.md
.
Use a random forest to predict the functionality status of pumps in the test set: See predict-data.md
.
Added a Makefile, which spin
s the R
scripts to produces the md
files. See how to Build a report based on an R script.