1. Install ClickHouse
2. Install R
From CRAN
R is required to fix some of the historical Parquet files that are missing column types
From the project root directory, not the clickhouse/
directory
./download_raw_data.sh
Some of the older Parquet files provided by the TLC have a few columns with null
types, which causes errors when trying to import into ClickHouse. The following script iterates through all of the downloaded files and set types if necessary
./clickhouse/fix_parquet_files.sh
./clickhouse/initialize_clickhouse_database.sh
./clickhouse/load_fhv_trips.sh
./clickhouse/load_taxi_trips.sh
The yellow taxi Parquet files from 2009 and 2010 have columns for lat/lon coordinates instead of location IDs, which makes them incompatible with the ClickHouse taxi_trips
table schema. As a workaround, there is a Parquet file available from a Requester Pays AWS S3 bucket here:
It contains all yellow taxi trips from 2009 and 2010, including location IDs instead of lat/lon coordinates. See here for info on how to download files from Requester Pays S3 buckets. Once you've downloaded the file to the data/
directory, run:
./clickhouse/backfill_yellow_taxi_2009_2010_trips.sh
If you want to reproduce the backfill file on your own instead of downloading from S3, you can:
- Run the Postgres-based scripts in this repo to load 2009/2010 yellow taxi files, which will map coordinates to location IDs
- Generate a Parquet file of 2009/2010 trips that conforms to the same schema as the 2011- files
- Import the Parquet file into ClickHouse using the
backfill_yellow_taxi_2009_2010_trips.sh
script
fhv_trips
table contains all for-hire vehicle trip records, including ride-hailing apps Uber, Lyft, Via, and Junotaxi_trips
table contains all yellow and green taxi tripstaxi_zones
table maps pickup and dropoff location IDs to neighborhood and borough names