Skip to content

Data set and queries that I use in my Hive and Impala presentations. Slides are usually posted at slideshare.net/markgrover

Notifications You must be signed in to change notification settings

markgrover/cloudcon-hive

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

cloudcon-hive

Hive presentation at CloudCon 2013 in San Francisco

Files in the repository

Datasets

There are 2 datasets in the repo.

a) The first dataset contains on-time flight performance data from 2008, originally released by Research and Innovative Technology Administration (RITA). The source of this dataset is http://stat-computing.org/dataexpo/2009/the-data.html. The dataset

b) The second dataset contains listing of various airport codes in continental US, Puerto Rico and US Virgin Islands. The source of this dataset is http://www.world-airport-codes.com/ The data was scraped from this website and then cleansed to be in its present CSV form.

Hive commands

  • On hive shell: Create hive table, flight_data:

CREATE TABLE flight_data(
   year INT,
   month INT,
   day INT,
   day_of_week INT,
   dep_time INT,
   crs_dep_time INT,
   arr_time INT,
   crs_arr_time INT,
   unique_carrier STRING,
   flight_num INT,
   tail_num STRING,
   actual_elapsed_time INT,
   crs_elapsed_time INT,
   air_time INT,
   arr_delay INT,
   dep_delay INT,
   origin STRING,
   dest STRING,
   distance INT,
   taxi_in INT,
   taxi_out INT,
   cancelled INT,
   cancellation_code STRING,
   diverted INT,
   carrier_delay STRING,
   weather_delay STRING,
   nas_delay STRING,
   security_delay STRING,
   late_aircraft_delay STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/hive/warehouse/flight_data';

  • Load the data into the table:

LOAD DATA LOCAL INPATH '2008.csv' OVERWRITE INTO TABLE flight_data;

  • Ensure the table got created and loaded fine:

SHOW TABLES;
SELECT
   *
FROM
   flight_data
LIMIT 10; 

  • Query the table. Find average arrival delay for all flights departing SFO in January:

SELECT
   avg(arr_delay)
FROM
   flight_data
WHERE
   month=1
   AND origin='SFO';

About

Data set and queries that I use in my Hive and Impala presentations. Slides are usually posted at slideshare.net/markgrover

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published