Skip to content

Latest commit

 

History

History
40 lines (23 loc) · 907 Bytes

README.md

File metadata and controls

40 lines (23 loc) · 907 Bytes

bigdata_test

Transfrom Data using PySpark

#Files Pyspark(python code) Crealytics.csv(Data file)

#Simple Transformation of Data

date;type;value

2016-07-01;impressions;1000

2016-07-01;impressions;100

2016-07-01;clicks;20

2016-07-01;conversions;1

2016-07-02;impressions;20

2016-07-02;clicks;2

2016-07-04;impressions;10

#Data transformed to:

date;impressions;clicks;conversions

2016-07-01;1100;20;1

2016-07-02;20;2;null

2016-07-03;null;null;null

2016-07-04;10;null;null

#Approach Decided to use Spark to show my Big Data skillset. Load the file from HDFS into context(sc) and then convert to a dataframe. Manipulate the data using Sparks sqlContext. Write the aggregated transformed data to file.(Could can be run on Databricks here https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1124054440194824/3765384094733164/3328767132102979/latest.html