A diary of my learning journey into the world of Apache Spark (pyspark) from an developer (Data Engineering) perspective
To follow my journey you will need:
- Azure Acount
- Azure Databricks
- Azure Data Lake Gen 2
- Python packages (ref. Pipfile)
- findspark
- jupyter
- numpy
- pandas
- pypandoc
- pyspark2.4.5
My learning path:
- Day 1: Installing a local Spark environment
- Day 2: My first Spark application and some basic concepts
- Day 3: Taking a deeper insight into DataFrames
- Day 4: Getting an Overview on the pyspark.sql module
- Day 5: Doing some math and aggregations
- Day 6: Tackling the date and time challenge
- Day 7: Handling of NULL values
- Day 8: JSON and complex data types to analyse semi-/unstructured data
- Day 9: Joins
- Day 10 : Connectors and I/O performance