Skip to content

Notebooks/materials on Big Data with PySpark skill track from datacamp (primarily). Also, contains books/cheat-sheets.

Notifications You must be signed in to change notification settings

maheshcheetirala/big-data-with-pyspark

Repository files navigation

Big Data with PySpark

Progress

Introducation to PySpark

Getting to know PySpark

Link to Notebooks

  • What is Spark, anyway?
  • Using Spark in Python
  • Examining The SparkContext
  • Using DataFrames
  • Creating a SparkSession
  • Viewing tables
  • Are you query-ious?
  • Pandafy a Spark DataFrame
  • Put some Spark in your data
  • Dropping the middle man

Manipulating data

Link to Notebooks

  • Creating columns
  • SQL in a nutshell
  • SQL in a nutshell (2)
  • Filtering Data
  • Selecting
  • Selecting II
  • Aggregating
  • Aggregating II
  • Grouping and Aggregating I
  • Grouping and Aggregating II
  • Joining
  • Joining II

Getting started with machine learning pipelines

Link to Notebooks

  • Machine Learning Pipelines
  • Join the DataFrames
  • Data types
  • String to integer
  • Create a new column
  • Making a Boolean
  • Strings and factors
  • Carrier
  • Destination
  • Assemble a vector
  • Create the pipeline
  • Test vs Train
  • Transform the data
  • Split the data

Model tuning and selection

Link to Notebooks

  • What is logistic regression?
  • Create the modeler
  • Cross validation
  • Create the evaluator
  • Make a grid
  • Make the validator
  • Fit the model(s)
  • Evaluating binary classifiers
  • Evaluate the model

Big Data Fundamentals with PySpark

Introduction to Big Data analysis with Spark

Link to Notebooks

  • What is Big Data?
  • The 3 V's of Big Data
  • PySpark: Spark with Python
  • Understanding SparkContext
  • Interactive Use of PySpark
  • Loading data in PySpark shell
  • Review of functional programming in Python
  • Use of lambda() with map()
  • Use of lambda() with filter()

Programming in PySpark RDD’s

Link to Notebooks

  • Abstracting Data with RDDs
  • RDDs from Parallelized collections
  • RDDs from External Datasets
  • Partitions in your data
  • Basic RDD Transformations and Actions
  • Map and Collect
  • Filter and Count
  • Pair RDDs in PySpark
  • ReduceBykey and Collect
  • SortByKey and Collect
  • Advanced RDD Actions
  • CountingBykeys
  • Create a base RDD and transform it
  • Remove stop words and reduce the dataset
  • Print word frequencies

PySpark SQL & DataFrames

Link to Notebooks

  • Abstracting Data with DataFrames
  • RDD to DataFrame
  • Loading CSV into DataFrame
  • Operating on DataFrames in PySpark
  • Inspecting data in PySpark DataFrame
  • PySpark DataFrame subsetting and cleaning
  • Filtering your DataFrame
  • Interacting with DataFrames using PySpark SQL
  • Running SQL Queries Programmatically
  • SQL queries for filtering Table
  • Data Visualization in PySpark using DataFrames
  • PySpark DataFrame visualization
  • Part 1: Create a DataFrame from CSV file
  • Part 2: SQL Queries on DataFrame
  • Part 3: Data visualization

Machine Learning with PySpark MLlib

Link to Notebooks

  • Overview of PySpark MLlib
  • PySpark ML libraries
  • PySpark MLlib algorithms
  • Collaborative filtering
  • Loading Movie Lens dataset into RDDs
  • Model training and predictions
  • Model evaluation using MSE
  • Classification
  • Loading spam and non-spam data
  • Feature hashing and LabelPoint
  • Logistic Regression model training
  • Clustering
  • Loading and parsing the 5000 points data
  • K-means training
  • Visualizing clusters

Cleaning Data with PySpark

DataFrame details

Link to Notebooks

  • A review of DataFrame fundamentals and the importance of data cleaning.
  • Intro to data cleaning with Apache Spark
  • Data cleaning review
  • Immutability and lazy processing
  • Immutability review
  • Using lazy processing
  • Understanding Parquet
  • Saving a DataFrame in Parquet format
  • SQL and Parquet

Manipulating DataFrames in the real world

Link to Notebooks

  • DataFrame column operations
  • Filtering column content with Python
  • Filtering Question #1
  • Filtering Question #2
  • Modifying DataFrame columns
  • Conditional DataFrame column operations
  • when() example
  • When / Otherwise
  • User defined functions
  • Understanding user defined functions
  • Using user defined functions in Spark
  • Partitioning and lazy processing
  • Adding an ID Field
  • IDs with different partitions
  • More ID tricks

Improving Performance

Link to Notebooks

  • Caching
  • Caching a DataFrame
  • Removing a DataFrame from cache
  • Improve import performance
  • File size optimization
  • File import performance
  • Cluster configurations
  • Reading Spark configurations
  • Writing Spark configurations
  • Performance improvements
  • Normal joins
  • Using broadcasting on Spark joins
  • Comparing broadcast vs normal joins

Complex processing and data pipelines

Link to Notebooks

  • Introduction to data pipelines
  • Quick pipeline
  • Pipeline data issue
  • Data handling techniques
  • Removing commented lines
  • Removing invalid rows
  • Splitting into columns
  • Further parsing
  • Data validation
  • Validate rows via join
  • Examining invalid rows
  • Final analysis and delivery
  • Dog parsing
  • Per image count
  • Percentage dog pixels

About

Notebooks/materials on Big Data with PySpark skill track from datacamp (primarily). Also, contains books/cheat-sheets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published