- What is Spark, anyway?
- Using Spark in Python
- Examining The SparkContext
- Using DataFrames
- Creating a SparkSession
- Viewing tables
- Are you query-ious?
- Pandafy a Spark DataFrame
- Put some Spark in your data
- Dropping the middle man
- Creating columns
- SQL in a nutshell
- SQL in a nutshell (2)
- Filtering Data
- Selecting
- Selecting II
- Aggregating
- Aggregating II
- Grouping and Aggregating I
- Grouping and Aggregating II
- Joining
- Joining II
- Machine Learning Pipelines
- Join the DataFrames
- Data types
- String to integer
- Create a new column
- Making a Boolean
- Strings and factors
- Carrier
- Destination
- Assemble a vector
- Create the pipeline
- Test vs Train
- Transform the data
- Split the data
- What is logistic regression?
- Create the modeler
- Cross validation
- Create the evaluator
- Make a grid
- Make the validator
- Fit the model(s)
- Evaluating binary classifiers
- Evaluate the model
- What is Big Data?
- The 3 V's of Big Data
- PySpark: Spark with Python
- Understanding SparkContext
- Interactive Use of PySpark
- Loading data in PySpark shell
- Review of functional programming in Python
- Use of lambda() with map()
- Use of lambda() with filter()
- Abstracting Data with RDDs
- RDDs from Parallelized collections
- RDDs from External Datasets
- Partitions in your data
- Basic RDD Transformations and Actions
- Map and Collect
- Filter and Count
- Pair RDDs in PySpark
- ReduceBykey and Collect
- SortByKey and Collect
- Advanced RDD Actions
- CountingBykeys
- Create a base RDD and transform it
- Remove stop words and reduce the dataset
- Print word frequencies
- Abstracting Data with DataFrames
- RDD to DataFrame
- Loading CSV into DataFrame
- Operating on DataFrames in PySpark
- Inspecting data in PySpark DataFrame
- PySpark DataFrame subsetting and cleaning
- Filtering your DataFrame
- Interacting with DataFrames using PySpark SQL
- Running SQL Queries Programmatically
- SQL queries for filtering Table
- Data Visualization in PySpark using DataFrames
- PySpark DataFrame visualization
- Part 1: Create a DataFrame from CSV file
- Part 2: SQL Queries on DataFrame
- Part 3: Data visualization
- Overview of PySpark MLlib
- PySpark ML libraries
- PySpark MLlib algorithms
- Collaborative filtering
- Loading Movie Lens dataset into RDDs
- Model training and predictions
- Model evaluation using MSE
- Classification
- Loading spam and non-spam data
- Feature hashing and LabelPoint
- Logistic Regression model training
- Clustering
- Loading and parsing the 5000 points data
- K-means training
- Visualizing clusters
- A review of DataFrame fundamentals and the importance of data cleaning.
- Intro to data cleaning with Apache Spark
- Data cleaning review
- Immutability and lazy processing
- Immutability review
- Using lazy processing
- Understanding Parquet
- Saving a DataFrame in Parquet format
- SQL and Parquet
- DataFrame column operations
- Filtering column content with Python
- Filtering Question #1
- Filtering Question #2
- Modifying DataFrame columns
- Conditional DataFrame column operations
- when() example
- When / Otherwise
- User defined functions
- Understanding user defined functions
- Using user defined functions in Spark
- Partitioning and lazy processing
- Adding an ID Field
- IDs with different partitions
- More ID tricks
- Caching
- Caching a DataFrame
- Removing a DataFrame from cache
- Improve import performance
- File size optimization
- File import performance
- Cluster configurations
- Reading Spark configurations
- Writing Spark configurations
- Performance improvements
- Normal joins
- Using broadcasting on Spark joins
- Comparing broadcast vs normal joins
- Introduction to data pipelines
- Quick pipeline
- Pipeline data issue
- Data handling techniques
- Removing commented lines
- Removing invalid rows
- Splitting into columns
- Further parsing
- Data validation
- Validate rows via join
- Examining invalid rows
- Final analysis and delivery
- Dog parsing
- Per image count
- Percentage dog pixels