Name		Name	Last commit message	Last commit date
parent directory ..
chap01		chap01
chap02		chap02
chap03		chap03
chap04		chap04
chap05		chap05
chap06		chap06
chap07		chap07
chap08		chap08
chap09		chap09
chap10		chap10
chap11		chap11
chap12		chap12
examples/wordcount		examples/wordcount
jars		jars
README.md		README.md

README.md

PySpark Algorithms: Source Code

Introduction

This is the directory for the source code of PySpark Algorithms book by Mahmoud Parsian.
Each chapter contains
- Source code for PySpark programs
- Sample shell scripts (how to invoke PySpark programs)
- Sample data
- Sample log files to show I/O

PySpark

PySpark is the Python programming language binding for the Spark Platform and API and not much different from the Java or Scala versions. Python programming language is dynamically typed (no need to compile Python programs), so RDDs can hold objects of multiple types. PySpark does support the following Spark's Data Abstractions:

RDDs
DataFrames

Spark

The following Spark version was used:

Spark-2.4.3
Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+.
The PySpark programs presented here are tested with the following:
- Spark 2.4.3
- Java8 (version "1.8.0_144")
- Python 3.7.2

Python

Which Python should we use? Python2 or Python3? According to Spark's documentation, we may use Python2 or Python3. But since Python 2 will retire by 2020, I have decided to use Python 3.

You may set up your own Python by setting the PYSPARK_PYTHON environment variable.

Let's assume that Python2 is installed as /usr/bin/python
Let's assume that Python3 is installed as /usr/local/bin/python3.

To use Python3, then execute the following before invoking PySpark shell or spark-submit command:

# verify the version of Python
/usr/local/bin/python3 --version
Python 3.7.2
export PYSPARK_PYTHON="/usr/local/bin/python3"

To use Python2, then execute the following before invoking PySpark shell or spark-submit command:

# verify the version of Python
/usr/bin/python --version
Python 2.7.1
export PYSPARK_PYTHON="/usr/bin/python"

Chapters

There are 12 chapters and each chapter has a designated source code directory as:
- Chapter 1: Introduction to PySpark
- Chapter 2: Hello World in PySpark
- Chapter 3: Data Abstractions
- Chapter 4: Getting Started
- Chapter 5: Transformations in Spark
- Chapter 6: Reductions in Spark
- Chapter 7: DataFrames
- Chapter 8: Spark Data Sources
- Chapter 9: Logistic Regression
- Chapter 10: Movie Recommendations
- Chapter 11: Graph Algorithms
- Chapter 12: Design Patterns
- Appendix-A: Spark Installation
- Appendix-B: Lambda Expressions and Functions
- Appendix-C: Questions and Answers

Programs

Each program (most of them) has 3 associated files:

PySpark program (.py) -- the actual program, which will run
Shell script (.sh) -- how to run the program
Sample log file (.log) -- what to expect from running this program

Sample Data and Resources

Per chapter, I have only provided some sample data.
When the size of data is big, I have provided a pointer to that data (when possible).
Sample data files (as .txt, .csv, .json, ...) are given for each chapter.
These files are located a long with the associated programs (as .py and .sh).

Required JARs

Third Party JARs

Spark Installation Directory

The author assumes that you have installed Spark at the /pyspark_book/spark-2.4.3/ (for Spark-2.4.3 version). However, you may install it in any other location and then update the shell scripts (.sh files) accordingly. The SPARK_HOME (as an environment variable) refers the Spark's installation directory.

Spark's Home Directory Structure

On decompressing the Spark downloadable, you will see the following directory structure:

bin
sbin
conf
examples
licenses
...

Spark's PySpark Examples

Spark installation comes with a set of PySpark examples and they are located at $SPARK_HOME/examples/src/main/python/.

Comments

If you have any comments or suggestions, please report them to the author at [email protected]

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

README.md

PySpark Algorithms: Source Code

Introduction

PySpark

Spark

Python

Chapters

Programs

Sample Data and Resources

Required JARs

Spark Installation Directory

Spark's Home Directory Structure

Spark's PySpark Examples

Comments

Files

code

Directory actions

More options

Directory actions

More options

Latest commit

History

code

Folders and files

parent directory

README.md

PySpark Algorithms: Source Code

Introduction

PySpark

Spark

Python

Chapters

Programs

Sample Data and Resources

Required JARs

Spark Installation Directory

Spark's Home Directory Structure

Spark's PySpark Examples

Comments