PySparkLab

Download Spark https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz Reference : https://spark.apache.org/downloads.html
Download winutils https://github.com/steveloughran/winutils/blob/master/hadoop-3.0.0/bin/winutils.exe Reference : https://github.com/steveloughran/winutils
JDK download https://download.oracle.com/java/21/latest/jdk-21_windows-x64_bin.exe Reference : https://www.oracle.com/java/technologies/downloads/

Extract download spark zip folder
Install all the previous

Create a folder called hadoop under c:/Hadoop/bin and place winutils.exe inside it
Add new System variable under Enviroment variable

HADOOP_HOME=c:/Hadoop
SPARK_HOME=
JAVA_HOME=C:\Program Files\Java\jdk-11.0.17

Update path variable to include the bin diretories

%JAVA_HOME%\bin
%SPARK_HOME%\bin
%HADOOP_HOME%\bin

Go to spark-3.5.0-bin-hadoop3\python\lib

py4j-0.10.9.7-src.zip Update the path System environment variable with - the absolute path of the py4j-0.10.9.7-src.zip, i.e. \spark-3.5.0-bin-hadoop3\python\lib\py4j-0.10.9.7-src.zip

test spark command using "cmd" command prompt spark-submit --version
Add content root in Project Structure under pycharm settings

Point to both libs

py4j-0.10.9.7-src.zip
pyspark.zip

Sample pyspark code bash

from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[1]").appName("pyspark-sample").getOrCreate()

df = spark.read.csv("data/coursera_course_dataset_v2_no_null.csv", header=True, inferSchema=True)

df.printSchema()

Dataset Description

The dataset was scrapped directly from coursera website.

Content

This dataset contains mainly 6 columns and 977 course data.

The detailed description:

Title : Contains the course title.
Organization : It tells which organization is conducting the courses.
Skills: It defines list of skills can be obtained from the course.
Ratings: It has the ratings associated with each course.
Review count: Have the review count for each course
Miscellaneous info like course type, difficulty, expected training length

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
CourseProcessing.py		CourseProcessing.py
README.md		README.md
pyspark-sample.py		pyspark-sample.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySparkLab

Dataset Description

Content

The detailed description:

About

Releases

Packages

Languages

ahmedredahussien/PySparkLab

Folders and files

Latest commit

History

Repository files navigation

PySparkLab

Dataset Description

Content

The detailed description:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages