Download Spark https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz Reference : https://spark.apache.org/downloads.html
Download winutils https://github.com/steveloughran/winutils/blob/master/hadoop-3.0.0/bin/winutils.exe Reference : https://github.com/steveloughran/winutils
JDK download https://download.oracle.com/java/21/latest/jdk-21_windows-x64_bin.exe Reference : https://www.oracle.com/java/technologies/downloads/
- Extract download spark zip folder
- Install all the previous
Create a folder called hadoop under c:/Hadoop/bin and place winutils.exe inside it
Add new System variable under Enviroment variable
- HADOOP_HOME=c:/Hadoop
- JAVA_HOME=C:\Program Files\Java\jdk-11.0.17
Update path variable to include the bin diretories
- %JAVA_HOME%\bin
- %SPARK_HOME%\bin
- Go to spark-3.5.0-bin-hadoop3\python\lib
py4j- Update the path System environment variable with - the absolute path of the py4j-, i.e. \spark-3.5.0-bin-hadoop3\python\lib\py4j-
test spark command using "cmd" command prompt spark-submit --version
Add content root in Project Structure under pycharm settings
Point to both libs
- py4j-
- pyspark.zip
Sample pyspark code bash
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]").appName("pyspark-sample").getOrCreate()
df = spark.read.csv("data/coursera_course_dataset_v2_no_null.csv", header=True, inferSchema=True)
The dataset was scrapped directly from coursera website.
This dataset contains mainly 6 columns and 977 course data.
Title : Contains the course title.
Organization : It tells which organization is conducting the courses.
Skills: It defines list of skills can be obtained from the course.
Ratings: It has the ratings associated with each course.
Review count: Have the review count for each course
Miscellaneous info like course type, difficulty, expected training length