Skip to content

ahmedredahussien/PySparkLab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PySparkLab

  1. Download Spark https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz Reference : https://spark.apache.org/downloads.html

  2. Download winutils https://github.com/steveloughran/winutils/blob/master/hadoop-3.0.0/bin/winutils.exe Reference : https://github.com/steveloughran/winutils

  3. JDK download https://download.oracle.com/java/21/latest/jdk-21_windows-x64_bin.exe Reference : https://www.oracle.com/java/technologies/downloads/

  • Extract download spark zip folder
  • Install all the previous

  1. Create a folder called hadoop under c:/Hadoop/bin and place winutils.exe inside it

  2. Add new System variable under Enviroment variable

  • HADOOP_HOME=c:/Hadoop
  • SPARK_HOME=
  • JAVA_HOME=C:\Program Files\Java\jdk-11.0.17

Update path variable to include the bin diretories

  • %JAVA_HOME%\bin
  • %SPARK_HOME%\bin
  • %HADOOP_HOME%\bin
  1. Go to spark-3.5.0-bin-hadoop3\python\lib

py4j-0.10.9.7-src.zip Update the path System environment variable with - the absolute path of the py4j-0.10.9.7-src.zip, i.e. \spark-3.5.0-bin-hadoop3\python\lib\py4j-0.10.9.7-src.zip

  1. test spark command using "cmd" command prompt spark-submit --version

  2. Add content root in Project Structure under pycharm settings

Point to both libs

  • py4j-0.10.9.7-src.zip
  • pyspark.zip

Sample pyspark code bash

from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[1]").appName("pyspark-sample").getOrCreate()

df = spark.read.csv("data/coursera_course_dataset_v2_no_null.csv", header=True, inferSchema=True)

df.printSchema()

Dataset Description

The dataset was scrapped directly from coursera website.

Content

This dataset contains mainly 6 columns and 977 course data.

The detailed description:

  • Title : Contains the course title.

  • Organization : It tells which organization is conducting the courses.

  • Skills: It defines list of skills can be obtained from the course.

  • Ratings: It has the ratings associated with each course.

  • Review count: Have the review count for each course

  • Miscellaneous info like course type, difficulty, expected training length

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages