Skip to content

arjunmantri/learning-hadoop-and-spark

 
 

Repository files navigation

Learning Hadoop and Spark

Contents

This is the companion repo to my LinkedIn Learning Courses on Hadoop and Spark.

  1. Learning Hadoop - link uses mostly GCP Dataproc for running Hadoop and associated libraries (i.e. Hive, Pig, Spark...) workloads
  2. Cloud Hadoop: Scaling Apache Spark - link - uses GCP DataProc, AWS EMR or Databricks on AWS
  3. Azure Databricks Spark Essential Training - link uses Azure with Databricks for scaling Apache Spark workloads

DevEnv Setup Information

  • Setup a Hadoop/Spark cloud-cluster on GCP DataProc or AWS EMR
    • see setup-hadoop folder in this Repo for instructions/scripts
  • Setup a Hadoop/Spark dev environment
    • can use EclipseChe (on-line IDE), or local IDE
    • select your language (i.e. Python, Scala...)
  • Create a GCS bucket for input/output job data
    • see example_datasets folder in this Repo for sample data files
  • Use Databricks Community Edition (managed, hosted Apache Spark) - example shown below
    • uses Databricks (Jupyter-style) notebooks to connect to a small, managed Spark cluster
    • AWS or Azure editions - easier to try out on AWS
    • Sign up for free trial - link

Databricks Notebook


Example Jobs or Scripts

EXAMPLES from org.apache.hadoop_or_spark.examples - link for Spark examples

  • Run a Hadoop WordCount Job with Java (jar file)
  • Run a Hadoop and/or Spark CalculatePi (digits) Script with PySpark or other libraries
  • Run using Cloudera shared demo env
    • at https://demo.gethue.com/
    • login is user:demo, pwd:demo

Other LinkedIn Learning Courses on Hadoop or Spark

There are ~ 10 courses on Hadoop/Spark topics on LinkedIn Learning. See graphic below
Learning Paths

  • Hadoop for Data Science Tips and Tricks - link
    • Set up Cloudera Enviroment
    • Working with Files in HDFS
    • Connecting to Hadoop Hive
    • Complex Data Structures in Hive
  • Spark courses - link
    • Various Topics - see screenshot below

LinkedInLearningSpark

About

Companion to Learning Hadoop and Learning Spark courses on Linked In Learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 90.3%
  • Java 8.1%
  • Python 0.9%
  • TeX 0.4%
  • R 0.1%
  • Scala 0.1%
  • Other 0.1%