Skip to content

mohamedsharshar/Bigdata-spark-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Bigdata-spark-project

Big Data Analysis and Visualization using Spark and HDFS

Big Data Project

Project Overview

This project involves Big Data analysis and visualization using Spark and HDFS. The project covers data extraction, transformation, and loading (ETL), data cleaning, exploratory data analysis (EDA), visualization, parallel processing with Spark, and job monitoring using Databricks.

Project Structure

  • data/: Contains the dataset(s).
  • scripts/: Contains ETL scripts, data cleaning scripts, and Spark job scripts.
  • docs/: Contains project documentation, report, and presentation.
  • visualizations/: Contains visualization outputs.
  • notebooks/: Contains Jupyter notebooks for EDA.

Setup Instructions

  1. Install the required dependencies.
  2. Run the ETL scripts to prepare the data.
  3. Perform EDA using the provided notebooks.
  4. Execute Spark jobs for parallel processing.
  5. Monitor job performance using Databricks.
  6. Store data in HDFS and track job metrics.

How to Run

  1. Clone the repository: git clone https://github.com/your-username/your-repository.git
  2. Navigate to the project directory: cd your-repository
  3. Follow the instructions in each script/notebook.

Contributing

Contributions are welcome! Please fork the repository and create a pull request with your changes.

About

Big Data Analysis and Visualization using Spark and HDFS

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published