Skip to content

Latest commit

 

History

History
 
 

2020-08-27 | How Apache Spark 3.0 and Delta Lake Enhances Data Lake Reliability

How Apache Spark™ 3.0 and Delta Lake Enhances Data Lake Reliability

2020-08-27 | Watch the video | This folder contains the notebooks used in this tutorial.

Apache Spark has become the de facto open source standard for big data processing for its ease of use and performance. The open source Delta Lake project improves Spark’s data reliability, with new capabilities like ACID transactions, Schema Enforcement, and Time Travel.

This helps to ensure that data lakes and data pipelines can deliver high quality and reliable data to downstream data teams for successful data analytics and machine learning projects.

Join us in this webinar to learn how Apache Spark 3.0 and Delta Lake enhances Data Lake reliability.

Topics to be covered including:

  • Apache Spark’s usage for big data processing
  • The evolution and technical challenges around data lake architectures
  • Delta Lake’s capabilities ensuring reliable data for Spark processing
  • Simplifying architectures with unified batch and streaming
  • The new Adaptive Query Execution (AQE) framework within Spark 3.0 can yield query performance gains. Based on a 3TB TPC-DS benchmark, two queries had more than a 1.5x speedup, and another 37 queries had more than 1.1x speedup.
  • With Dynamic Partition Pruning (DPP), we can significantly speed up performance by pruning partitions based on the joins between the fact and dimension tables common in star schema design.
  • Accelerator-aware Scheduling helps Spark take advantage of GPU and hardware accelerators for certain workloads (e.g deep learning). This release enhances the scheduler and makes the cluster manager accelerator-aware.
  • Spark 3.0 also introduces new Pandas UDF types and new Pandas function APIs for improved performance and usability.