tech-talks/2020-08-27 | How Apache Spark 3.0 and Delta Lake Enhances Data Lake Reliability at master · Persival-intellibrand/tech-talks

History

Name		Name	Last commit message	Last commit date
parent directory ..
How Apache Spark 3.0 and Delta Lake Enhances Data Lake Reliability.pdf		How Apache Spark 3.0 and Delta Lake Enhances Data Lake Reliability.pdf
readme.md		readme.md

readme.md

How Apache Spark™ 3.0 and Delta Lake Enhances Data Lake Reliability

2020-08-27 | Watch the video | This folder contains the notebooks used in this tutorial.

Apache Spark has become the de facto open source standard for big data processing for its ease of use and performance. The open source Delta Lake project improves Spark’s data reliability, with new capabilities like ACID transactions, Schema Enforcement, and Time Travel.

This helps to ensure that data lakes and data pipelines can deliver high quality and reliable data to downstream data teams for successful data analytics and machine learning projects.

Join us in this webinar to learn how Apache Spark 3.0 and Delta Lake enhances Data Lake reliability.

Topics to be covered including:

Apache Spark’s usage for big data processing
The evolution and technical challenges around data lake architectures
Delta Lake’s capabilities ensuring reliable data for Spark processing
Simplifying architectures with unified batch and streaming
The new Adaptive Query Execution (AQE) framework within Spark 3.0 can yield query performance gains. Based on a 3TB TPC-DS benchmark, two queries had more than a 1.5x speedup, and another 37 queries had more than 1.1x speedup.
With Dynamic Partition Pruning (DPP), we can significantly speed up performance by pruning partitions based on the joins between the fact and dimension tables common in star schema design.
Accelerator-aware Scheduling helps Spark take advantage of GPU and hardware accelerators for certain workloads (e.g deep learning). This release enhances the scheduler and makes the cluster manager accelerator-aware.
Spark 3.0 also introduces new Pandas UDF types and new Pandas function APIs for improved performance and usability.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2020-08-27 | How Apache Spark 3.0 and Delta Lake Enhances Data Lake Reliability

2020-08-27 | How Apache Spark 3.0 and Delta Lake Enhances Data Lake Reliability

readme.md

How Apache Spark™ 3.0 and Delta Lake Enhances Data Lake Reliability

Files

2020-08-27 | How Apache Spark 3.0 and Delta Lake Enhances Data Lake Reliability

Directory actions

More options

Directory actions

More options

Latest commit

History

2020-08-27 | How Apache Spark 3.0 and Delta Lake Enhances Data Lake Reliability

Folders and files

parent directory

readme.md

How Apache Spark™ 3.0 and Delta Lake Enhances Data Lake Reliability