Duration: 4 days
Labs: Minimum 50% hands-on labs
Prerequisites: Reasonable programming experience. An overview of Scala is provided for those who don't know it.
Supported Platforms: Spark 2.1+
Knowledge and Skills Gained:
- Understand the need for Spark in data processing
- Understand the Spark architecture and how it distributes computations to cluster nodes
- Be familiar with basic installation / setup / layout of Spark
- Use the Spark shell for interactive and ad-hoc operations
- Understand RDDs (Resilient Distributed Datasets), and data partitioning, pipelining, and computations
- Understand/use RDD ops such as map(), filter() and others.
- Understand and use Spark SQL and the DataFrame/DataSet API.
- Understand DataSet/DataFrame capabilities, including the Catalyst query optimizer and Tungsten memory/cpu optimizations.
- Be familiar with performance issues, and use the DataSet/DataFrame and Spark SQL for efficient computations
- Understand Spark's data caching and use it for efficient data transfer
- Write/run standalone Spark programs with the Spark API
- Use Spark Streaming / Structured Streaming to process streaming (real-time) data
- Ingest streaming data from Kafka, and process via Spark Structured Streaming
- Understand performance implications and optimizations when using Spark
Session 1 (Optional): Scala Ramp Up
Session 2: Introduction to Spark
Session 3: RDDs and Spark Architecture
Session 4: Spark SQL, DataFrames, and DataSets
Session 5: Shuffling Transformations and Performance
Session 6: Performance Tuning
Session 7: Creating Standalone Applications
Session 7: Spark Streaming