Writing Your First Spark Job

The Data Engineering Bootcamp: Zero to Mastery

Section 00 - Introduction

Section 01 - Introduction to Data Engineering

Section 02 - Big Data Processing with Apache Spark: Process & Analyze Real-World Airbnb Data

Introduction (4:08)
Apache Spark (3:43)
How Spark Works (4:23)
Spark Application (7:40)
DataFrames (6:42)
Installing Spark (5:50)
Installing Spark on Linux
Inside Airbnb Data (7:01)
Writing Your First Spark Job (7:04)
Lazy Processing (2:16)
[Note] Minor correction
[Exercise] Basic Functions (1:28)
[Exercise] Basic Functions - Solution (6:41)
Aggregating Data (3:59)
Joining Data (4:39)
Aggregations and Joins with Spark (6:09)
Complex Data Types (5:08)
[Exercise] Aggregate Functions (0:49)
[Exercise] Aggregate Functions - Solution (5:53)
User Defined Functions (3:25)
Data Shuffle (6:13)
Data Accumulators (3:41)
Optimizing Spark Jobs (7:38)
Submitting Spark Jobs (4:28)
Other Spark APIs (5:15)
Spark SQL (4:32)
[Exercise] Advanced Spark (2:10)
[Exercise] Advanced Spark - Solution (5:25)
Summary (3:07)
Let's Have Some Fun (+ More Resources)

Section 03 - Creating a Data Lake with AWS

Introduction (4:25)
What Is a Data Lake? (9:08)
Amazon Web Services (AWS) (7:46)
Simple Storage Service (S3) (5:44)
Setting Up an AWS Account (9:29)
Data Partitioning (3:23)
Using S3 (7:48)
EMR Serverless (2:58)
IAM Roles (2:51)
Running a Spark Job (8:48)
Parquet Data Format (7:41)
Implementing a Data Catalog (5:31)
Data Catalog Demo (6:41)
Querying a Data Lake (3:59)
Summary (3:38)
Course Check-In

Section 04 - Implementing Data Pipelines with Apache Airflow

Introduction (5:52)
What Is Apache Airflow? (5:18)
Airflow’s Architecture (3:14)
Installing Airflow (6:32)
Defining an Airflow DAG (8:02)
Errors Handling (3:37)
Idempotent Tasks (4:53)
Creating a DAG - Part 1 (4:58)
Creating a DAG - Part 2 (4:41)
Handling Failed Tasks (4:08)
[Exercise] Data Validation (4:30)
[Exercise] Data Validation - Solution (3:26)
Spark with Airflow (3:01)
Using Spark with Airflow - Part 1 (7:38)
Using Spark with Airflow - Part 2 (5:51)
Sensors In Airflow (4:45)
Using File Sensors (4:07)
Data Ingestion (5:49)
Reading Data From Postgres - Part 1 (6:02)
Reading Data from Postgres - Part 2 (5:39)
[Exercise] Average Customer Review (3:52)
[Exercise] Average Customer Review - Solution (4:32)
Advanced DAGs (4:25)
Summary (2:26)
Unlimited Updates

Section 05 - Machine Learning with Spark ML: Create a Data Pipeline, Train a Model + more

Introduction (5:27)
What Is Machine Learning (6:05)
Regression Algorithms (5:37)
Building a Regression Model (5:03)
Training a Model (9:45)
Model Evaluation (7:25)
Testing a Regression Model (3:56)
Model Lifecycle (2:11)
Feature Engineering (8:43)
Improving a Regression Model (7:33)
Machine Learning Pipelines (3:55)
Creating a Pipeline (2:40)
[Exercise] House Price Estimation (1:58)
[Exercise] House Price Estimation - Solution (3:12)
[Exercise] Imposter Syndrome (2:55)
Classification (7:36)
Classifiers Evaluation (4:26)
Training a Classifier (8:30)
Hyperparameters (8:05)
Optimizing a Model (3:01)
[Exercise] Loan Approval (2:33)
[Exercise] Load Approval - Solution (2:32)
Deep Learning (6:55)
Summary (3:23)
Implement a New Life System

Section 06 - Using AI with Data Engineering: LLMs, HuggingFace + more

Introduction (5:06)
Natural Language Processing (NLP) before LLMs (6:10)
Transformers (6:20)
Types of LLMs (7:39)
Hugging Face (2:18)
Databricks Set Up (10:37)
Using an LLM (7:35)
Structured Output (3:41)
Producing JSON Output (5:09)
LLMs With Apache Spark (5:19)
Summary (2:47)

Section 07 - Real-Time Data Processing ("Stream Processing") with Apache Kafka

Introduction (6:05)
What Is Apache Kafka? (6:59)
Partitioning Data (8:55)
Kafka API (7:41)
Kafka Architecture (3:14)
Set Up Kafka (5:52)
Writing to Kafka (6:06)
Reading from Kafka (7:36)
Data Durability (6:38)
Kafka vs Queues (2:10)
[Exercise] Processing Records (3:43)
[Exercise] Processing Records - Solution (2:58)
Delivery Semantics (5:52)
Kafka Transactions (4:33)
Log Compaction (3:22)
Kafka Connect (6:58)
Using Kafka Connect (9:44)
Outbox Pattern (4:30)
Schema Registry (8:00)
Using Schema Registry (8:09)
Tiered Storage (3:27)
[Exercise] Track Order Status Changes (4:26)
[Exercise] Track Order Status Changes - Solution (5:05)
Summary (4:40)

Section 08 - Stream Processing with Apache Flink

Introduction (5:40)
What Is Apache Flink? (5:23)
Kafka Application (8:10)
Multiple Streams (3:10)
Installing Apache Flink (5:45)
Processing Individual Records (7:21)
[Exercise] Stream Processing (4:01)
[Exercise] Stream Processing - Solution (2:39)
Time Windows (6:48)
Keyed Windows (2:39)
Using Time Windows (5:17)
Watermarks (10:05)
Advanced Window Operations (6:16)
Stateful Stream Processing (7:49)
Using Local State (4:41)
[Exercise] Anomalies Detection (4:34)
[Exercise] Anomalies Detection - Solution (3:33)
Joining Streams (5:49)
Summary (3:09)

Where To Go From Here?

This lecture is available exclusively for ZTM Academy members.

If you're already a member, you'll need to login.