Batch Processing with Apache Spark

Leader posted 1 min read

Week 6 of Data Engineering Zoomcamp by @DataTalksClub complete

Just finished Module 6 - Batch Processing with Spark. Learned how to:

✅ Set up PySpark and create Spark sessions

✅ Read and process Parquet files at scale

✅ Repartition data for optimal performance

✅ Analyze millions of taxi trips with DataFrames

✅ Use Spark UI for monitoring jobs

Processing 4M+ taxi trips with Spark - distributed computing is powerful

Here's my homework solution: https://github.com/Derrick-Ryan-Giggs/pyspark-homework

Following along with this amazing free course - who else is learning data engineering?

You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/

More Posts

Tech Ecosystem Observatory: How I Built a Cloud-Native Data Pipeline to Track Global Tech Layoffs

Derrick Ryan - Mar 30

From APIs to Warehouses: AI-Assisted Data Ingestion with dlt

Derrick Ryan - Mar 1

Tuesday Coding Tip 02 - Template with type-specific API

Jakub Neruda - Mar 10

Workflow Orchestration

Derrick Ryan - Jan 30

Integerating Webapps into Kubeflow

Payezhi Chegattil Abhishek - Feb 1
chevron_left

Related Jobs

View all jobs →

Commenters (This Week)

3 comments
1 comment
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!