Back to Data Science

Module 7: Big Data & Tools

Scale your data science skills to handle massive datasets with modern big data tools

🌐 What is Big Data?

Imagine trying to analyze every tweet ever posted, or Netflix's viewing history for 200 million users! Traditional tools can't handle this scale. Big Data is about processing datasets so large that they don't fit on a single computer - we're talking terabytes to petabytes!

The 3 Vs of Big Data

Volume (Size)

Massive amounts of data - terabytes, petabytes, or more. Think Netflix's 200+ million users watching billions of hours of content!

Velocity (Speed)

Data arrives fast and needs processing in real-time. Twitter processes 500 million tweets per day - that's 6,000 tweets per second!

Variety (Types)

Different data formats - structured (databases), semi-structured (JSON, XML), unstructured (text, images, videos).

Why Big Data Tools?

Distributed Processing

Split work across hundreds of computers

Fault Tolerance

Automatically handle hardware failures

Scalability

Add more machines to handle more data

Cost Effective

Use commodity hardware instead of expensive servers

📚 Learn More:

⚡ Apache Spark & PySpark

Apache Spark is the Ferrari of big data processing - fast, powerful, and elegant! It can process data 100x faster than Hadoop MapReduce by keeping data in memory. Used by Netflix, Uber, Airbnb, and thousands of companies.

PySpark Basics

# Install PySpark

# pip install pyspark

# Create Spark session

from pyspark.sql import SparkSession

spark = SparkSession.builder \

.appName("MyApp") \

.getOrCreate()

# Read data

df = spark.read.csv("data.csv", header=True, inferSchema=True)

df.show(5)

df.printSchema()

# DataFrame operations

df.select("name", "age").show()

df.filter(df["age"] > 30).show()

df.groupBy("department").count().show()

# SQL queries

df.createOrReplaceTempView("employees")

result = spark.sql("SELECT * FROM employees WHERE age > 30")

result.show()

🚀 Spark Performance Tips:

  • • Use Parquet format for storage (columnar, compressed)
  • • Cache DataFrames you'll reuse: df.cache()
  • • Avoid collect() on large datasets
  • • Use partitioning for large datasets

🏢 Data Warehousing Concepts

A data warehouse is like a massive, organized library for your company's data! It stores historical data from multiple sources, optimized for analysis and reporting. Think of it as the single source of truth for business intelligence.

Fact Tables vs Dimension Tables

Fact Tables

Store measurable events (transactions, sales, clicks). Contains metrics and foreign keys to dimensions.

Example: Sales Fact

  • • order_id (PK)
  • • customer_id (FK)
  • • product_id (FK)
  • • date_id (FK)
  • • quantity (measure)
  • • revenue (measure)

Dimension Tables

Store descriptive attributes (who, what, where, when). Provide context for facts.

Example: Customer Dimension

  • • customer_id (PK)
  • • name
  • • email
  • • city
  • • country
  • • segment

Star Schema

The most common data warehouse design! One central fact table surrounded by dimension tables, forming a star shape. Simple, fast queries, easy to understand.

Example Structure:

• Center: Sales_Fact (order_id, customer_id, product_id, date_id, amount)

• Points: Customer_Dim, Product_Dim, Date_Dim, Store_Dim

🔄 ETL Pipelines

ETL (Extract, Transform, Load) is the process of moving data from source systems to a data warehouse. Think of it as a data assembly line - raw materials (source data) go in, finished products (clean, structured data) come out!

The ETL Process

1. Extract

Pull data from various sources (databases, APIs, files, logs). Handle different formats and connection types.

2. Transform

Clean, validate, and reshape data. Remove duplicates, handle nulls, apply business rules, aggregate, join tables.

3. Load

Write transformed data to target system (data warehouse, data lake). Can be full load or incremental.

Apache Airflow for Orchestration

# Simple Airflow DAG (Directed Acyclic Graph)

from airflow import DAG

from airflow.operators.python import PythonOperator

from datetime import datetime

# Define tasks

def extract_data():

print("Extracting data from source...")

def transform_data():

print("Transforming data...")

def load_data():

print("Loading data to warehouse...")

# Create DAG

dag = DAG(

'etl_pipeline',

start_date=datetime(2024, 1, 1),

schedule_interval='@daily'

)

# Define task dependencies

extract = PythonOperator(task_id='extract', python_callable=extract_data, dag=dag)

transform = PythonOperator(task_id='transform', python_callable=transform_data, dag=dag)

load = PythonOperator(task_id='load', python_callable=load_data, dag=dag)

extract >> transform >> load # Set execution order

💡 ETL Best Practices:

  • • Make pipelines idempotent (can run multiple times safely)
  • • Log everything for debugging
  • • Handle failures gracefully with retries
  • • Monitor pipeline health and data quality
  • • Use incremental loads when possible (not full refreshes)

☁️ Cloud Data Platforms

The cloud has revolutionized big data! Instead of buying expensive hardware, you rent computing power and storage on-demand. Scale up for Black Friday, scale down on slow days. Pay only for what you use!

AWS (Amazon Web Services)

The market leader with the most comprehensive set of services. If you can imagine it, AWS probably has it!

• S3 (Simple Storage Service)

Object storage for data lakes. Store petabytes of data cheaply. Think of it as infinite hard drive space!

• Redshift

Data warehouse for analytics. Fast SQL queries on petabyte-scale data. Columnar storage, massively parallel.

• EMR (Elastic MapReduce)

Managed Hadoop and Spark clusters. Spin up 100 nodes, run your job, shut down. Pay by the hour!

• Glue

Serverless ETL service. Automatically discovers schema, generates ETL code, runs jobs.

GCP (Google Cloud Platform)

• BigQuery

Serverless data warehouse. Query terabytes in seconds! No infrastructure management. Pay per query.

• Dataflow

Managed Apache Beam for stream and batch processing. Auto-scaling, no ops required.

• Cloud Storage

Object storage like S3. Multiple storage classes for different access patterns.

Azure (Microsoft)

• Synapse Analytics

Unified analytics platform. Combines data warehousing, big data, and data integration.

• Data Factory

Cloud ETL service. Visual interface for building data pipelines. 90+ connectors.

• Blob Storage

Object storage for unstructured data. Hot, cool, and archive tiers for cost optimization.

🎯 Which Cloud to Learn?

Start with AWS (most popular) or GCP (best for data/ML). The concepts transfer! Once you know one cloud, learning others is easy. Focus on understanding distributed computing concepts, not memorizing service names.

⚡ Real-Time Data Processing

Batch processing is like doing laundry once a week. Stream processing is like having a washing machine that cleans clothes as soon as they're dirty! Real-time processing analyzes data as it arrives - crucial for fraud detection, recommendations, monitoring.

Batch vs Stream Processing

Batch Processing

• Process large volumes at once

• Run on schedule (hourly, daily)

• Higher latency, higher throughput

• Example: Daily sales reports

Stream Processing

• Process events as they arrive

• Continuous, real-time

• Low latency, lower throughput

• Example: Fraud detection

🔧 Popular Streaming Tools:

  • Apache Kafka: Distributed event streaming platform (industry standard)
  • Apache Flink: Stream processing framework with exactly-once semantics
  • Spark Streaming: Micro-batch processing on top of Spark
  • AWS Kinesis: Managed streaming service on AWS

🎯 Complete Project: Big Data Pipeline

Let's build a complete big data pipeline! We'll process large-scale e-commerce data using PySpark, store it in a data warehouse structure, and create analytics-ready datasets.

# Step 1: Initialize Spark

from pyspark.sql import SparkSession

from pyspark.sql.functions import col, sum, avg, count, when

from pyspark.sql.window import Window

spark = SparkSession.builder \

.appName("EcommercePipeline") \

.config("spark.sql.adaptive.enabled", "true") \

.getOrCreate()

# Step 2: Extract - Read from multiple sources

orders_df = spark.read.parquet("s3://data-lake/orders/")

customers_df = spark.read.json("s3://data-lake/customers/")

products_df = spark.read.csv("s3://data-lake/products/", header=True)

# Step 3: Transform - Clean and enrich data

# Remove duplicates

orders_clean = orders_df.dropDuplicates(["order_id"])

# Handle nulls

orders_clean = orders_clean.fillna({"discount": 0, "notes": ""})

# Add calculated columns

orders_enriched = orders_clean.withColumn(

"total_with_tax",

col("total_amount") * 1.1

)

# Step 4: Join tables (create fact table)

fact_sales = orders_enriched \

.join(customers_df, "customer_id", "left") \

.join(products_df, "product_id", "left") \

.select(

"order_id",

"customer_id",

"product_id",

"order_date",

"quantity",

"total_amount",

"total_with_tax"

)

# Step 5: Aggregate analytics

customer_metrics = fact_sales.groupBy("customer_id") \

.agg(

count("order_id").alias("order_count"),

sum("total_amount").alias("lifetime_value"),

avg("total_amount").alias("avg_order_value")

)

# Step 6: Window functions for rankings

window_spec = Window.partitionBy("category").orderBy(col("revenue").desc())

product_rankings = products_df \

.withColumn("rank", row_number().over(window_spec)) \

.filter(col("rank") <= 10)

# Step 7: Load - Write to data warehouse

# Partition by date for efficient queries

fact_sales.write \

.mode("overwrite") \

.partitionBy("order_date") \

.parquet("s3://warehouse/fact_sales/")

customer_metrics.write \

.mode("overwrite") \

.parquet("s3://warehouse/customer_metrics/")

# Step 8: Data quality checks

print(f"Total orders processed: {fact_sales.count()}")

print(f"Null check: {fact_sales.filter(col('customer_id').isNull()).count()}")

# Stop Spark session

spark.stop()

🎓 What This Project Demonstrates:

  • • Reading from multiple data sources (Parquet, JSON, CSV)
  • • Data cleaning (deduplication, null handling)
  • • Complex transformations and joins
  • • Aggregations and window functions
  • • Partitioned writes for performance
  • • Data quality validation
  • • Production-ready big data pipeline

📚 Learning Resources

Official Documentation

Learning Platforms

🎯 Congratulations!

You've completed the Data Science learning path! You now have the skills to work with data at any scale - from Pandas on your laptop to Spark on cloud clusters processing petabytes. You've learned Python, statistics, machine learning, SQL, and big data tools. You're ready to tackle real-world data science challenges!

What You've Mastered:

Module 1: Python for Data Science

NumPy, Pandas, data cleaning

Module 2: Statistics & Probability

Hypothesis testing, distributions, A/B testing

Module 3: Data Visualization

Matplotlib, Seaborn, storytelling with data

Module 4: Machine Learning

Supervised/unsupervised learning, model evaluation

Module 5: Advanced ML

XGBoost, feature engineering, hyperparameter tuning

Module 6: SQL & Databases

JOINs, aggregations, window functions

Module 7: Big Data & Tools

Spark, data warehousing, ETL, cloud platforms