Back to Data Science

Module 6: SQL & Databases

Master SQL to query, analyze, and extract insights from databases like a pro

🗄️ What is SQL?

Imagine you have a massive library with millions of books. SQL (Structured Query Language) is like having a super-smart librarian who can instantly find exactly what you need! SQL is the universal language for talking to databases - used by data scientists, analysts, and engineers worldwide.

Simple Definition

SQL is a programming language designed for managing and querying data stored in relational databases. It lets you ask questions like "Show me all customers who spent > $1000 last month" and get instant answers from millions of records!

Example: Find top customers

SELECT customer_name, SUM(amount) as total

FROM orders

WHERE order_date >= '2024-01-01'

GROUP BY customer_name

ORDER BY total DESC

LIMIT 10;

Why SQL for Data Science?

Universal Standard

Works with MySQL, PostgreSQL, SQL Server, Oracle, and more

Handles Big Data

Query billions of rows efficiently

Essential Skill

Required for 90%+ of data science jobs

Easy to Learn

Reads like English, powerful yet simple

📚 Learn More:

📝 SQL Fundamentals

Let's start with the basics! These four commands (SELECT, WHERE, ORDER BY, LIMIT) will handle 80% of your daily SQL needs. Master these first!

SELECT - Retrieve Data

SELECT is like saying "Show me..." - it retrieves data from tables.

-- Select all columns

SELECT * FROM customers;

-- Select specific columns

SELECT first_name, last_name, email

FROM customers;

-- Select with calculations

SELECT

product_name,

price,

price * 0.9 AS discounted_price

FROM products;

-- Select distinct values (no duplicates)

SELECT DISTINCT city

FROM customers;

WHERE - Filter Data

-- Filter by condition

SELECT * FROM orders

WHERE total_amount > 1000;

-- Multiple conditions (AND, OR)

SELECT * FROM products

WHERE category = 'Electronics'

AND price < 500

AND in_stock = TRUE;

-- Pattern matching with LIKE

SELECT * FROM customers

WHERE email LIKE '%@gmail.com';

-- % means "any characters"

-- Check for NULL values

SELECT * FROM customers

WHERE phone_number IS NULL;

-- IN operator (multiple values)

SELECT * FROM orders

WHERE status IN ('pending', 'processing', 'shipped');

-- BETWEEN operator (range)

SELECT * FROM orders

WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31';

ORDER BY - Sort Results

-- Sort ascending (default)

SELECT * FROM products

ORDER BY price;

-- Sort descending

SELECT * FROM products

ORDER BY price DESC;

-- Sort by multiple columns

SELECT * FROM customers

ORDER BY city, last_name;

LIMIT - Restrict Results

-- Get top 10 results

SELECT * FROM products

ORDER BY sales DESC

LIMIT 10;

-- Skip first 20, then get 10 (pagination)

SELECT * FROM customers

LIMIT 10 OFFSET 20;

💡 SQL Writing Tips:

  • • SQL keywords are case-insensitive (SELECT = select), but UPPERCASE is convention
  • • Always end statements with semicolon (;)
  • • Use comments: -- for single line, /* */ for multiple lines
  • • Format for readability: one clause per line

🔗 JOINs - Combining Tables

Real data is spread across multiple tables! JOINs let you combine them. Think of it like matching puzzle pieces - you connect tables using common columns (like customer_id).

INNER JOIN

INNER JOIN returns only rows that have matches in BOTH tables. Like finding people who are in both your Facebook AND Instagram friends list.

-- Join customers with their orders

SELECT

c.customer_name,

c.email,

o.order_id,

o.total_amount

FROM customers c

INNER JOIN orders o

ON c.customer_id = o.customer_id;

LEFT JOIN

-- Get ALL customers, even those without orders

SELECT

c.customer_name,

COUNT(o.order_id) AS order_count

FROM customers c

LEFT JOIN orders o

ON c.customer_id = o.customer_id

GROUP BY c.customer_name;

-- Customers with 0 orders will show count = 0

RIGHT JOIN & FULL OUTER JOIN

-- RIGHT JOIN: All rows from right table

SELECT * FROM orders o

RIGHT JOIN customers c

ON o.customer_id = c.customer_id;

-- FULL OUTER JOIN: All rows from both tables

SELECT * FROM customers c

FULL OUTER JOIN orders o

ON c.customer_id = o.customer_id;

🎯 JOIN Cheat Sheet:

  • INNER JOIN: Only matching rows from both tables
  • LEFT JOIN: All from left table + matches from right
  • RIGHT JOIN: All from right table + matches from left
  • FULL OUTER JOIN: All rows from both tables

📊 Aggregations & GROUP BY

Aggregations let you summarize data - calculate totals, averages, counts, etc. Combined with GROUP BY, you can analyze data by categories!

Aggregate Functions

-- COUNT: Count rows

SELECT COUNT(*) AS total_customers

FROM customers;

-- SUM: Add up values

SELECT SUM(total_amount) AS total_revenue

FROM orders;

-- AVG: Calculate average

SELECT AVG(price) AS avg_price

FROM products;

-- MIN and MAX

SELECT

MIN(price) AS cheapest,

MAX(price) AS most_expensive

FROM products;

GROUP BY

-- Sales by category

SELECT

category,

COUNT(*) AS product_count,

AVG(price) AS avg_price

FROM products

GROUP BY category;

-- Monthly revenue

SELECT

DATE_TRUNC('month', order_date) AS month,

SUM(total_amount) AS revenue

FROM orders

GROUP BY month

ORDER BY month;

-- Group by multiple columns

SELECT

city,

category,

COUNT(*) AS sales_count

FROM sales

GROUP BY city, category;

HAVING - Filter Groups

-- Find categories with > 10 products

SELECT

category,

COUNT(*) AS product_count

FROM products

GROUP BY category

HAVING COUNT(*) > 10;

-- HAVING filters groups, WHERE filters rows

-- Customers who spent > $5000

SELECT

customer_id,

SUM(total_amount) AS total_spent

FROM orders

GROUP BY customer_id

HAVING SUM(total_amount) > 5000

ORDER BY total_spent DESC;

💡 WHERE vs HAVING:

WHERE filters individual rows BEFORE grouping. HAVING filters groups AFTER aggregation. Use WHERE for row-level filters, HAVING for aggregate filters!

🪟 Window Functions

Window functions are like having X-ray vision for your data! They let you perform calculations across rows while keeping all rows (unlike GROUP BY which collapses rows). Perfect for rankings, running totals, and comparisons.

ROW_NUMBER, RANK, DENSE_RANK

-- Rank products by price

SELECT

product_name,

price,

ROW_NUMBER() OVER (ORDER BY price DESC) AS row_num,

RANK() OVER (ORDER BY price DESC) AS rank,

DENSE_RANK() OVER (ORDER BY price DESC) AS dense_rank

FROM products;

-- Top 3 products per category

WITH ranked AS (

SELECT

category,

product_name,

sales,

ROW_NUMBER() OVER (

PARTITION BY category

ORDER BY sales DESC

) AS rank

FROM products

)

SELECT * FROM ranked

WHERE rank <= 3;

LAG and LEAD

-- Compare with previous month

SELECT

month,

revenue,

LAG(revenue) OVER (ORDER BY month) AS prev_month,

revenue - LAG(revenue) OVER (ORDER BY month) AS growth

FROM monthly_sales;

Running Totals

-- Calculate cumulative revenue

SELECT

order_date,

total_amount,

SUM(total_amount) OVER (

ORDER BY order_date

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

) AS running_total

FROM orders;

🚀 Window Function Power:

Window functions are incredibly powerful for analytics! Use them for rankings, moving averages, year-over-year comparisons, and more - all without losing row-level detail.

🎯 Subqueries & CTEs

Sometimes you need to query the results of another query! Subqueries and CTEs (Common Table Expressions) let you break complex problems into manageable steps.

Subqueries

-- Find products above average price

SELECT product_name, price

FROM products

WHERE price > (

SELECT AVG(price) FROM products

);

-- Customers who made orders in 2024

SELECT *

FROM customers

WHERE customer_id IN (

SELECT DISTINCT customer_id

FROM orders

WHERE EXTRACT(YEAR FROM order_date) = 2024

);

CTEs (WITH clause)

-- CTEs make complex queries readable

WITH high_value_customers AS (

SELECT

customer_id,

SUM(total_amount) AS total_spent

FROM orders

GROUP BY customer_id

HAVING SUM(total_amount) > 10000

)

SELECT

c.customer_name,

c.email,

h.total_spent

FROM customers c

JOIN high_value_customers h

ON c.customer_id = h.customer_id;

-- Multiple CTEs

WITH

sales_2023 AS (

SELECT SUM(amount) AS total FROM orders WHERE EXTRACT(YEAR FROM order_date) = 2023

),

sales_2024 AS (

SELECT SUM(amount) AS total FROM orders WHERE EXTRACT(YEAR FROM order_date) = 2024

)

SELECT

s23.total AS sales_2023,

s24.total AS sales_2024,

s24.total - s23.total AS growth

FROM sales_2023 s23, sales_2024 s24;

💡 CTEs vs Subqueries:

CTEs are more readable and can be referenced multiple times. Use CTEs for complex queries, subqueries for simple one-off filters. Modern SQL developers prefer CTEs!

🏗️ Database Design & Normalization

Good database design is like organizing a library - everything has its place, no duplication, easy to find! Normalization is the process of organizing data to reduce redundancy.

Normal Forms (Simplified)

1st Normal Form (1NF)

• Each column contains atomic (indivisible) values

• No repeating groups or arrays

2nd Normal Form (2NF)

• Must be in 1NF

• All non-key columns depend on the entire primary key

3rd Normal Form (3NF)

• Must be in 2NF

• No transitive dependencies (non-key columns don't depend on other non-key columns)

🎯 Design Principles:

  • • Use meaningful table and column names
  • • Every table should have a primary key
  • • Use foreign keys to maintain referential integrity
  • • Normalize to reduce redundancy, but denormalize for performance when needed
  • • Index columns used in WHERE, JOIN, and ORDER BY clauses

🔄 PostgreSQL vs MySQL

Both are excellent open-source databases! PostgreSQL is more feature-rich and standards-compliant, while MySQL is simpler and faster for basic operations.

PostgreSQL

The "most advanced open-source database". Better for complex queries and data integrity.

Best for:

  • • Complex queries and analytics
  • • Data integrity is critical
  • • Advanced features (JSON, arrays, window functions)
  • • Data science and analytics

MySQL

Simple, fast, and widely used. Powers WordPress, Facebook, and millions of websites.

Best for:

  • • Web applications
  • • Simple read-heavy workloads
  • • When you need maximum speed
  • • Smaller projects and startups

💡 Which to Learn?

Learn PostgreSQL first - it's more standards-compliant, so your SQL will work everywhere. Once you know PostgreSQL, MySQL is easy to pick up. The core SQL syntax is 95% the same!

🎯 Complete Project: E-Commerce Sales Analysis

Let's analyze an e-commerce database! We'll use JOINs, aggregations, window functions, and CTEs to extract business insights.

-- 1. Top 10 customers by revenue

SELECT

c.customer_id,

c.customer_name,

c.email,

COUNT(o.order_id) AS order_count,

SUM(o.total_amount) AS total_revenue

FROM customers c

JOIN orders o ON c.customer_id = o.customer_id

GROUP BY c.customer_id, c.customer_name, c.email

ORDER BY total_revenue DESC

LIMIT 10;

-- 2. Monthly revenue trend with growth rate

WITH monthly_revenue AS (

SELECT

DATE_TRUNC('month', order_date) AS month,

SUM(total_amount) AS revenue

FROM orders

GROUP BY month

)

SELECT

month,

revenue,

LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,

ROUND(

(revenue - LAG(revenue) OVER (ORDER BY month)) /

LAG(revenue) OVER (ORDER BY month) * 100, 2

) AS growth_rate_pct

FROM monthly_revenue

ORDER BY month;

-- 3. Product performance by category

SELECT

p.category,

p.product_name,

COUNT(oi.order_id) AS times_ordered,

SUM(oi.quantity) AS total_quantity,

SUM(oi.quantity * oi.price) AS total_revenue,

RANK() OVER (

PARTITION BY p.category

ORDER BY SUM(oi.quantity * oi.price) DESC

) AS rank_in_category

FROM products p

JOIN order_items oi ON p.product_id = oi.product_id

GROUP BY p.category, p.product_name

ORDER BY p.category, rank_in_category;

-- 4. Customer segmentation (RFM Analysis)

WITH customer_metrics AS (

SELECT

customer_id,

MAX(order_date) AS last_order_date,

COUNT(order_id) AS frequency,

SUM(total_amount) AS monetary

FROM orders

GROUP BY customer_id

)

SELECT

c.customer_name,

CURRENT_DATE - cm.last_order_date AS recency_days,

cm.frequency,

cm.monetary,

CASE

WHEN CURRENT_DATE - cm.last_order_date < 30 AND cm.frequency > 5 THEN 'VIP'

WHEN CURRENT_DATE - cm.last_order_date < 90 THEN 'Active'

WHEN CURRENT_DATE - cm.last_order_date < 180 THEN 'At Risk'

ELSE 'Churned'

END AS customer_segment

FROM customers c

JOIN customer_metrics cm ON c.customer_id = cm.customer_id

ORDER BY cm.monetary DESC;

-- 5. Cohort analysis - retention by signup month

WITH cohorts AS (

SELECT

customer_id,

DATE_TRUNC('month', signup_date) AS cohort_month

FROM customers

)

SELECT

c.cohort_month,

DATE_TRUNC('month', o.order_date) AS order_month,

COUNT(DISTINCT o.customer_id) AS active_customers

FROM cohorts c

JOIN orders o ON c.customer_id = o.customer_id

GROUP BY c.cohort_month, order_month

ORDER BY c.cohort_month, order_month;

🎓 What This Project Demonstrates:

  • • Complex JOINs across multiple tables
  • • Aggregations with GROUP BY and HAVING
  • • Window functions for rankings and comparisons
  • • CTEs for readable, modular queries
  • • Date functions and calculations
  • • CASE statements for conditional logic
  • • Real-world business analytics (RFM, cohort analysis)

📚 Learning Resources

Official Documentation

Practice Platforms

🎯 What's Next?

You've mastered SQL! Next, we'll explore Big Data & Tools - learn about Apache Spark, data warehousing, ETL pipelines, and cloud platforms like AWS, GCP, and Azure. Get ready to work with massive datasets!