Master SQL to query, analyze, and extract insights from databases like a pro
Imagine you have a massive library with millions of books. SQL (Structured Query Language) is like having a super-smart librarian who can instantly find exactly what you need! SQL is the universal language for talking to databases - used by data scientists, analysts, and engineers worldwide.
SQL is a programming language designed for managing and querying data stored in relational databases. It lets you ask questions like "Show me all customers who spent > $1000 last month" and get instant answers from millions of records!
Example: Find top customers
SELECT customer_name, SUM(amount) as total
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_name
ORDER BY total DESC
LIMIT 10;
Universal Standard
Works with MySQL, PostgreSQL, SQL Server, Oracle, and more
Handles Big Data
Query billions of rows efficiently
Essential Skill
Required for 90%+ of data science jobs
Easy to Learn
Reads like English, powerful yet simple
Let's start with the basics! These four commands (SELECT, WHERE, ORDER BY, LIMIT) will handle 80% of your daily SQL needs. Master these first!
SELECT is like saying "Show me..." - it retrieves data from tables.
-- Select all columns
SELECT * FROM customers;
-- Select specific columns
SELECT first_name, last_name, email
FROM customers;
-- Select with calculations
SELECT
product_name,
price,
price * 0.9 AS discounted_price
FROM products;
-- Select distinct values (no duplicates)
SELECT DISTINCT city
FROM customers;
-- Filter by condition
SELECT * FROM orders
WHERE total_amount > 1000;
-- Multiple conditions (AND, OR)
SELECT * FROM products
WHERE category = 'Electronics'
AND price < 500
AND in_stock = TRUE;
-- Pattern matching with LIKE
SELECT * FROM customers
WHERE email LIKE '%@gmail.com';
-- % means "any characters"
-- Check for NULL values
SELECT * FROM customers
WHERE phone_number IS NULL;
-- IN operator (multiple values)
SELECT * FROM orders
WHERE status IN ('pending', 'processing', 'shipped');
-- BETWEEN operator (range)
SELECT * FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31';
-- Sort ascending (default)
SELECT * FROM products
ORDER BY price;
-- Sort descending
SELECT * FROM products
ORDER BY price DESC;
-- Sort by multiple columns
SELECT * FROM customers
ORDER BY city, last_name;
-- Get top 10 results
SELECT * FROM products
ORDER BY sales DESC
LIMIT 10;
-- Skip first 20, then get 10 (pagination)
SELECT * FROM customers
LIMIT 10 OFFSET 20;
Real data is spread across multiple tables! JOINs let you combine them. Think of it like matching puzzle pieces - you connect tables using common columns (like customer_id).
INNER JOIN returns only rows that have matches in BOTH tables. Like finding people who are in both your Facebook AND Instagram friends list.
-- Join customers with their orders
SELECT
c.customer_name,
c.email,
o.order_id,
o.total_amount
FROM customers c
INNER JOIN orders o
ON c.customer_id = o.customer_id;
-- Get ALL customers, even those without orders
SELECT
c.customer_name,
COUNT(o.order_id) AS order_count
FROM customers c
LEFT JOIN orders o
ON c.customer_id = o.customer_id
GROUP BY c.customer_name;
-- Customers with 0 orders will show count = 0
-- RIGHT JOIN: All rows from right table
SELECT * FROM orders o
RIGHT JOIN customers c
ON o.customer_id = c.customer_id;
-- FULL OUTER JOIN: All rows from both tables
SELECT * FROM customers c
FULL OUTER JOIN orders o
ON c.customer_id = o.customer_id;
Aggregations let you summarize data - calculate totals, averages, counts, etc. Combined with GROUP BY, you can analyze data by categories!
-- COUNT: Count rows
SELECT COUNT(*) AS total_customers
FROM customers;
-- SUM: Add up values
SELECT SUM(total_amount) AS total_revenue
FROM orders;
-- AVG: Calculate average
SELECT AVG(price) AS avg_price
FROM products;
-- MIN and MAX
SELECT
MIN(price) AS cheapest,
MAX(price) AS most_expensive
FROM products;
-- Sales by category
SELECT
category,
COUNT(*) AS product_count,
AVG(price) AS avg_price
FROM products
GROUP BY category;
-- Monthly revenue
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(total_amount) AS revenue
FROM orders
GROUP BY month
ORDER BY month;
-- Group by multiple columns
SELECT
city,
category,
COUNT(*) AS sales_count
FROM sales
GROUP BY city, category;
-- Find categories with > 10 products
SELECT
category,
COUNT(*) AS product_count
FROM products
GROUP BY category
HAVING COUNT(*) > 10;
-- HAVING filters groups, WHERE filters rows
-- Customers who spent > $5000
SELECT
customer_id,
SUM(total_amount) AS total_spent
FROM orders
GROUP BY customer_id
HAVING SUM(total_amount) > 5000
ORDER BY total_spent DESC;
WHERE filters individual rows BEFORE grouping. HAVING filters groups AFTER aggregation. Use WHERE for row-level filters, HAVING for aggregate filters!
Window functions are like having X-ray vision for your data! They let you perform calculations across rows while keeping all rows (unlike GROUP BY which collapses rows). Perfect for rankings, running totals, and comparisons.
-- Rank products by price
SELECT
product_name,
price,
ROW_NUMBER() OVER (ORDER BY price DESC) AS row_num,
RANK() OVER (ORDER BY price DESC) AS rank,
DENSE_RANK() OVER (ORDER BY price DESC) AS dense_rank
FROM products;
-- Top 3 products per category
WITH ranked AS (
SELECT
category,
product_name,
sales,
ROW_NUMBER() OVER (
PARTITION BY category
ORDER BY sales DESC
) AS rank
FROM products
)
SELECT * FROM ranked
WHERE rank <= 3;
-- Compare with previous month
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_month,
revenue - LAG(revenue) OVER (ORDER BY month) AS growth
FROM monthly_sales;
-- Calculate cumulative revenue
SELECT
order_date,
total_amount,
SUM(total_amount) OVER (
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM orders;
Window functions are incredibly powerful for analytics! Use them for rankings, moving averages, year-over-year comparisons, and more - all without losing row-level detail.
Sometimes you need to query the results of another query! Subqueries and CTEs (Common Table Expressions) let you break complex problems into manageable steps.
-- Find products above average price
SELECT product_name, price
FROM products
WHERE price > (
SELECT AVG(price) FROM products
);
-- Customers who made orders in 2024
SELECT *
FROM customers
WHERE customer_id IN (
SELECT DISTINCT customer_id
FROM orders
WHERE EXTRACT(YEAR FROM order_date) = 2024
);
-- CTEs make complex queries readable
WITH high_value_customers AS (
SELECT
customer_id,
SUM(total_amount) AS total_spent
FROM orders
GROUP BY customer_id
HAVING SUM(total_amount) > 10000
)
SELECT
c.customer_name,
c.email,
h.total_spent
FROM customers c
JOIN high_value_customers h
ON c.customer_id = h.customer_id;
-- Multiple CTEs
WITH
sales_2023 AS (
SELECT SUM(amount) AS total FROM orders WHERE EXTRACT(YEAR FROM order_date) = 2023
),
sales_2024 AS (
SELECT SUM(amount) AS total FROM orders WHERE EXTRACT(YEAR FROM order_date) = 2024
)
SELECT
s23.total AS sales_2023,
s24.total AS sales_2024,
s24.total - s23.total AS growth
FROM sales_2023 s23, sales_2024 s24;
CTEs are more readable and can be referenced multiple times. Use CTEs for complex queries, subqueries for simple one-off filters. Modern SQL developers prefer CTEs!
Good database design is like organizing a library - everything has its place, no duplication, easy to find! Normalization is the process of organizing data to reduce redundancy.
1st Normal Form (1NF)
• Each column contains atomic (indivisible) values
• No repeating groups or arrays
2nd Normal Form (2NF)
• Must be in 1NF
• All non-key columns depend on the entire primary key
3rd Normal Form (3NF)
• Must be in 2NF
• No transitive dependencies (non-key columns don't depend on other non-key columns)
Both are excellent open-source databases! PostgreSQL is more feature-rich and standards-compliant, while MySQL is simpler and faster for basic operations.
The "most advanced open-source database". Better for complex queries and data integrity.
Best for:
Simple, fast, and widely used. Powers WordPress, Facebook, and millions of websites.
Best for:
Learn PostgreSQL first - it's more standards-compliant, so your SQL will work everywhere. Once you know PostgreSQL, MySQL is easy to pick up. The core SQL syntax is 95% the same!
Let's analyze an e-commerce database! We'll use JOINs, aggregations, window functions, and CTEs to extract business insights.
-- 1. Top 10 customers by revenue
SELECT
c.customer_id,
c.customer_name,
c.email,
COUNT(o.order_id) AS order_count,
SUM(o.total_amount) AS total_revenue
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name, c.email
ORDER BY total_revenue DESC
LIMIT 10;
-- 2. Monthly revenue trend with growth rate
WITH monthly_revenue AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(total_amount) AS revenue
FROM orders
GROUP BY month
)
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
ROUND(
(revenue - LAG(revenue) OVER (ORDER BY month)) /
LAG(revenue) OVER (ORDER BY month) * 100, 2
) AS growth_rate_pct
FROM monthly_revenue
ORDER BY month;
-- 3. Product performance by category
SELECT
p.category,
p.product_name,
COUNT(oi.order_id) AS times_ordered,
SUM(oi.quantity) AS total_quantity,
SUM(oi.quantity * oi.price) AS total_revenue,
RANK() OVER (
PARTITION BY p.category
ORDER BY SUM(oi.quantity * oi.price) DESC
) AS rank_in_category
FROM products p
JOIN order_items oi ON p.product_id = oi.product_id
GROUP BY p.category, p.product_name
ORDER BY p.category, rank_in_category;
-- 4. Customer segmentation (RFM Analysis)
WITH customer_metrics AS (
SELECT
customer_id,
MAX(order_date) AS last_order_date,
COUNT(order_id) AS frequency,
SUM(total_amount) AS monetary
FROM orders
GROUP BY customer_id
)
SELECT
c.customer_name,
CURRENT_DATE - cm.last_order_date AS recency_days,
cm.frequency,
cm.monetary,
CASE
WHEN CURRENT_DATE - cm.last_order_date < 30 AND cm.frequency > 5 THEN 'VIP'
WHEN CURRENT_DATE - cm.last_order_date < 90 THEN 'Active'
WHEN CURRENT_DATE - cm.last_order_date < 180 THEN 'At Risk'
ELSE 'Churned'
END AS customer_segment
FROM customers c
JOIN customer_metrics cm ON c.customer_id = cm.customer_id
ORDER BY cm.monetary DESC;
-- 5. Cohort analysis - retention by signup month
WITH cohorts AS (
SELECT
customer_id,
DATE_TRUNC('month', signup_date) AS cohort_month
FROM customers
)
SELECT
c.cohort_month,
DATE_TRUNC('month', o.order_date) AS order_month,
COUNT(DISTINCT o.customer_id) AS active_customers
FROM cohorts c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.cohort_month, order_month
ORDER BY c.cohort_month, order_month;
You've mastered SQL! Next, we'll explore Big Data & Tools - learn about Apache Spark, data warehousing, ETL pipelines, and cloud platforms like AWS, GCP, and Azure. Get ready to work with massive datasets!