Back to Data Science

Module 2: Statistics & Probability

Master the mathematical foundation for data analysis and making data-driven decisions

📊 What is Statistics?

Imagine you're a detective trying to understand patterns in data. Statistics gives you the tools to summarize data, find patterns, and make predictions. It's the language of data science - without it, you're just guessing!

Simple Definition

Statistics is the science of collecting, analyzing, and interpreting data to make informed decisions. Probability helps us understand uncertainty and make predictions about future events.

Example: Coin Flip

Probability of heads = 0.5 (50%)

But if you flip 100 times, you might get 48 heads - statistics helps explain this!

Why Statistics for Data Science?

Understand Data

Summarize millions of data points into meaningful insights

Make Decisions

Test hypotheses and validate assumptions with confidence

Predict Future

Use probability to forecast outcomes and trends

Avoid Mistakes

Distinguish real patterns from random noise

📚 Learn More:

📈 Descriptive Statistics

Descriptive statistics summarize and describe data. Instead of looking at thousands of numbers, you get a few key metrics that tell the story. Think of it as creating a "profile" of your data.

Measures of Central Tendency

These tell you where the "center" or "typical" value of your data is.

# Import libraries

import numpy as np

import pandas as pd

from scipy import stats

# Sample data: test scores

scores = [85, 90, 78, 92, 88, 85, 95, 82, 88, 90]

# Mean (average) - sum divided by count

mean = np.mean(scores)

print(f"Mean: {mean}") # 87.3

# Median (middle value when sorted)

median = np.median(scores)

print(f"Median: {median}") # 88.0

# Mode (most frequent value)

mode = stats.mode(scores)

print(f"Mode: {mode.mode}") # 85 and 88 and 90 appear twice

💡 When to Use Which?

  • Mean: Use for normally distributed data without outliers (e.g., heights, test scores)
  • Median: Use when you have outliers (e.g., house prices, salaries)
  • Mode: Use for categorical data (e.g., most popular product, favorite color)

Measures of Spread

# Range (max - min)

data_range = np.max(scores) - np.min(scores)

print(f"Range: {data_range}") # 17 (95 - 78)

# Variance (average squared deviation from mean)

variance = np.var(scores)

print(f"Variance: {variance:.2f}") # 24.81

# Standard Deviation (square root of variance)

std_dev = np.std(scores)

print(f"Std Dev: {std_dev:.2f}") # 4.98

# Quartiles (divide data into 4 parts)

Q1 = np.percentile(scores, 25) # 25th percentile

Q2 = np.percentile(scores, 50) # 50th percentile (median)

Q3 = np.percentile(scores, 75) # 75th percentile

IQR = Q3 - Q1 # Interquartile Range

🎯 Real-World Example:

Two classes have the same average score (85), but Class A has std dev of 2 (everyone scored 83-87) while Class B has std dev of 15 (scores from 70-100). Same average, very different stories! Standard deviation reveals this difference.

🎲 Probability Distributions

A probability distribution shows all possible values a variable can take and how likely each value is. Think of it as a map showing where your data lives!

Normal Distribution (Bell Curve)

The normal distribution is the most important distribution in statistics. Many natural phenomena follow this pattern: heights, test scores, measurement errors. It's symmetric and bell-shaped!

# Generate normal distribution data

mean = 100 # Average IQ

std_dev = 15 # Standard deviation

data = np.random.normal(mean, std_dev, 1000)

# 68-95-99.7 Rule (Empirical Rule)

# 68% of data within 1 std dev of mean

# 95% within 2 std devs

# 99.7% within 3 std devs

# Calculate probability

from scipy.stats import norm

prob_above_115 = 1 - norm.cdf(115, mean, std_dev)

print(f"Probability IQ > 115: {prob_above_115:.2%}")

Other Important Distributions

Binomial Distribution

For yes/no outcomes (coin flips, pass/fail). Example: Probability of getting exactly 6 heads in 10 coin flips.

from scipy.stats import binom

prob = binom.pmf(6, 10, 0.5)

Poisson Distribution

For counting events in a fixed time period. Example: Number of customers arriving per hour, emails received per day.

from scipy.stats import poisson

prob = poisson.pmf(5, 3)

🔬 Hypothesis Testing

Hypothesis testing helps you make decisions based on data. Is a new drug effective? Does a website change increase sales? Statistics gives you a framework to answer these questions with confidence!

The Hypothesis Testing Process

1. State Hypotheses

Null Hypothesis (H₀): No effect/difference (status quo)

Alternative Hypothesis (H₁): There is an effect/difference

2. Choose Significance Level (α)

• Usually 0.05 (5%) - willing to accept 5% chance of false positive

3. Calculate Test Statistic

• Use appropriate test (t-test, chi-square, etc.)

4. Calculate P-value

• Probability of seeing results this extreme if null hypothesis is true

5. Make Decision

• If p-value < α: Reject null hypothesis (result is significant!)

• If p-value >= α: Fail to reject null hypothesis

T-Test Example

# Question: Does new teaching method improve scores?

from scipy.stats import ttest_ind

# Old method scores

old_method = [78, 82, 75, 80, 79, 81, 77, 83]

# New method scores

new_method = [85, 88, 84, 90, 87, 89, 86, 91]

# Perform t-test

t_statistic, p_value = ttest_ind(old_method, new_method)

print(f"T-statistic: {t_statistic:.3f}")

print(f"P-value: {p_value:.4f}")

# Interpret results

alpha = 0.05

if p_value < alpha:

print("Reject null hypothesis - new method is better!")

else:

print("Cannot reject null hypothesis - no significant difference")

⚠️ Common Mistakes:

  • P-hacking: Testing multiple hypotheses until you find significance
  • Confusing correlation with causation: Just because A and B correlate doesn't mean A causes B
  • Ignoring sample size: Small samples can give misleading results
  • Misinterpreting p-values: P-value is NOT the probability the hypothesis is true!

📏 Confidence Intervals

A confidence interval gives you a range where the true value likely falls. Instead of saying "average height is 170cm", you say "average height is between 168-172cm with 95% confidence."

Understanding Confidence Intervals

A 95% confidence interval means: If we repeated this study 100 times, about 95 of those intervals would contain the true population parameter.

# Calculate confidence interval for mean

from scipy import stats

data = [23, 25, 27, 24, 26, 28, 25, 24, 26, 27]

confidence = 0.95 # 95% confidence

mean = np.mean(data)

sem = stats.sem(data) # Standard error of mean

interval = stats.t.interval(confidence, len(data)-1, mean, sem)

print(f"95% CI: {interval}")

🔗 Correlation and Causation

Correlation measures how two variables move together. But remember: correlation does NOT imply causation! Ice cream sales and drowning deaths correlate, but ice cream doesn't cause drowning - both increase in summer!

# Calculate correlation coefficient

x = [1, 2, 3, 4, 5] # Study hours

y = [50, 60, 70, 80, 90] # Test scores

correlation = np.corrcoef(x, y)[0, 1]

print(f"Correlation: {correlation:.3f}") # 1.0 = perfect positive

# Correlation ranges from -1 to +1

# +1: Perfect positive (both increase together)

# 0: No correlation

# -1: Perfect negative (one increases, other decreases)

🧪 A/B Testing

A/B testing is how companies make data-driven decisions. Show version A to half your users, version B to the other half, then use statistics to determine which performs better!

# A/B Test Example: Website button color

from scipy.stats import chi2_contingency

# Results: [clicked, didn't click]

button_a = [120, 880] # Blue button: 12% click rate

button_b = [150, 850] # Red button: 15% click rate

# Perform chi-square test

observed = np.array([button_a, button_b])

chi2, p_value, dof, expected = chi2_contingency(observed)

print(f"P-value: {p_value:.4f}")

if p_value < 0.05:

print("Significant difference! Use red button.")

🎯 A/B Testing Best Practices:

  • • Test one change at a time
  • • Ensure sufficient sample size
  • • Run test long enough (account for day-of-week effects)
  • • Randomly assign users to groups
  • • Define success metrics before starting

🎯 Complete Statistical Analysis Example

Let's put everything together! Here's a complete example analyzing customer satisfaction data using descriptive statistics, hypothesis testing, and visualization.

# Import libraries

import numpy as np

import pandas as pd

from scipy import stats

import matplotlib.pyplot as plt

# Sample data: Customer satisfaction scores (1-10)

before_training = [6.2, 5.8, 6.5, 5.9, 6.1, 6.3, 5.7, 6.0, 6.4, 5.8,

6.2, 5.9, 6.1, 6.0, 5.8, 6.3, 6.1, 5.9, 6.2, 6.0]

after_training = [7.5, 7.8, 7.2, 7.6, 7.9, 7.4, 7.7, 7.3, 7.8, 7.5,

7.6, 7.4, 7.7, 7.5, 7.8, 7.3, 7.6, 7.4, 7.7, 7.5]

# Step 1: Descriptive Statistics

print("=== BEFORE TRAINING ===")

print(f"Mean: {np.mean(before_training):.2f}")

print(f"Median: {np.median(before_training):.2f}")

print(f"Std Dev: {np.std(before_training):.2f}")

print(f"Range: {np.max(before_training) - np.min(before_training):.2f}")

print("\\n=== AFTER TRAINING ===")

print(f"Mean: {np.mean(after_training):.2f}")

print(f"Median: {np.median(after_training):.2f}")

print(f"Std Dev: {np.std(after_training):.2f}")

print(f"Range: {np.max(after_training) - np.min(after_training):.2f}")

# Step 2: Hypothesis Test

# H0: Training has no effect (means are equal)

# H1: Training improves satisfaction (after > before)

t_stat, p_value = stats.ttest_ind(after_training, before_training)

print(f"\\n=== HYPOTHESIS TEST ===")

print(f"T-statistic: {t_stat:.3f}")

print(f"P-value: {p_value:.6f}")

if p_value < 0.05:

print("✓ Significant improvement! Training works!")

else:

print("✗ No significant difference")

# Step 3: Effect Size (Cohen's d)

mean_diff = np.mean(after_training) - np.mean(before_training)

pooled_std = np.sqrt((np.std(before_training)**2 + np.std(after_training)**2) / 2)

cohens_d = mean_diff / pooled_std

print(f"\\nEffect Size (Cohen's d): {cohens_d:.2f}")

print("Interpretation: ", end="")

if cohens_d < 0.2:

print("Small effect")

elif cohens_d < 0.8:

print("Medium effect")

else:

print("Large effect")

# Step 4: Confidence Intervals

ci_before = stats.t.interval(0.95, len(before_training)-1,

np.mean(before_training),

stats.sem(before_training))

ci_after = stats.t.interval(0.95, len(after_training)-1,

np.mean(after_training),

stats.sem(after_training))

print(f"\\n95% CI Before: {ci_before}")

print(f"95% CI After: {ci_after}")

# Step 5: Summary Report

print("\\n=== SUMMARY ===")

print(f"Average improvement: {mean_diff:.2f} points")

print(f"Percentage improvement: {(mean_diff/np.mean(before_training)*100):.1f}%")

print(f"Statistical significance: p = {p_value:.6f}")

print(f"Practical significance: Cohen's d = {cohens_d:.2f}")

🎓 What This Analysis Shows:

  • • Descriptive statistics to understand the data
  • • Hypothesis testing to determine statistical significance
  • • Effect size to measure practical significance
  • • Confidence intervals to show uncertainty
  • • Clear summary for decision-makers

📚 Learning Resources

Official Documentation

Learning Platforms

🎯 What's Next?

You now understand the statistical foundation of data science! In the next module, we'll learn Data Visualization - how to create compelling charts and dashboards to communicate your insights.