Master the mathematical foundation for data analysis and making data-driven decisions
Imagine you're a detective trying to understand patterns in data. Statistics gives you the tools to summarize data, find patterns, and make predictions. It's the language of data science - without it, you're just guessing!
Statistics is the science of collecting, analyzing, and interpreting data to make informed decisions. Probability helps us understand uncertainty and make predictions about future events.
Example: Coin Flip
Probability of heads = 0.5 (50%)
But if you flip 100 times, you might get 48 heads - statistics helps explain this!
Understand Data
Summarize millions of data points into meaningful insights
Make Decisions
Test hypotheses and validate assumptions with confidence
Predict Future
Use probability to forecast outcomes and trends
Avoid Mistakes
Distinguish real patterns from random noise
Descriptive statistics summarize and describe data. Instead of looking at thousands of numbers, you get a few key metrics that tell the story. Think of it as creating a "profile" of your data.
These tell you where the "center" or "typical" value of your data is.
# Import libraries
import numpy as np
import pandas as pd
from scipy import stats
# Sample data: test scores
scores = [85, 90, 78, 92, 88, 85, 95, 82, 88, 90]
# Mean (average) - sum divided by count
mean = np.mean(scores)
print(f"Mean: {mean}") # 87.3
# Median (middle value when sorted)
median = np.median(scores)
print(f"Median: {median}") # 88.0
# Mode (most frequent value)
mode = stats.mode(scores)
print(f"Mode: {mode.mode}") # 85 and 88 and 90 appear twice
# Range (max - min)
data_range = np.max(scores) - np.min(scores)
print(f"Range: {data_range}") # 17 (95 - 78)
# Variance (average squared deviation from mean)
variance = np.var(scores)
print(f"Variance: {variance:.2f}") # 24.81
# Standard Deviation (square root of variance)
std_dev = np.std(scores)
print(f"Std Dev: {std_dev:.2f}") # 4.98
# Quartiles (divide data into 4 parts)
Q1 = np.percentile(scores, 25) # 25th percentile
Q2 = np.percentile(scores, 50) # 50th percentile (median)
Q3 = np.percentile(scores, 75) # 75th percentile
IQR = Q3 - Q1 # Interquartile Range
Two classes have the same average score (85), but Class A has std dev of 2 (everyone scored 83-87) while Class B has std dev of 15 (scores from 70-100). Same average, very different stories! Standard deviation reveals this difference.
A probability distribution shows all possible values a variable can take and how likely each value is. Think of it as a map showing where your data lives!
The normal distribution is the most important distribution in statistics. Many natural phenomena follow this pattern: heights, test scores, measurement errors. It's symmetric and bell-shaped!
# Generate normal distribution data
mean = 100 # Average IQ
std_dev = 15 # Standard deviation
data = np.random.normal(mean, std_dev, 1000)
# 68-95-99.7 Rule (Empirical Rule)
# 68% of data within 1 std dev of mean
# 95% within 2 std devs
# 99.7% within 3 std devs
# Calculate probability
from scipy.stats import norm
prob_above_115 = 1 - norm.cdf(115, mean, std_dev)
print(f"Probability IQ > 115: {prob_above_115:.2%}")
For yes/no outcomes (coin flips, pass/fail). Example: Probability of getting exactly 6 heads in 10 coin flips.
from scipy.stats import binom
prob = binom.pmf(6, 10, 0.5)
For counting events in a fixed time period. Example: Number of customers arriving per hour, emails received per day.
from scipy.stats import poisson
prob = poisson.pmf(5, 3)
Hypothesis testing helps you make decisions based on data. Is a new drug effective? Does a website change increase sales? Statistics gives you a framework to answer these questions with confidence!
1. State Hypotheses
• Null Hypothesis (H₀): No effect/difference (status quo)
• Alternative Hypothesis (H₁): There is an effect/difference
2. Choose Significance Level (α)
• Usually 0.05 (5%) - willing to accept 5% chance of false positive
3. Calculate Test Statistic
• Use appropriate test (t-test, chi-square, etc.)
4. Calculate P-value
• Probability of seeing results this extreme if null hypothesis is true
5. Make Decision
• If p-value < α: Reject null hypothesis (result is significant!)
• If p-value >= α: Fail to reject null hypothesis
# Question: Does new teaching method improve scores?
from scipy.stats import ttest_ind
# Old method scores
old_method = [78, 82, 75, 80, 79, 81, 77, 83]
# New method scores
new_method = [85, 88, 84, 90, 87, 89, 86, 91]
# Perform t-test
t_statistic, p_value = ttest_ind(old_method, new_method)
print(f"T-statistic: {t_statistic:.3f}")
print(f"P-value: {p_value:.4f}")
# Interpret results
alpha = 0.05
if p_value < alpha:
print("Reject null hypothesis - new method is better!")
else:
print("Cannot reject null hypothesis - no significant difference")
A confidence interval gives you a range where the true value likely falls. Instead of saying "average height is 170cm", you say "average height is between 168-172cm with 95% confidence."
A 95% confidence interval means: If we repeated this study 100 times, about 95 of those intervals would contain the true population parameter.
# Calculate confidence interval for mean
from scipy import stats
data = [23, 25, 27, 24, 26, 28, 25, 24, 26, 27]
confidence = 0.95 # 95% confidence
mean = np.mean(data)
sem = stats.sem(data) # Standard error of mean
interval = stats.t.interval(confidence, len(data)-1, mean, sem)
print(f"95% CI: {interval}")
Correlation measures how two variables move together. But remember: correlation does NOT imply causation! Ice cream sales and drowning deaths correlate, but ice cream doesn't cause drowning - both increase in summer!
# Calculate correlation coefficient
x = [1, 2, 3, 4, 5] # Study hours
y = [50, 60, 70, 80, 90] # Test scores
correlation = np.corrcoef(x, y)[0, 1]
print(f"Correlation: {correlation:.3f}") # 1.0 = perfect positive
# Correlation ranges from -1 to +1
# +1: Perfect positive (both increase together)
# 0: No correlation
# -1: Perfect negative (one increases, other decreases)
A/B testing is how companies make data-driven decisions. Show version A to half your users, version B to the other half, then use statistics to determine which performs better!
# A/B Test Example: Website button color
from scipy.stats import chi2_contingency
# Results: [clicked, didn't click]
button_a = [120, 880] # Blue button: 12% click rate
button_b = [150, 850] # Red button: 15% click rate
# Perform chi-square test
observed = np.array([button_a, button_b])
chi2, p_value, dof, expected = chi2_contingency(observed)
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
print("Significant difference! Use red button.")
Let's put everything together! Here's a complete example analyzing customer satisfaction data using descriptive statistics, hypothesis testing, and visualization.
# Import libraries
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
# Sample data: Customer satisfaction scores (1-10)
before_training = [6.2, 5.8, 6.5, 5.9, 6.1, 6.3, 5.7, 6.0, 6.4, 5.8,
6.2, 5.9, 6.1, 6.0, 5.8, 6.3, 6.1, 5.9, 6.2, 6.0]
after_training = [7.5, 7.8, 7.2, 7.6, 7.9, 7.4, 7.7, 7.3, 7.8, 7.5,
7.6, 7.4, 7.7, 7.5, 7.8, 7.3, 7.6, 7.4, 7.7, 7.5]
# Step 1: Descriptive Statistics
print("=== BEFORE TRAINING ===")
print(f"Mean: {np.mean(before_training):.2f}")
print(f"Median: {np.median(before_training):.2f}")
print(f"Std Dev: {np.std(before_training):.2f}")
print(f"Range: {np.max(before_training) - np.min(before_training):.2f}")
print("\\n=== AFTER TRAINING ===")
print(f"Mean: {np.mean(after_training):.2f}")
print(f"Median: {np.median(after_training):.2f}")
print(f"Std Dev: {np.std(after_training):.2f}")
print(f"Range: {np.max(after_training) - np.min(after_training):.2f}")
# Step 2: Hypothesis Test
# H0: Training has no effect (means are equal)
# H1: Training improves satisfaction (after > before)
t_stat, p_value = stats.ttest_ind(after_training, before_training)
print(f"\\n=== HYPOTHESIS TEST ===")
print(f"T-statistic: {t_stat:.3f}")
print(f"P-value: {p_value:.6f}")
if p_value < 0.05:
print("✓ Significant improvement! Training works!")
else:
print("✗ No significant difference")
# Step 3: Effect Size (Cohen's d)
mean_diff = np.mean(after_training) - np.mean(before_training)
pooled_std = np.sqrt((np.std(before_training)**2 + np.std(after_training)**2) / 2)
cohens_d = mean_diff / pooled_std
print(f"\\nEffect Size (Cohen's d): {cohens_d:.2f}")
print("Interpretation: ", end="")
if cohens_d < 0.2:
print("Small effect")
elif cohens_d < 0.8:
print("Medium effect")
else:
print("Large effect")
# Step 4: Confidence Intervals
ci_before = stats.t.interval(0.95, len(before_training)-1,
np.mean(before_training),
stats.sem(before_training))
ci_after = stats.t.interval(0.95, len(after_training)-1,
np.mean(after_training),
stats.sem(after_training))
print(f"\\n95% CI Before: {ci_before}")
print(f"95% CI After: {ci_after}")
# Step 5: Summary Report
print("\\n=== SUMMARY ===")
print(f"Average improvement: {mean_diff:.2f} points")
print(f"Percentage improvement: {(mean_diff/np.mean(before_training)*100):.1f}%")
print(f"Statistical significance: p = {p_value:.6f}")
print(f"Practical significance: Cohen's d = {cohens_d:.2f}")
You now understand the statistical foundation of data science! In the next module, we'll learn Data Visualization - how to create compelling charts and dashboards to communicate your insights.