Back to Data Science

Module 3: Data Visualization

Transform data into compelling visual stories that drive decisions and insights

📊 What is Data Visualization?

Imagine trying to understand 10,000 rows of sales data in a spreadsheet versus seeing a beautiful chart that instantly shows trends, patterns, and outliers. That's the power of visualization - turning numbers into insights that anyone can understand!

Simple Definition

Data Visualization is the graphical representation of data. It uses visual elements like charts, graphs, and maps to help people understand patterns, trends, and insights in data that would be hard to see in raw numbers.

Example: Sales Data

Numbers: 100, 150, 120, 180, 200, 250...

Chart: 📈 Upward trend clearly visible!

Why Visualization Matters

Instant Understanding

Humans process visuals 60,000x faster than text

Find Patterns

Spot trends, outliers, and correlations easily

Tell Stories

Communicate insights to non-technical audiences

Drive Decisions

Make data-driven choices with confidence

📚 Learn More:

  • Matplotlib - Python plotting library
  • Seaborn - Statistical data visualization
  • Plotly - Interactive visualizations

📈 Matplotlib Fundamentals

Matplotlib is the foundation of Python visualization - like the Swiss Army knife of plotting! It gives you complete control over every element of your charts. Think of it as the "low-level" tool that other libraries build upon.

Line Plots

Line plots show trends over time. Perfect for stock prices, temperature changes, or any continuous data that changes over time.

# Import libraries

import matplotlib.pyplot as plt

import numpy as np

# Create data

months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']

sales = [15000, 18000, 22000, 25000, 30000, 35000]

# Create line plot

plt.figure(figsize=(10, 6)) # Set size

plt.plot(months, sales, marker='o', linewidth=2, color='#8b5cf6')

plt.title('Monthly Sales Trend', fontsize=16, fontweight='bold')

plt.xlabel('Month', fontsize=12)

plt.ylabel('Sales ($)', fontsize=12)

plt.grid(True, alpha=0.3) # Add grid

plt.tight_layout() # Prevent label cutoff

plt.show()

Scatter Plots

# Scatter plot - show relationship between two variables

study_hours = [1, 2, 3, 4, 5, 6, 7, 8]

test_scores = [55, 60, 65, 70, 75, 85, 90, 95]

plt.figure(figsize=(8, 6))

plt.scatter(study_hours, test_scores, s=100, c='#8b5cf6', alpha=0.6)

plt.title('Study Hours vs Test Scores')

plt.xlabel('Hours Studied')

plt.ylabel('Test Score')

plt.show()

Bar Charts

# Bar chart - compare categories

products = ['Laptop', 'Phone', 'Tablet', 'Watch']

revenue = [45000, 65000, 30000, 20000]

plt.figure(figsize=(10, 6))

plt.bar(products, revenue, color='#8b5cf6', alpha=0.7)

plt.title('Product Revenue Comparison')

plt.xlabel('Product')

plt.ylabel('Revenue ($)')

plt.xticks(rotation=45) # Rotate labels

plt.show()

Histograms

# Histogram - show distribution of data

ages = np.random.normal(35, 10, 1000) # 1000 ages, mean 35, std 10

plt.figure(figsize=(10, 6))

plt.hist(ages, bins=30, color='#8b5cf6', alpha=0.7, edgecolor='black')

plt.title('Age Distribution')

plt.xlabel('Age')

plt.ylabel('Frequency')

plt.axvline(ages.mean(), color='red', linestyle='--', label='Mean')

plt.legend()

plt.show()

💡 Pro Tip:

Use plt.style.use('seaborn') at the start of your script to make Matplotlib plots look more modern and professional!

🎨 Seaborn for Statistical Plots

Seaborn is built on top of Matplotlib but makes beautiful statistical plots with less code! It's like Matplotlib with a designer's touch - perfect for exploring relationships in data.

Heatmaps

Heatmaps show data as colors in a matrix. Perfect for correlation matrices, confusion matrices, or any grid-based data.

# Import Seaborn

import seaborn as sns

import pandas as pd

# Create sample data

data = pd.DataFrame({

'Math': [85, 90, 78, 92, 88],

'Science': [88, 85, 80, 95, 90],

'English': [75, 80, 85, 88, 82]

})

# Create correlation heatmap

plt.figure(figsize=(8, 6))

corr = data.corr() # Calculate correlations

sns.heatmap(corr, annot=True, cmap='Purples', center=0)

plt.title('Subject Correlation Heatmap')

plt.show()

Pair Plots

# Pair plot - visualize all pairwise relationships

iris = sns.load_dataset('iris') # Load sample dataset

sns.pairplot(iris, hue='species', palette='Set2')

plt.suptitle('Iris Dataset Pair Plot', y=1.02)

plt.show()

Violin and Box Plots

# Violin plot - shows distribution shape

tips = sns.load_dataset('tips')

plt.figure(figsize=(10, 6))

sns.violinplot(x='day', y='total_bill', data=tips, palette='Purples')

plt.title('Total Bill Distribution by Day')

plt.show()

# Box plot - shows quartiles and outliers

plt.figure(figsize=(10, 6))

sns.boxplot(x='day', y='total_bill', data=tips, palette='Set2')

plt.title('Total Bill Box Plot by Day')

plt.show()

🎯 When to Use Seaborn:

  • • Statistical visualizations (distributions, relationships)
  • • Quick exploratory data analysis
  • • When you want beautiful plots with minimal code
  • • Working with Pandas DataFrames

⚡ Plotly for Interactive Visualizations

Plotly creates interactive charts that users can zoom, pan, and hover over! Perfect for dashboards and web applications. Your charts come alive with interactivity!

Interactive Line Chart

# Import Plotly

import plotly.graph_objects as go

import plotly.express as px

# Create interactive line chart

dates = pd.date_range('2024-01-01', periods=30)

values = np.cumsum(np.random.randn(30)) + 100

fig = go.Figure()

fig.add_trace(go.Scatter(

x=dates, y=values,

mode='lines+markers',

name='Stock Price',

line=dict(color='#8b5cf6', width=2)

))

fig.update_layout(

title='Interactive Stock Price Chart',

xaxis_title='Date',

yaxis_title='Price ($)',

hovermode='x unified'

)

fig.show()

3D Scatter Plot

# 3D scatter plot

df = px.data.iris()

fig = px.scatter_3d(df, x='sepal_length', y='sepal_width', z='petal_width',

color='species', size='petal_length',

title='3D Iris Dataset')

fig.show()

💡 Plotly Express vs Graph Objects:

Plotly Express (px): High-level, quick plots with one line of code
Graph Objects (go): Low-level, complete control over every detail

🚀 Dashboard Creation with Streamlit

Streamlit turns Python scripts into interactive web apps in minutes! No HTML, CSS, or JavaScript needed. Perfect for creating data dashboards and sharing your analysis with others.

Simple Streamlit Dashboard

# Install: pip install streamlit

# Run: streamlit run app.py

# app.py

import streamlit as st

import pandas as pd

import plotly.express as px

# Title and description

st.title('📊 Sales Dashboard')

st.write('Interactive sales analysis dashboard')

# Sidebar filters

year = st.sidebar.selectbox('Select Year', [2022, 2023, 2024])

region = st.sidebar.multiselect('Select Region', ['East', 'West', 'North', 'South'])

# Load and display data

df = pd.read_csv('sales.csv')

st.dataframe(df.head())

# Metrics

col1, col2, col3 = st.columns(3)

col1.metric('Total Revenue', '$1.2M', '+12%')

col2.metric('Orders', '1,234', '+5%')

col3.metric('Customers', '567', '+8%')

# Interactive chart

fig = px.line(df, x='date', y='revenue', title='Revenue Trend')

st.plotly_chart(fig, use_container_width=True)

🎯 Streamlit Features:

  • st.slider(): Interactive sliders for filtering
  • st.selectbox(): Dropdown menus
  • st.file_uploader(): Upload CSV/Excel files
  • st.download_button(): Download processed data
  • st.cache_data: Cache data for faster loading

📋 Chart Types and When to Use Them

Choosing the right chart is crucial! The wrong chart can confuse your audience, while the right one makes insights crystal clear. Here's your decision guide:

📈 Line Chart

Use for: Trends over time

Examples: Stock prices, temperature, website traffic

Best when: Showing continuous data with time on x-axis

📊 Bar Chart

Use for: Comparing categories

Examples: Sales by product, survey responses

Best when: Comparing 3-10 categories

🥧 Pie Chart

Use for: Parts of a whole

Examples: Market share, budget breakdown

Best when: Showing 2-5 categories that sum to 100%

🔵 Scatter Plot

Use for: Relationships between variables

Examples: Height vs weight, price vs demand

Best when: Looking for correlations

📦 Box Plot

Use for: Distribution and outliers

Examples: Salary ranges, test scores

Best when: Comparing distributions across groups

🌡️ Heatmap

Use for: Matrix data with color intensity

Examples: Correlations, confusion matrix

Best when: Showing patterns in 2D data

⚠️ Common Mistakes to Avoid:

  • 3D charts: Usually add confusion, not clarity
  • Too many colors: Stick to 3-5 colors max
  • Pie charts with many slices: Use bar chart instead
  • Dual y-axes: Can be misleading, use with caution
  • Missing labels: Always label axes and add titles!

🎨 Color Theory and Accessibility

Colors aren't just decoration - they communicate meaning! But 8% of men and 0.5% of women have color blindness. Your beautiful red-green chart might be invisible to them!

Color Palette Guidelines

Sequential (for ordered data)

Light to dark of same color: Perfect for heatmaps, choropleth maps

sns.color_palette("Blues")

sns.color_palette("Purples")

Diverging (for data with meaningful midpoint)

Two colors meeting at middle: Perfect for showing positive/negative

sns.color_palette("RdBu") # Red-Blue

sns.color_palette("PiYG") # Pink-Green

Qualitative (for categories)

Distinct colors: Perfect for categorical data

sns.color_palette("Set2")

sns.color_palette("tab10")

Accessibility Best Practices

  • Avoid red-green combinations: Use blue-orange or purple-yellow instead
  • Use patterns + colors: Add hatching or markers, not just color
  • High contrast: Ensure text is readable on backgrounds
  • Test your charts: Use colorblind simulators online
  • Add labels: Don't rely solely on color to convey information

🎯 Complete Visualization Project

Let's put everything together! We'll analyze a sales dataset and create a comprehensive visualization report using Matplotlib, Seaborn, and Plotly.

# Complete Visualization Project: Sales Analysis

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

import plotly.express as px

plt.style.use('seaborn-v0_8')

# Step 1: Create sample sales data

np.random.seed(42)

dates = pd.date_range('2024-01-01', periods=365)

products = ['Laptop', 'Phone', 'Tablet', 'Watch', 'Headphones']

regions = ['East', 'West', 'North', 'South']

data = []

for date in dates:

for _ in range(np.random.randint(5, 15)):

data.append({

'date': date,

'product': np.random.choice(products),

'region': np.random.choice(regions),

'quantity': np.random.randint(1, 5),

'price': np.random.uniform(50, 1500)

})

df = pd.DataFrame(data)

df['revenue'] = df['quantity'] * df['price']

df['month'] = df['date'].dt.month

df['day_of_week'] = df['date'].dt.day_name()

# Step 2: Overview Statistics

print("=== SALES OVERVIEW ===")

print(f"Total Revenue: ${df['revenue'].sum():,.2f}")

print(f"Total Orders: {len(df):,}")

print(f"Average Order Value: ${df['revenue'].mean():,.2f}")

print(f"Date Range: {df['date'].min()} to {df['date'].max()}")

# Step 3: Revenue Trend Over Time (Matplotlib)

daily_revenue = df.groupby('date')['revenue'].sum()

monthly_revenue = df.groupby('month')['revenue'].sum()

plt.figure(figsize=(14, 6))

plt.plot(daily_revenue.index, daily_revenue.values, color='#8b5cf6', linewidth=1.5)

plt.title('Daily Revenue Trend 2024', fontsize=16, fontweight='bold')

plt.xlabel('Date', fontsize=12)

plt.ylabel('Revenue ($)', fontsize=12)

plt.grid(True, alpha=0.3)

plt.tight_layout()

plt.savefig('revenue_trend.png', dpi=300, bbox_inches='tight')

plt.show()

# Step 4: Product Performance (Seaborn)

product_stats = df.groupby('product').agg({

'revenue': 'sum',

'quantity': 'sum'

}).sort_values('revenue', ascending=False)

plt.figure(figsize=(10, 6))

sns.barplot(x=product_stats.index, y=product_stats['revenue'], palette='Purples_r')

plt.title('Revenue by Product', fontsize=16, fontweight='bold')

plt.xlabel('Product', fontsize=12)

plt.ylabel('Total Revenue ($)', fontsize=12)

plt.xticks(rotation=45)

plt.tight_layout()

plt.show()

# Step 5: Regional Analysis (Heatmap)

pivot_data = df.pivot_table(

values='revenue',

index='product',

columns='region',

aggfunc='sum'

)

plt.figure(figsize=(10, 6))

sns.heatmap(pivot_data, annot=True, fmt='.0f', cmap='Purples', cbar_kws={'label': 'Revenue ($)'})

plt.title('Revenue Heatmap: Product vs Region', fontsize=16, fontweight='bold')

plt.tight_layout()

plt.show()

# Step 6: Interactive Dashboard (Plotly)

fig = px.sunburst(

df,

path=['region', 'product'],

values='revenue',

title='Revenue Distribution: Region > Product',

color='revenue',

color_continuous_scale='Purples'

)

fig.update_layout(height=600)

fig.show()

# Step 7: Key Insights Summary

print("\\n=== KEY INSIGHTS ===")

best_product = product_stats.index[0]

best_region = df.groupby('region')['revenue'].sum().idxmax()

best_day = df.groupby('day_of_week')['revenue'].sum().idxmax()

print(f"1. Top Product: {best_product} (${product_stats.loc[best_product, 'revenue']:,.2f})")

print(f"2. Best Region: {best_region}")

print(f"3. Best Day: {best_day}")

print(f"4. Growth Trend: {(monthly_revenue.iloc[-1] / monthly_revenue.iloc[0] - 1) * 100:.1f}% from Jan to Dec")

🎓 What This Project Demonstrates:

  • • Data preparation and aggregation with Pandas
  • • Time series visualization with Matplotlib
  • • Statistical plots with Seaborn
  • • Interactive visualizations with Plotly
  • • Heatmaps for multi-dimensional data
  • • Extracting and presenting key insights
  • • Saving high-quality charts for reports

📚 Learning Resources

Official Documentation

Learning Platforms

🎯 What's Next?

You can now create stunning visualizations that tell compelling data stories! In the next module, we'll dive into Machine Learning Fundamentals - building models that learn from data and make predictions.