Module 1: Python for Data Science

Master Python fundamentals and essential libraries for data analysis and manipulation

🐍 What is Python for Data Science?

Imagine you have a spreadsheet with millions of rows of customer data, sales figures, or scientific measurements. Excel would crash, but Python handles it effortlessly! Python is the #1 language for data science because it's powerful, easy to learn, and has incredible libraries for data work.

Simple Definition

Python for Data Science is using Python programming to collect, clean, analyze, and visualize data to extract insights and make data-driven decisions. It's like having a super-powered calculator that can handle millions of numbers at once!

Example: Calculate average of 1 million numbers

import numpy as np

data = np.random.rand(1000000)

average = data.mean() # Done in milliseconds!

Why Python for Data Science?

Easy to Learn

Readable syntax that looks like English

Powerful Libraries

NumPy, Pandas, Matplotlib - tools for everything

Industry Standard

Used by Google, Netflix, NASA, and more

Huge Community

Millions of data scientists sharing knowledge

📚 Learn More:

• Python.org - Official Python website
• Python Tutorial - Official beginner guide

📝 Python Fundamentals

Before diving into data analysis, let's master the basics! Think of these as the building blocks you'll use to construct powerful data pipelines.

Variables and Data Types

A variable is like a labeled container that stores information. Python automatically figures out what type of data you're storing!

# Numbers

age = 25 # Integer

price = 19.99 # Float (decimal)

revenue = 1_000_000 # Underscores for readability

# Text (strings)

name = "Alice"

city = 'New York' # Single or double quotes work

message = f"Hello, {name}!" # f-string for formatting

# Boolean (True/False)

is_active = True

has_discount = False

# Lists (ordered collections)

scores = [95, 87, 92, 88, 90]

names = ["Alice", "Bob", "Charlie"]

print(scores[0]) # Access first item: 95

# Dictionaries (key-value pairs)

person = {

"name": "Alice",

"age": 25,

"city": "NYC"

}

print(person["name"]) # Output: Alice

Control Flow

# If-else statements (make decisions)

score = 85

if score >= 90:

print("Grade: A")

elif score >= 80:

print("Grade: B")

else:

print("Grade: C")

# For loops (repeat actions)

numbers = [1, 2, 3, 4, 5]

for num in numbers:

print(num * 2) # Prints: 2, 4, 6, 8, 10

# List comprehension (elegant way to create lists)

squares = [x**2 for x in range(5)]

print(squares) # [0, 1, 4, 9, 16]

Functions

Functions are reusable blocks of code. Write once, use many times! They help keep your code organized and avoid repetition.

# Define a function

def calculate_average(numbers):

"""Calculate the average of a list of numbers."""

total = sum(numbers)

count = len(numbers)

return total / count

# Use the function

scores = [95, 87, 92, 88, 90]

avg = calculate_average(scores)

print(f"Average: {avg}") # Average: 90.4

# Function with default parameters

def greet(name, greeting="Hello"):

return f"{greeting}, {name}!"

print(greet("Alice")) # Hello, Alice!

print(greet("Bob", "Hi")) # Hi, Bob!

💡 Real-World Analogy:

Think of functions like recipes. Once you write down a recipe for chocolate chip cookies, you don't need to rewrite it every time you bake. Just follow the recipe (call the function) with your ingredients (parameters) and get cookies (return value)!

🔢 NumPy for Numerical Computing

NumPy (Numerical Python) is the foundation of data science in Python. It provides powerful tools for working with arrays of numbers - think of it as a supercharged calculator that can handle millions of numbers at lightning speed!

What are NumPy Arrays?

An array is like a grid of numbers. A 1D array is a list, a 2D array is a table, and a 3D array is like a cube. Arrays are 50-100x faster than Python lists for numerical operations!

# Import NumPy (standard alias is np)

import numpy as np

# Create arrays

arr1d = np.array([1, 2, 3, 4, 5]) # 1D array

arr2d = np.array([[1, 2, 3], [4, 5, 6]]) # 2D array

print(arr1d.shape) # (5,) - 5 elements

print(arr2d.shape) # (2, 3) - 2 rows, 3 columns

# Create special arrays

zeros = np.zeros(5) # [0. 0. 0. 0. 0.]

ones = np.ones((3, 3)) # 3x3 matrix of 1s

range_arr = np.arange(0, 10, 2) # [0 2 4 6 8]

linspace = np.linspace(0, 1, 5) # 5 evenly spaced numbers

random_arr = np.random.rand(3, 3) # 3x3 random numbers

Array Operations

# Element-wise operations (vectorization)

a = np.array([1, 2, 3, 4])

b = np.array([10, 20, 30, 40])

print(a + b) # [11 22 33 44]

print(a * b) # [10 40 90 160]

print(a ** 2) # [1 4 9 16] - square each element

print(a > 2) # [False False True True] - boolean array

# Statistical operations

data = np.array([95, 87, 92, 88, 90])

print(data.mean()) # 90.4 - average

print(data.std()) # 2.86 - standard deviation

print(data.min()) # 87 - minimum value

print(data.max()) # 95 - maximum value

print(data.sum()) # 452 - sum of all elements

# Indexing and slicing

arr = np.array([10, 20, 30, 40, 50])

print(arr[0]) # 10 - first element

print(arr[-1]) # 50 - last element

print(arr[1:4]) # [20 30 40] - slice

print(arr[arr > 25]) # [30 40 50] - boolean indexing

# Reshaping arrays

arr = np.arange(12) # [0 1 2 ... 11]

reshaped = arr.reshape(3, 4) # 3 rows, 4 columns

flattened = reshaped.flatten() # Back to 1D

transposed = reshaped.T # Swap rows and columns

🚀 Why NumPy is Fast:

NumPy operations are implemented in C and use vectorization - applying operations to entire arrays at once instead of looping through elements. This makes NumPy 50-100x faster than regular Python lists for numerical work!

🐼 Pandas for Data Manipulation

Pandas is like Excel on steroids! It lets you work with tables of data (called DataFrames), clean messy data, filter, group, and analyze with ease. If NumPy is for arrays, Pandas is for structured data with rows and columns.

DataFrames: Your Data Table

A DataFrame is a 2D table with labeled rows and columns. Each column can have different data types (numbers, text, dates). It's the most important data structure in data science!

# Import Pandas (standard alias is pd)

import pandas as pd

# Create DataFrame from dictionary

data = {

'name': ['Alice', 'Bob', 'Charlie', 'Diana'],

'age': [25, 30, 35, 28],

'city': ['NYC', 'LA', 'Chicago', 'NYC'],

'salary': [70000, 85000, 90000, 75000]

}

df = pd.DataFrame(data)

print(df)

Reading and Writing Data

# Read from CSV (most common format)

df = pd.read_csv('data.csv')

df = pd.read_csv('data.csv', sep=';') # Custom separator

# Read from Excel

df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Read from JSON

df = pd.read_json('data.json')

# Write to files

df.to_csv('output.csv', index=False) # Don't save index

df.to_excel('output.xlsx', index=False)

df.to_json('output.json')

# Quick data exploration

df.head() # First 5 rows

df.tail(10) # Last 10 rows

df.info() # Data types and missing values

df.describe() # Statistical summary

df.shape # (rows, columns)

df.columns # Column names

Data Selection and Filtering

# Select columns

names = df['name'] # Single column (Series)

subset = df[['name', 'age']] # Multiple columns (DataFrame)

# Filter rows by condition

adults = df[df['age'] >= 30] # Age 30 or older

nyc_people = df[df['city'] == 'NYC'] # People in NYC

high_earners = df[df['salary'] > 80000] # Salary above 80k

# Multiple conditions (use & for AND, | for OR)

result = df[(df['age'] > 25) & (df['city'] == 'NYC')]

# Select by position (iloc) or label (loc)

df.iloc[0] # First row

df.iloc[0:3] # First 3 rows

df.loc[0, 'name'] # Specific cell

Data Manipulation

# Add new columns

df['salary_k'] = df['salary'] / 1000 # Salary in thousands

df['is_senior'] = df['age'] >= 30 # Boolean column

# Apply functions to columns

df['name_upper'] = df['name'].str.upper() # Uppercase names

df['age_group'] = df['age'].apply(lambda x: 'Young' if x < 30 else 'Senior')

# Sort data

df.sort_values('age') # Sort by age (ascending)

df.sort_values('salary', ascending=False) # Descending

df.sort_values(['city', 'age']) # Sort by multiple columns

# Group and aggregate

df.groupby('city')['salary'].mean() # Average salary by city

df.groupby('city').agg({'salary': ['mean', 'max', 'min']})

# Remove duplicates

df.drop_duplicates() # Remove duplicate rows

df.drop_duplicates(subset=['name']) # Based on specific column

💡 Pandas vs Excel:

Excel is great for small datasets and manual work. Pandas shines with large datasets (millions of rows), automation, and reproducibility. What takes hours of clicking in Excel can be done in seconds with Pandas code!

🧹 Data Cleaning and Preprocessing

Real-world data is messy! Missing values, duplicates, wrong formats, outliers - data scientists spend 80% of their time cleaning data. Let's learn how to handle these common issues.

Handling Missing Data

Missing data appears as NaN (Not a Number) or None. You can either remove it or fill it with reasonable values.

# Check for missing values

df.isnull().sum() # Count missing values per column

df.isnull().any() # Which columns have missing values?

# Remove missing values

df.dropna() # Remove rows with any missing values

df.dropna(subset=['age']) # Remove rows where age is missing

df.dropna(axis=1) # Remove columns with missing values

# Fill missing values

df.fillna(0) # Replace all NaN with 0

df['age'].fillna(df['age'].mean()) # Fill with average

df['age'].fillna(df['age'].median()) # Fill with median

df.fillna(method='ffill') # Forward fill (use previous value)

df.fillna(method='bfill') # Backward fill (use next value)

DateTime Handling

# Convert strings to datetime

df['date'] = pd.to_datetime(df['date'])

df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

# Extract date components

df['year'] = df['date'].dt.year

df['month'] = df['date'].dt.month

df['day'] = df['date'].dt.day

df['day_of_week'] = df['date'].dt.day_name() # Monday, Tuesday, etc.

# Date arithmetic

df['days_since'] = (pd.Timestamp.now() - df['date']).dt.days

df['next_week'] = df['date'] + pd.Timedelta(days=7)

Handling Outliers

Outliers are extreme values that don't fit the pattern. They can be errors or genuine unusual cases. Common methods: remove values beyond 3 standard deviations or use the IQR (Interquartile Range) method.

# Method 1: Standard deviation

mean = df['salary'].mean()

std = df['salary'].std()

df_clean = df[(df['salary'] > mean - 3*std) & (df['salary'] < mean + 3*std)]

# Method 2: IQR (Interquartile Range)

Q1 = df['salary'].quantile(0.25)

Q3 = df['salary'].quantile(0.75)

IQR = Q3 - Q1

df_clean = df[(df['salary'] >= Q1 - 1.5*IQR) & (df['salary'] <= Q3 + 1.5*IQR)]

⚠️ Warning:

Don't automatically remove all outliers! Sometimes they're the most interesting data points. A billionaire in a salary dataset is an outlier but might be important for your analysis. Always investigate before removing!

🎯 Complete Data Analysis Project

Let's put everything together! Here's a complete example analyzing sales data - loading, cleaning, analyzing, and extracting insights.

# Import libraries

import pandas as pd

import numpy as np

# Create sample sales data

data = {

'date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05'],

'product': ['Laptop', 'Phone', 'Laptop', 'Tablet', 'Phone'],

'quantity': [2, 5, 1, 3, 4],

'price': [1200, 800, 1200, 500, 800],

'region': ['East', 'West', 'East', 'West', 'East']

}

df = pd.DataFrame(data)

# Step 1: Data Cleaning

df['date'] = pd.to_datetime(df['date']) # Convert to datetime

df['total_sales'] = df['quantity'] * df['price'] # Calculate total

# Step 2: Exploratory Analysis

print("Dataset Overview:")

print(df.head())

print(f"\\nTotal Revenue: ${df['total_sales'].sum():,.2f}")

print(f"Average Order Value: ${df['total_sales'].mean():,.2f}")

# Step 3: Group Analysis

product_sales = df.groupby('product')['total_sales'].sum().sort_values(ascending=False)

print("\\nSales by Product:")

print(product_sales)

region_sales = df.groupby('region').agg({

'total_sales': ['sum', 'mean', 'count']

})

print("\\nSales by Region:")

print(region_sales)

# Step 4: Find Insights

best_product = product_sales.idxmax()

best_region = df.groupby('region')['total_sales'].sum().idxmax()

print(f"\\n📊 Key Insights:")

print(f"• Best selling product: {best_product}")

print(f"• Top performing region: {best_region}")

print(f"• Total units sold: {df['quantity'].sum()}")

# Step 5: Export results

df.to_csv('sales_analysis.csv', index=False)

product_sales.to_excel('product_summary.xlsx')

🎓 What This Project Demonstrates:

• Creating and loading data into DataFrames
• Data type conversion (strings to dates)
• Creating calculated columns
• Grouping and aggregating data
• Finding insights (max, min, totals)
• Exporting results to files

📚 Learning Resources

Official Documentation

• Python Docs - Official Python documentation
• NumPy Docs - NumPy user guide
• Pandas Docs - Pandas documentation

Practice Platforms

• Kaggle Learn - Free micro-courses
• Google Colab - Free Jupyter notebooks
• Kaggle Datasets - Practice datasets

🎯 What's Next?

You now have the Python foundation for data science! In the next module, we'll dive into Statistics & Probability - the mathematical foundation for understanding and analyzing data.

← Back to Course Next: Statistics & Probability →