Master Python fundamentals and essential libraries for data analysis and manipulation
Imagine you have a spreadsheet with millions of rows of customer data, sales figures, or scientific measurements. Excel would crash, but Python handles it effortlessly! Python is the #1 language for data science because it's powerful, easy to learn, and has incredible libraries for data work.
Python for Data Science is using Python programming to collect, clean, analyze, and visualize data to extract insights and make data-driven decisions. It's like having a super-powered calculator that can handle millions of numbers at once!
Example: Calculate average of 1 million numbers
import numpy as np
data = np.random.rand(1000000)
average = data.mean() # Done in milliseconds!
Easy to Learn
Readable syntax that looks like English
Powerful Libraries
NumPy, Pandas, Matplotlib - tools for everything
Industry Standard
Used by Google, Netflix, NASA, and more
Huge Community
Millions of data scientists sharing knowledge
Before diving into data analysis, let's master the basics! Think of these as the building blocks you'll use to construct powerful data pipelines.
A variable is like a labeled container that stores information. Python automatically figures out what type of data you're storing!
# Numbers
age = 25 # Integer
price = 19.99 # Float (decimal)
revenue = 1_000_000 # Underscores for readability
# Text (strings)
name = "Alice"
city = 'New York' # Single or double quotes work
message = f"Hello, {name}!" # f-string for formatting
# Boolean (True/False)
is_active = True
has_discount = False
# Lists (ordered collections)
scores = [95, 87, 92, 88, 90]
names = ["Alice", "Bob", "Charlie"]
print(scores[0]) # Access first item: 95
# Dictionaries (key-value pairs)
person = {
"name": "Alice",
"age": 25,
"city": "NYC"
}
print(person["name"]) # Output: Alice
# If-else statements (make decisions)
score = 85
if score >= 90:
print("Grade: A")
elif score >= 80:
print("Grade: B")
else:
print("Grade: C")
# For loops (repeat actions)
numbers = [1, 2, 3, 4, 5]
for num in numbers:
print(num * 2) # Prints: 2, 4, 6, 8, 10
# List comprehension (elegant way to create lists)
squares = [x**2 for x in range(5)]
print(squares) # [0, 1, 4, 9, 16]
Functions are reusable blocks of code. Write once, use many times! They help keep your code organized and avoid repetition.
# Define a function
def calculate_average(numbers):
"""Calculate the average of a list of numbers."""
total = sum(numbers)
count = len(numbers)
return total / count
# Use the function
scores = [95, 87, 92, 88, 90]
avg = calculate_average(scores)
print(f"Average: {avg}") # Average: 90.4
# Function with default parameters
def greet(name, greeting="Hello"):
return f"{greeting}, {name}!"
print(greet("Alice")) # Hello, Alice!
print(greet("Bob", "Hi")) # Hi, Bob!
Think of functions like recipes. Once you write down a recipe for chocolate chip cookies, you don't need to rewrite it every time you bake. Just follow the recipe (call the function) with your ingredients (parameters) and get cookies (return value)!
NumPy (Numerical Python) is the foundation of data science in Python. It provides powerful tools for working with arrays of numbers - think of it as a supercharged calculator that can handle millions of numbers at lightning speed!
An array is like a grid of numbers. A 1D array is a list, a 2D array is a table, and a 3D array is like a cube. Arrays are 50-100x faster than Python lists for numerical operations!
# Import NumPy (standard alias is np)
import numpy as np
# Create arrays
arr1d = np.array([1, 2, 3, 4, 5]) # 1D array
arr2d = np.array([[1, 2, 3], [4, 5, 6]]) # 2D array
print(arr1d.shape) # (5,) - 5 elements
print(arr2d.shape) # (2, 3) - 2 rows, 3 columns
# Create special arrays
zeros = np.zeros(5) # [0. 0. 0. 0. 0.]
ones = np.ones((3, 3)) # 3x3 matrix of 1s
range_arr = np.arange(0, 10, 2) # [0 2 4 6 8]
linspace = np.linspace(0, 1, 5) # 5 evenly spaced numbers
random_arr = np.random.rand(3, 3) # 3x3 random numbers
# Element-wise operations (vectorization)
a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])
print(a + b) # [11 22 33 44]
print(a * b) # [10 40 90 160]
print(a ** 2) # [1 4 9 16] - square each element
print(a > 2) # [False False True True] - boolean array
# Statistical operations
data = np.array([95, 87, 92, 88, 90])
print(data.mean()) # 90.4 - average
print(data.std()) # 2.86 - standard deviation
print(data.min()) # 87 - minimum value
print(data.max()) # 95 - maximum value
print(data.sum()) # 452 - sum of all elements
# Indexing and slicing
arr = np.array([10, 20, 30, 40, 50])
print(arr[0]) # 10 - first element
print(arr[-1]) # 50 - last element
print(arr[1:4]) # [20 30 40] - slice
print(arr[arr > 25]) # [30 40 50] - boolean indexing
# Reshaping arrays
arr = np.arange(12) # [0 1 2 ... 11]
reshaped = arr.reshape(3, 4) # 3 rows, 4 columns
flattened = reshaped.flatten() # Back to 1D
transposed = reshaped.T # Swap rows and columns
NumPy operations are implemented in C and use vectorization - applying operations to entire arrays at once instead of looping through elements. This makes NumPy 50-100x faster than regular Python lists for numerical work!
Pandas is like Excel on steroids! It lets you work with tables of data (called DataFrames), clean messy data, filter, group, and analyze with ease. If NumPy is for arrays, Pandas is for structured data with rows and columns.
A DataFrame is a 2D table with labeled rows and columns. Each column can have different data types (numbers, text, dates). It's the most important data structure in data science!
# Import Pandas (standard alias is pd)
import pandas as pd
# Create DataFrame from dictionary
data = {
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'city': ['NYC', 'LA', 'Chicago', 'NYC'],
'salary': [70000, 85000, 90000, 75000]
}
df = pd.DataFrame(data)
print(df)
# Read from CSV (most common format)
df = pd.read_csv('data.csv')
df = pd.read_csv('data.csv', sep=';') # Custom separator
# Read from Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# Read from JSON
df = pd.read_json('data.json')
# Write to files
df.to_csv('output.csv', index=False) # Don't save index
df.to_excel('output.xlsx', index=False)
df.to_json('output.json')
# Quick data exploration
df.head() # First 5 rows
df.tail(10) # Last 10 rows
df.info() # Data types and missing values
df.describe() # Statistical summary
df.shape # (rows, columns)
df.columns # Column names
# Select columns
names = df['name'] # Single column (Series)
subset = df[['name', 'age']] # Multiple columns (DataFrame)
# Filter rows by condition
adults = df[df['age'] >= 30] # Age 30 or older
nyc_people = df[df['city'] == 'NYC'] # People in NYC
high_earners = df[df['salary'] > 80000] # Salary above 80k
# Multiple conditions (use & for AND, | for OR)
result = df[(df['age'] > 25) & (df['city'] == 'NYC')]
# Select by position (iloc) or label (loc)
df.iloc[0] # First row
df.iloc[0:3] # First 3 rows
df.loc[0, 'name'] # Specific cell
# Add new columns
df['salary_k'] = df['salary'] / 1000 # Salary in thousands
df['is_senior'] = df['age'] >= 30 # Boolean column
# Apply functions to columns
df['name_upper'] = df['name'].str.upper() # Uppercase names
df['age_group'] = df['age'].apply(lambda x: 'Young' if x < 30 else 'Senior')
# Sort data
df.sort_values('age') # Sort by age (ascending)
df.sort_values('salary', ascending=False) # Descending
df.sort_values(['city', 'age']) # Sort by multiple columns
# Group and aggregate
df.groupby('city')['salary'].mean() # Average salary by city
df.groupby('city').agg({'salary': ['mean', 'max', 'min']})
# Remove duplicates
df.drop_duplicates() # Remove duplicate rows
df.drop_duplicates(subset=['name']) # Based on specific column
Excel is great for small datasets and manual work. Pandas shines with large datasets (millions of rows), automation, and reproducibility. What takes hours of clicking in Excel can be done in seconds with Pandas code!
Real-world data is messy! Missing values, duplicates, wrong formats, outliers - data scientists spend 80% of their time cleaning data. Let's learn how to handle these common issues.
Missing data appears as NaN (Not a Number) or None. You can either remove it or fill it with reasonable values.
# Check for missing values
df.isnull().sum() # Count missing values per column
df.isnull().any() # Which columns have missing values?
# Remove missing values
df.dropna() # Remove rows with any missing values
df.dropna(subset=['age']) # Remove rows where age is missing
df.dropna(axis=1) # Remove columns with missing values
# Fill missing values
df.fillna(0) # Replace all NaN with 0
df['age'].fillna(df['age'].mean()) # Fill with average
df['age'].fillna(df['age'].median()) # Fill with median
df.fillna(method='ffill') # Forward fill (use previous value)
df.fillna(method='bfill') # Backward fill (use next value)
# Convert strings to datetime
df['date'] = pd.to_datetime(df['date'])
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
# Extract date components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['day_of_week'] = df['date'].dt.day_name() # Monday, Tuesday, etc.
# Date arithmetic
df['days_since'] = (pd.Timestamp.now() - df['date']).dt.days
df['next_week'] = df['date'] + pd.Timedelta(days=7)
Outliers are extreme values that don't fit the pattern. They can be errors or genuine unusual cases. Common methods: remove values beyond 3 standard deviations or use the IQR (Interquartile Range) method.
# Method 1: Standard deviation
mean = df['salary'].mean()
std = df['salary'].std()
df_clean = df[(df['salary'] > mean - 3*std) & (df['salary'] < mean + 3*std)]
# Method 2: IQR (Interquartile Range)
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
df_clean = df[(df['salary'] >= Q1 - 1.5*IQR) & (df['salary'] <= Q3 + 1.5*IQR)]
Don't automatically remove all outliers! Sometimes they're the most interesting data points. A billionaire in a salary dataset is an outlier but might be important for your analysis. Always investigate before removing!
Let's put everything together! Here's a complete example analyzing sales data - loading, cleaning, analyzing, and extracting insights.
# Import libraries
import pandas as pd
import numpy as np
# Create sample sales data
data = {
'date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05'],
'product': ['Laptop', 'Phone', 'Laptop', 'Tablet', 'Phone'],
'quantity': [2, 5, 1, 3, 4],
'price': [1200, 800, 1200, 500, 800],
'region': ['East', 'West', 'East', 'West', 'East']
}
df = pd.DataFrame(data)
# Step 1: Data Cleaning
df['date'] = pd.to_datetime(df['date']) # Convert to datetime
df['total_sales'] = df['quantity'] * df['price'] # Calculate total
# Step 2: Exploratory Analysis
print("Dataset Overview:")
print(df.head())
print(f"\\nTotal Revenue: ${df['total_sales'].sum():,.2f}")
print(f"Average Order Value: ${df['total_sales'].mean():,.2f}")
# Step 3: Group Analysis
product_sales = df.groupby('product')['total_sales'].sum().sort_values(ascending=False)
print("\\nSales by Product:")
print(product_sales)
region_sales = df.groupby('region').agg({
'total_sales': ['sum', 'mean', 'count']
})
print("\\nSales by Region:")
print(region_sales)
# Step 4: Find Insights
best_product = product_sales.idxmax()
best_region = df.groupby('region')['total_sales'].sum().idxmax()
print(f"\\n๐ Key Insights:")
print(f"โข Best selling product: {best_product}")
print(f"โข Top performing region: {best_region}")
print(f"โข Total units sold: {df['quantity'].sum()}")
# Step 5: Export results
df.to_csv('sales_analysis.csv', index=False)
product_sales.to_excel('product_summary.xlsx')
You now have the Python foundation for data science! In the next module, we'll dive into Statistics & Probability - the mathematical foundation for understanding and analyzing data.