Module 2: Machine Learning Fundamentals

Build your first ML models: regression, classification, and ensemble methods

🤖 What is Machine Learning?

Imagine teaching a child to recognize dogs. You don't give them rules like "if it has 4 legs and fur, it's a dog." Instead, you show them many pictures of dogs and not-dogs, and they learn the pattern. That's Machine Learning - teaching computers to learn from examples instead of following explicit rules!

Simple Definition

Machine Learning is a way to teach computers to make predictions or decisions by learning from data, without being explicitly programmed with rules. The computer finds patterns in examples and uses them to make predictions on new data.

Traditional Programming:

Rules + Data → Answers

Machine Learning:

Data + Answers → Rules (Model)

The model learns the rules from examples!

🌟 Real-World Examples:

• Netflix: Recommends shows based on what you've watched
• Gmail: Filters spam by learning from millions of emails
• Amazon: Predicts what you might want to buy
• Self-driving cars: Learn to recognize pedestrians, signs, and obstacles
• Medical diagnosis: Detect diseases from X-rays and scans

📊 Types of Machine Learning

There are three main types of machine learning, each suited for different problems. Let's understand them with simple analogies.

Supervised Learning

Learning with a teacher. You have labeled data (inputs with correct answers), and the model learns to predict the answer for new inputs.

Analogy:

Like studying for an exam with answer keys. You see questions and their correct answers, learn the pattern, then answer new questions on the test.

Examples:

• Email spam detection
• House price prediction
• Image classification
• Credit risk assessment

Two Types:

• Regression: Predict numbers (price, temperature)
• Classification: Predict categories (spam/not spam)

Unsupervised Learning

Learning without a teacher. You have data without labels, and the model finds hidden patterns or groups in the data.

Analogy:

Like organizing your closet. Nobody tells you how to group clothes, but you naturally group similar items together - shirts with shirts, pants with pants.

Examples:

• Customer segmentation
• Anomaly detection
• Recommendation systems
• Data compression

Common Methods:

• Clustering: Group similar items
• Dimensionality Reduction: Simplify data

Reinforcement Learning

Learning by trial and error. An agent learns to make decisions by receiving rewards for good actions and penalties for bad ones.

Analogy:

Like training a dog. You give treats (rewards) when it does something right and say "no" (penalty) when it does something wrong. Over time, it learns which actions lead to treats.

Examples:

• Game AI (Chess, Go, video games)
• Robotics and control
• Self-driving cars
• Resource optimization

Key Concepts:

• Agent takes actions
• Environment gives rewards
• Goal: Maximize total reward

📈 Linear Regression

Linear Regression is the simplest ML algorithm. It finds the best straight line (or plane) that fits your data, then uses that line to make predictions. Perfect for predicting continuous numbers!

The Concept

Imagine plotting house sizes (x-axis) vs prices (y-axis). Linear regression draws the best-fit line through these points. To predict a new house's price, you find its size on the line!

The Formula:

y = mx + b

y = prediction, m = slope, x = input, b = intercept

Example: Price = 200 × Size + 50000

Code Example

# Import libraries

import numpy as np

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

# Sample data: house sizes and prices

sizes = np.array([1000, 1500, 2000, 2500, 3000]).reshape(-1, 1)

prices = np.array([200000, 300000, 400000, 500000, 600000])

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(

sizes, prices, test_size=0.2, random_state=42

)

# Create and train the model

model = LinearRegression()

model.fit(X_train, y_train) # Learn from training data

# Make predictions

predictions = model.predict(X_test)

print(f"Predicted prices: {predictions}")

# Evaluate the model

score = model.score(X_test, y_test)

print(f"R² Score: {score:.2f}") # 1.0 = perfect fit

# Predict price for a new house (2200 sq ft)

new_house = np.array([[2200]])

predicted_price = model.predict(new_house)

print(f"Predicted price: ${predicted_price[0]:,.0f}")

# Visualize

plt.scatter(sizes, prices, color='blue', label='Actual')

plt.plot(sizes, model.predict(sizes), color='red', label='Prediction Line')

plt.xlabel('Size (sq ft)')

plt.ylabel('Price ($)')

plt.legend()

plt.show()

🎯 When to Use Linear Regression:

• Predicting continuous values (prices, temperatures, sales)
• When relationship between variables is roughly linear
• When you need an interpretable model (can see the formula)
• As a baseline before trying complex models

🎯 Logistic Regression

Despite its name, Logistic Regression is for classification (predicting categories), not regression! It predicts the probability that something belongs to a category. Perfect for yes/no questions.

The Concept

Instead of predicting a number, logistic regression predicts a probability between 0 and 1. If probability > 0.5, predict "Yes" (class 1). If ≤ 0.5, predict "No" (class 0).

Example: Will a customer buy?

Input: Age, Income, Previous Purchases

Output: Probability = 0.75 (75% chance)

→ Prediction: Yes, they will buy!

Code Example: Email Spam Detection

# Import libraries

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score, classification_report

# Sample data: email features and labels

# Features: [num_exclamation, num_caps, num_links]

X = np.array([

[5, 20, 3], # Spam

[0, 2, 1], # Not spam

[8, 35, 5], # Spam

[1, 5, 0], # Not spam

[10, 40, 7], # Spam

])

y = np.array([1, 0, 1, 0, 1]) # 1=spam, 0=not spam

# Split data

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.2, random_state=42

)

# Create and train model

model = LogisticRegression()

model.fit(X_train, y_train)

# Make predictions

predictions = model.predict(X_test)

probabilities = model.predict_proba(X_test)

print(f"Predictions: {predictions}")

print(f"Probabilities: {probabilities}")

# Evaluate

accuracy = accuracy_score(y_test, predictions)

print(f"Accuracy: {accuracy:.2f}")

print(classification_report(y_test, predictions))

# Predict for new email

new_email = np.array([[7, 30, 4]]) # Suspicious features

prediction = model.predict(new_email)

probability = model.predict_proba(new_email)[0][1]

print(f"Spam probability: {probability:.2%}")

💡 Common Use Cases:

• Email spam detection (spam/not spam)
• Medical diagnosis (disease/no disease)
• Credit approval (approve/reject)
• Customer churn prediction (will leave/will stay)
• Click prediction (will click/won't click)

🌳 Decision Trees

A Decision Tree is like a flowchart of yes/no questions that leads to a decision. It's one of the most intuitive ML algorithms because you can literally see how it makes decisions!

Visual Example

Imagine deciding if you should play tennis today:

Is it sunny?

├─ Yes → Is humidity high?

│ ├─ Yes → Don't play

│ └─ No → Play!

└─ No → Is it windy?

├─ Yes → Don't play

└─ No → Play!

Code Example

# Import Decision Tree

from sklearn.tree import DecisionTreeClassifier

from sklearn import tree

import matplotlib.pyplot as plt

# Create and train model

model = DecisionTreeClassifier(max_depth=3)

model.fit(X_train, y_train)

# Make predictions

predictions = model.predict(X_test)

accuracy = model.score(X_test, y_test)

print(f"Accuracy: {accuracy:.2f}")

# Visualize the tree

plt.figure(figsize=(15, 10))

tree.plot_tree(model, filled=True, feature_names=feature_names)

plt.show()

✅ Advantages:

• Easy to understand and visualize
• No data preprocessing needed
• Works with numbers and categories
• Fast to train and predict

❌ Disadvantages:

• Can overfit easily (memorize training data)
• Unstable (small data changes = big tree changes)
• Not great for complex patterns

🌲 Random Forests

If one decision tree is good, many trees are better! A Random Forest creates hundreds of decision trees, each slightly different, and combines their predictions. This is called "ensemble learning."

The Wisdom of Crowds

Imagine asking 100 people to guess the number of jellybeans in a jar. Most individual guesses will be wrong, but the average of all guesses is usually very close! Random Forests work the same way.

Tree 1 predicts: Class A (60% confident)

Tree 2 predicts: Class B (55% confident)

Tree 3 predicts: Class A (70% confident)

...

Tree 100 predicts: Class A (65% confident)

Final prediction: Class A (majority vote)

Code Example

# Import Random Forest

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score

# Create Random Forest with 100 trees

model = RandomForestClassifier(

n_estimators=100, # Number of trees

max_depth=10, # Max tree depth

random_state=42

)

# Train the forest

model.fit(X_train, y_train)

# Make predictions

predictions = model.predict(X_test)

accuracy = accuracy_score(y_test, predictions)

print(f"Accuracy: {accuracy:.2%}")

# Feature importance (which features matter most?)

importances = model.feature_importances_

for name, importance in zip(feature_names, importances):

print(f"{name}: {importance:.3f}")

🎯 Why Random Forests Are Popular:

• More accurate than single decision trees
• Less prone to overfitting
• Works well "out of the box" with default settings
• Handles missing data well
• Shows which features are most important
• Used in many Kaggle winning solutions!

📊 Model Evaluation

How do you know if your model is good? You need metrics! Different metrics tell you different things about your model's performance. Let's understand the most important ones.

Classification Metrics

Accuracy

Percentage of correct predictions. Simple but can be misleading with imbalanced data.

Accuracy = (Correct Predictions) / (Total Predictions)

Precision

Of all positive predictions, how many were actually positive? Important when false positives are costly.

Precision = True Positives / (True Positives + False Positives)

Example: Of emails marked as spam, how many were actually spam?

Recall (Sensitivity)

Of all actual positives, how many did we find? Important when false negatives are costly.

Recall = True Positives / (True Positives + False Negatives)

Example: Of all actual spam emails, how many did we catch?

F1 Score

Harmonic mean of precision and recall. Good when you need balance between both.

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Code Example

# Import metrics

from sklearn.metrics import (

accuracy_score, precision_score, recall_score,

f1_score, confusion_matrix, classification_report

)

# Calculate metrics

accuracy = accuracy_score(y_test, predictions)

precision = precision_score(y_test, predictions)

recall = recall_score(y_test, predictions)

f1 = f1_score(y_test, predictions)

print(f"Accuracy: {accuracy:.2%}")

print(f"Precision: {precision:.2%}")

print(f"Recall: {recall:.2%}")

print(f"F1 Score: {f1:.2%}")

# Confusion Matrix

cm = confusion_matrix(y_test, predictions)

print("\\nConfusion Matrix:")

print(cm)

# [[True Neg, False Pos]

# [False Neg, True Pos]]

# Detailed report

print("\\nClassification Report:")

print(classification_report(y_test, predictions))

💡 Which Metric to Use?

• Accuracy: When classes are balanced
• Precision: When false positives are costly (e.g., spam filter)
• Recall: When false negatives are costly (e.g., disease detection)
• F1 Score: When you need balance between precision and recall

✂️ Train/Test Split

Imagine studying for an exam using practice questions. If the actual exam has the exact same questions, you'll ace it - but did you really learn? This is why we split data into training and testing sets!

The Concept

Training Set: Data the model learns from (usually 70-80%)
Test Set: Data we use to evaluate the model (usually 20-30%)
The model never sees the test set during training!

Total Data: 1000 samples

↓

Training: 800 samples (80%) → Train model

Testing: 200 samples (20%) → Evaluate model

# Split data

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(

X, y,

test_size=0.2, # 20% for testing

random_state=42, # For reproducibility

stratify=y # Keep class proportions

)

🎯 Best Practices:

• Always split before any preprocessing
• Use stratify for imbalanced datasets
• Set random_state for reproducible results
• Never train on test data (data leakage!)
• For small datasets, use cross-validation instead

🔧 Scikit-learn Basics

Scikit-learn is Python's most popular ML library. It has a consistent, easy-to-use interface for dozens of algorithms. Once you learn the pattern, you can use any algorithm!

The Universal Pattern

Every scikit-learn model follows the same 3-step pattern:

# 1. Create the model

model = SomeAlgorithm(parameters)

# 2. Train the model

model.fit(X_train, y_train)

# 3. Make predictions

predictions = model.predict(X_test)

Complete Workflow Example

# 1. Import libraries

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score

# 2. Load data

iris = load_iris()

X, y = iris.data, iris.target

# 3. Split data

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.2, random_state=42

)

# 4. Scale features

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

# 5. Train model

model = RandomForestClassifier(n_estimators=100)

model.fit(X_train_scaled, y_train)

# 6. Make predictions

predictions = model.predict(X_test_scaled)

# 7. Evaluate

accuracy = accuracy_score(y_test, predictions)

print(f"Accuracy: {accuracy:.2%}")

📚 Learn More:

• Scikit-learn Docs - Official documentation
• Scikit-learn Tutorials - Step-by-step guides
• Example Gallery - Code examples

🎯 What's Next?

You now understand ML fundamentals and can build classification and regression models! In the next module, we'll dive into Deep Learning with TensorFlow - building neural networks that can learn complex patterns.

← Previous: Python for AI/ML Next: Deep Learning with TensorFlow →