Build your first ML models: regression, classification, and ensemble methods
Imagine teaching a child to recognize dogs. You don't give them rules like "if it has 4 legs and fur, it's a dog." Instead, you show them many pictures of dogs and not-dogs, and they learn the pattern. That's Machine Learning - teaching computers to learn from examples instead of following explicit rules!
Machine Learning is a way to teach computers to make predictions or decisions by learning from data, without being explicitly programmed with rules. The computer finds patterns in examples and uses them to make predictions on new data.
Traditional Programming:
Rules + Data → Answers
Machine Learning:
Data + Answers → Rules (Model)
The model learns the rules from examples!
There are three main types of machine learning, each suited for different problems. Let's understand them with simple analogies.
Learning with a teacher. You have labeled data (inputs with correct answers), and the model learns to predict the answer for new inputs.
Analogy:
Like studying for an exam with answer keys. You see questions and their correct answers, learn the pattern, then answer new questions on the test.
Examples:
Two Types:
Learning without a teacher. You have data without labels, and the model finds hidden patterns or groups in the data.
Analogy:
Like organizing your closet. Nobody tells you how to group clothes, but you naturally group similar items together - shirts with shirts, pants with pants.
Examples:
Common Methods:
Learning by trial and error. An agent learns to make decisions by receiving rewards for good actions and penalties for bad ones.
Analogy:
Like training a dog. You give treats (rewards) when it does something right and say "no" (penalty) when it does something wrong. Over time, it learns which actions lead to treats.
Examples:
Key Concepts:
Linear Regression is the simplest ML algorithm. It finds the best straight line (or plane) that fits your data, then uses that line to make predictions. Perfect for predicting continuous numbers!
Imagine plotting house sizes (x-axis) vs prices (y-axis). Linear regression draws the best-fit line through these points. To predict a new house's price, you find its size on the line!
The Formula:
y = mx + b
y = prediction, m = slope, x = input, b = intercept
Example: Price = 200 × Size + 50000
# Import libraries
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
# Sample data: house sizes and prices
sizes = np.array([1000, 1500, 2000, 2500, 3000]).reshape(-1, 1)
prices = np.array([200000, 300000, 400000, 500000, 600000])
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
sizes, prices, test_size=0.2, random_state=42
)
# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train) # Learn from training data
# Make predictions
predictions = model.predict(X_test)
print(f"Predicted prices: {predictions}")
# Evaluate the model
score = model.score(X_test, y_test)
print(f"R² Score: {score:.2f}") # 1.0 = perfect fit
# Predict price for a new house (2200 sq ft)
new_house = np.array([[2200]])
predicted_price = model.predict(new_house)
print(f"Predicted price: ${predicted_price[0]:,.0f}")
# Visualize
plt.scatter(sizes, prices, color='blue', label='Actual')
plt.plot(sizes, model.predict(sizes), color='red', label='Prediction Line')
plt.xlabel('Size (sq ft)')
plt.ylabel('Price ($)')
plt.legend()
plt.show()
Despite its name, Logistic Regression is for classification (predicting categories), not regression! It predicts the probability that something belongs to a category. Perfect for yes/no questions.
Instead of predicting a number, logistic regression predicts a probability between 0 and 1. If probability > 0.5, predict "Yes" (class 1). If ≤ 0.5, predict "No" (class 0).
Example: Will a customer buy?
Input: Age, Income, Previous Purchases
Output: Probability = 0.75 (75% chance)
→ Prediction: Yes, they will buy!
# Import libraries
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Sample data: email features and labels
# Features: [num_exclamation, num_caps, num_links]
X = np.array([
[5, 20, 3], # Spam
[0, 2, 1], # Not spam
[8, 35, 5], # Spam
[1, 5, 0], # Not spam
[10, 40, 7], # Spam
])
y = np.array([1, 0, 1, 0, 1]) # 1=spam, 0=not spam
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Create and train model
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)
print(f"Predictions: {predictions}")
print(f"Probabilities: {probabilities}")
# Evaluate
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2f}")
print(classification_report(y_test, predictions))
# Predict for new email
new_email = np.array([[7, 30, 4]]) # Suspicious features
prediction = model.predict(new_email)
probability = model.predict_proba(new_email)[0][1]
print(f"Spam probability: {probability:.2%}")
A Decision Tree is like a flowchart of yes/no questions that leads to a decision. It's one of the most intuitive ML algorithms because you can literally see how it makes decisions!
Imagine deciding if you should play tennis today:
Is it sunny?
├─ Yes → Is humidity high?
│ ├─ Yes → Don't play
│ └─ No → Play!
└─ No → Is it windy?
├─ Yes → Don't play
└─ No → Play!
# Import Decision Tree
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import matplotlib.pyplot as plt
# Create and train model
model = DecisionTreeClassifier(max_depth=3)
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2f}")
# Visualize the tree
plt.figure(figsize=(15, 10))
tree.plot_tree(model, filled=True, feature_names=feature_names)
plt.show()
✅ Advantages:
❌ Disadvantages:
If one decision tree is good, many trees are better! A Random Forest creates hundreds of decision trees, each slightly different, and combines their predictions. This is called "ensemble learning."
Imagine asking 100 people to guess the number of jellybeans in a jar. Most individual guesses will be wrong, but the average of all guesses is usually very close! Random Forests work the same way.
Tree 1 predicts: Class A (60% confident)
Tree 2 predicts: Class B (55% confident)
Tree 3 predicts: Class A (70% confident)
...
Tree 100 predicts: Class A (65% confident)
Final prediction: Class A (majority vote)
# Import Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Create Random Forest with 100 trees
model = RandomForestClassifier(
n_estimators=100, # Number of trees
max_depth=10, # Max tree depth
random_state=42
)
# Train the forest
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2%}")
# Feature importance (which features matter most?)
importances = model.feature_importances_
for name, importance in zip(feature_names, importances):
print(f"{name}: {importance:.3f}")
How do you know if your model is good? You need metrics! Different metrics tell you different things about your model's performance. Let's understand the most important ones.
Accuracy
Percentage of correct predictions. Simple but can be misleading with imbalanced data.
Precision
Of all positive predictions, how many were actually positive? Important when false positives are costly.
Example: Of emails marked as spam, how many were actually spam?
Recall (Sensitivity)
Of all actual positives, how many did we find? Important when false negatives are costly.
Example: Of all actual spam emails, how many did we catch?
F1 Score
Harmonic mean of precision and recall. Good when you need balance between both.
# Import metrics
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, confusion_matrix, classification_report
)
# Calculate metrics
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2%}")
print(f"Precision: {precision:.2%}")
print(f"Recall: {recall:.2%}")
print(f"F1 Score: {f1:.2%}")
# Confusion Matrix
cm = confusion_matrix(y_test, predictions)
print("\\nConfusion Matrix:")
print(cm)
# [[True Neg, False Pos]
# [False Neg, True Pos]]
# Detailed report
print("\\nClassification Report:")
print(classification_report(y_test, predictions))
Imagine studying for an exam using practice questions. If the actual exam has the exact same questions, you'll ace it - but did you really learn? This is why we split data into training and testing sets!
Training Set: Data the model learns from (usually 70-80%)
Test Set: Data we use to evaluate the model (usually 20-30%)
The model never sees the test set during training!
Total Data: 1000 samples
↓
Training: 800 samples (80%) → Train model
Testing: 200 samples (20%) → Evaluate model
# Split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% for testing
random_state=42, # For reproducibility
stratify=y # Keep class proportions
)
Scikit-learn is Python's most popular ML library. It has a consistent, easy-to-use interface for dozens of algorithms. Once you learn the pattern, you can use any algorithm!
Every scikit-learn model follows the same 3-step pattern:
# 1. Create the model
model = SomeAlgorithm(parameters)
# 2. Train the model
model.fit(X_train, y_train)
# 3. Make predictions
predictions = model.predict(X_test)
# 1. Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# 2. Load data
iris = load_iris()
X, y = iris.data, iris.target
# 3. Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 4. Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 5. Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train_scaled, y_train)
# 6. Make predictions
predictions = model.predict(X_test_scaled)
# 7. Evaluate
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2%}")
You now understand ML fundamentals and can build classification and regression models! In the next module, we'll dive into Deep Learning with TensorFlow - building neural networks that can learn complex patterns.