Back to Data Science

Module 5: Advanced Machine Learning

Take your ML skills to the next level with advanced techniques and optimization strategies

🚀 What is Advanced Machine Learning?

You've learned the basics of machine learning - now it's time to level up! Advanced ML is about squeezing every bit of performance from your models, handling complex real-world scenarios, and using cutting-edge techniques that win competitions and power production systems.

Simple Definition

Advanced Machine Learning involves sophisticated techniques to improve model performance, handle complex data, and solve challenging real-world problems. It's like going from a beginner chess player to a grandmaster - same game, but much deeper strategy!

Why Advanced ML?

Better Performance

Boost accuracy from 85% to 95%+ with advanced techniques

Handle Complexity

Work with high-dimensional data and complex patterns

Production Ready

Build robust models that work in real-world systems

Competitive Edge

Use techniques that win Kaggle competitions

📚 Learn More:

🌳 Gradient Boosting (XGBoost & LightGBM)

Gradient boosting is like assembling a team of experts where each new expert focuses on fixing the mistakes of previous ones! It's one of the most powerful ML techniques - winning countless Kaggle competitions and powering production systems at companies like Uber, Airbnb, and Netflix.

How Gradient Boosting Works

Build models sequentially, where each new model corrects errors made by previous models. It's like iterative learning - each iteration gets smarter by focusing on what went wrong before!

# Install libraries

# pip install xgboost lightgbm

# XGBoost Example

import xgboost as xgb

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

# Load and split data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create and train XGBoost model

model = xgb.XGBClassifier(

n_estimators=100, # Number of trees

learning_rate=0.1, # Step size

max_depth=5, # Tree depth

random_state=42

)

model.fit(X_train, y_train)

# Make predictions

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.2%}")

XGBoost vs LightGBM

XGBoost

The original champion! More mature, widely used, excellent documentation.

Best for:

  • • Small to medium datasets
  • • When you need stability
  • • Extensive hyperparameter tuning

LightGBM

The speed demon! Faster training, lower memory usage, handles large datasets better.

Best for:

  • • Large datasets (> 10k rows)
  • • When speed matters
  • • High-dimensional data

# LightGBM Example

import lightgbm as lgb

model = lgb.LGBMClassifier(

n_estimators=100,

learning_rate=0.1,

num_leaves=31, # Max leaves per tree

random_state=42

)

model.fit(X_train, y_train)

# Feature importance

import matplotlib.pyplot as plt

lgb.plot_importance(model, max_num_features=10)

plt.show()

🚀 Pro Tip:

Start with XGBoost for learning and small projects. Switch to LightGBM when you have large datasets or need faster training. Both produce similar accuracy, so choose based on your needs!

🔧 Feature Engineering

Feature engineering is the art of creating better input features for your models. It's often said that "better data beats better algorithms" - a simple model with great features outperforms a complex model with poor features!

Feature Scaling

Scaling transforms features to similar ranges. Imagine comparing house prices ($100k-$1M) with number of bedrooms (1-5) - the prices dominate! Scaling fixes this.

# StandardScaler (mean=0, std=1)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X_train)

# Each feature now has mean=0, std=1

# MinMaxScaler (scale to 0-1 range)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X_scaled = scaler.fit_transform(X_train)

# All features now between 0 and 1

# RobustScaler (handles outliers better)

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()

X_scaled = scaler.fit_transform(X_train)

# Uses median and IQR, robust to outliers

Encoding Categorical Variables

# Label Encoding (for ordinal data)

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df['size_encoded'] = le.fit_transform(df['size'])

# Small=0, Medium=1, Large=2

# One-Hot Encoding (for nominal data)

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)

encoded = encoder.fit_transform(df[['color']])

# Creates binary columns: color_red, color_blue, color_green

# Pandas get_dummies (easiest way)

import pandas as pd

df_encoded = pd.get_dummies(df, columns=['color', 'size'])

# Automatically creates dummy variables

Creating New Features

# Polynomial features (interactions)

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False)

X_poly = poly.fit_transform(X)

# Creates x1, x2, x1², x2², x1*x2

# Domain-specific features

df['price_per_sqft'] = df['price'] / df['sqft']

df['age'] = 2024 - df['year_built']

df['is_weekend'] = df['date'].dt.dayofweek >= 5

# Binning continuous variables

df['age_group'] = pd.cut(df['age'],

bins=[0, 18, 35, 50, 100],

labels=['child', 'young', 'middle', 'senior'])

⚠️ Feature Engineering Tips:

  • • Always fit scalers on training data only, then transform test data
  • • Use domain knowledge to create meaningful features
  • • More features isn't always better - can lead to overfitting
  • • Test feature importance and remove low-value features

📉 Dimensionality Reduction

Imagine trying to visualize 100-dimensional data - impossible! Dimensionality reduction compresses high-dimensional data into 2-3 dimensions while preserving important patterns. It's like creating a map from a 3D globe - you lose some information but gain understandability!

PCA (Principal Component Analysis)

PCA finds the directions of maximum variance in your data. Think of it as finding the best camera angles to capture the most information about a 3D object in 2D photos.

# Apply PCA

from sklearn.decomposition import PCA

import matplotlib.pyplot as plt

# Reduce to 2 components for visualization

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_scaled)

# Check explained variance

print(f"Variance explained: {pca.explained_variance_ratio_}")

# Shows how much information each component captures

# Visualize in 2D

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')

plt.xlabel('First Principal Component')

plt.ylabel('Second Principal Component')

plt.show()

# Keep 95% of variance

pca = PCA(n_components=0.95) # Automatically chooses components

X_pca = pca.fit_transform(X_scaled)

print(f"Reduced from {X.shape[1]} to {X_pca.shape[1]} features")

t-SNE (t-Distributed Stochastic Neighbor Embedding)

# t-SNE for visualization (not for ML models!)

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=42, perplexity=30)

X_tsne = tsne.fit_transform(X_scaled)

# Visualize clusters

plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis')

plt.title('t-SNE Visualization')

plt.show()

🎯 PCA vs t-SNE:

PCA: Fast, linear, preserves global structure. Use for preprocessing and feature reduction.

t-SNE: Slow, non-linear, preserves local structure. Use only for visualization, not as input to models!

🎭 Ensemble Methods

Why rely on one model when you can combine many? Ensemble methods are like asking multiple experts and taking a vote - often more accurate than any single expert! This is why ensemble methods dominate ML competitions.

Bagging (Bootstrap Aggregating)

Bagging trains multiple models on different random subsets of data, then averages their predictions. Random Forest is the most famous bagging algorithm!

# Random Forest (bagging with decision trees)

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(

n_estimators=100, # Number of trees

max_depth=10,

random_state=42

)

rf.fit(X_train, y_train)

# Feature importance

importances = rf.feature_importances_

print("Top features:", importances)

Boosting

# AdaBoost (Adaptive Boosting)

from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(

n_estimators=50,

learning_rate=1.0,

random_state=42

)

ada.fit(X_train, y_train)

# Gradient Boosting (we covered XGBoost/LightGBM earlier!)

from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(n_estimators=100, random_state=42)

gb.fit(X_train, y_train)

Stacking

# Stacking: Train a meta-model on predictions of base models

from sklearn.ensemble import StackingClassifier

from sklearn.linear_model import LogisticRegression

from sklearn.svm import SVC

# Define base models

estimators = [

('rf', RandomForestClassifier(n_estimators=10)),

('svm', SVC(probability=True))

]

# Create stacking ensemble

stacking = StackingClassifier(

estimators=estimators,

final_estimator=LogisticRegression()

)

stacking.fit(X_train, y_train)

🎯 Ensemble Strategy:

  • Bagging: Reduces variance, good for unstable models (decision trees)
  • Boosting: Reduces bias, builds strong models from weak ones
  • Stacking: Combines different model types for best performance

🎛️ Hyperparameter Tuning

Hyperparameters are the knobs you turn to optimize your model - like adjusting the temperature and time when baking a cake. Finding the right settings can boost accuracy by 5-10%!

Grid Search

Grid Search tries every combination of parameters you specify. Thorough but slow!

from sklearn.model_selection import GridSearchCV

from sklearn.ensemble import RandomForestClassifier

# Define parameter grid

param_grid = {

'n_estimators': [50, 100, 200],

'max_depth': [5, 10, 15, None],

'min_samples_split': [2, 5, 10]

}

# Create grid search

grid_search = GridSearchCV(

RandomForestClassifier(random_state=42),

param_grid,

cv=5, # 5-fold cross-validation

scoring='accuracy',

n_jobs=-1 # Use all CPU cores

)

# Fit and get best parameters

grid_search.fit(X_train, y_train)

print(f"Best params: {grid_search.best_params_}")

print(f"Best score: {grid_search.best_score_:.3f}")

# Use best model

best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test)

Random Search

from sklearn.model_selection import RandomizedSearchCV

from scipy.stats import randint, uniform

# Define parameter distributions

param_dist = {

'n_estimators': randint(50, 500),

'max_depth': randint(3, 20),

'learning_rate': uniform(0.01, 0.3)

}

# Random search (faster than grid search)

random_search = RandomizedSearchCV(

xgb.XGBClassifier(),

param_dist,

n_iter=50, # Try 50 random combinations

cv=5,

random_state=42

)

random_search.fit(X_train, y_train)

💡 Tuning Strategy:

Start with Random Search to explore the parameter space quickly, then use Grid Search to fine-tune around the best values found. This hybrid approach is faster and more effective!

⚖️ Handling Imbalanced Datasets

Real-world data is often imbalanced - 99% normal transactions, 1% fraud. If your model just predicts "normal" every time, it's 99% accurate but useless! Here's how to handle this challenge.

Class Weights

Give more importance to the minority class during training. Most sklearn models support this!

from sklearn.ensemble import RandomForestClassifier

# Automatically balance class weights

model = RandomForestClassifier(class_weight='balanced')

model.fit(X_train, y_train)

# Or specify custom weights

model = RandomForestClassifier(class_weight={0: 1, 1: 10})

# Class 1 (minority) gets 10x more weight

SMOTE (Synthetic Minority Over-sampling)

# Install: pip install imbalanced-learn

from imblearn.over_sampling import SMOTE

# Create synthetic samples for minority class

smote = SMOTE(random_state=42)

X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

print(f"Original: {len(y_train)} samples")

print(f"After SMOTE: {len(y_resampled)} samples")

# Train on balanced data

model.fit(X_resampled, y_resampled)

Undersampling

from imblearn.under_sampling import RandomUnderSampler

# Reduce majority class samples

rus = RandomUnderSampler(random_state=42)

X_resampled, y_resampled = rus.fit_resample(X_train, y_train)

⚠️ Important Metrics for Imbalanced Data:

  • • Don't use accuracy! Use precision, recall, F1-score instead
  • • ROC-AUC score is great for imbalanced classification
  • • Confusion matrix shows true/false positives/negatives
  • • Consider the business cost of false positives vs false negatives

🎯 Complete Project: Fraud Detection System

Let's build a complete fraud detection system using all the advanced techniques we've learned! This project combines feature engineering, handling imbalanced data, ensemble methods, and hyperparameter tuning.

# Step 1: Import libraries

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from imblearn.over_sampling import SMOTE

import xgboost as xgb

from sklearn.metrics import classification_report, roc_auc_score

# Step 2: Load and explore data

df = pd.read_csv('transactions.csv')

print(df.head())

print(f"Fraud rate: {df['is_fraud'].mean():.2%}")

# Step 3: Feature Engineering

df['hour'] = pd.to_datetime(df['timestamp']).dt.hour

df['is_night'] = (df['hour'] >= 22) | (df['hour'] <= 6)

df['amount_log'] = np.log1p(df['amount'])

df['velocity'] = df.groupby('user_id')['amount'].transform('count')

# Step 4: Prepare features

features = ['amount', 'amount_log', 'hour', 'is_night', 'velocity']

X = df[features]

y = df['is_fraud']

# Step 5: Split data

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.2, random_state=42, stratify=y

)

# Step 6: Scale features

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

# Step 7: Handle imbalance with SMOTE

smote = SMOTE(random_state=42)

X_train_balanced, y_train_balanced = smote.fit_resample(

X_train_scaled, y_train

)

print(f"After SMOTE: {len(y_train_balanced)} samples")

# Step 8: Train XGBoost model

model = xgb.XGBClassifier(

n_estimators=200,

learning_rate=0.1,

max_depth=6,

scale_pos_weight=1, # Already balanced with SMOTE

random_state=42

)

model.fit(X_train_balanced, y_train_balanced)

# Step 9: Evaluate model

y_pred = model.predict(X_test_scaled)

y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]

print("\nClassification Report:")

print(classification_report(y_test, y_pred))

roc_score = roc_auc_score(y_test, y_pred_proba)

print(f"\nROC-AUC Score: {roc_score:.3f}")

# Step 10: Feature importance

import matplotlib.pyplot as plt

xgb.plot_importance(model, max_num_features=10)

plt.title('Top 10 Important Features')

plt.show()

# Step 11: Save model

import joblib

joblib.dump(model, 'fraud_detection_model.pkl')

joblib.dump(scaler, 'scaler.pkl')

print("Model saved successfully!")

🎓 What This Project Demonstrates:

  • • Feature engineering (time-based, log transforms, aggregations)
  • • Handling imbalanced data with SMOTE
  • • Feature scaling for better model performance
  • • XGBoost for gradient boosting
  • • Proper evaluation metrics (ROC-AUC, classification report)
  • • Feature importance analysis
  • • Model persistence (saving for production)

📚 Learning Resources

Official Documentation

Practice & Competitions

🎯 What's Next?

You've mastered advanced ML techniques! Next, we'll dive into SQL & Databases - essential skills for working with real-world data stored in databases. You'll learn to query, join, and analyze data directly from databases.