Take your ML skills to the next level with advanced techniques and optimization strategies
You've learned the basics of machine learning - now it's time to level up! Advanced ML is about squeezing every bit of performance from your models, handling complex real-world scenarios, and using cutting-edge techniques that win competitions and power production systems.
Advanced Machine Learning involves sophisticated techniques to improve model performance, handle complex data, and solve challenging real-world problems. It's like going from a beginner chess player to a grandmaster - same game, but much deeper strategy!
Better Performance
Boost accuracy from 85% to 95%+ with advanced techniques
Handle Complexity
Work with high-dimensional data and complex patterns
Production Ready
Build robust models that work in real-world systems
Competitive Edge
Use techniques that win Kaggle competitions
Gradient boosting is like assembling a team of experts where each new expert focuses on fixing the mistakes of previous ones! It's one of the most powerful ML techniques - winning countless Kaggle competitions and powering production systems at companies like Uber, Airbnb, and Netflix.
Build models sequentially, where each new model corrects errors made by previous models. It's like iterative learning - each iteration gets smarter by focusing on what went wrong before!
# Install libraries
# pip install xgboost lightgbm
# XGBoost Example
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load and split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Create and train XGBoost model
model = xgb.XGBClassifier(
n_estimators=100, # Number of trees
learning_rate=0.1, # Step size
max_depth=5, # Tree depth
random_state=42
)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2%}")
The original champion! More mature, widely used, excellent documentation.
Best for:
The speed demon! Faster training, lower memory usage, handles large datasets better.
Best for:
# LightGBM Example
import lightgbm as lgb
model = lgb.LGBMClassifier(
n_estimators=100,
learning_rate=0.1,
num_leaves=31, # Max leaves per tree
random_state=42
)
model.fit(X_train, y_train)
# Feature importance
import matplotlib.pyplot as plt
lgb.plot_importance(model, max_num_features=10)
plt.show()
Start with XGBoost for learning and small projects. Switch to LightGBM when you have large datasets or need faster training. Both produce similar accuracy, so choose based on your needs!
Feature engineering is the art of creating better input features for your models. It's often said that "better data beats better algorithms" - a simple model with great features outperforms a complex model with poor features!
Scaling transforms features to similar ranges. Imagine comparing house prices ($100k-$1M) with number of bedrooms (1-5) - the prices dominate! Scaling fixes this.
# StandardScaler (mean=0, std=1)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
# Each feature now has mean=0, std=1
# MinMaxScaler (scale to 0-1 range)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_train)
# All features now between 0 and 1
# RobustScaler (handles outliers better)
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X_train)
# Uses median and IQR, robust to outliers
# Label Encoding (for ordinal data)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['size_encoded'] = le.fit_transform(df['size'])
# Small=0, Medium=1, Large=2
# One-Hot Encoding (for nominal data)
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(df[['color']])
# Creates binary columns: color_red, color_blue, color_green
# Pandas get_dummies (easiest way)
import pandas as pd
df_encoded = pd.get_dummies(df, columns=['color', 'size'])
# Automatically creates dummy variables
# Polynomial features (interactions)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
# Creates x1, x2, x1², x2², x1*x2
# Domain-specific features
df['price_per_sqft'] = df['price'] / df['sqft']
df['age'] = 2024 - df['year_built']
df['is_weekend'] = df['date'].dt.dayofweek >= 5
# Binning continuous variables
df['age_group'] = pd.cut(df['age'],
bins=[0, 18, 35, 50, 100],
labels=['child', 'young', 'middle', 'senior'])
Imagine trying to visualize 100-dimensional data - impossible! Dimensionality reduction compresses high-dimensional data into 2-3 dimensions while preserving important patterns. It's like creating a map from a 3D globe - you lose some information but gain understandability!
PCA finds the directions of maximum variance in your data. Think of it as finding the best camera angles to capture the most information about a 3D object in 2D photos.
# Apply PCA
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Reduce to 2 components for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Check explained variance
print(f"Variance explained: {pca.explained_variance_ratio_}")
# Shows how much information each component captures
# Visualize in 2D
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.show()
# Keep 95% of variance
pca = PCA(n_components=0.95) # Automatically chooses components
X_pca = pca.fit_transform(X_scaled)
print(f"Reduced from {X.shape[1]} to {X_pca.shape[1]} features")
# t-SNE for visualization (not for ML models!)
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X_scaled)
# Visualize clusters
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis')
plt.title('t-SNE Visualization')
plt.show()
PCA: Fast, linear, preserves global structure. Use for preprocessing and feature reduction.
t-SNE: Slow, non-linear, preserves local structure. Use only for visualization, not as input to models!
Why rely on one model when you can combine many? Ensemble methods are like asking multiple experts and taking a vote - often more accurate than any single expert! This is why ensemble methods dominate ML competitions.
Bagging trains multiple models on different random subsets of data, then averages their predictions. Random Forest is the most famous bagging algorithm!
# Random Forest (bagging with decision trees)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
n_estimators=100, # Number of trees
max_depth=10,
random_state=42
)
rf.fit(X_train, y_train)
# Feature importance
importances = rf.feature_importances_
print("Top features:", importances)
# AdaBoost (Adaptive Boosting)
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier(
n_estimators=50,
learning_rate=1.0,
random_state=42
)
ada.fit(X_train, y_train)
# Gradient Boosting (we covered XGBoost/LightGBM earlier!)
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb.fit(X_train, y_train)
# Stacking: Train a meta-model on predictions of base models
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
# Define base models
estimators = [
('rf', RandomForestClassifier(n_estimators=10)),
('svm', SVC(probability=True))
]
# Create stacking ensemble
stacking = StackingClassifier(
estimators=estimators,
final_estimator=LogisticRegression()
)
stacking.fit(X_train, y_train)
Hyperparameters are the knobs you turn to optimize your model - like adjusting the temperature and time when baking a cake. Finding the right settings can boost accuracy by 5-10%!
Grid Search tries every combination of parameters you specify. Thorough but slow!
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# Define parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 15, None],
'min_samples_split': [2, 5, 10]
}
# Create grid search
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5, # 5-fold cross-validation
scoring='accuracy',
n_jobs=-1 # Use all CPU cores
)
# Fit and get best parameters
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")
# Use best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
# Define parameter distributions
param_dist = {
'n_estimators': randint(50, 500),
'max_depth': randint(3, 20),
'learning_rate': uniform(0.01, 0.3)
}
# Random search (faster than grid search)
random_search = RandomizedSearchCV(
xgb.XGBClassifier(),
param_dist,
n_iter=50, # Try 50 random combinations
cv=5,
random_state=42
)
random_search.fit(X_train, y_train)
Start with Random Search to explore the parameter space quickly, then use Grid Search to fine-tune around the best values found. This hybrid approach is faster and more effective!
Real-world data is often imbalanced - 99% normal transactions, 1% fraud. If your model just predicts "normal" every time, it's 99% accurate but useless! Here's how to handle this challenge.
Give more importance to the minority class during training. Most sklearn models support this!
from sklearn.ensemble import RandomForestClassifier
# Automatically balance class weights
model = RandomForestClassifier(class_weight='balanced')
model.fit(X_train, y_train)
# Or specify custom weights
model = RandomForestClassifier(class_weight={0: 1, 1: 10})
# Class 1 (minority) gets 10x more weight
# Install: pip install imbalanced-learn
from imblearn.over_sampling import SMOTE
# Create synthetic samples for minority class
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
print(f"Original: {len(y_train)} samples")
print(f"After SMOTE: {len(y_resampled)} samples")
# Train on balanced data
model.fit(X_resampled, y_resampled)
from imblearn.under_sampling import RandomUnderSampler
# Reduce majority class samples
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)
Let's build a complete fraud detection system using all the advanced techniques we've learned! This project combines feature engineering, handling imbalanced data, ensemble methods, and hyperparameter tuning.
# Step 1: Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
import xgboost as xgb
from sklearn.metrics import classification_report, roc_auc_score
# Step 2: Load and explore data
df = pd.read_csv('transactions.csv')
print(df.head())
print(f"Fraud rate: {df['is_fraud'].mean():.2%}")
# Step 3: Feature Engineering
df['hour'] = pd.to_datetime(df['timestamp']).dt.hour
df['is_night'] = (df['hour'] >= 22) | (df['hour'] <= 6)
df['amount_log'] = np.log1p(df['amount'])
df['velocity'] = df.groupby('user_id')['amount'].transform('count')
# Step 4: Prepare features
features = ['amount', 'amount_log', 'hour', 'is_night', 'velocity']
X = df[features]
y = df['is_fraud']
# Step 5: Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Step 6: Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Step 7: Handle imbalance with SMOTE
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(
X_train_scaled, y_train
)
print(f"After SMOTE: {len(y_train_balanced)} samples")
# Step 8: Train XGBoost model
model = xgb.XGBClassifier(
n_estimators=200,
learning_rate=0.1,
max_depth=6,
scale_pos_weight=1, # Already balanced with SMOTE
random_state=42
)
model.fit(X_train_balanced, y_train_balanced)
# Step 9: Evaluate model
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
roc_score = roc_auc_score(y_test, y_pred_proba)
print(f"\nROC-AUC Score: {roc_score:.3f}")
# Step 10: Feature importance
import matplotlib.pyplot as plt
xgb.plot_importance(model, max_num_features=10)
plt.title('Top 10 Important Features')
plt.show()
# Step 11: Save model
import joblib
joblib.dump(model, 'fraud_detection_model.pkl')
joblib.dump(scaler, 'scaler.pkl')
print("Model saved successfully!")
You've mastered advanced ML techniques! Next, we'll dive into SQL & Databases - essential skills for working with real-world data stored in databases. You'll learn to query, join, and analyze data directly from databases.