Back to AI & Machine Learning

Module 7: Computer Vision

Master image processing, CNNs, object detection, and real-world computer vision applications

👁️ What is Computer Vision?

Imagine teaching a computer to see and understand images like humans do - recognizing faces, reading signs, detecting objects, understanding scenes. That's Computer Vision! It's how self-driving cars see the road and how your phone unlocks with your face.

Simple Definition

Computer Vision is a field of AI that trains computers to interpret and understand visual information from the world. It extracts meaningful information from images and videos.

Example: Looking at a photo

• Human sees: "A cat sitting on a couch"

• Computer sees: Millions of pixel values

• Computer Vision: Teaches computer to understand it's a cat!

🌟 Real-World Applications:

  • Self-Driving Cars: Detect pedestrians, signs, lanes
  • Face Recognition: Unlock phones, security systems
  • Medical Imaging: Detect diseases in X-rays, MRIs
  • Quality Control: Inspect products for defects
  • Augmented Reality: Snapchat filters, Pokemon GO
  • Retail: Cashier-less stores, inventory management

Common CV Tasks

Image Classification

What is in this image? (cat, dog, car)

Object Detection

Where are objects? (bounding boxes)

Image Segmentation

Pixel-level classification

Face Recognition

Identify specific people

Pose Estimation

Detect body keypoints

Image Generation

Create new images (GANs, Diffusion)

🖼️ Image Basics

Before we can process images, we need to understand what they are! To a computer, an image is just a grid of numbers representing colors.

Pixels and Channels

An image is made of pixels (tiny dots). Each pixel has color values in differentchannels (Red, Green, Blue).

Grayscale Image:

• 1 channel (intensity: 0-255)

• Shape: (height, width)

• Example: 28×28 = 784 pixels

Color Image (RGB):

• 3 channels (Red, Green, Blue: 0-255 each)

• Shape: (height, width, 3)

• Example: 224×224×3 = 150,528 values

Working with Images in Python

# Import libraries

import cv2 # OpenCV

import numpy as np

from PIL import Image

import matplotlib.pyplot as plt

# Read an image

img = cv2.imread('cat.jpg')

print(img.shape) # (height, width, channels)

print(img.dtype) # uint8 (0-255)

# Convert BGR to RGB (OpenCV uses BGR!)

img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# Display image

plt.imshow(img_rgb)

plt.axis('off')

plt.show()

# Convert to grayscale

gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

print(gray.shape) # (height, width) - no channel dimension

# Access pixel value

pixel = img[100, 200] # row 100, col 200

print(pixel) # [B, G, R] values

💡 Key Concepts:

  • • Images are NumPy arrays of numbers
  • • OpenCV uses BGR, most others use RGB
  • • Pixel values range from 0 (black) to 255 (white/full color)
  • • Image shape is (height, width, channels)

🔧 Image Preprocessing

Before feeding images to ML models, we need to prepare them. This includes resizing, normalizing, and augmenting images to improve model performance.

Common Preprocessing Steps

1. Resizing

Neural networks expect fixed-size inputs. Resize all images to the same dimensions.

# Resize to 224x224 (common for CNNs)

resized = cv2.resize(img, (224, 224))

print(resized.shape) # (224, 224, 3)

2. Normalization

Scale pixel values to 0-1 range or standardize. Helps neural networks train faster.

# Scale to 0-1

normalized = img.astype('float32') / 255.0

# Standardize (mean=0, std=1)

mean = np.array([0.485, 0.456, 0.406])

std = np.array([0.229, 0.224, 0.225])

standardized = (normalized - mean) / std

3. Data Augmentation

Create variations of images to increase training data and prevent overfitting.

# Flip horizontally

flipped = cv2.flip(img, 1)

# Rotate

rows, cols = img.shape[:2]

M = cv2.getRotationMatrix2D((cols/2, rows/2), 45, 1)

rotated = cv2.warpAffine(img, M, (cols, rows))

# Adjust brightness

bright = cv2.convertScaleAbs(img, alpha=1.2, beta=30)

# Add noise

noise = np.random.normal(0, 25, img.shape).astype('uint8')

noisy = cv2.add(img, noise)

💡 Why Augmentation?

If you only have 1,000 images, augmentation can create 10,000+ variations! This helps the model learn to recognize objects from different angles, lighting, and conditions.

🐱 Image Classification: Cats vs Dogs

Let's build a complete image classifier using a Convolutional Neural Network (CNN)!

Complete CNN Example

# Import libraries

import tensorflow as tf

from tensorflow.keras import layers, models

from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Build CNN model

model = models.Sequential([

# Convolutional layers

layers.Conv2D(32, (3, 3), activation='relu', input_shape=(150, 150, 3)),

layers.MaxPooling2D(2, 2),

layers.Conv2D(64, (3, 3), activation='relu'),

layers.MaxPooling2D(2, 2),

layers.Conv2D(128, (3, 3), activation='relu'),

layers.MaxPooling2D(2, 2),

# Flatten and dense layers

layers.Flatten(),

layers.Dense(512, activation='relu'),

layers.Dropout(0.5),

layers.Dense(1, activation='sigmoid') # Binary classification

])

# Compile model

model.compile(

optimizer='adam',

loss='binary_crossentropy',

metrics=['accuracy']

)

# Data augmentation

train_datagen = ImageDataGenerator(

rescale=1./255,

rotation_range=40,

width_shift_range=0.2,

height_shift_range=0.2,

shear_range=0.2,

zoom_range=0.2,

horizontal_flip=True

)

# Load training data

train_generator = train_datagen.flow_from_directory(

'data/train',

target_size=(150, 150),

batch_size=32,

class_mode='binary'

)

# Train model

history = model.fit(

train_generator,

epochs=20,

steps_per_epoch=100

)

# Make prediction on new image

from tensorflow.keras.preprocessing import image

img = image.load_img('test_cat.jpg', target_size=(150, 150))

img_array = image.img_to_array(img) / 255.0

img_array = np.expand_dims(img_array, axis=0)

prediction = model.predict(img_array)[0][0]

print(f"Dog" if prediction > 0.5 else f"Cat")

🧠 How CNNs Work:

  • Conv2D: Detects features (edges, textures, patterns)
  • MaxPooling: Reduces size, keeps important features
  • Multiple layers: Learn increasingly complex features
  • Flatten: Convert 2D features to 1D for classification
  • Dense: Final classification layers

📦 Object Detection with YOLO

Object detection doesn't just classify images - it finds WHERE objects are! YOLO (You Only Look Once) is one of the fastest and most popular object detection algorithms.

How YOLO Works

YOLO divides the image into a grid and predicts bounding boxes and class probabilities for each grid cell. It's incredibly fast - can process video in real-time!

Using YOLO with OpenCV

# Install ultralytics (YOLOv8)

pip install ultralytics

# Import and load model

from ultralytics import YOLO

model = YOLO('yolov8n.pt') # nano model (fastest)

# Detect objects in image

results = model('street.jpg')

# Display results

for result in results:

for box in result.boxes:

class_id = int(box.cls[0])

confidence = float(box.conf[0])

bbox = box.xyxy[0].tolist()

print(f"Detected: {model.names[class_id]} ({confidence:.2f})")

# Save annotated image

result.save('output.jpg')

# Real-time video detection

cap = cv2.VideoCapture(0) # Webcam

while True:

ret, frame = cap.read()

results = model(frame)

annotated = results[0].plot()

cv2.imshow('YOLO', annotated)

if cv2.waitKey(1) == ord('q'): break

YOLO Versions:

  • YOLOv8n: Fastest, good for real-time
  • YOLOv8s: Balanced speed/accuracy
  • YOLOv8m: More accurate
  • YOLOv8l/x: Most accurate, slower

Can Detect:

  • • 80 object classes (COCO dataset)
  • • People, vehicles, animals
  • • Everyday objects
  • • Custom objects (with fine-tuning)

🎯 Pre-trained Models

Don't train from scratch! Use pre-trained models that have learned from millions of images.

Popular Models

ModelTaskAccuracySpeed
ResNet-50Classification76% Top-1Fast
EfficientNetClassification84% Top-1Medium
YOLOv8Object Detection53 mAPVery Fast
Mask R-CNNInstance SegmentationHighSlow

Transfer Learning Example

# Use pre-trained ResNet50

from tensorflow.keras.applications import ResNet50

from tensorflow.keras import layers, models

# Load pre-trained model (without top layer)

base_model = ResNet50(

weights='imagenet',

include_top=False,

input_shape=(224, 224, 3)

)

# Freeze base model layers

base_model.trainable = False

# Add custom layers for your task

model = models.Sequential([

base_model,

layers.GlobalAveragePooling2D(),

layers.Dense(256, activation='relu'),

layers.Dropout(0.5),

layers.Dense(10, activation='softmax') # 10 classes

])

# Train only the new layers

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.fit(train_data, epochs=10)

🚀 Why Transfer Learning?

  • • Train with less data (hundreds vs millions of images)
  • • Faster training (hours vs days/weeks)
  • • Better accuracy (leverages learned features)
  • • Lower computational cost

📚 Learning Resources

Libraries & Tools

Learning Resources

🎯 What's Next?

You now understand computer vision, CNNs, and object detection! In the final module, we'll learn how to deploy AI models to production - making your models accessible to users!