Master image processing, CNNs, object detection, and real-world computer vision applications
Imagine teaching a computer to see and understand images like humans do - recognizing faces, reading signs, detecting objects, understanding scenes. That's Computer Vision! It's how self-driving cars see the road and how your phone unlocks with your face.
Computer Vision is a field of AI that trains computers to interpret and understand visual information from the world. It extracts meaningful information from images and videos.
Example: Looking at a photo
• Human sees: "A cat sitting on a couch"
• Computer sees: Millions of pixel values
• Computer Vision: Teaches computer to understand it's a cat!
Image Classification
What is in this image? (cat, dog, car)
Object Detection
Where are objects? (bounding boxes)
Image Segmentation
Pixel-level classification
Face Recognition
Identify specific people
Pose Estimation
Detect body keypoints
Image Generation
Create new images (GANs, Diffusion)
Before we can process images, we need to understand what they are! To a computer, an image is just a grid of numbers representing colors.
An image is made of pixels (tiny dots). Each pixel has color values in differentchannels (Red, Green, Blue).
Grayscale Image:
• 1 channel (intensity: 0-255)
• Shape: (height, width)
• Example: 28×28 = 784 pixels
Color Image (RGB):
• 3 channels (Red, Green, Blue: 0-255 each)
• Shape: (height, width, 3)
• Example: 224×224×3 = 150,528 values
# Import libraries
import cv2 # OpenCV
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
# Read an image
img = cv2.imread('cat.jpg')
print(img.shape) # (height, width, channels)
print(img.dtype) # uint8 (0-255)
# Convert BGR to RGB (OpenCV uses BGR!)
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
# Display image
plt.imshow(img_rgb)
plt.axis('off')
plt.show()
# Convert to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
print(gray.shape) # (height, width) - no channel dimension
# Access pixel value
pixel = img[100, 200] # row 100, col 200
print(pixel) # [B, G, R] values
Before feeding images to ML models, we need to prepare them. This includes resizing, normalizing, and augmenting images to improve model performance.
Neural networks expect fixed-size inputs. Resize all images to the same dimensions.
# Resize to 224x224 (common for CNNs)
resized = cv2.resize(img, (224, 224))
print(resized.shape) # (224, 224, 3)
Scale pixel values to 0-1 range or standardize. Helps neural networks train faster.
# Scale to 0-1
normalized = img.astype('float32') / 255.0
# Standardize (mean=0, std=1)
mean = np.array([0.485, 0.456, 0.406])
std = np.array([0.229, 0.224, 0.225])
standardized = (normalized - mean) / std
Create variations of images to increase training data and prevent overfitting.
# Flip horizontally
flipped = cv2.flip(img, 1)
# Rotate
rows, cols = img.shape[:2]
M = cv2.getRotationMatrix2D((cols/2, rows/2), 45, 1)
rotated = cv2.warpAffine(img, M, (cols, rows))
# Adjust brightness
bright = cv2.convertScaleAbs(img, alpha=1.2, beta=30)
# Add noise
noise = np.random.normal(0, 25, img.shape).astype('uint8')
noisy = cv2.add(img, noise)
If you only have 1,000 images, augmentation can create 10,000+ variations! This helps the model learn to recognize objects from different angles, lighting, and conditions.
Let's build a complete image classifier using a Convolutional Neural Network (CNN)!
# Import libraries
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.preprocessing.image import ImageDataGenerator
# Build CNN model
model = models.Sequential([
# Convolutional layers
layers.Conv2D(32, (3, 3), activation='relu', input_shape=(150, 150, 3)),
layers.MaxPooling2D(2, 2),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D(2, 2),
layers.Conv2D(128, (3, 3), activation='relu'),
layers.MaxPooling2D(2, 2),
# Flatten and dense layers
layers.Flatten(),
layers.Dense(512, activation='relu'),
layers.Dropout(0.5),
layers.Dense(1, activation='sigmoid') # Binary classification
])
# Compile model
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy']
)
# Data augmentation
train_datagen = ImageDataGenerator(
rescale=1./255,
rotation_range=40,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True
)
# Load training data
train_generator = train_datagen.flow_from_directory(
'data/train',
target_size=(150, 150),
batch_size=32,
class_mode='binary'
)
# Train model
history = model.fit(
train_generator,
epochs=20,
steps_per_epoch=100
)
# Make prediction on new image
from tensorflow.keras.preprocessing import image
img = image.load_img('test_cat.jpg', target_size=(150, 150))
img_array = image.img_to_array(img) / 255.0
img_array = np.expand_dims(img_array, axis=0)
prediction = model.predict(img_array)[0][0]
print(f"Dog" if prediction > 0.5 else f"Cat")
Object detection doesn't just classify images - it finds WHERE objects are! YOLO (You Only Look Once) is one of the fastest and most popular object detection algorithms.
YOLO divides the image into a grid and predicts bounding boxes and class probabilities for each grid cell. It's incredibly fast - can process video in real-time!
# Install ultralytics (YOLOv8)
pip install ultralytics
# Import and load model
from ultralytics import YOLO
model = YOLO('yolov8n.pt') # nano model (fastest)
# Detect objects in image
results = model('street.jpg')
# Display results
for result in results:
for box in result.boxes:
class_id = int(box.cls[0])
confidence = float(box.conf[0])
bbox = box.xyxy[0].tolist()
print(f"Detected: {model.names[class_id]} ({confidence:.2f})")
# Save annotated image
result.save('output.jpg')
# Real-time video detection
cap = cv2.VideoCapture(0) # Webcam
while True:
ret, frame = cap.read()
results = model(frame)
annotated = results[0].plot()
cv2.imshow('YOLO', annotated)
if cv2.waitKey(1) == ord('q'): break
YOLO Versions:
Can Detect:
Don't train from scratch! Use pre-trained models that have learned from millions of images.
| Model | Task | Accuracy | Speed |
|---|---|---|---|
| ResNet-50 | Classification | 76% Top-1 | Fast |
| EfficientNet | Classification | 84% Top-1 | Medium |
| YOLOv8 | Object Detection | 53 mAP | Very Fast |
| Mask R-CNN | Instance Segmentation | High | Slow |
# Use pre-trained ResNet50
from tensorflow.keras.applications import ResNet50
from tensorflow.keras import layers, models
# Load pre-trained model (without top layer)
base_model = ResNet50(
weights='imagenet',
include_top=False,
input_shape=(224, 224, 3)
)
# Freeze base model layers
base_model.trainable = False
# Add custom layers for your task
model = models.Sequential([
base_model,
layers.GlobalAveragePooling2D(),
layers.Dense(256, activation='relu'),
layers.Dropout(0.5),
layers.Dense(10, activation='softmax') # 10 classes
])
# Train only the new layers
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(train_data, epochs=10)
You now understand computer vision, CNNs, and object detection! In the final module, we'll learn how to deploy AI models to production - making your models accessible to users!