Module 6: Natural Language Processing

Master text processing, sentiment analysis, NER, and modern NLP with transformers

📝 What is Natural Language Processing?

Imagine teaching a computer to read and understand human language - not just words, but meaning, emotion, and context. That's Natural Language Processing (NLP)! It's how computers make sense of text and speech, from simple spell-check to complex chatbots.

Simple Definition

Natural Language Processing is a branch of AI that helps computers understand, interpret, and generate human language. It bridges the gap between human communication and computer understanding.

Example: Understanding "I love this!"

• Words: "I", "love", "this"

• Meaning: Positive sentiment

• Context: Expression of enthusiasm

🌟 Real-World Applications:

• Virtual Assistants: Siri, Alexa, Google Assistant
• Translation: Google Translate, DeepL
• Sentiment Analysis: Analyzing customer reviews
• Spam Detection: Email filtering
• Autocomplete: Search suggestions, text prediction
• Chatbots: Customer service automation

NLP Tasks

Text Classification

Categorize text (spam/not spam, positive/negative)

Named Entity Recognition

Find names, places, organizations in text

Sentiment Analysis

Determine emotion or opinion in text

Machine Translation

Translate between languages

Question Answering

Answer questions based on context

Text Generation

Create human-like text

🧹 Text Preprocessing

Raw text is messy! Before feeding text to ML models, we need to clean and prepare it. Think of it like washing vegetables before cooking - essential for good results!

Common Preprocessing Steps

Tokenization

Split text into individual words or pieces (tokens). Like breaking a sentence into words.

# Using NLTK

import nltk

from nltk.tokenize import word_tokenize

text = "Hello! How are you?"

tokens = word_tokenize(text)

print(tokens)

# ['Hello', '!', 'How', 'are', 'you', '?']

Lowercasing

Convert all text to lowercase so "Hello" and "hello" are treated the same.

text = "Hello World"

lower_text = text.lower()

print(lower_text) # "hello world"

Remove Punctuation & Special Characters

Remove symbols that don't add meaning (usually).

import string

text = "Hello, world! How are you?"

clean_text = text.translate(str.maketrans('', '', string.punctuation))

print(clean_text) # "Hello world How are you"

Remove Stop Words

Remove common words that don't add much meaning (the, is, at, which, on).

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

words = ["this", "is", "a", "great", "movie"]

filtered = [w for w in words if w not in stop_words]

print(filtered) # ['great', 'movie']

Stemming

Reduce words to their root form by chopping off endings. Fast but crude.

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

words = ["running", "runs", "ran", "runner"]

stems = [stemmer.stem(w) for w in words]

print(stems) # ['run', 'run', 'ran', 'runner']

Lemmatization

Reduce words to their dictionary form (lemma). More accurate than stemming.

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

words = ["running", "runs", "ran", "better"]

lemmas = [lemmatizer.lemmatize(w, pos='v') for w in words]

print(lemmas) # ['run', 'run', 'run', 'better']

💡 When to Use What:

• Stemming: Fast, good for search engines and simple tasks
• Lemmatization: More accurate, better for understanding meaning
• Stop words: Remove for classification, keep for sentiment analysis
• Lowercasing: Almost always, unless case matters (names, acronyms)

🔤 Word Embeddings

Computers don't understand words - they need numbers! Word embeddings convert words into vectors (lists of numbers) that capture meaning. Similar words have similar vectors!

The Magic of Embeddings

Words with similar meanings are close together in vector space. You can even do math with words!

Famous example:

king - man + woman ≈ queen

This actually works with word vectors!

Popular Embedding Methods

Word2Vec

Learns word vectors by predicting context. Words that appear in similar contexts get similar vectors.

# Using Gensim for Word2Vec

from gensim.models import Word2Vec

# Sample sentences (tokenized)

sentences = [

['cat', 'sits', 'on', 'mat'],

['dog', 'sits', 'on', 'floor'],

['cat', 'and', 'dog', 'are', 'friends']

]

# Train Word2Vec model

model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)

# Get vector for a word

vector = model.wv['cat']

print(vector.shape) # (100,) - 100-dimensional vector

# Find similar words

similar = model.wv.most_similar('cat', topn=3)

print(similar) # [('dog', 0.95), ...]

GloVe (Global Vectors)

Pre-trained on massive text corpora. You can download and use immediately!

# Load pre-trained GloVe embeddings

import numpy as np

def load_glove(file_path):

embeddings = {}

with open(file_path, 'r', encoding='utf-8') as f:

for line in f:

values = line.split()

word = values[0]

vector = np.array(values[1:], dtype='float32')

embeddings[word] = vector

return embeddings

# Download from: https://nlp.stanford.edu/projects/glove/

glove = load_glove('glove.6B.100d.txt')

print(glove['king'].shape) # (100,)

🎯 Why Embeddings Matter:

• Capture semantic meaning (synonyms are close)
• Enable transfer learning (use pre-trained embeddings)
• Work better than one-hot encoding
• Foundation for modern NLP (BERT, GPT use embeddings)

😊 Sentiment Analysis Project

Let's build a complete sentiment analysis system that classifies movie reviews as positive or negative!

Complete Code Example

# Import libraries

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB

from sklearn.metrics import accuracy_score, classification_report

# Sample data

reviews = [

"This movie was amazing! I loved it.",

"Terrible film, waste of time.",

"Best movie I've seen this year!",

"Boring and predictable.",

"Absolutely fantastic performance!",

"Worst movie ever made."

]

labels = [1, 0, 1, 0, 1, 0] # 1=positive, 0=negative

# Split data

X_train, X_test, y_train, y_test = train_test_split(

reviews, labels, test_size=0.3, random_state=42

)

# Convert text to numbers using TF-IDF

vectorizer = TfidfVectorizer(max_features=100)

X_train_vec = vectorizer.fit_transform(X_train)

X_test_vec = vectorizer.transform(X_test)

# Train classifier

model = MultinomialNB()

model.fit(X_train_vec, y_train)

# Make predictions

predictions = model.predict(X_test_vec)

accuracy = accuracy_score(y_test, predictions)

print(f"Accuracy: {accuracy:.2%}")

# Test on new review

new_review = ["This movie was incredible!"]

new_vec = vectorizer.transform(new_review)

prediction = model.predict(new_vec)[0]

sentiment = "Positive" if prediction == 1 else "Negative"

print(f"Sentiment: {sentiment}")

📊 What is TF-IDF?

TF-IDF (Term Frequency-Inverse Document Frequency) measures how important a word is to a document. Common words like "the" get low scores, unique words get high scores.

🏷️ Named Entity Recognition (NER)

NER finds and classifies named entities in text - people, organizations, locations, dates, etc. It's like highlighting important information automatically!

Example

Text: "Apple Inc. was founded by Steve Jobs in California."

Entities:

• Apple Inc. → ORGANIZATION

• Steve Jobs → PERSON

• California → LOCATION

Using spaCy for NER

# Install spaCy and download model

pip install spacy

python -m spacy download en_core_web_sm

# Import and load model

import spacy

nlp = spacy.load("en_core_web_sm")

# Process text

text = "Elon Musk founded SpaceX in California in 2002."

doc = nlp(text)

# Extract entities

for ent in doc.ents:

print(f"{ent.text} - {ent.label_}")

# Output:

# Elon Musk - PERSON

# SpaceX - ORG

# California - GPE (Geo-Political Entity)

# 2002 - DATE

# Visualize entities (in Jupyter)

from spacy import displacy

displacy.render(doc, style="ent", jupyter=True)

Common Entity Types

PERSON

People, including fictional

ORG

Companies, agencies, institutions

GPE

Countries, cities, states

DATE

Absolute or relative dates

MONEY

Monetary values

PRODUCT

Objects, vehicles, foods, etc.

🎯 NER Use Cases:

• Information Extraction: Extract key facts from documents
• Content Recommendation: Understand article topics
• Customer Support: Identify products, dates in queries
• Resume Parsing: Extract names, skills, companies
• News Analysis: Track mentions of people/organizations

🤗 Hugging Face Transformers

Hugging Face is like a library of pre-trained AI models. Instead of training from scratch, you can use state-of-the-art models with just a few lines of code!

What is Hugging Face?

Hugging Face provides thousands of pre-trained models for NLP tasks. It's the GitHub of AI models - you can download, use, and even upload your own models!

Quick Start Examples

Sentiment Analysis (Zero Code Training!)

# Install transformers

pip install transformers

# Use pre-trained sentiment model

from transformers import pipeline

classifier = pipeline("sentiment-analysis")

result = classifier("I love this product!")

print(result)

# [{'label': 'POSITIVE', 'score': 0.9998}]

Text Generation

generator = pipeline("text-generation", model="gpt2")

result = generator(

"Artificial intelligence is",

max_length=50,

num_return_sequences=1

)

print(result[0]['generated_text'])

Question Answering

qa_pipeline = pipeline("question-answering")

context = "Python is a programming language. It was created by Guido van Rossum."

question = "Who created Python?"

result = qa_pipeline(question=question, context=context)

print(result['answer']) # "Guido van Rossum"

Translation

translator = pipeline("translation_en_to_fr")

result = translator("Hello, how are you?")

print(result[0]['translation_text'])

# "Bonjour, comment allez-vous?"

Popular Models

Model	Best For	Size
BERT	Understanding text, classification	110M params
RoBERTa	Improved BERT, better performance	125M params
DistilBERT	Faster, smaller BERT (97% accuracy)	66M params
GPT-2	Text generation	1.5B params
T5	Text-to-text (translation, summary)	220M params

🚀 Why Hugging Face is Amazing:

• Pre-trained Models: Use state-of-the-art models immediately
• Easy API: Just a few lines of code
• Model Hub: 100,000+ models to choose from
• Fine-tuning: Adapt models to your specific task
• Community: Active community and great documentation

📚 Learning Resources

Libraries & Tools

• NLTK - Natural Language Toolkit
• spaCy - Industrial-strength NLP
• Hugging Face - Transformers library
• Gensim - Topic modeling and embeddings

Learning Resources

• Fast.ai NLP Course - Practical NLP
• Stanford CS224N - NLP with Deep Learning
• Hugging Face Course - Free transformers course
• Kaggle NLP - Hands-on tutorials

🎯 What's Next?

You now understand NLP fundamentals, text processing, and modern transformers! Next, we'll explore Computer Vision - teaching computers to see and understand images.

← Previous: Large Language Models Next: Computer Vision →