Master text processing, sentiment analysis, NER, and modern NLP with transformers
Imagine teaching a computer to read and understand human language - not just words, but meaning, emotion, and context. That's Natural Language Processing (NLP)! It's how computers make sense of text and speech, from simple spell-check to complex chatbots.
Natural Language Processing is a branch of AI that helps computers understand, interpret, and generate human language. It bridges the gap between human communication and computer understanding.
Example: Understanding "I love this!"
• Words: "I", "love", "this"
• Meaning: Positive sentiment
• Context: Expression of enthusiasm
Text Classification
Categorize text (spam/not spam, positive/negative)
Named Entity Recognition
Find names, places, organizations in text
Sentiment Analysis
Determine emotion or opinion in text
Machine Translation
Translate between languages
Question Answering
Answer questions based on context
Text Generation
Create human-like text
Raw text is messy! Before feeding text to ML models, we need to clean and prepare it. Think of it like washing vegetables before cooking - essential for good results!
Split text into individual words or pieces (tokens). Like breaking a sentence into words.
# Using NLTK
import nltk
from nltk.tokenize import word_tokenize
text = "Hello! How are you?"
tokens = word_tokenize(text)
print(tokens)
# ['Hello', '!', 'How', 'are', 'you', '?']
Convert all text to lowercase so "Hello" and "hello" are treated the same.
text = "Hello World"
lower_text = text.lower()
print(lower_text) # "hello world"
Remove symbols that don't add meaning (usually).
import string
text = "Hello, world! How are you?"
clean_text = text.translate(str.maketrans('', '', string.punctuation))
print(clean_text) # "Hello world How are you"
Remove common words that don't add much meaning (the, is, at, which, on).
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = ["this", "is", "a", "great", "movie"]
filtered = [w for w in words if w not in stop_words]
print(filtered) # ['great', 'movie']
Reduce words to their root form by chopping off endings. Fast but crude.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ["running", "runs", "ran", "runner"]
stems = [stemmer.stem(w) for w in words]
print(stems) # ['run', 'run', 'ran', 'runner']
Reduce words to their dictionary form (lemma). More accurate than stemming.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ["running", "runs", "ran", "better"]
lemmas = [lemmatizer.lemmatize(w, pos='v') for w in words]
print(lemmas) # ['run', 'run', 'run', 'better']
Computers don't understand words - they need numbers! Word embeddings convert words into vectors (lists of numbers) that capture meaning. Similar words have similar vectors!
Words with similar meanings are close together in vector space. You can even do math with words!
Famous example:
king - man + woman ≈ queen
This actually works with word vectors!
Learns word vectors by predicting context. Words that appear in similar contexts get similar vectors.
# Using Gensim for Word2Vec
from gensim.models import Word2Vec
# Sample sentences (tokenized)
sentences = [
['cat', 'sits', 'on', 'mat'],
['dog', 'sits', 'on', 'floor'],
['cat', 'and', 'dog', 'are', 'friends']
]
# Train Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)
# Get vector for a word
vector = model.wv['cat']
print(vector.shape) # (100,) - 100-dimensional vector
# Find similar words
similar = model.wv.most_similar('cat', topn=3)
print(similar) # [('dog', 0.95), ...]
Pre-trained on massive text corpora. You can download and use immediately!
# Load pre-trained GloVe embeddings
import numpy as np
def load_glove(file_path):
embeddings = {}
with open(file_path, 'r', encoding='utf-8') as f:
for line in f:
values = line.split()
word = values[0]
vector = np.array(values[1:], dtype='float32')
embeddings[word] = vector
return embeddings
# Download from: https://nlp.stanford.edu/projects/glove/
glove = load_glove('glove.6B.100d.txt')
print(glove['king'].shape) # (100,)
Let's build a complete sentiment analysis system that classifies movie reviews as positive or negative!
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
# Sample data
reviews = [
"This movie was amazing! I loved it.",
"Terrible film, waste of time.",
"Best movie I've seen this year!",
"Boring and predictable.",
"Absolutely fantastic performance!",
"Worst movie ever made."
]
labels = [1, 0, 1, 0, 1, 0] # 1=positive, 0=negative
# Split data
X_train, X_test, y_train, y_test = train_test_split(
reviews, labels, test_size=0.3, random_state=42
)
# Convert text to numbers using TF-IDF
vectorizer = TfidfVectorizer(max_features=100)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
# Train classifier
model = MultinomialNB()
model.fit(X_train_vec, y_train)
# Make predictions
predictions = model.predict(X_test_vec)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2%}")
# Test on new review
new_review = ["This movie was incredible!"]
new_vec = vectorizer.transform(new_review)
prediction = model.predict(new_vec)[0]
sentiment = "Positive" if prediction == 1 else "Negative"
print(f"Sentiment: {sentiment}")
TF-IDF (Term Frequency-Inverse Document Frequency) measures how important a word is to a document. Common words like "the" get low scores, unique words get high scores.
NER finds and classifies named entities in text - people, organizations, locations, dates, etc. It's like highlighting important information automatically!
Text: "Apple Inc. was founded by Steve Jobs in California."
Entities:
• Apple Inc. → ORGANIZATION
• Steve Jobs → PERSON
• California → LOCATION
# Install spaCy and download model
pip install spacy
python -m spacy download en_core_web_sm
# Import and load model
import spacy
nlp = spacy.load("en_core_web_sm")
# Process text
text = "Elon Musk founded SpaceX in California in 2002."
doc = nlp(text)
# Extract entities
for ent in doc.ents:
print(f"{ent.text} - {ent.label_}")
# Output:
# Elon Musk - PERSON
# SpaceX - ORG
# California - GPE (Geo-Political Entity)
# 2002 - DATE
# Visualize entities (in Jupyter)
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)
PERSON
People, including fictional
ORG
Companies, agencies, institutions
GPE
Countries, cities, states
DATE
Absolute or relative dates
MONEY
Monetary values
PRODUCT
Objects, vehicles, foods, etc.
Hugging Face is like a library of pre-trained AI models. Instead of training from scratch, you can use state-of-the-art models with just a few lines of code!
Hugging Face provides thousands of pre-trained models for NLP tasks. It's the GitHub of AI models - you can download, use, and even upload your own models!
# Install transformers
pip install transformers
# Use pre-trained sentiment model
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("I love this product!")
print(result)
# [{'label': 'POSITIVE', 'score': 0.9998}]
generator = pipeline("text-generation", model="gpt2")
result = generator(
"Artificial intelligence is",
max_length=50,
num_return_sequences=1
)
print(result[0]['generated_text'])
qa_pipeline = pipeline("question-answering")
context = "Python is a programming language. It was created by Guido van Rossum."
question = "Who created Python?"
result = qa_pipeline(question=question, context=context)
print(result['answer']) # "Guido van Rossum"
translator = pipeline("translation_en_to_fr")
result = translator("Hello, how are you?")
print(result[0]['translation_text'])
# "Bonjour, comment allez-vous?"
| Model | Best For | Size |
|---|---|---|
| BERT | Understanding text, classification | 110M params |
| RoBERTa | Improved BERT, better performance | 125M params |
| DistilBERT | Faster, smaller BERT (97% accuracy) | 66M params |
| GPT-2 | Text generation | 1.5B params |
| T5 | Text-to-text (translation, summary) | 220M params |
You now understand NLP fundamentals, text processing, and modern transformers! Next, we'll explore Computer Vision - teaching computers to see and understand images.