🔍 Use Case · NLP · Retail Intelligence

Extracting Product Reviews
for Sentiment Analysis

Customer reviews contain the most honest signal your product team will ever get. This use case shows how to systematically extract, structure, and run sentiment analysis on scraped product reviews — turning thousands of unstructured opinions into actionable product intelligence.

18K+
Reviews
in dataset
94%
Classifier
accuracy
6
Review
sources
3.2×
Faster than
manual tagging
Sentiment Distribution · Live Sample
analysing
62%positive
Positive
62%
Negative
21%
Neutral
17%
conf 0.97
✓ Positive
“Battery life is genuinely impressive — lasts 3 full days with heavy use.”
conf 0.91
✕ Negative
“Stopped working after 6 weeks. Customer support took 12 days to respond.”
conf 0.78
– Neutral
“Arrived on time. Does what it says. Nothing exceptional either way.”

The Problem

Why manually reading reviews doesn’t scale

A mid-sized e-commerce brand with 200 products and presence on Amazon, Trustpilot, and their own site accumulates roughly 800–1,200 new reviews per week. A product manager reading even 20% of those would spend 6+ hours per week on raw qualitative data — with no structure, no trends, and no way to connect a spike in negative reviews to a specific manufacturing batch or UI change.

Sentiment analysis automates the first layer of this work: it reads every review, labels it positive / negative / neutral with a confidence score, extracts the key topic (packaging, delivery, product quality, customer service), and flags anomalies — all in milliseconds per review.

💡

The key insight: Sentiment analysis doesn’t replace human judgment — it filters and prioritises so that humans spend their attention on the 5% of reviews that actually require a decision (escalation, refund, product recall signal), not the 95% that are routine feedback already captured by the star rating.

800+
New reviews/week for a mid-size brand
6h
Manual reading time vs <2 min automated
31%
Of 1-star reviews contain actionable product bugs
14 days
Avg lag between product issue and team awareness (manual)

Extraction Pipeline

From raw review page to structured sentiment record

The pipeline has five stages. Each stage is independently maintainable — if Amazon changes its HTML structure, only the extractor needs updating, not the classifier or storage layer.

1
Source Discovery
Identify all review surfaces: Amazon product pages, Trustpilot, brand site, Google Shopping, App Store, G2.
sitemap + API
2
Scrape & Extract
Playwright renders JS-heavy pages. CSS selectors extract reviewer name, date, star rating, title and body text.
Playwright / BS4
3
Clean & Normalise
Strip HTML, de-duplicate cross-platform reviews by fingerprint, normalise dates to ISO 8601, detect language.
pandas / langdetect
4
Classify Sentiment
Run fine-tuned RoBERTa or GPT-4o via API. Outputs: label, confidence score, aspect tags (quality / delivery / price).
RoBERTa / GPT-4o
5
Store & Alert
Write to PostgreSQL. Trigger Slack alert if negative spike >15% above 7-day rolling average for any product.
Postgres + Slack

Here’s what a single classified review record looks like in the output database:

// Classified review record — sentiment_reviews table { “review_id”: “AMZ-GB-B09XK2LM-00841”, “source”: “amazon_uk”, “product_asin”: “B09XK2LM94”, “product_name”: “SoundPulse Pro Wireless Earbuds”, “reviewer”: “Verified Purchase — Sarah K.”, “review_date”: “2025-03-14T09:22:00Z”, “star_rating”: 2, “review_title”: “Stopped working after 3 weeks”, “review_body”: “Left ear completely died. Charging case also stopped holding charge…”, “sentiment_label”: “NEGATIVE”, “confidence”: 0.94, “aspect_tags”: [“product_quality”, “hardware_failure”, “battery”], “escalation_flag”: true, // hardware_failure tag triggers auto-flag “helpful_votes”: 47, “verified_purchase”: true, “language”: “en” }

Data Schema

Every field — and what it tells you about sentiment

The schema is split into two layers: extraction fields (scraped directly from the source) and enrichment fields (computed by the classifier pipeline). Understanding which is which matters enormously for debugging classifier errors.


Live Dataset Sample

Classified review records — interactive table

Below is a sample of 20 reviews from the dataset, across 3 products and 4 sources. Filter by sentiment, sort any column. Reviews flagged for escalation have a red confidence bar.

20 / 20 records
Product Source Stars Review Sentiment Confidence Aspect Date

* Illustrative data. Confidence = model output probability for assigned label.


Sentiment Analysis

What the data tells you — by product category

Once reviews are classified, the real insight comes from aggregation — looking at sentiment per product, per time period, and per aspect. The chart below shows positive vs negative balance per product category in the dataset.

Overall Sentiment Score
+41 Net Positive
Highest Risk Category
Battery Life
Reviews Analysed
18,420
Most Frequent Positive Keywords
Most Frequent Negative Keywords
⚠️

Watch signal: “Battery” appears in 68% of negative reviews for SoundPulse Pro, with a notable spike in week 11 (March 2025) — coinciding with batch #MP-2025-03 from the Shenzhen supplier. This cross-reference between sentiment timeline and production batches is the kind of insight that’s invisible without structured data.


Classification Methods

Three approaches — and when to use each

There’s no single “best” sentiment classifier. The right choice depends on your data volume, accuracy requirements, latency budget, and whether you need aspect-level granularity (e.g. “negative about delivery” vs “negative about product“).

VADER / TextBlob
Lexicon-based rule approach. No model training needed. Excellent for quick prototyping — understand the distribution of your dataset in an afternoon. Struggles with sarcasm and domain-specific language.
Accuracy ~78% · <1ms/review
🧠
Fine-tuned RoBERTa
Pre-trained transformer fine-tuned on product review data (e.g. amazon-polarity dataset). Handles nuance, negation, and domain vocabulary well. Requires GPU for production speed. Best balance of accuracy and cost.
Accuracy ~92% · ~15ms/review
GPT-4o via API
Zero-shot classification with structured JSON output. Highest accuracy including aspect extraction, multi-label output, and language-agnostic. High cost at volume — best for escalation triage rather than full dataset classification.
Accuracy ~94% · ~800ms/review
🏗️
Hybrid Pipeline
Use RoBERTa for bulk classification. Route low-confidence reviews (confidence <0.75) to GPT-4o for re-scoring. Keeps cost 80% lower than full GPT-4o while maintaining near-GPT accuracy at the margin.
Accuracy ~93% · Optimal cost

Recommended starting point: Run VADER on your first 1,000 reviews to understand the rough distribution and spot obvious patterns. Then fine-tune RoBERTa on a hand-labelled sample of 500 reviews from your specific domain. This two-step process typically takes 2–3 days and produces a production-grade classifier that costs <£0.002 per review at scale.

Before vs After

How structured sentiment changes the product team’s workflow

Without Sentiment Analysis
Manual & Reactive
😓PM reads 40–50 reviews/week — mostly ignores the rest
Product defect spotted 2–3 weeks after first reports appear
📊No visibility into which aspect is driving negative ratings
🗣️Competitor mentions buried in review text — never surfaced
🔁Monthly manual export from review platforms — no automation
VS
With Sentiment Analysis Pipeline
Automated & Proactive
Every review classified in <30ms. PM reviews dashboard once a week.
🚨Slack alert within hours of a negative spike — same-day response possible
🎯Aspect tags show exactly: delivery −34%, product quality +12% this week
🔍Competitor mentions extracted and tagged automatically for intel team
🔄Weekly automated pipeline — new reviews processed every Sunday night

ROI Calculator

What does this pipeline cost — and what does it save?

The cost model for a review sentiment pipeline is unusually transparent. Infrastructure is cheap; the variable is your classifier choice. Enter your numbers to see the economics for your brand.

Sentiment Analysis ROI Estimator
Estimate annual cost and value of your review intelligence pipeline
Manual Cost / Year
Pipeline Cost / Year
Annual Saving
ROI

Limitations & Caveats

What sentiment analysis cannot tell you

Sentiment classifiers are powerful but imperfect. Understanding where they fail is as important as knowing where they excel.

🧩

Sarcasm and irony: “Oh great, broke after one day” is negative — but most classifiers see “great” and flag it positive at lower confidence. This is why confidence thresholds matter: route anything below 0.80 to human review rather than accepting the label blindly.

⚖️

Review scraping legality: Terms of service vary by platform. Amazon, Trustpilot, and Google Shopping all have explicit anti-scraping clauses. In the UK, the Computer Misuse Act and database rights under the Copyright and Database Rights Regulations 1997 are relevant. Use official APIs where available (Amazon Product Advertising API, Trustpilot API) and consult legal counsel before scraping at scale.

Star rating ≠ sentiment: A review that says “3 stars — I like it but delivery was a disaster” is mixed sentiment, not neutral. Star rating alone misses 23% of reviews that contain high-signal qualitative text that contradicts the numeric rating. Always analyse text separately from the rating.