Customer reviews contain the most honest signal your product team will ever get. This use case shows how to systematically extract, structure, and run sentiment analysis on scraped product reviews — turning thousands of unstructured opinions into actionable product intelligence.
A mid-sized e-commerce brand with 200 products and presence on Amazon, Trustpilot, and their own site accumulates roughly 800–1,200 new reviews per week. A product manager reading even 20% of those would spend 6+ hours per week on raw qualitative data — with no structure, no trends, and no way to connect a spike in negative reviews to a specific manufacturing batch or UI change.
Sentiment analysis automates the first layer of this work: it reads every review, labels it positive / negative / neutral with a confidence score, extracts the key topic (packaging, delivery, product quality, customer service), and flags anomalies — all in milliseconds per review.
The key insight: Sentiment analysis doesn’t replace human judgment — it filters and prioritises so that humans spend their attention on the 5% of reviews that actually require a decision (escalation, refund, product recall signal), not the 95% that are routine feedback already captured by the star rating.
The pipeline has five stages. Each stage is independently maintainable — if Amazon changes its HTML structure, only the extractor needs updating, not the classifier or storage layer.
Here’s what a single classified review record looks like in the output database:
The schema is split into two layers: extraction fields (scraped directly from the source) and enrichment fields (computed by the classifier pipeline). Understanding which is which matters enormously for debugging classifier errors.
Below is a sample of 20 reviews from the dataset, across 3 products and 4 sources. Filter by sentiment, sort any column. Reviews flagged for escalation have a red confidence bar.
| Product | Source | Stars | Review | Sentiment | Confidence | Aspect | Date |
|---|
* Illustrative data. Confidence = model output probability for assigned label.
Once reviews are classified, the real insight comes from aggregation — looking at sentiment per product, per time period, and per aspect. The chart below shows positive vs negative balance per product category in the dataset.
Watch signal: “Battery” appears in 68% of negative reviews for SoundPulse Pro, with a notable spike in week 11 (March 2025) — coinciding with batch #MP-2025-03 from the Shenzhen supplier. This cross-reference between sentiment timeline and production batches is the kind of insight that’s invisible without structured data.
There’s no single “best” sentiment classifier. The right choice depends on your data volume, accuracy requirements, latency budget, and whether you need aspect-level granularity (e.g. “negative about delivery” vs “negative about product“).
Recommended starting point: Run VADER on your first 1,000 reviews to understand the rough distribution and spot obvious patterns. Then fine-tune RoBERTa on a hand-labelled sample of 500 reviews from your specific domain. This two-step process typically takes 2–3 days and produces a production-grade classifier that costs <£0.002 per review at scale.
The cost model for a review sentiment pipeline is unusually transparent. Infrastructure is cheap; the variable is your classifier choice. Enter your numbers to see the economics for your brand.
Sentiment classifiers are powerful but imperfect. Understanding where they fail is as important as knowing where they excel.
Sarcasm and irony: “Oh great, broke after one day” is negative — but most classifiers see “great” and flag it positive at lower confidence. This is why confidence thresholds matter: route anything below 0.80 to human review rather than accepting the label blindly.
Review scraping legality: Terms of service vary by platform. Amazon, Trustpilot, and Google Shopping all have explicit anti-scraping clauses. In the UK, the Computer Misuse Act and database rights under the Copyright and Database Rights Regulations 1997 are relevant. Use official APIs where available (Amazon Product Advertising API, Trustpilot API) and consult legal counsel before scraping at scale.
Star rating ≠ sentiment: A review that says “3 stars — I like it but delivery was a disaster” is mixed sentiment, not neutral. Star rating alone misses 23% of reviews that contain high-signal qualitative text that contradicts the numeric rating. Always analyse text separately from the rating.