Problem
An NGO ran a large-scale survey across multiple regions and received tens of thousands of free-text responses to open-ended questions. The communications and policy teams needed to surface themes, sentiment, and regional variation — but had been doing it by spreadsheet for previous waves and missing patterns in the long tail.
Approach
NLP pipeline combining topic modelling (BERTopic-style clustering) with LLM-driven categorisation against a hand-curated taxonomy. Per-region and per-demographic sentiment breakdowns with confidence scoring. Output structured for both quick executive-summary consumption and deep-dive policy-team analysis with citation back to original responses.
Stack
Python · spaCy · BERTopic · Claude for taxonomy classification · Pandas · region-level dashboards
Outcome
The NGO surfaced regional themes that prior waves had missed — particularly in the long-tail responses that spreadsheet-based coding had lumped into "other". The classification + topic-model combination gave the policy team both the structured roll-up and the unstructured tail to act on.