A big factor in producing a good analysis is the feedback modality -- chat transcripts are different from emails, which are different from web forms or operator notes.<p>We've had several "customer feedback / intent / support case analysis" projects in the past. Some for large customers with millions of individual records (Autodesk), where there's the additional challenge of "What should the categories be in the first place? What's in the data?" (discovery).<p>What we learned is a model trained on one type of feedback will not necessarily perform well on others, because the relevant signals manifest differently across modalities: feedback length / writing style / typos, lexical richness / repetition / boilerplate, OCR noise / how long is the long tail… Your model may learn to pick up on cues that are orthogonal to the sentiment or categorization problem.<p>This is especially true for black box models (deep learning) where introspection is limited: Did the model learn to rely on syntax? Specific words or character ngrams? Exclamation marks? Something else? Does an Indian-looking name imply sentiment negativity?<p>Slapping a generic ML technique (Stanford NLP, Naive Bayes, bi-LSTM, whatever) onto a bunch of tokens is a reasonable first step, that's the low-hanging fruit. The tricky part is defining the problem space and the QA process correctly, and managing the devil that comes with the details.
I always see all these article, services and products offering NLP for English. I wonder how this works with other languages that have a different structure e.g. Japanese, Arabic, etc. It would also be interesting to see how these algorithms behave when considering cultural aspects: one word or expression may have a different meaning in different places. How would the system handle something like "Your service is the sh*t!". Is that positive? negative? There's probably info on this subject all over the internet already haha very interesting though...
Have you considered using this for analyzing feedback for politicians? They have similar pain points in understanding what feedback from constituents is a general problem vs isolated concern. Maybe through twitter data (as PoC) and then actual emails from constituents.
Interesting article, how do you guys cope with badly written feedback or feedback that just doesn't make sense? I guess that this type of feedback could "pollute" your algorithms if you constantly use unverified feedback as training data?
When you talk about discerning between algorithms from Google, Stanford, etc... what's the criteria for doing that? Does it change based on the domain? if you are just trying to classify feedback how much the domain affects the algorithm?