The concept seems quite interesting. I can hear slight differences between the different versions, but I honestly can't hear those differences as conveying any particular emotion. If you mixed them up and asked me which one sounded most "happy" or most "scared" or most "sad", I'd be stumped.
OMG. A TV channel used it (or a very similar technique) here in Spain to manipulate some video footage.<p><a href="http://iniciativadebate.org/2016/01/22/por-que-solo-en-video-antena-3-suena-temblorosa-la-voz-de-anna-gabriel/" rel="nofollow">http://iniciativadebate.org/2016/01/22/por-que-solo-en-video...</a><p>In the first video (at around 0:33) you can see the voice as if it was applied the "scared" filter too much, and cutting only the section where she says "yes, we met with X" out of context to make it sound like a shameful confession instead of the firm statement of known information that it is (0:41 of second video).
Amazing. Imagine combining this with facial emotion detection algorithms like those in Microsoft Kinect and making an AI that can automatically empathize with users. When you're happy, the AI is happy. When you hate the world it hates the world. Maybe you could even optimize the UI so that certain offers like buying things or rating apps would only be shown when the user is in a good mood.