Why are people being so critical about this work? Sure, the blog post provides a simplified picture about what the system is actually capable of, but it's still helpful for a non-ML audience to get a better understanding of the high-level motivation behind the work. The OpenAI folks are trying to educate the broader public as well, not just ML/AI researchers.<p>Imagine if this discovery were made by some undergraduate student who had little experience in the traditions of how ML benchmark experiments are done, or was just starting out her ML career. Would we be just as critical?<p>As a researcher, I like seeing shorter communications like these, as it illuminates the thinking process of the researcher. Read ML papers for the ideas, not the results :)<p>I personally don't mind blog posts that have a bit of hyped-up publicity. It's thanks to groups like DeepMind and OpenAI that have captured public imagination on the subject and accelerated such interest in prospective students in studying ML + AI + robotics. If the hype is indeed unjustified, then it'll become irrelevant in the long-term. One caveat is that researchers should be very careful to not mislead reporters who are looking for the next "killer robots" story. But that doesn't really apply here.
I don't know, but this seems a bit hyped in places.<p>They start with:<p>> Our L1-regularized model matches multichannel CNN performance with only 11 labeled examples, and state-of-the-art CT-LSTM Ensembles with 232 examples.<p>Hmm, that sounds pretty impressive. But then later you read:<p>> We first trained a multiplicative LSTM with 4,096 units on a corpus of 82 million Amazon reviews to predict the next character in a chunk of text. Training took one month across four NVIDIA Pascal GPUs<p>Wait, what? How did "232 examples" transform into "82 million"??<p>OK, I get it: they pretrained the network on the 82M reviews, and then trained the last layer to do the sentiment analysis. But you can't honestly claim that you did great with just 232 examples!
If you are interested in looking at the model in more detail, we (@harvardnlp) have uploaded the model features to LSTMVis [1]. We ran their code on amazon reviews and are showing a subset of the learned features. Haven't had a chance to look further yet, but it is interesting to play with.<p>[1] <a href="http://lstm.seas.harvard.edu/client/pattern_finder.html?data_set=32sentiment&source=states::states&pos=110&brush=28,31&queried=true&ex_cells=" rel="nofollow">http://lstm.seas.harvard.edu/client/pattern_finder.html?data...</a>
The synthetic text they generated was surprisingly realistic, despite being generic.<p>If I were perusing a dozen reviews I probably wouldn't have spotted the AI-generated ones in the crowd.
So char-by-char models is the next Word2Vec then. Pretty impressive results.<p>It would be interesting to see how it performed for other NLP tasks. I'd be pretty interested to see how many neurons it uses to attempt something like stance detection.<p><i>Data-parallelism was used across 4 Pascal Titan X gpus to speed up training and increase effective memory size. Training took approximately one month.</i><p>Everytime I look at something like this I find a line like that and go: "ok that's ncie.. I'll wait for the trained model".
It's very difficult to understand what the contributions are here. From what I've read so far this feels more of a proposal for future research or a press release than advancing the state of the art.<p>* Using large models trained on lots of data to provide the foundation for sample efficient smaller models is common.<p>* Transfer learning, fine tuning, character RNNs is common.<p>Were there any insights learned that give a deeper understanding of these phenomena?<p>Not knowing too much about the sentiment space, it's hard to tell how significant the resulting model is.
(Apologies for the slightly incoherent post below)<p>I've been noticing a lot of work that digs into ML model internals (as they've done here to find the sentiment neuron) to understand why they work or use them to do something. Let me recall interesting instances of this:<p>1. Sander Dieleman's blog post about using CNNs at Spotify to do content-based recommendations for music. He didn't write about the system performance but collected playlists that maximally activated each of the CNN filters (early layer filters picked up on primitive audio features, later ones picked up on more abstract features). The filters were essentially learning the musical elements specific to various subgenres.<p>2. The ELI5 - Explain Like I'm Five - Python Library. It explains the outputs of many linear classifiers. I've used it to explain why a text classifier was given a certain prediction: it highlights features to show how much or little they contribute to the prediction (dark red for negative contribution, dark green for positive contribution).<p>3. FairML: Auditing black-box models. Inspecting the model to find which features are important. With privacy and security concerns too!<p>Since deep learning/machine learning is very empirical at this stage, I think improvements in instrumentation can lead to ML/DL being adopted for more kinds of problems. For example: chemical/biological data. I'd be highly curious to what new ways of inspecting such kinds of data would be insightful (we can play audio input that maximally active filters for a music-related network, we can visualize what filters are learning in an object detection network, etc.)
"The selected model reaches 1.12 bits per byte." (<a href="https://arxiv.org/pdf/1704.01444.pdf" rel="nofollow">https://arxiv.org/pdf/1704.01444.pdf</a>)<p>For context, Claude Shannon found that humans could model English text with an entropy of 0.6 to 1.3 bits per character (<a href="http://languagelog.ldc.upenn.edu/myl/Shannon1950.pdf" rel="nofollow">http://languagelog.ldc.upenn.edu/myl/Shannon1950.pdf</a>)
I would imagine stuff like sarcasm is still out of reach though. It seems hard for humans to understand it in text based communication. Also using anything out of the standard sentimental model might throw it off. "This product is as good as <product x> (where product x has been known to perform bad." I am just trying to think of scenarios where a sentimental model would fail.<p>Sentimental neuron sounds fascinating too. I didn't realize individual neurons could be talked about or understood outside of the concept of the NN. I am thinking in terms of "black box" its often referenced to in some articles.<p>Since one of the research goal for openai is to train language model on jokes[0], I wonder how this neuron would perform with a joke corpus.<p>----------------------------<p>[0] <a href="https://openai.com/requests-for-research/#funnybot" rel="nofollow">https://openai.com/requests-for-research/#funnybot</a>
I'm trying to understand this statement:<p>"The sentiment neuron within our model can classify reviews as negative or positive, even though the model is trained only to predict the next character in the text."<p>If you look closely at the colorized paragraph in their paper/website, you can see that the major sentiment jumps (e.g. from green to light-green and from light-orangish to red) occur with period characters. Perhaps the insight is that periods delineate the boundary of sentiment. For example:<p>I like this movie.
I liked this movie, but not that much.
I initially hated the movie, but ended up loving it.<p>The period tells the model that the thought has ended.<p>My question for the team: How well does the model perform if you remove periods?
Can someone explain what is "unsupervised" about this? I'm guessing this is what confuses me most.<p>I think this work is interesting, although when you think about it, it's kind of normal that the model converges to a point where there is a neuron that indicates whether the review is positive or negative. There are probably a lot of other traits that can be found in the "features" layer as well.<p>There are probably neurons that can predict the geographical location of the author, based on the words they use.<p>There are probably neurons that can predict that the author favors short sentences over long explanations.<p>But what makes this "unsupervised"?
Machine Learning has become more and more like archaeology after people start saying "empirically" more and only provide a single or limited datasets.
I think it's fair to criticize this blog post for being unclear on what exactly is novel here; pre-training is a straighforward and old idea, but the blog post does not even mention this. Having accessible write ups for AI work is great, but surely it should not be confusing to domain experts or be written in such a way as to exacerbate the rampant oversimplification or misreporting in popular press about AI. Still, it is a cool mostly-experimental/empirical result, and it's good that these blog posts exist these days.<p>For what it's worth, the paper predictably does a better job of covering the previous work and stating what their motivation was: "The experimental and evaluation protocols may be underestimating the quality of unsupervised representation learning for sentences and documents due to certain seemingly
insignificant design decisions. Hill et al. (2016) also raises concern about current evaluation tasks in their recent work which provides a thorough survey of architectures and objectives for learning unsupervised sentence representations - including the above mentioned skip-thoughts. In this work, we test whether this is the case. We focus in on the task of sentiment analysis and attempt to learn an unsupervised representation that accurately contains this concept. Mikolov et al. (2013) showed that word-level recurrent language modelling supports the learning of useful
word vectors and we are interested in pushing this line of
work. As an approach, we consider the popular research
benchmark of byte (character) level language modelling
due to its further simplicity and generality. We are also interested in evaluating this approach as it is not immediately clear whether such a low-level training objective supports the learning of high-level representations." So, they question some built in assumptions from the past by training on lower-level data (characters), with a bigger dataset and more varied evaluation.<p>The interesting result they highlight is that a single model unit is able to perform so well with their representation: "It is an open question why our model recovers the concept of sentiment in such a precise, disentangled, interpretable, and manipulable way. It is possible that sentiment as a conditioning feature has strong predictive capability for language modelling. This is likely since sentiment is such an important component of a review" , which I tend to agree with... train a on a whole lot of reviews, it's only natural to train a regressor for review sentiment.
I think one of the most amazing parts of this is how accessible the hardware is right now. You can get world-class AI results with the cost of less than most used cars. In addition, with so many resources freely available through open-source, the ability to get started is very accessible.
> The model struggles the more the input text diverges from review data<p>This is where I fear the results will fail to scale. The ability to represent 'sentiment' as one neuron, and its ground truth as uni-dimensional seems most true to corpuses of online reviews where the entire point is to communicate whether you're happy with the thing that came out of the box. Most other forms of writing communicate sentiment in a more multi-dimensional way, and the subject of sentiment is more varied than a single item shipped in a box.<p>In otherwords, the unreasonable simplicity of modelling a complex feature like sentiment with this method, is something of an artifact of this dataset.
This is a great name for a band :-). That said, I found the paper really interesting. I tend to think about LSTM systems as series expansions and using that as an analogy don't find it unusual that you can figure out the dominant (or first) coefficient of the expansion and that it has a really strong impact on the output.
What they have done is semi-supervised learning (Char-RNN) + supervised training of sentiment.
Another way to do is semi-supervised learning (Word2Vec) + supervised training of sentiment.
If first approach works better, does it imply that character level learning is more performant than word level learning?
As far as I understand, it means that there must be a relation between a character's sentiment and what the next character can (/should) be for neural network to use this as a feature, am I right?<p>Does this mean we have unconsciously developed a language that exposes such relations?
Impressive the abstraction NNs can achieve from just character prediction. Do the other systems they compare to also use 81M Amazon reviews for training? Seems disingenuous to claim "state-of-the-art" and "less data" if they haven't.
Train on character-by-character basis, this is really incredible, quite opposite to human's intuition about language, but it seems a brilliant idea, and OpenAI tried it out, great!
why did they do this character by character? Would word by word make sense? Other than punctuation I'm not seeing why specific characters are meaningful units.
Why is the linear combination used to train the sentiment classifier? Why does its result get taken into account?<p>Is this linear combination between 2 different strings?