OpenAI is Using Reddit to Teach An Artificial Intelligence How to Speak

260 点作者 niccolop超过 8 年前

41 条评论

Back in 2007, mobile phones used a system called T9 from Nuance corp which was trained on a word corpus taken from IRC and similar chats. This caused all kinds of issues - the mobile phones would accept offensive words like "naziparking" but reject normal language like "world peace". Using reddit may lead to ... surprises.Source: <a href="http://spraktidningen.se/artiklar/2007/11/darfor-ar-din-mobil-fordomsfull" rel="nofollow">http://spraktidningen.se/artiklar/2007/11/darfor-ar-din-mobi...</a>Translated by Google: <a href="https://translate.google.se/translate?sl=sv&tl=en&js=y&prev=_t&hl=sv&ie=UTF-8&u=http%3A%2F%2Fspraktidningen.se%2Fartiklar%2F2007%2F11%2Fdarfor-ar-din-mobil-fordomsfull&edit-text=&act=url" rel="nofollow">https://translate.google.se/translate?sl=sv&tl=en&js=y&prev=...</a>

评论 #12685379 未加载

评论 #12685948 未加载

评论 #12686352 未加载

评论 #12684814 未加载

评论 #12685292 未加载

评论 #12688508 未加载

评论 #12689340 未加载

评论 #12686303 未加载

syllogism超过 8 年前

The Reddit comment corpus is an awesome dataset. There's relatively little mark-up to scrub out, low duplication, good metadata, and a variety of topics.We used it to train a syntax-enriched word2vec model. Write up and demo: <a href="https://explosion.ai/blog/sense2vec-with-spacy" rel="nofollow">https://explosion.ai/blog/sense2vec-with-spacy</a>Btw, the above was run on CPU in a couple of days, because spaCy doesn't use GPUs yet. I've applied for a grant from NVidia so I can fix that. If anyone from NVidia is reading, email me? :)

评论 #12686156 未加载

评论 #12684336 未加载

minimaxir超过 8 年前

Since it was not mentioned in the post, here's a direct link to the Reddit comment corpus likely being used: <a href="http://files.pushshift.io/reddit/comments/" rel="nofollow">http://files.pushshift.io/reddit/comments/</a>The full table (up to end of 2015) is available on BigQuery, with separate tables for each month thereafter: <a href="https://bigquery.cloud.google.com/table/fh-bigquery:reddit_posts.full_corpus_201512" rel="nofollow">https://bigquery.cloud.google.com/table/fh-bigquery:reddit_p...</a> (there is a similar table for comments)And here's a year-old post I wrote on how to use that Reddit dataset with BigQuery: <a href="http://minimaxir.com/2015/10/reddit-bigquery/" rel="nofollow">http://minimaxir.com/2015/10/reddit-bigquery/</a>

评论 #12687964 未加载

评论 #12685037 未加载

评论 #12684582 未加载

评论 #12687899 未加载

TY超过 8 年前

How one would use such technology? Let me rephrase - how would YOU use this technology if you had it?Imagine you have a bot that convincingly passes the Turing test - what would you do with it?Build a chatbot business? B2C or B2B?Sell it to one of the big companies and if yes then how much do you think it would go for?Give it to OpenAI? Open source it? If you answer yes to any of this questions, then why?Edit: let me qualify - this would not be AGI, just a much more advanced bot than whatever is currently on the market.

评论 #12684588 未加载

评论 #12685171 未加载

评论 #12688171 未加载

评论 #12687113 未加载

评论 #12686270 未加载

评论 #12692095 未加载

评论 #12684562 未加载

评论 #12685309 未加载

评论 #12684321 未加载

评论 #12687881 未加载

jonstokes超过 8 年前

"Oh my God, they'll turn it on and it'll start spewing memes and jokes and ad hominem and false equivalences and propaganda and garbage!" was my first reaction to this headline.My second reaction was, "at least they're not using 4chan."

评论 #12686085 未加载

评论 #12684893 未加载

评论 #12685022 未加载

评论 #12685021 未加载

评论 #12689632 未加载

评论 #12685001 未加载

评论 #12685601 未加载

评论 #12684744 未加载

ppod超过 8 年前

Reddit gets a lot of stick, but it's a bastion of civility and intelligence compared to the comments on youtube videos or even mainstream newspaper comments. I don't think there is any forum of comparable size that has a higher quality discussion. Reddit's problems are just humans' problems.

评论 #12685776 未加载

评论 #12685793 未加载

samfisher83超过 8 年前

Didn't msft do the same thing with twitter and end up with racist bot? I am not sure how this will turn out.

评论 #12684362 未加载

评论 #12684367 未加载

arctangent超过 8 年前

One Reddit user has already implemented a bot which does something similar:<a href="https://www.reddit.com/r/SubredditSimulator/" rel="nofollow">https://www.reddit.com/r/SubredditSimulator/</a>

评论 #12684678 未加载

nateberkopec超过 8 年前

The DGX-1 is available for a cool $129k: <a href="http://www.nvidia.com/object/deep-learning-system.html" rel="nofollow">http://www.nvidia.com/object/deep-learning-system.html</a>Correct me if I'm wrong, but I think it's basically a couple hundred NVIDIA 10-series cards strapped together with a full custom NVIDIA software stack.

评论 #12684331 未加载

评论 #12684574 未加载

评论 #12685220 未加载

评论 #12684456 未加载

bkanber超过 8 年前

This will be interesting. I'm sure they are, but I hope they'll be training the system on tone and sentiment alongside syntax.Reddit can get vitriolic and rude, insightful at times too, but once the system learns the syntax hopefully they'll be able to use sentiment analysis to weigh more strongly the polite conversion that occurs.Also interested to see how many memes this AI picks up.I also hope they are able to follow links through to sources when a comment cites another page -- not only can this bot learn syntax but also data extraction by comparing what is said to the source material.

评论 #12686372 未加载

anexprogrammer超过 8 年前

My first thought was it'll major on smart-arse, with a good line in sarcasm and insult.If they're taking the whole of reddit it could start to identify enough context to know when to be smart, sarcastic or simply helpful.With some of the subs there are long discussions that stay mainly civilised. Same for the support subs it could learn the context and how of sympathy and empathy. Things that end up on front page, filled with snap sarcasm, will be a tiny fraction.I think it's going to be very interesting see what comes out.

Cshelton超过 8 年前

As a frequent Redditor, this AI is going to be very witty.They should limit it to top comments only, and for training, you might as well assume 90% of top comments are sarcastic/tongue in cheek. Or let a user dial the sarcasm/wittiness/seriousness as they want it, kind of like TARS from 'Interstellar'.

评论 #12685329 未加载

评论 #12684347 未加载

corysama超过 8 年前

So, he's building a literal Reddit Hivemind?In seriousness, between all of the garbage there is a ton of knowledge and intelligent conversation uploaded to Reddit every day. And, it's all hierarchically organized and scored by domain semi-experts. It really would be wonderful if someone could mine that knowledge IBM Watson style. For example, I'd love to ask the /r/BuildAPC collective AI for PC building advice.

评论 #12688820 未加载

ajamesm超过 8 年前

Heh. Reddit, huh?----"Siri, get me dinner date reservations.". . . DID YOU MEAN 'false rape accusations' ?

beambot超过 8 年前

I hope they choose the subreddits wisely. The difference between an altruistic AI and a cynical smartass AI trained on Reddit data seems mighty razor thin.

评论 #12684983 未加载

qxf2超过 8 年前

The Reddit data set on BigQuery is excellent. My side project is tangentially related to the fact that the Reddit data set has normal folk commenting. I have been using Reddit comments to help writers research and find what normal people say about any topic [1]. So far, I have had little luck in incorporating the comment scores and coming up with something more useful than the standard bag of words search techniques[2]. I am currently working on making a more interesting/creative writing prompts ... again based on the Reddit data set.One problem for data geeks to solve: Reddit data fits nicely into a graph structure and not so nicely in table form. It would be fantastic if someone put the Reddit data set into a graphdb and made it open.[1]<a href="https://wisdomofreddit.com" rel="nofollow">https://wisdomofreddit.com</a> and <a href="https://github.com/qxf2/wisdomofreddit" rel="nofollow">https://github.com/qxf2/wisdomofreddit</a>[2]For now, my search engine currently just uses Whoosh's (out of the box) BM25F.

Tepix超过 8 年前

So, what's the reddit equivalent of X-No-Archive (<a href="https://en.wikipedia.org/wiki/X-No-Archive" rel="nofollow">https://en.wikipedia.org/wiki/X-No-Archive</a>).. or X-No-Teach-AI-That-Will-Kill-My-Children? Asking for a friend.

bigato超过 8 年前

A computer will learn how to speak from reddit, hahahaha. What could possibly go wrong?

评论 #12684441 未加载

jwtadvice超过 8 年前

Let's not kid ourselves. The technology will be used by PR firms, advertising companies, political campaigns and governments to pretend, at scale, that there is public consensus on certain issues and to drown social media conversation in particular narratives.Anyone have any good defensive technology ideas?

评论 #12687942 未加载

cooper12超过 8 年前

How does the team plan to address the issues faced by Microsoft's twitter chatbot Tay [0], which had racist inputs and in turn gave similar responses? While I don't know how recent the corpus is, the majority of reddit speaks like and holds the views of college-aged white males, and many of the things said on reddit have been deplorable. It'd be a shame if OpenAI pooled all that computing power into training on a bad data set, resulting in an AI that regurgitates memes and random references in response to anything.[0]: <a href="http://www.theverge.com/2016/3/24/11297050/tay-microsoft-chatbot-racist" rel="nofollow">http://www.theverge.com/2016/3/24/11297050/tay-microsoft-cha...</a>

评论 #12685072 未加载

krashidov超过 8 年前

One of the things I like to do is play out a business to it's absurd maximum. What's the craziest possible future can I see for a company and its assets?For Reddit, I like to imagine that it's basically the training data for all of the emotional and societal nuances that a human goes through.Think about all of those stories that people post in ask Reddit that explain western norms and no nos. how to treat people with respect, when to call the police, how to communicate properly, etc.Obviously we're far away from using the data to its full potential but one day I could see Reddit data to make our AIs more relatable and human like.

评论 #12686588 未加载

snehesht超过 8 年前

I wonder if anyone else thinks reddit is a bad example in teaching an AI.

sjcsjc超过 8 年前

"nearly two billion Reddit comments will be processed"For interest, how many HN comments are there? Miles fewer, no doubt, but perhaps far more erudite and less likely to offend.

ge96超过 8 年前

I am a human and I don't understand. I thought speaking would imply sound not text.Time to read the article.Ignorant person speaking here: this still doesn't sound like AI, you're just making something follow patterns and regurgitating them. Is that AI? Maybe that's what I do a tech parrot. Ahh well time will tell.Of course we imitates our parents/others to learn how to speak.I was interested in parsing vocal sound bytes and learning how sound was created/formed letters/words.Alright ignorant person out.

Wei-1超过 8 年前

And we all know what type of a person OpenAI will become.

random_upvoter超过 8 年前

Instead of the Reddit corpus you may just as well use a picture library of human footprints. It would be no more optimistic.Human speech is produced from the conscious experience of being a human being. If your dataset contains just the speech, without the experience, there's simply not enough there. Any machine trained on this data is doomed to talk hollow rubbish.

tvural超过 8 年前

I'm a bit worried that OpenAI hasn't released anything substantive for the past four months. There are research ideas like this one, but most ideas don't pan out. With the number and quality of people they have, I would expect to have heard of some kind of progress.

shawn-butler超过 8 年前

Great, just what I need.A virtual assistant that has the personality of a smug know-it-all, know-nothing 20 year-old with little motivation to do anything but regurgitate surface knowledge and sarcasm in an attempt to look intelligent without expressing genuine interest in helping anyone.

评论 #12691423 未加载

plusepsilon超过 8 年前

Reddit and Hacker News comments are surprisingly good data. They cover a wide array of topics and writing styles, generally written better than Facebook comments or Twitter, easier to process than Common Crawl or ukWac, and less rigid than newspaper writing.

AndrewKemendo超过 8 年前

Is it just me or does Greg Brockman speak startlingly similar to how Sam Altman speaks. Given that Sam helped start OpenAI, it wouldn't surprise me if there was some mirroring going on in the hiring process.

philjackson超过 8 年前

I'm looking forward to a bot making a joke about banging my mum...

yahma超过 8 年前

Anyone know what type of architecture they will be using? Nvidia is involved, so I suspect there will be some type of deep learning. Will it be LSTM's? Adversarial Nets?

Dowwie超过 8 年前

"Why does the AI keep calling everything Meta!?"

peter303超过 8 年前

Will it understand what it is peaking about?Humans have opposite problem. We understand what we talk about, but have little idea how our brains create language.

cjdulberger超过 8 年前

It'd be interesting to see an AI trained using HN, ingesting content of posted links and comments.

LeanderK超过 8 年前

i think one of the major advantages over microsoft approach with Tay is that you can't mess with it on purpose, as long as they choose their subreddits wisely. It will probably learn its fair bit of racial slurs and insults, but thats just how humanity is like.

Keyframe超过 8 年前

Interesting to see what will happen with non-english comments.

sidcool超过 8 年前

There's no saying what the AI will grow up to be.

rbanffy超过 8 年前

Could be worse. Could be 4Chan...

bertomartin超过 8 年前

Interesting corpus there ;)

benkaiser超过 8 年前

me too thanks