What this article doesn't tell you is that the human who wrote it works for Narrative Science.<p>For those who don't know, if you see a story in your local paper, and it doesn't involve a car crash, crime, weather, or sports, it was probably placed there by a PR representative. Most of the things you read are not the result of random reporters deciding to cover X or Y, but a paid, concerted effort to place story X or Y in the paper by providing the paper with a fully pre-digested story to perhaps rewrite, or perhaps not.<p>The words "narrative science" appear 14 times in that story, including such clunkers as "To generate story “angles,” explains Mr. Hammond of Narrative Science...." when Mr. Hammond has already been introduced earlier in the story. It even includes pricing: hey readers, this is not only cool and will win the Pulitzer Prize, but it's cheap too! No mention of competitors... It reads like an ad because it is an ad.<p>This story was provided, probably almost word for word, by a PR person to the NYT reporter.<p>I'm not sure if computer-generated text will be better or worse than the media system we have now.
I love seeing these examples of product development: begin with a very specific niche at the edge (not tackling the mainstream head-on) and "target non-consumption" - that way, you have no competition; and it's not a zero-sum game where you beat someone, but creating value that never existed before. This is possible not because it's good, but because it's <i>cheap</i> (and good enough):<p>> primarily a low-cost tool ... for local youth sports .... and financial results of local public companies ... <i>“Mostly, we’re doing things that are not being done otherwise,”</i><p>Then, once you have some customers - <i>any</i> customers! - you improve it, bit by bit. It doesn't need to be perfect in the first place; it doesn't need to be perfect in the end. It just needs to be good enough to be useful.<p>> [customer] worked with Narrative Science for months to fine-tune the software<p>As for the technology itself, we're not told anything of its details, just what it can do. This is a marketing article, not a tech report. It would be interesting to see the models they use for stories, and whether they use grammars for the overall structure. These are very narrow domains, which are the easiest to start with: you could enumerate all the standard cliches, understand when they apply, and tweak the model. That's where the journalist expert domain knowledge of the two founders would come in handy. BTW: "easiest" is only relative - it would still be very difficult (almost impossible), and kudos to these guys for actually doing it - and even better, making an actual business out of it.<p>It reads like a 50's Asimov story - the future is finally arriving.<p>But a Pulitzer in 5 years is absurd, either cynical puff or visionary bravado. Theoretically possible, I think, maybe in 50 years - the figure I've long given for strong AI. ;-)
Did they write an entire two page article while ignoring the real leader in this space, in my opinion: <a href="http://statsheet.com/" rel="nofollow">http://statsheet.com/</a>
My worry here is computers will learn to write articles specific to every individual.
The computer will know what other articles we liked and what we didn't like and just try to write to what we want to read. This will make it even less likely we'll hear an opposing view to our own, if the computers are giving us what we want to read.
I'm skeptical of the claim that a program could win a Pulitzer. How does it decide what to write about, who to interview, and what questions to ask?<p>Reporting a day at the races or the markets is easy because we know which kinds of data are relevant and we have them available.
I wonder if these automatically generated articles will ever become good enough to be worth reading. Currently, they seem to be just good enough to fool Google, and convince people to click the link. Do any sports fans bookmark and come back to these sites?<p>No matter how good the algorithms get, they are still limited by their input, the statistics. If for example a player scores a very unusual goal, say a bicycle kick in soccer, then a real writer who actually saw the match would surely mention it. An algorithm could not if there is no field for unusual goal in the match statistics.
Here's a description of my venture into this territory, in which I generated formulaic lottery result briefs:<p>"I wrote this article with one mouse click"<p><a href="http://coding.pressbin.com/60/I-wrote-this-article-with-one-mouse-click" rel="nofollow">http://coding.pressbin.com/60/I-wrote-this-article-with-one-...</a><p>I can't imagine the sort of code base that would be needed to make these stories not seem formulaic.
Single page:<p><a href="http://www.nytimes.com/2011/09/11/business/computer-generated-articles-are-gaining-traction.html?_r=1&pagewanted=all" rel="nofollow">http://www.nytimes.com/2011/09/11/business/computer-generate...</a>
ObXKCD: <a href="http://xkcd.com/904/" rel="nofollow">http://xkcd.com/904/</a><p>There are certain topical areas which lend themselves to automated content generation. Sports, financial news, weather, astronomy (astrology isn't worth mentioning), earthquakes and other severe events, machine monitoring.<p>Domains in which a quantified or measured outcome tied to a specific point in time or event (final score, market close, daily forecast, etc.) occurs. The important data has already been highlighted, all you've got to do is sprinkle some syntactic sugar around it.<p>Oddly enough, these are areas in which you're already most likely to find existing "AI"-type content generators.<p>In areas in which you've got to do significant determination of what is salient, the approach isn't nearly as successful.
This is a recent email I got from Facebook Support team regarding a vanity url for my business. I could swear this guy is a robot or a script, and I wonder if Facebook is using the technology described in the article:<p>----------------<p>We’re sorry, but we’re unable to process your request because another entity has made a previous request concerning this username. If you are still interested in claiming the username, you may contact us in 60 days for an update about its availability.<p>---<p>You have reached the right channel for these requests. As mentioned earlier, we have no further information to share with you concerning the username "xxxx" (marked out). We will be unable to assist you further from this alias.<p>----------------<p>What human being talks like that?
I suppose this may do for articles that just deliver some facts. However the kind of stuff I enjoy reading doesn't just barf up some facts in the form of sentences, it provides insight into what the implications of those facts may be and also draws from the past to better put things in context.<p>That's not to say their technology couldn't be improved to search the web and see what past events are relevant, but providing <i>good</i> insights about the implications of the facts will be a whole lot tougher. I don't think journalists need to be shaking in their boots unless they only deliver the quality and depth of results that this algorithm delivers.
These technological advances made me shudder about the potential job loss of the future even though the previous technological advances created new jobs.<p>Sure, there's no way that my profession and the great majority of jobs on the internet would be possible if we rely on human switchboard operators rather than relying on automation. That doesn't mean it will be true for the next advances in technology, does it?
This is pretty fascinating stuff, despite the limitations and obvious bias of this article. Are there any Open Source libraries or papers which cover toy implementations of this sort of thing? (Assuming, of course, that it is not simply a bunch of if/else constructs applied to templates, which would be far less interesting.)
This reminds me of what MarketBrief is doing for financial documents. Definitely less color / variance in the stories though.<p><a href="http://techcrunch.com/2011/08/15/yc-funded-marketbrief-makes-obtuse-sec-documents-human-friendly/" rel="nofollow">http://techcrunch.com/2011/08/15/yc-funded-marketbrief-makes...</a>
For those interested, the best source of research in this field is the "Special Interest Group on Natural Language Generation": <a href="http://www.aclweb.org/anthology/siggen.html" rel="nofollow">http://www.aclweb.org/anthology/siggen.html</a>