17 Year Old Builds a Better Search Engine

165 pointsby adebelovalmost 13 years ago

18 comments

tlrobinsonalmost 13 years ago

The biggest reason I immediately switched to Google when I first tried it was Google only matched documents containing exactly the words I searched for, nothing more.It's fine (and good) if search engines add more intelligence like this, but I'll always need a way to search for exact phrases. The default behavior of Google is much "fuzzier" than it was 5 years ago, so I'm surprised they don't already do something like this (or do they?)

评论 #4269658 未加载

评论 #4269739 未加载

评论 #4269707 未加载

评论 #4270000 未加载

tg3almost 13 years ago

That interviewer was terrible:> How long have you been interested in computer coding, and searching, and things like that?If you're going to have someone do an interview about computer science, at least let the interviewer be someone with a cursory knowledge of the field.

评论 #4269558 未加载

评论 #4269436 未加载

评论 #4269548 未加载

cantbecoolalmost 13 years ago

If I have to see another story with 'XYZ YEAR OLD BUILDS XYZ', I'm going to go thermonuclear war on HN. Adolescents have been exposed to high speed internet and technology before they can even remember. What do you expect is going to happen, honestly?

评论 #4269577 未加载

评论 #4269513 未加载

enjoalmost 13 years ago

A much better explanation:<a href="http://www.youtube.com/watch?v=fmxNuVDJZEY" rel="nofollow">http://www.youtube.com/watch?v=fmxNuVDJZEY</a>

评论 #4269424 未加载

wickedchickenalmost 13 years ago

Armchair analysis of his algorithm after watching his TED talk: a version of LSA that uses PageRank instead of a straight SVD to calculate rankings.LSA[1] has been around since the 80s and is used in many applications from GRE testing to Apple's junk mail filtering[2]. It's used a lot since the patent expired, it's relatively good and can be computed quickly. Of course, a lot of text-retrieval research has happened in the past few decades, one of my favorites being LDA[3] which relies on a much more sound statistical basis than finding lower-dimensional representations of term-document vectors. Unfortunately LDA's model is not directly computable and answers must be determined via Monte-Carlo methods.As for 'indepdendence,' his terminology gets a little confused here. At first I thought he was talking about the 'bag-of-words' assumption that most large-scale language models have. These effectively ignore grammar (other than stemming) in order to efficiently determine the 'gist' of a document without its intricacies. However, his videos imply he is talking about word-sense disambiguation[4], which is certainly known about and was the crux of LSA in the first place. If he is talking about lifting the bag-of-words assumption, there has been some interesting work going on, such as [5] (disclaimer: I am a coauthor on that paper).If you're interested in this stuff, I highly recommend trying out the LSA demo server at [6] (it can get swamped sometimes so don't kill it) and David Blei's LDA implementation at [7]. The LDA-C inputs and parameters are a little obtuse when you first look at it, and I don't have my notes on how to use it at the moment but if you play around with it it should make sense.This kid is crazy smart, and I hope he gets exposed to a lot of really cool research since he can obviously pull off a lot at a young age. Best of luck to him.[1] <a href="http://en.wikipedia.org/wiki/Latent_semantic_analysis" rel="nofollow">http://en.wikipedia.org/wiki/Latent_semantic_analysis</a>[2] <a href="http://developer.apple.com/library/mac/#samplecode/LSMSmartCategorizer/Introduction/Intro.html" rel="nofollow">http://developer.apple.com/library/mac/#samplecode/LSMSmartC...</a>[3] <a href="http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation" rel="nofollow">http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation</a>[4] <a href="http://en.wikipedia.org/wiki/Word-sense_disambiguation" rel="nofollow">http://en.wikipedia.org/wiki/Word-sense_disambiguation</a>[5] <a href="http://aclweb.org/anthology-new/D/D12/D12-1020.pdf" rel="nofollow">http://aclweb.org/anthology-new/D/D12/D12-1020.pdf</a>[6] <a href="http://lsa.colorado.edu/" rel="nofollow">http://lsa.colorado.edu/</a>[7] <a href="http://www.cs.princeton.edu/~blei/lda-c/" rel="nofollow">http://www.cs.princeton.edu/~blei/lda-c/</a>

评论 #4270405 未加载

评论 #4270111 未加载

评论 #4270119 未加载

评论 #4269737 未加载

评论 #4270179 未加载

评论 #4271868 未加载

评论 #4270132 未加载

econneralmost 13 years ago

Essentially: PageRank where nodes are words and edges are occurrence in the same document.Really cool idea. Great work. I love to see this kind of stuff.

naneralmost 13 years ago

...for the social-media worldHis search is focused on short text blurbs from social media. Not replacing Google.EDIT: Though the technique may work well for general search.

hokuaalmost 13 years ago

Definitely Bill Gate's Mini-Me

saalweachteralmost 13 years ago

It's not really appropriate to this particular story -- high school student vs VC-funded startup -- but what really bugs me about all "the next Google" reporting is that they ignore the biggest gotcha of search: corpus size matters.The secret sauce of a successful search engine isn't the core algorithm which handles 90% of the work 90% of the time, it's the millions of tweaks which keep the millions of pathological cases from making every query useless. On a small enough corpora, even grep does a great job. But once you are trying to search the world, it stops being an insight problem; at scale there are so many corner cases that you can't ignore them.

DigitalSeaalmost 13 years ago

In this day and age building "X is better than well known Y product" means nothing. You could build a car with 5 times the gas mileage of any car around today and I bet you'll be hard pressed to make a dime for the first few years while competing with the likes of Ford, Toyota or General Motors. People don't care about things being better when it comes to the web, people stick with what they know and if you're a Google user like me who is also vested in Gmail, Google Docs, Google Analytics, Google Adwords, etc then you're in too deep to switch to any other search engine.I remember Cuil when it launched. They were onto something great, made some pretty bold claims and in many ways were better than Google at search and look what happened to them? Nobody cared and they died in a huge internet tire fire. Sure the issues with poor results didn't help and people wanted them to succeed but lets be honest here, Cuil would have ended up like Bing (a few users but nothing to gloat about). People are too lazy to switch to anything new especially when it comes to search, it takes time to woo a user from another product that still does the job perfectly.Having said that, this kid is 17 and he's done a f*cking amazing job. How many people can say at the age of 17 they built anything remotely cool like this? I'm sure some can, but not many. If he keeps on this path, he'll be achieving bigger things in his 20's and 30's than this and it'll be well-deserved. The interviewer was pretty bad though, he had no clue whatsoever an insult to the kid they were interviewing who deserves at least an interviewer with a remotely above average IQ.

评论 #4269559 未加载

评论 #4269554 未加载

评论 #4269643 未加载

评论 #4269556 未加载

评论 #4269562 未加载

nschieferalmost 13 years ago

Hi, this is Nicholas, a long time lurker on HN and the person in the video. I saw this thread during my morning commute to work (and was very surprised, to say the least!) and wanted to register to mention a few important details that the news articles always omit. Hopefully this helps correct a few misconceptions!To begin, I'd like to flatly deny that I "built a better search engine." I did my (very academic) work in information retrieval and developed a new algorithm that seems to give significantly better search results (when compared to other academic search techniques, more on this later) on short documents like Twitter tweets. Specifically, my algorithm uses random walks (modelled as Markov chains) on graphs of terms representing documents to perform a type of semantic smoothing known as document expansion, where a statistical model of a document's meaning (usually based on the words that appear in the document) is expanded to include related words. My system is in no way, shape, or form a "search engine" or even comparable with something like Google---rather, it is an algorithm that could help improve search results in a real, commercial search engine.My work is not, by far, the first to attempt document expansion. A number of related techniques, including pseudo-relevance feedback expansion, translation models, some forms of latent semantic indexing, and some of those mentioned by exg already exist. However, to my knowledge, the knowledge of my science fair juges (some of whom are active IR researchers), and the knowledge of my research mentor (also more on this later), my work is a novel method (not a synthesis of existing methods) that seems to work quite well in comparison to other, similar, algorithms on collections of small documents like tweets.The last point is certainly important: it is simply impossible to compare my algorithm to something like Google, for several reasons. First, I'm not a software engineer or a large company; it is downright impossible for me to craft a combination of algorithms like that found in Google to get comparable results. No commercial search engine would be so foolish as to use only a single algorithm (essentially a single feature, from an ML perspective). Instead, they use hundreds or thousands. Second, it is essentially impossible to compare search engines with any level of scientific rigour. I evaluated my system using a standard corpus of data published by NIST as part of TREC (the Text REtrieval Conference), consisting not only of 16+ million tweets, but also of sample queries and the correct, human-determined results for these queries. However, to achieve statistically comparable results, many variables have to be controlled in a way that is impossible with a large, complex search engine. Instead, the academic approach compares individual algorithms one-on-one and postulates that these can be combined to give better search results in aggregate.Specifically, my research showed that my system achieved above-median scores on the official evaluation metrics of the 2011 Microblog corpus when compared to research groups that published last November. Furthermore, my system did the best of all of the "single algorithm" systems, including those that used other document expansion techniques like I described above.Most of my work was spent on the development of the algorithm, proofs of its convergence and asymptotic complexity, a theoretical framework, and a statistical analysis of my results. Notably absent from this list is engineering. My project is not, by any means, "a toy engineering project" as some commenters have suggested. Actually, the engineering in my project is quite poor, as that area is not one I've had much exposure to.To briefly address my research mentor: my parents had nothing to do with my project other than providing emotional support when I was stressed. I had a research mentor at a university who I found after I did very well at the 2011 Canada-Wide Science Fair. He provided me with important computational and data resources (such as the corpus I used), but did not develop my algorithm, proofs, or code, which were my own work.Given the recent attention of my project (and Jack Andraka's project on cancer detection), I'd like to point out a general trend in news articles about science fair projects. In general, the media has a tendency to focus on the potential applications of a project and ignore the science in it, leading to (seemingly fair) criticism. Using me as an example, the talk about "toy" projects and "synthesis" is fair given how it is portrayed in the media. Somehow, "novel IR algorithm based on Markov chain-based document expansion," even with careful (and thorough!) explanation, gets turned into "Teen builds a better search engine." Similarly, a great friend (and roommate) of mine whose project on drug combinations to treat cystic fibrosis was completely shredded on Reddit when it got significant media attention last year. In his project, he never once claimed or tried to claim that he had done anything with immediate (or even near) medical applications. Instead, he discussed his work to identify molecules that bind to different sites on the damaged protein and can work synergistically as drugs. The media spin-machine quickly turned this into "Teen cures cystic fibrosis" and other such nonsense. Even Jack's project (I know both him and his project), which is unusually "real world" has being overspun by the media. It's just what happens. Heck, people even make fun of it in upper-level science fairs, but it still happens.Finally, thank you for the encouraging words! To finish with a shameless plug, I'd like to point out that, while fairs like ISEF tend to be very well-funded (because of the positive publicity). However, many regional and state (in the US) or national (outside of the US) youth science organizations struggle to find funding (and even volunteers) to run fairs that send people to ISEF. If you ever find yourself in a position where you can help (financially, with your time, whatever), I'd strongly encourage it. Given the impact the science fairs have had on my life, I know that I certainly will.

评论 #4287952 未加载

评论 #4288032 未加载

评论 #4275499 未加载

koidealmost 13 years ago

I was instantly reminded of David Evans (from UVA and Udacity fame: <a href="http://www.cs.virginia.edu/~evans/" rel="nofollow">http://www.cs.virginia.edu/~evans/</a>) listening to this kid, both physically and in his manner of speech.I really expect to... eeeh... see more of him in the future.

PetroFeedalmost 13 years ago

I know very little about search but what I love about how he looked at the graph element, exploring relationships between "entities".The relationships between entities in our world holds so much information and yet in most databases it's reduced to a join between tables. Mapping the relationship and capturing the "hidden" information and therefore making it available for use unlocks amazing potential.

akrymskialmost 13 years ago

This kid would definitely make a great marketing person. There's absolutely nothing new in what he's done but he's presenting it really well, thumbs up to his parents who have probably done quite a bit of the work ;)When I was 18 (10 years ago) I did loads of research into that stuff and knew just as much about information retrieval, vector space tf-ids models, latent semantic indexing, wordnet analysis, etc. At the time it was fairly cutting edge research. This stuff isn't anything new now, I was actually forced to decipher some research papers instead of reading popular books on the subject. It was fairly obvious to me back then that none of these techniques worked well for general web search. I did end up building a system that clustered google search results (in realtime) into DMOZ categories letting you refine your search results by clicking a category (which was actually useful and worked quite well in case you were searching for something ambiguous like "jaguar").None of these techniques are new to anyone working in information retrieval. Just looking at co-occurrence of words in tweets and expanding the query with some related terms (weighted appropriately) would probably achieve what he has done (weekend project for an average dev).I'd call this kid really smart if he'd actually figure out how to improve general web search, or could think of a useful application at least. Talking about existing research and making it look like your own isn't great form in my opinion. Coming up with your own definition for a "word" just makes you look stuck up. Much better off acknowledging work of other researchers and quoting them, although that would never generate as much press I guess.Sorry if the rant is quite negative, I'm just getting a bit fed up with all the marketing surrounding "young geniuses" and teenage entrepreneurs these days. If I wanted to read that stuff I'd get the local paper.

评论 #4273924 未加载

richardburtonalmost 13 years ago

What I found amazing was the contract between the interviewer's grasp of the English language when compared to the interviewee.

Miner49eralmost 13 years ago

He read the Harry Potter books when he was 6?!

评论 #4270069 未加载

lwatalmost 13 years ago

Has Google not been doing this for ages?

评论 #4269542 未加载

评论 #4269509 未加载

评论 #4269359 未加载

spaghettialmost 13 years ago

If it's so great why aren't I using it right now? After my first Google search in 2000 I was hooked.Does the article actually link to anything related to his computer coding? Oh wait here it is: <a href="http://www.cuil.com" rel="nofollow">http://www.cuil.com</a>

评论 #4269383 未加载