科技回声

9 条评论

dwwatk01超过 16 年前

I ran into a similar problem teaching myself perl a couple years ago by doing a short tutorial then foolishly jumping into a co-worker's code. "What the hell is '$.'? Hmm, well I'm sure Google can help me. What? No matching documents?!? What is this crazy s.o.b. doing here??"

评论 #363098 未加载

评论 #363916 未加载

评论 #363732 未加载

robg超过 16 年前

Way back in 2004, I ran a little experiment with Google -- over a period of a week, I searched for an entire dictionary of ~110k individual English words and recorded how many hits Google returned for each.Of course, a word can appear on a page multiple times. That's why, I think, folks used to ignore the stopwords. They introduced noise when trying to access the content words. Now, with span constraints, you can incorporate them into the analysis. So "a matrix" and "the matrix" returns very different results, even without quotes.

whacked_new超过 16 年前

> "The" is one of the most common words in the English language"the" is THE most common word in English.

gills超过 16 年前

It makes sense that low-information terms would have a lower preference when searching without any context. If your index models the context around terms, you can get better results from a low-information search.I think...I'm kind of shooting from the hip here relating it to context modeling in lossless compression schemes like CABAC and PPM.Could you overcome stop words with some sort of Bayesian phrase matching over some learned hidden states?

jgrahamc超过 16 年前

POPFile has stopwords because people in the community insisted on it. My commercial email filtering software does not because it turned out that in my tests that the accuracy difference they made was so small as to be in the noise. And they were costly in terms of time to check, and to maintain across different languages.

randomuser7超过 16 年前

I guess the idea was to help allow English search queries (i.e. exclude words people were using to describe their query but shouldn't be searched for).

liuliu超过 16 年前

it is about how to sort with stop word. Tranditional tf-idf method didn't work well as it didn't contain any information about each word relative location in its context. a simple method is to index "the the", the word group instead of single "the". I guess it is what Google does now with "to be or not to be". However, the word grouping tech is a common method in CJK full text search.

jcromartie超过 16 年前

I like how the top Google words are all generic web marketing words, with the two exceptions of "hotels" and "women."

tumult超过 16 年前

<a href="http://www.google.com/search?hl=en&safe=off&q=%22the+the%22+band&btnG=Search" rel="nofollow">http://www.google.com/search?hl=en&safe=off&q=%22the...</a>as soon as the article asserted this wouldn't work, i tried googling, and it worked fine. i stopped reading after that.edit: for whatever reason, if you follow the link directly, the search results are wrong. you might have to submit the query again after the page loads to get the right results. weird! maybe he was onto something (nope)

评论 #362964 未加载

评论 #362963 未加载

9 条评论

dwwatk01超过 16 年前

评论 #363098 未加载

评论 #363916 未加载

评论 #363732 未加载

robg超过 16 年前

whacked_new超过 16 年前

> "The" is one of the most common words in the English language"the" is THE most common word in English.

gills超过 16 年前

jgrahamc超过 16 年前

randomuser7超过 16 年前

I guess the idea was to help allow English search queries (i.e. exclude words people were using to describe their query but shouldn't be searched for).

liuliu超过 16 年前

jcromartie超过 16 年前

I like how the top Google words are all generic web marketing words, with the two exceptions of "hotels" and "women."

tumult超过 16 年前

评论 #362964 未加载

评论 #362963 未加载

Stop Me If You've Seen This Word Before

9 条评论

Stop Me If You've Seen This Word Before

9 条评论