Show HN: Open-source search engine with 2bn-page index

229 pointsby deusuover 8 years ago

31 comments

Alternative general purpose search engines are an exciting idea.It seems a lot like we're about the time when yahoo was dominant and searching was sort of awful. When you searched, what ranked highest was market-driven sorts of stuff.Right now, for topics normal people search for - not techies -all you get are content farm sites with js-popups asking for your email address. Try searching for anything health related, for example. We've regressed.My own half-solution is to look only for sites that are discussions - reddit, hn, etc. It could be better. A search engine that favored non-marketing content could really steal some thunder.This doesn't look like that, but maybe its a start?

评论 #12486779 未加载

评论 #12488195 未加载

评论 #12486867 未加载

评论 #12490104 未加载

评论 #12486783 未加载

评论 #12486842 未加载

评论 #12493143 未加载

评论 #12487537 未加载

评论 #12487231 未加载

评论 #12494576 未加载

CM30over 8 years ago

Well, I admire the work behind it, and I think the idea is good (especially how having this open source means multiple sites can build on the same data set and get it more and more accurate over time).But I have to be honest and say that it's just not working for me.I type in Reddit, and it shows links to the NSFW subreddits instead of the main site or anything else on it.Typing in Wikipedia gives me the Dutch version of Wikipedia.Mario Wiki? The page on Mario Wiki about Mario Clash, then the Smash Bros Wiki and a bunch of SEO spam pages.Pokemon Go gets me no relevant results at all. Certainly not anything official, that's for sure.It's a decent start, and having 2 billion pages indexed is pretty impressive for a project like this as it is, but it's just not really usable as a search engine just yet.

laurent123456over 8 years ago

They need to filter porn out of their search results (even for common queries like "hat", there's only porn) and perhaps be more resilient to SEO techniques since it looks like there's lot of spam on top results. Queries with common words such as "cat" return almost only irrelevant results.I'd really like to see that kind of project working as a good alternative to Google, but as it is it's not really usable.

评论 #12486758 未加载

Taekover 8 years ago

Is the two billion page index open source?I've been thinking a lot about days recently. Seems to me like Pandora's box is open. Google knows where you live, where you eat, what your fetishes are, all of your sexual partners. Facebook knows most of those things to, via different methods. And if you run Windows Microsoft probably has access to most of that as well. Apple will too, because if they don't they won't be able to compete. Tesla, Uber, Waze also have a huge amount of data on your life.Everyone is pushing the envelope on how much data they are collecting, and the companies which collect more data will compete better. As tech gets better we will increasingly be unable to resist sharing our whole lives with the companies who are powering modern living.Even worse, there's a huge monopolization effect to having data. Nobody else has anywhere near as much data as Google. That means nobody else can compete. Nevermind the engineering, your algorithms can be 2x as good but you won't have 0.1% the data as a company with billions of daily users.So Google and Facebook are left untouchable. Microsoft, Apple, and maybe Amazon can get in range. Is there anyone else?We can fight back by giving up the privacy war and blowing the doors open instead. Take your data (as much as you dare) and make it public. Let every startup have access to it. Let every doctor have access to it. Give the small players a fighting chance.That does mean a massive cultural shift. It means your neighbors will be able to look up your salary, your fetishes, your personal affairs. It's a big deal.I don't see any other way out of this though. Surveillance technology is getting better faster than privacy technology, because surveillance tech has the entire tech industry behind it. Smarter phones, smarter TVs, smarter grocery stores, smarter credit cards, smarter shoes... smarter everything. Privacy is melting away and we aren't getting it back.

评论 #12488790 未加载

评论 #12490285 未加载

评论 #12491455 未加载

fnord123over 8 years ago

It's written in pascal. Neat.However, it's not very good. If I search for "banana" I get information about a sex shop rather than about bananas.

评论 #12486836 未加载

mstolpmover 8 years ago

In addition to the lack of removing porn and the ordering of the results not priorizing "quality" sources, some of the indexed site data is at least 4-6 months old and has heavily changed since the last crawl. I even got 404 errors. That makes it very hard to really find use in the project other than for academic interest.

评论 #12486793 未加载

jbb555over 8 years ago

I think projects like this are really important because they help reduce the impression that big server projects are only meant to be done by big companies. The internet is becoming a content consumption medium for many people.I'm not sure I'll use this, but I'll try to... it all depends on how good it is. But I approve of the project so I sent a (very) small bitcoin donation to hopefully help fund it for a few more minutes :)

评论 #12489206 未加载

ccleveover 8 years ago

You get really good performance on not much hardware. Can you share some technical details?- file formats, particularly the postings- query evaluation strategy- update strategyI poked around in the source code a bit, but couldn't find these things.

评论 #12490087 未加载

pmontraover 8 years ago

Written in Delphi. I might be wrong but I don't see many people downloading and working on it. 30 day free trial and then you have to pay for the development environment. IMHO it's a non starter for an open source project but if it's the only language the author is comfortable with, well that's OK.

评论 #12488403 未加载

评论 #12488445 未加载

NKCSSover 8 years ago

Fun, but overal quality seems a bit lacking.When I search myself; the top 10 results don't even have my last name ('Kusters') and just shows pages that have the word 'Nick'. I suppose you don't use a form of LSA to score the search results? Maybe it's too specific, but afaik mainstream search engines seem to give somewhat consistent results here.<a href="https://deusu.org/query?q=nick+kusters" rel="nofollow">https://deusu.org/query?q=nick+kusters</a>Looking at the code (<a href="https://github.com/MichaelSchoebel/DeuSu/" rel="nofollow">https://github.com/MichaelSchoebel/DeuSu/</a>) I notice that you have ranking modifiers based on the .tld; why not store the reported content language and score based on that? Isn't that more relevant?

评论 #12486771 未加载

gkstover 8 years ago

Pascal is an interesting language choice. I think it is the 1st time I see an open source project that is actually used in production written in Pascal.

skykoolerover 8 years ago

It shows snippets of the web pages under each result; however, generally not the particular snippets that contain the search term. I would think that would be useful.

评论 #12488382 未加载

yatiover 8 years ago

Looking at the source code took me back to days when I used to do stuff in Delphi :)Neat project -- Loads of room for improvement, but a great initiative!

swileyover 8 years ago

The site's interface is just incredibly pleasant compared to Google.com. I really hope the author sticks with it. Unfortunately I'm not sure it's usable right now, searching "group theory Wikipedia" never brings up a Wikipedia page (although maybe I should just be directly searching Wikipedia if that's what I wanted).

评论 #12486830 未加载

rshmover 8 years ago

As of aug 16, common crawl has 1.73n pages. For the complimentary set of urls, if any benefit you can use their data dump as seed.If the metadata (such as last modified) size of your index is small enough to upload to aws, you can also reduce your re-crawl efforts when they have a fresh release.

评论 #12493701 未加载

supersanover 8 years ago

Hi, I find the Blog more interesting right now since I hope to find write-ups about how you were able to manage such a herculean task on your own?Crawling 2bn pages could take forever and could generate a huge bandwidth bills, so any lessons you learnt, pitfalls you faced, etc would be a great read.

评论 #12487524 未加载

ommunistover 8 years ago

DeuSu seems not indexing Cyrillic part of the Internet, and cannot give you insights for Greek, try <a href="https://deusu.org/query?q=ελιά" rel="nofollow">https://deusu.org/query?q=ελιά</a> . Is it Latin ANSI only index?

评论 #12488421 未加载

tychuzover 8 years ago

And all javascript related questions still have w3schools as first result, god dammit.

评论 #12488347 未加载

kowdermeisterover 8 years ago

Strange, Wikipedia article is not on the first page and don't blame me for searching something non German thing :)<a href="https://deusu.org/query?q=berlin" rel="nofollow">https://deusu.org/query?q=berlin</a>

评论 #12486721 未加载

0xmohitover 8 years ago

Earlier discussion: <a href="https://news.ycombinator.com/item?id=9122397" rel="nofollow">https://news.ycombinator.com/item?id=9122397</a>

ommunistover 8 years ago

DeuSu does not crawl social pages it seems. No traces of linkedin profiles and no facebook. From a certain point of view - this is a good thing.

billconanover 8 years ago

I searched "meta programming c++" and the top returns are all about java.I'm curious, is it expensive to run a search site like this?

评论 #12490155 未加载

vainover 8 years ago

Google's secret ingredient to stay relevant and informational is Wikipedia.Deusu on the other hand seems to weight words in urls highly.If you search for scientology only on Deusu, you might end up wearing a funky hat <a href="https://deusu.org/query?q=scientology" rel="nofollow">https://deusu.org/query?q=scientology</a>

amiroucheover 8 years ago

Did you think about database dump of popular services like HN, SO or Wikipedia to speed up crawling and revelance?

评论 #12490541 未加载

outpanover 8 years ago

Awesome job!For the life of me I can't figure out how you manage to crawl over a billion web pages (even in 2-3 months), index the data and run the server with €300 per month. Especially the crawler part...

rbjorklinover 8 years ago

What makes this better than <a href="https://duckduckgo.com" rel="nofollow">https://duckduckgo.com</a> ?

评论 #12486700 未加载

vcool07over 8 years ago

Any specific reason you've used pascal ? I thought that language got extinct long ago.

评论 #12495317 未加载

malinensover 8 years ago

works really fast!

评论 #12486825 未加载

scandoxover 8 years ago

Every time I see new search engine projects I remember this: <a href="https://en.wikipedia.org/wiki/Cuil" rel="nofollow">https://en.wikipedia.org/wiki/Cuil</a>I note that Dr Anna Patterson is back with Google. She wrote this in 2004: <a href="http://queue.acm.org/detail.cfm?id=988407" rel="nofollow">http://queue.acm.org/detail.cfm?id=988407</a>

评论 #12490965 未加载

micwoover 8 years ago

Deusu can't find deusu (or deusu.org)<a href="https://deusu.org/query?q=deusu" rel="nofollow">https://deusu.org/query?q=deusu</a>

评论 #12486811 未加载

ashitlerferadover 8 years ago

Another open source search engine:<a href="http://yacy.net/" rel="nofollow">http://yacy.net/</a>

评论 #12487864 未加载

31 comments

throwaway13337over 8 years ago

评论 #12486779 未加载

评论 #12488195 未加载

评论 #12486867 未加载

评论 #12490104 未加载

评论 #12486783 未加载

评论 #12486842 未加载

评论 #12493143 未加载

评论 #12487537 未加载

评论 #12487231 未加载

评论 #12494576 未加载

CM30over 8 years ago

laurent123456over 8 years ago

评论 #12486758 未加载

Taekover 8 years ago

评论 #12488790 未加载

评论 #12490285 未加载

评论 #12491455 未加载

fnord123over 8 years ago

It's written in pascal. Neat.However, it's not very good. If I search for "banana" I get information about a sex shop rather than about bananas.

评论 #12486836 未加载

mstolpmover 8 years ago

评论 #12486793 未加载

jbb555over 8 years ago

评论 #12489206 未加载

ccleveover 8 years ago

评论 #12490087 未加载

pmontraover 8 years ago

评论 #12488403 未加载

评论 #12488445 未加载

NKCSSover 8 years ago

评论 #12486771 未加载

gkstover 8 years ago

Pascal is an interesting language choice. I think it is the 1st time I see an open source project that is actually used in production written in Pascal.

skykoolerover 8 years ago

It shows snippets of the web pages under each result; however, generally not the particular snippets that contain the search term. I would think that would be useful.

评论 #12488382 未加载

yatiover 8 years ago

Looking at the source code took me back to days when I used to do stuff in Delphi :)Neat project -- Loads of room for improvement, but a great initiative!

swileyover 8 years ago

评论 #12486830 未加载

rshmover 8 years ago

评论 #12493701 未加载

supersanover 8 years ago

评论 #12487524 未加载

ommunistover 8 years ago

评论 #12488421 未加载

tychuzover 8 years ago

And all javascript related questions still have w3schools as first result, god dammit.

评论 #12488347 未加载

kowdermeisterover 8 years ago

评论 #12486721 未加载

0xmohitover 8 years ago

Earlier discussion: <a href="https://news.ycombinator.com/item?id=9122397" rel="nofollow">https://news.ycombinator.com/item?id=9122397</a>

ommunistover 8 years ago

DeuSu does not crawl social pages it seems. No traces of linkedin profiles and no facebook. From a certain point of view - this is a good thing.

billconanover 8 years ago

I searched "meta programming c++" and the top returns are all about java.I'm curious, is it expensive to run a search site like this?

评论 #12490155 未加载

vainover 8 years ago

amiroucheover 8 years ago

Did you think about database dump of popular services like HN, SO or Wikipedia to speed up crawling and revelance?

评论 #12490541 未加载

outpanover 8 years ago

rbjorklinover 8 years ago

What makes this better than <a href="https://duckduckgo.com" rel="nofollow">https://duckduckgo.com</a> ?

评论 #12486700 未加载

vcool07over 8 years ago

Any specific reason you've used pascal ? I thought that language got extinct long ago.

评论 #12495317 未加载

malinensover 8 years ago

works really fast!

评论 #12486825 未加载

scandoxover 8 years ago

评论 #12490965 未加载

micwoover 8 years ago

Deusu can't find deusu (or deusu.org)<a href="https://deusu.org/query?q=deusu" rel="nofollow">https://deusu.org/query?q=deusu</a>

评论 #12486811 未加载

ashitlerferadover 8 years ago

Another open source search engine:<a href="http://yacy.net/" rel="nofollow">http://yacy.net/</a>

评论 #12487864 未加载