Alternative general purpose search engines are an exciting idea.<p>It seems a lot like we're about the time when yahoo was dominant and searching was sort of awful. When you searched, what ranked highest was market-driven sorts of stuff.<p>Right now, for topics normal people search for - not techies -all you get are content farm sites with js-popups asking for your email address. Try searching for anything health related, for example. We've regressed.<p>My own half-solution is to look only for sites that are discussions - reddit, hn, etc. It could be better. A search engine that favored non-marketing content could really steal some thunder.<p>This doesn't look like that, but maybe its a start?
Well, I admire the work behind it, and I think the idea is good (especially how having this open source means multiple sites can build on the same data set and get it more and more accurate over time).<p>But I have to be honest and say that it's just not working for me.<p>I type in Reddit, and it shows links to the NSFW subreddits instead of the main site or anything else on it.<p>Typing in Wikipedia gives me the Dutch version of Wikipedia.<p>Mario Wiki? The page on Mario Wiki about Mario Clash, then the Smash Bros Wiki and a bunch of SEO spam pages.<p>Pokemon Go gets me no relevant results at all. Certainly not anything official, that's for sure.<p>It's a decent start, and having 2 billion pages indexed is pretty impressive for a project like this as it is, but it's just not really usable as a search engine just yet.
They need to filter porn out of their search results (even for common queries like "hat", there's only porn) and perhaps be more resilient to SEO techniques since it looks like there's lot of spam on top results. Queries with common words such as "cat" return almost only irrelevant results.<p>I'd really like to see that kind of project working as a good alternative to Google, but as it is it's not really usable.
Is the two billion page index open source?<p>I've been thinking a lot about days recently. Seems to me like Pandora's box is open. Google knows where you live, where you eat, what your fetishes are, all of your sexual partners. Facebook knows most of those things to, via different methods. And if you run Windows Microsoft probably has access to most of that as well. Apple will too, because if they don't they won't be able to compete. Tesla, Uber, Waze also have a huge amount of data on your life.<p>Everyone is pushing the envelope on how much data they are collecting, and the companies which collect more data will compete better. As tech gets better we will increasingly be unable to resist sharing our whole lives with the companies who are powering modern living.<p>Even worse, there's a huge monopolization effect to having data. Nobody else has anywhere near as much data as Google. That means nobody else can compete. Nevermind the engineering, your algorithms can be 2x as good but you won't have 0.1% the data as a company with billions of daily users.<p>So Google and Facebook are left untouchable. Microsoft, Apple, and maybe Amazon can get in range. Is there anyone else?<p>We can fight back by giving up the privacy war and blowing the doors open instead. Take your data (as much as you dare) and make it public. Let every startup have access to it. Let every doctor have access to it. Give the small players a fighting chance.<p>That does mean a massive cultural shift. It means your neighbors will be able to look up your salary, your fetishes, your personal affairs. It's a big deal.<p>I don't see any other way out of this though. Surveillance technology is getting better faster than privacy technology, because surveillance tech has the entire tech industry behind it. Smarter phones, smarter TVs, smarter grocery stores, smarter credit cards, smarter shoes... smarter everything. Privacy is melting away and we aren't getting it back.
It's written in pascal. Neat.<p>However, it's not very good. If I search for "banana" I get information about a sex shop rather than about bananas.
In addition to the lack of removing porn and the ordering of the results not priorizing "quality" sources, some of the indexed site data is at least 4-6 months old and has heavily changed since the last crawl. I even got 404 errors. That makes it very hard to really find use in the project other than for academic interest.
I think projects like this are really important because they help reduce the impression that big server projects are only meant to be done by big companies. The internet is becoming a content consumption medium for many people.<p>I'm not sure I'll use this, but I'll try to... it all depends on how good it is. But I approve of the project so I sent a (very) small bitcoin donation to hopefully help fund it for a few more minutes :)
You get really good performance on not much hardware. Can you share some technical details?<p>- file formats, particularly the postings<p>- query evaluation strategy<p>- update strategy<p>I poked around in the source code a bit, but couldn't find these things.
Written in Delphi. I might be wrong but I don't see many people downloading and working on it. 30 day free trial and then you have to pay for the development environment. IMHO it's a non starter for an open source project but if it's the only language the author is comfortable with, well that's OK.
Fun, but overal quality seems a bit lacking.<p>When I search myself; the top 10 results don't even have my last name ('Kusters') and just shows pages that have the word 'Nick'. I suppose you don't use a form of LSA to score the search results? Maybe it's too specific, but afaik mainstream search engines seem to give somewhat consistent results here.<p><a href="https://deusu.org/query?q=nick+kusters" rel="nofollow">https://deusu.org/query?q=nick+kusters</a><p>Looking at the code (<a href="https://github.com/MichaelSchoebel/DeuSu/" rel="nofollow">https://github.com/MichaelSchoebel/DeuSu/</a>) I notice that you have ranking modifiers based on the .tld; why not store the reported content language and score based on that? Isn't that more relevant?
Pascal is an interesting language choice. I think it is the 1st time I see an open source project that is actually used in production written in Pascal.
It shows snippets of the web pages under each result; however, generally not the particular snippets that contain the search term. I would think that would be useful.
Looking at the source code took me back to days when I used to do stuff in Delphi :)<p>Neat project -- Loads of room for improvement, but a great initiative!
The site's interface is just incredibly pleasant compared to Google.com. I really hope the author sticks with it. Unfortunately I'm not sure it's usable right now, searching "group theory Wikipedia" never brings up a Wikipedia page (although maybe I should just be directly searching Wikipedia if that's what I wanted).
As of aug 16, common crawl has 1.73n pages. For the complimentary set of urls, if any benefit you can use their data dump as seed.<p>If the metadata (such as last modified) size of your index is small enough to upload to aws, you can also reduce your re-crawl efforts when they have a fresh release.
Hi, I find the Blog more interesting right now since I hope to find write-ups about how you were able to manage such a herculean task on your own?<p>Crawling 2bn pages could take forever and could generate a huge bandwidth bills, so any lessons you learnt, pitfalls you faced, etc would be a great read.
DeuSu seems not indexing Cyrillic part of the Internet, and cannot give you insights for Greek, try <a href="https://deusu.org/query?q=ελιά" rel="nofollow">https://deusu.org/query?q=ελιά</a> . Is it Latin ANSI only index?
Strange, Wikipedia article is not on the first page and don't blame me for searching something non German thing :)<p><a href="https://deusu.org/query?q=berlin" rel="nofollow">https://deusu.org/query?q=berlin</a>
I searched "meta programming c++" and the top returns are all about java.<p>I'm curious, is it expensive to run a search site like this?
Google's secret ingredient to stay relevant and informational is Wikipedia.<p>Deusu on the other hand seems to weight words in urls highly.<p>If you search for scientology only on Deusu, you might end up wearing a funky hat <a href="https://deusu.org/query?q=scientology" rel="nofollow">https://deusu.org/query?q=scientology</a>
Awesome job!<p>For the life of me I can't figure out how you manage to crawl over a billion web pages (even in 2-3 months), index the data and run the server with €300 per month. Especially the crawler part...
Every time I see new search engine projects I remember this: <a href="https://en.wikipedia.org/wiki/Cuil" rel="nofollow">https://en.wikipedia.org/wiki/Cuil</a><p>I note that Dr Anna Patterson is back with Google. She wrote this in 2004: <a href="http://queue.acm.org/detail.cfm?id=988407" rel="nofollow">http://queue.acm.org/detail.cfm?id=988407</a>