It is a cool project. S3 can be cost efficient, but only if you don't touch data :)<p>Their price calculation doesn't mention cost of S3 requests, which very quickly adds up and is often neglected.<p>It costs $1 for 2.5M GET requests to S3. They have 180 shards, in a general case query seems to fetch all of them. Presumably they don't download full shard per request, but download an index + some relevant ranges. Lets say that is 10 requests per shard. So that would be 1800 S3 GET requests = ~1400 search queries cost them $1.<p>Assuming their service is reasonably popular and serve 1 req/second on average, that would be $1,440 per 30 days in addition to advertised $1,000 spent on EC2 and S3 storage.<p>Seems comparable to AWS ElasticSearch service costs:<p>- 3 nodes m5.2xlarge.elasticsearch = $1,200<p>- 20TB EBS storage = $1,638
If you're going for low cost, you could do better:<p><a href="https://www.hetzner.com/dedicated-rootserver/dell/dx181/configurator" rel="nofollow">https://www.hetzner.com/dedicated-rootserver/dell/dx181/conf...</a><p>Basic configuration in Finland 1 224,91 €<p>1.92 TB SATA SSD Datacenter Edition 4 95,20 €<p>320,11 €<p>320 Euro equals 385.90 United States Dollar
Interesting! We've built similar support for decoupling compute from storage into Elasticsearch and, as coincidence would have it, just shared some performance numbers today:<p><a href="https://www.elastic.co/blog/querying-a-petabyte-of-cloud-storage-in-10-minutes" rel="nofollow">https://www.elastic.co/blog/querying-a-petabyte-of-cloud-sto...</a><p>It works just as any regular Elasticsearch index (with full Kibana support etc.).<p>The data being indexed by Lucene allows queries to access index structures and return results orders of magnitude faster than doing a full table scan.<p>It is complemented with various caching layers to make repeat queries fast.<p>We expect this new functionality to be used for less frequently queried data (e.g. operational or security investigations, legal discoveries, or historical performance comparisons on older data), trading query speed for cost.<p>It supports Google Cloud Storage, Azure Blob Storage, Amazon S3 (+ S3 compatible stores), HDFS, and shared file systems.
But are you solving the right problem? This sounds like someone has produced a very good and efficient version of AltaVista. Back in the 1990s, if you wanted to do classic keyword searches of the web, and find all pages that had terms A and B but not C, it would give them to you, in a big unsorted pile. The web was still small enough that this was sometimes useful, but until Google came along with tricks to rank pages that are obvious in retrospect, it just wasn't useful for common search terms.
This is super interesting. I've recently also been working on a similar concept: we have a reasonable amount (in the terabytes) of data, that's fairly static, that I need to search fairly infrequently (but sometimes in bulk). A solution we came up with was a small , hot, in memory index, that points to the location of the data in a file on S3. Random access of a file on S3 is pretty fast, and running in an EC2 instance means latency is almost nil to S3. Cheap, fast and effective.<p>We're using some custom Python code to build a Marisa Trie as our index. I was wondering if there were alternatives to this set up?
> a new breed of full-text search engine<p>The following is a stupid question, so bare with me.<p>I have been using search engines for about... 26 years. I have attempted to make really crappy databases and search engines. I have worked for companies that use search products for internal services and customer products. I'm not a <i>search engineer</i> but I have a decent understanding of them and their issues, I think. And I get why people <i>want</i> full-text search. But is it actually a good idea? Should anyone really be using full text search?<p>I actually work on search products right now. We use Solr as the general full text index. We have separate indexes and algorithms to make context and semantic inferences, and prioritize results based on those, falling back to full text if we don't get anything. The full text sucks. The corpus of relationships of related concepts is what makes the whole thing useful.<p>Are we (all) only using full-text because some users are demanding that it be there? Or shouldn't we all stop this charade of thinking that full-text search of billions of items of data will ever be useful to a human being? Even when I show my coworkers that I can get something done 10x faster with a curated index of content, they <i>still</i> want a search engine that they know doesn't give them the results they want.<p>Is full-text search the junk food of information retrieval?
Curious to know if anyone can explain why on-line storage is so expensive? Most places want $100+/mo for 1TB of storage, while a 1TB drive only costs $50. I understand there's management costs, cooling, electricity, physical space, etc. But those would be per drive costs, not per TB, and certainly wouldn't add up to $100/mo.<p>Meanwhile, there's services like Google Drive, etc which costs about $100/yr per TB. Still exorbitant, but not as much so. 3rd party software can mount it as a drive, but only for a short period of time before the token expires. So they seem happy to sell you space at less than 1/10th the cost, as long as it's harder to use.<p>There just seems to be a lot of cash on the table for someone to offer much cheaper storage solutions, but no one is actually doing it.
Chaos Search seems to be doing this architecture already and according to the podcast episode [1], it uses a highly optimized storage layout.<p>Never used it, so would be interested if somebody could comment on it.<p>[1] <a href="https://www.dataengineeringpodcast.com/chaos-search-with-pete-cheslock-and-thomas-hazel-episode-47/" rel="nofollow">https://www.dataengineeringpodcast.com/chaos-search-with-pet...</a>
Francois, Adrien, that's a super nice demo.<p>Stateless search engine is something new, for sure.<p>I'd be super interested to see how it evolves over time. We're [1] indexing over 1,000,000 news articles per day. We're using ElasticSearch to index our data.<p>Would be interested to see if there's a way to make a cross-demo? Let me know.<p>[1] <a href="https://newscatcherapi.com/" rel="nofollow">https://newscatcherapi.com/</a>
This looks really interesting, I wonder how they will monetize it though.<p>As an aside, projects like these are what keep me wondering whether I should switch from cheaper but "dumb" object stores to AWS since on AWS you can use your object store together with things like Athena etc. and get pay-per-use search / grep and a lot of other things, without the egress fees since it's all within AWS.
Nice! Maybe at one point you can release a general web search engine for the Common Crawl corpus? It seems even simpler than this proof of concept, but potentially more useful for people looking for a true full text web search.<p>There isn't an easy way today to explore or search what is contained in the Common Crawl index.
Cool demo. Searching for phases like "there was a" and "and there is" take a really long time. I presume that since the words are common, the document IDs mapped to those individual tokens are too long as well, so intersections etc. take longer?
> which is key as each instance issues a lot of parallel requests to Amazon S3 and tends to be bound by the network<p>I wonder if most of the cost comes from S3, EC2 or the "premium" bandwidth that Amazon charges ridiculously much for. Since it seems to be doing a lot of requests, it wouldn't surprise me if it's the network cost, and if so, I wonder why they would even use AWS at all then.
Could this be adapted for IPFS? Anyone with stateless client and link to index could search and become part of swarm to speed up trendy queries with redundancy.<p>Then update it with git like diff versioning, utilize IPNS to point to HEAD of the latest chain of the index.
What does your on-S3 storage format look like? Are you storing relatively large blobs and doing HTTP Range requests against them or are you storing lots of tiny objects and fetching the whole object any time you need it?
Is this reliant on S3 or can it be used on something like minio or digital ocean spaces or backblaze2 too? Backblaze to cloudflare data transfers is free so that can reduce costs a lot plus B2 is much cheaper than S3.
Is there a more recent common crawl data set? 2019 is a long time away.<p>Reason I ask is I'm trying to get all subdomain a of a certain domain. So I want a reverse host of unique hostnames under a certain domain.
How are you dealing with the fact common crawl updates their data much less regularly than commercial search engines? And that each update is only a partial refresh?<p>Edit: And I will say your site design is very nice.
Congrats for the project and very cool demo!<p>One point that may help - I've searched the word fast with adjective selected and it didn't show results.
Seems like you could build a workstation that runs these quesries faster and cheaper than AWS ever could on a RAIDed set of NVME drives.<p><a href="https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/" rel="nofollow">https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-th...</a>
Searching the web is a fool's errand. Google doesn't even search the web anymore, they just mind-controlled everyone to submit nightly sitemaps to them. Google is more of an index than a search engine nowadays.
Article title is "Searching the web for < $1000 / month".<p>Despite mentioning Rust once, of course it had to be added to the title on HN as "Search 1B pages on AWS S3 for 1000$ / month, made in Rust and tantivy".