Searching the web for under $1000/month

456 pointsby francoismassotabout 4 years ago

30 comments

nopurposeabout 4 years ago

It is a cool project. S3 can be cost efficient, but only if you don't touch data :)Their price calculation doesn't mention cost of S3 requests, which very quickly adds up and is often neglected.It costs $1 for 2.5M GET requests to S3. They have 180 shards, in a general case query seems to fetch all of them. Presumably they don't download full shard per request, but download an index + some relevant ranges. Lets say that is 10 requests per shard. So that would be 1800 S3 GET requests = ~1400 search queries cost them $1.Assuming their service is reasonably popular and serve 1 req/second on average, that would be $1,440 per 30 days in addition to advertised $1,000 spent on EC2 and S3 storage.Seems comparable to AWS ElasticSearch service costs:- 3 nodes m5.2xlarge.elasticsearch = $1,200- 20TB EBS storage = $1,638

评论 #27075394 未加载

评论 #27075551 未加载

评论 #27075724 未加载

评论 #27077456 未加载

评论 #27077054 未加载

评论 #27078519 未加载

jonatronabout 4 years ago

If you're going for low cost, you could do better:<a href="https://www.hetzner.com/dedicated-rootserver/dell/dx181/configurator" rel="nofollow">https://www.hetzner.com/dedicated-rootserver/dell/dx181/conf...</a>Basic configuration in Finland 1 224,91 €1.92 TB SATA SSD Datacenter Edition 4 95,20 €320,11 €320 Euro equals 385.90 United States Dollar

评论 #27075253 未加载

评论 #27075080 未加载

评论 #27075610 未加载

评论 #27075139 未加载

评论 #27077878 未加载

ywelschabout 4 years ago

Interesting! We've built similar support for decoupling compute from storage into Elasticsearch and, as coincidence would have it, just shared some performance numbers today:<a href="https://www.elastic.co/blog/querying-a-petabyte-of-cloud-storage-in-10-minutes" rel="nofollow">https://www.elastic.co/blog/querying-a-petabyte-of-cloud-sto...</a>It works just as any regular Elasticsearch index (with full Kibana support etc.).The data being indexed by Lucene allows queries to access index structures and return results orders of magnitude faster than doing a full table scan.It is complemented with various caching layers to make repeat queries fast.We expect this new functionality to be used for less frequently queried data (e.g. operational or security investigations, legal discoveries, or historical performance comparisons on older data), trading query speed for cost.It supports Google Cloud Storage, Azure Blob Storage, Amazon S3 (+ S3 compatible stores), HDFS, and shared file systems.

not2babout 4 years ago

But are you solving the right problem? This sounds like someone has produced a very good and efficient version of AltaVista. Back in the 1990s, if you wanted to do classic keyword searches of the web, and find all pages that had terms A and B but not C, it would give them to you, in a big unsorted pile. The web was still small enough that this was sometimes useful, but until Google came along with tricks to rank pages that are obvious in retrospect, it just wasn't useful for common search terms.

natpatabout 4 years ago

This is super interesting. I've recently also been working on a similar concept: we have a reasonable amount (in the terabytes) of data, that's fairly static, that I need to search fairly infrequently (but sometimes in bulk). A solution we came up with was a small , hot, in memory index, that points to the location of the data in a file on S3. Random access of a file on S3 is pretty fast, and running in an EC2 instance means latency is almost nil to S3. Cheap, fast and effective.We're using some custom Python code to build a Marisa Trie as our index. I was wondering if there were alternatives to this set up?

评论 #27074945 未加载

评论 #27074792 未加载

评论 #27074857 未加载

评论 #27075088 未加载

评论 #27076285 未加载

评论 #27080587 未加载

评论 #27075022 未加载

0xbadcafebeeabout 4 years ago

> a new breed of full-text search engineThe following is a stupid question, so bare with me.I have been using search engines for about... 26 years. I have attempted to make really crappy databases and search engines. I have worked for companies that use search products for internal services and customer products. I'm not a search engineer but I have a decent understanding of them and their issues, I think. And I get why people want full-text search. But is it actually a good idea? Should anyone really be using full text search?I actually work on search products right now. We use Solr as the general full text index. We have separate indexes and algorithms to make context and semantic inferences, and prioritize results based on those, falling back to full text if we don't get anything. The full text sucks. The corpus of relationships of related concepts is what makes the whole thing useful.Are we (all) only using full-text because some users are demanding that it be there? Or shouldn't we all stop this charade of thinking that full-text search of billions of items of data will ever be useful to a human being? Even when I show my coworkers that I can get something done 10x faster with a curated index of content, they still want a search engine that they know doesn't give them the results they want.Is full-text search the junk food of information retrieval?

gmiller123456about 4 years ago

Curious to know if anyone can explain why on-line storage is so expensive? Most places want $100+/mo for 1TB of storage, while a 1TB drive only costs $50. I understand there's management costs, cooling, electricity, physical space, etc. But those would be per drive costs, not per TB, and certainly wouldn't add up to $100/mo.Meanwhile, there's services like Google Drive, etc which costs about $100/yr per TB. Still exorbitant, but not as much so. 3rd party software can mount it as a drive, but only for a short period of time before the token expires. So they seem happy to sell you space at less than 1/10th the cost, as long as it's harder to use.There just seems to be a lot of cash on the table for someone to offer much cheaper storage solutions, but no one is actually doing it.

评论 #27082618 未加载

评论 #27082607 未加载

评论 #27082585 未加载

sam_lowry_about 4 years ago

Why use AWS if you are cost-conscious?

评论 #27082672 未加载

评论 #27077946 未加载

snidaneabout 4 years ago

Chaos Search seems to be doing this architecture already and according to the podcast episode [1], it uses a highly optimized storage layout.Never used it, so would be interested if somebody could comment on it.[1] <a href="https://www.dataengineeringpodcast.com/chaos-search-with-pete-cheslock-and-thomas-hazel-episode-47/" rel="nofollow">https://www.dataengineeringpodcast.com/chaos-search-with-pet...</a>

artembugaraabout 4 years ago

Francois, Adrien, that's a super nice demo.Stateless search engine is something new, for sure.I'd be super interested to see how it evolves over time. We're [1] indexing over 1,000,000 news articles per day. We're using ElasticSearch to index our data.Would be interested to see if there's a way to make a cross-demo? Let me know.[1] <a href="https://newscatcherapi.com/" rel="nofollow">https://newscatcherapi.com/</a>

评论 #27075019 未加载

heipeiabout 4 years ago

This looks really interesting, I wonder how they will monetize it though.As an aside, projects like these are what keep me wondering whether I should switch from cheaper but "dumb" object stores to AWS since on AWS you can use your object store together with things like Athena etc. and get pay-per-use search / grep and a lot of other things, without the egress fees since it's all within AWS.

评论 #27075054 未加载

bambaxabout 4 years ago

Very interesting! For some reason I find search engines fascinating...How dependent is this on AWS? Can it be ported to another cloud provider?

评论 #27075105 未加载

chris_fabout 4 years ago

Nice! Maybe at one point you can release a general web search engine for the Common Crawl corpus? It seems even simpler than this proof of concept, but potentially more useful for people looking for a true full text web search.There isn't an easy way today to explore or search what is contained in the Common Crawl index.

评论 #27077367 未加载

评论 #27077375 未加载

karterkabout 4 years ago

Cool demo. Searching for phases like "there was a" and "and there is" take a really long time. I presume that since the words are common, the document IDs mapped to those individual tokens are too long as well, so intersections etc. take longer?

评论 #27078966 未加载

capablewebabout 4 years ago

> which is key as each instance issues a lot of parallel requests to Amazon S3 and tends to be bound by the networkI wonder if most of the cost comes from S3, EC2 or the "premium" bandwidth that Amazon charges ridiculously much for. Since it seems to be doing a lot of requests, it wouldn't surprise me if it's the network cost, and if so, I wonder why they would even use AWS at all then.

评论 #27074847 未加载

imhoguyabout 4 years ago

Could this be adapted for IPFS? Anyone with stateless client and link to index could search and become part of swarm to speed up trendy queries with redundancy.Then update it with git like diff versioning, utilize IPNS to point to HEAD of the latest chain of the index.

simonwabout 4 years ago

What does your on-S3 storage format look like? Are you storing relatively large blobs and doing HTTP Range requests against them or are you storing lots of tiny objects and fetching the whole object any time you need it?

评论 #27078058 未加载

busymom0about 4 years ago

Is this reliant on S3 or can it be used on something like minio or digital ocean spaces or backblaze2 too? Backblaze to cloudflare data transfers is free so that can reduce costs a lot plus B2 is much cheaper than S3.

评论 #27075294 未加载

chrisackyabout 4 years ago

Is there a more recent common crawl data set? 2019 is a long time away.Reason I ask is I'm trying to get all subdomain a of a certain domain. So I want a reverse host of unique hostnames under a certain domain.

评论 #27078940 未加载

Grimm1about 4 years ago

How are you dealing with the fact common crawl updates their data much less regularly than commercial search engines? And that each update is only a partial refresh?Edit: And I will say your site design is very nice.

评论 #27079283 未加载

评论 #27079046 未加载

ykevinator3about 4 years ago

What an amazing project, good luck to you guys and thanks for sharing.

评论 #27075935 未加载

visargaabout 4 years ago

Is it a web search engine or an adjective search engine? I'd love to see someone make a deep search engine that goes beyond the 100...1000 limit.

评论 #27075926 未加载

cardosofabout 4 years ago

Congrats for the project and very cool demo!One point that may help - I've searched the word fast with adjective selected and it didn't show results.

评论 #27077888 未加载

ClumsyPilotabout 4 years ago

Seems like you could build a workstation that runs these quesries faster and cheaper than AWS ever could on a RAIDed set of NVME drives.<a href="https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/" rel="nofollow">https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-th...</a>

djdjdjdjdjabout 4 years ago

Hui I wonder why this is not a cost trap. The S3 API request s where relative expensive.

评论 #27075113 未加载

ProKevinYabout 4 years ago

Brilliant and interesting project by smart people. Kudos. (the demo is addictive af)

phendrenad2about 4 years ago

Searching the web is a fool's errand. Google doesn't even search the web anymore, they just mind-controlled everyone to submit nightly sitemaps to them. Google is more of an index than a search engine nowadays.

ryanworlabout 4 years ago

What are you using for metadata storage?

评论 #27077422 未加载

marcinzmabout 4 years ago

Interesting although a 15 second response time on certain queries is not a very good user experience.

评论 #27075286 未加载

评论 #27075256 未加载

hu3about 4 years ago

Article title is "Searching the web for < $1000 / month".Despite mentioning Rust once, of course it had to be added to the title on HN as "Search 1B pages on AWS S3 for 1000$ / month, made in Rust and tantivy".