TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Searching 20 GB/sec: Systems Engineering Before Algorithms (2014)

105 pointsby misframerabout 10 years ago

8 comments

snewmanabout 10 years ago
Hi! Great to see this pop back up on HN. I&#x27;m the author of the blog post (and Scalyr founder), happy to answer any questions.<p>Downthread, someone mentioned that they couldn&#x27;t find the HN discussion from when this was originally posted; it&#x27;s here:<p><a href="https://news.ycombinator.com/item?id=7715025" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=7715025</a>
hglaserabout 10 years ago
Great post.<p>BTW, this product (Scalyr) is a lifesaver. We (Periscope) are able to operate ~ a dozen heterogeneous servers with no FT DevOps largely because of Scalyr.
twotwotwoabout 10 years ago
Lots of attention goes to OLTP-type loads for good reasons, but when you do design to just stream fast, some fun things happen:<p>You can use lots of relatively cheap spindles in parallel, and think of each one as (at least) 100MB&#x2F;s of sequential read speed and a couple TB of space. You have fast compression available that can increase your effective bandwidth and make the effective cost of space cheaper.<p>You can draw on well-understood ways to search, sort, do hash- or sort-based joining and grouping, and so on.<p>Streaming doesn&#x27;t need a big in-memory cache to avoid disk seeks, so you can use those gobs of RAM for other things--aggregating results or holding data to join against, say. (Of course, if you don&#x27;t need the RAM, disk cache might still be useful for some access patterns.)<p>Besides log search, you see a stream-fast approach in analytics-focused DBs: BigQuery, Redshift, Vertica, and open-source ones--Facebook put up a good post about the work that led to the their Hive ORCFile design.<p>Some bioinformatics tools load a big hashtable into memory and, roughly, hash-join against a ton of raw data streamed from disk, then sometimes then repeat the process with another hashtable.<p>These are not at all original observations, but I managed to hear about these sorts of analytics and bioinformatics tools for a while before really getting how or why why they did things all that differently from a typical random-access-oriented database.
imaginenoreabout 10 years ago
I have another idea for you guys. Instead of relying on expensive AWS SSD instances, why not switch to Hetzner, and keep everything in RAM?<p>128 GB RAM for $135&#x2F;month:<p><a href="https://www.hetzner.de/en/hosting/produkte_rootserver/px120" rel="nofollow">https:&#x2F;&#x2F;www.hetzner.de&#x2F;en&#x2F;hosting&#x2F;produkte_rootserver&#x2F;px120</a><p>And you will have so much extra disk space, you can use it for backups. Or even resell it.<p>Your i2.4xlarge cost you $2,455&#x2F;month.
评论 #9207821 未加载
imaginenoreabout 10 years ago
I wonder why they chose Java for substring search. Why not C (strstr) or grep?<p><a href="http://www.arstdesign.com/articles/fastsearch.html" rel="nofollow">http:&#x2F;&#x2F;www.arstdesign.com&#x2F;articles&#x2F;fastsearch.html</a>
评论 #9204447 未加载
lostmsuabout 10 years ago
Nice. But does not scale.
swatowabout 10 years ago
Judging from the comments, this article was written around May 8 2014. Can we get a (2014) in the title?
评论 #9204404 未加载
kiallmacinnesabout 10 years ago
The linked article has been posted before, I can&#x27;t find the old HN thread.. But it was certainly worth a re-read :)<p>I wonder has scalyr reached their expected 100GB&#x2F;s yet?
评论 #9204108 未加载