TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Heroku's Ugly Secret: The story of how the cloud-king turned its back on Rails

1763 点作者 tomlemon超过 12 年前

107 条评论

teich超过 12 年前
This is Oren Teich, I run Heroku.<p>I've read through the OP, and all of the comments here. Our job at Heroku is to make you successful and we want every single customer to feel that Heroku is transparent and responsive. Getting to the bottom of this situation and giving you a clear understanding of what we’re going to do to make it right is our top priority. I am committing to the community to provide more information as soon as possible, including a blog post on <a href="http://blog.heroku.com" rel="nofollow">http://blog.heroku.com</a>.
评论 #5218309 未加载
评论 #5218573 未加载
评论 #5217955 未加载
评论 #5224347 未加载
评论 #5217950 未加载
评论 #5217998 未加载
评论 #5218328 未加载
评论 #5218048 未加载
评论 #5218842 未加载
toast76超过 12 年前
Wow. This is explains a lot.<p>We've always been of the opinion that queues were happening on the router, not on the dyno.<p>We consistently see performance problems that, whilst we could tie down to a particular user request (file uploads for example, now moved to S3 direct), we could never figure out why this would result in queuing requests given Heroku's advertised "intelligent routing". We mistakenly thought the occasion slow request couldn't create a queue....although evidence pointed to the contrary.<p>Now that it's apparent that requests are queuing on the dyno (although we have no way to tell from what I can gather) it makes the occasional "slow requests" we have all the more fatal. e.g. data exports, reporting and any other non-paged data request.
评论 #5216593 未加载
评论 #5216821 未加载
评论 #5216703 未加载
评论 #5216560 未加载
michaelrkn超过 12 年前
We ran into this exact same problem at Impact Dialing. When we hit scale, we optimized the crap out of our app; our New Relic stats looked insanely fast, but Twilio logs told us that we were taking over 15 seconds to respond to many of their callbacks. After spending a few weeks working with Heroku support (and paying for a dedicated support engineer), we moved to raw AWS and our performance problems disappeared. I want to love Heroku, but it doesn't scale for Rails apps.
评论 #5217396 未加载
FireBeyond超过 12 年前
This should be more prominent. I want to love Heroku, and am sure that I could.<p>But really, throwing in the towel at intelligent routing and replacing it with "random routing" is horrific, if true.<p>It's arguable that the routing mesh and scaling dynamics of Heroku are a large part, if not -the- defining reason for someone to choose Heroku over AWS directly.<p>Is it a "hard" problem? I'm absolutely sure it is. That's one reason customers are throwing money at you to solve it, Heroku.
评论 #5216469 未加载
评论 #5216679 未加载
评论 #5217304 未加载
评论 #5216798 未加载
lkrubner超过 12 年前
Good lord!!!!!<p>Percentage of the requests served within a certain time (ms)<p><pre><code> 50% 844 66% 2977 75% 5032 80% 7575 90% 16052 95% 20069 98% 29282 99% 30029 100% 30029 (longest request) </code></pre> Those numbers are amazingly awful. If I ever run ab and see 4 digits I assume I need to optimize my software or server. But 5 digits?<p>Why in the world would a company spend $20,000 a month for service this awful?
评论 #5217336 未加载
评论 #5221422 未加载
评论 #5217099 未加载
评论 #5216869 未加载
评论 #5217008 未加载
bignoggins超过 12 年前
Rap Genius is employing a classic rap-mogul strategy: start a beef
评论 #5216534 未加载
评论 #5217084 未加载
mattj超过 12 年前
So the issue here is two-fold: - It's very hard to do 'intelligent routing' at scale. - Random routing plays poorly with request times with a really bad tail (median is 50ms, 99th is 3 seconds)<p>The solution here is to figure out why your 99th is 3 seconds. Once you solve that, randomized routing won't hurt you anymore. You hit this exact same problem in a non-preemptive multi-tasking system (like gevent or golang).
评论 #5216424 未加载
评论 #5216641 未加载
评论 #5216962 未加载
评论 #5217359 未加载
评论 #5218185 未加载
评论 #5216482 未加载
评论 #5217298 未加载
评论 #5217256 未加载
nthj超过 12 年前
I'm inclined to wait until Heroku weighs in to render judgement. Specifically, because their argument depends on this premise:<p>&#62; But elsewhere in their current docs, they make the same old statement loud and clear: &#62; The heroku.com stack only supports single threaded requests. Even if your applicaExplaintion were to fork and support handling multiple requests at once, the routing mesh will never serve more than a single request to a dyno at a time.<p>They pull this from Heroku's documentation on the Bamboo stack [1], but then extrapolate and say it also applies to Heroku's Cedar stack.<p>However, I don't believe this to be true. Recently, I wrote a brief tutorial on implementing Google Apps' openID into your Rails app.<p>The underlying problem with doing so on a free (single-dyno) Heroku app is that while your app makes an authentication request to Google, Google turns around and makes a "oh hey" request to your app. With a single-concurrency system, Google your app times out waiting for Google to get back to you and Google won't get back to you until your app gets back to you so hey deadlock.<p>However, there is a work-around on the Cedar stack: configure the unicorn server to supply 4 or so worker processes for your web server, and the Heroku routing mesh appropriately routes multiple concurrent requests to Unicorn/my app. This immediately fixed my deadlock problem. I have code and more details in a blog post I wrote recently. [2]<p>This seems to be confirmed by Heroku's documentation on dynos [3]: &#62; Multi-threaded or event-driven environments like Java, Unicorn, and Node.js can handle many concurrent requests. Load testing these applications is the only realistic way to determine request throughput.<p>I might be missing something really obvious here, but to summarize: their premise is that Heroku only supports single-threaded requests, which is true on the legacy Bamboo stack but I don't believe to be true on Cedar, which they consider their "canonical" stack and where I have been hosting Rails apps for quite a while.<p>[1] <a href="https://devcenter.heroku.com/articles/http-routing-bamboo" rel="nofollow">https://devcenter.heroku.com/articles/http-routing-bamboo</a><p>[2] <a href="http://www.thirdprestige.com/posts/your-website-and-email-accounts-should-be-friends-part-ii" rel="nofollow">http://www.thirdprestige.com/posts/your-website-and-email-ac...</a><p>[3] <a href="https://devcenter.heroku.com/articles/dynos#dynos-and-requests" rel="nofollow">https://devcenter.heroku.com/articles/dynos#dynos-and-reques...</a><p>[edit: formatting]
评论 #5216653 未加载
评论 #5216628 未加载
评论 #5216843 未加载
评论 #5216575 未加载
评论 #5216651 未加载
评论 #5217520 未加载
评论 #5216603 未加载
habosa超过 12 年前
Wow.<p>Normally when I read "X is screwing Y!!!" posts on Hacker News I generally consider them to be an overreaction or I can't relate. In this case, I think this was a reasonable reaction and I am immediately convinced never to rely on Heroku again.<p>Does anyone have a reasonably easy to follow guide on moving from Heroku to AWS? Let's keep it simple and say I'm just looking to move an app with 2 web Dynos and 1 worker. I realize this is not the type of app that will be hurt by Heroku's new routing scheme but I might as well learn to get out before it's too late.
评论 #5216642 未加载
stevewilhelm超过 12 年前
Heroku Support Request #76070<p>To whom it may concern,<p>We are long time users of Heroku and are big fans of the service. Heroku allows us to focus on application development. We recently read an article on HN entitled 'Heroku's Ugly Secret' <a href="http://s831.us/11IIoMF" rel="nofollow">http://s831.us/11IIoMF</a><p>We have noticed similar behavior, namely increasing dynos does not provide performance increases we would expect. We continue to see wildly different performance responses across different requests that New Relic metrics and internal instrumentation can not explain.<p>We would like the following:<p>1. A response from Heroku regarding the analysis done in the article, and 2. Heroku-supplied persistant logs that include information how long requests are queued for processing by the dynos<p>Thanks in advance for any insight you can provide into this situation and keep up the good work.
评论 #5224165 未加载
htsh超过 12 年前
Why not hire a devops guy &#38; rack your own hardware? Or get some massive computing units at amazon (just as good but more expensive)?<p>This reminds me of the excellent 5 stages of hosting story shared on here from a while back:<p><a href="http://blog.pinboard.in/2012/01/the_five_stages_of_hosting/" rel="nofollow">http://blog.pinboard.in/2012/01/the_five_stages_of_hosting/</a>
评论 #5216564 未加载
评论 #5216464 未加载
barmstrong超过 12 年前
We were very surprised to discover Heroku no longer has a global request queue, and spent a good bit of time debugging performance issues to find this was the culprit.<p>Heroku is a great company, and I imagine there was some technical reason they did it (not an evil plot to make more money). But not having a global request queue (or "intelligent routing") definitely makes their platform less useful. Moving to Unicorn helped a bit in the short term, but is not a complete solution.
评论 #5217477 未加载
rapind超过 12 年前
I'd been using Heroku since forever, but bailed on them for a high traffic app last year (Olympics related) due to poor performance once we hit a certain load (adding dynos made very little difference). We were paying for their (new at the time) critical app support, and I brought up that it appears to be failing at a routing level continuously. And this was with a Sinatra app served by Unicorn (which at the time at least was considered unsupported).<p>We went with a metal cluster setup and everything ran super smooth. I never did figure out what the problem was with Heroku though and this article has been a very illuminating read.
gojomo超过 12 年前
They want to force the issue with a public spat. Fair enough.<p>But, they also might also be able to self-help quite a bit. RG makes no mention of using more than 1 unicorn worker per dyno. That could help, making a smaller number of dynos behave more like a larger number. I think it was around when Heroku switched to random routing that they also became more officially supportive of dynos handling multiple requests at once.<p>There's still the risk of random pileups behind long-running requests, and as others have noted, it's that long-tail of long-running requests that messes things up. Besides diving into the worst offender requests, perhaps simply <i>segregating those requests to a different Heroku-app</i> would lead to a giant speedup for most users, who rarely do long-running requests.<p>Then, the 90% of requests that never take more than a second would stay in one bank of dynos, never having pathological pile-ups, while the 10% that take 1-6 seconds would go to another bank (by different entry URL hostname). There'd still be awful pile-ups there, but for less-frequent requests, perhaps only used by a subset of users/crawler-bots, who don't mind waiting.
评论 #5216765 未加载
zenazn超过 12 年前
Randomized routing isn't all bad. In fact, if Heroku were to switch from purely random routing to minimum-of-two random routing, they'd perform asymptotically better [1].<p>[1]: <a href="http://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.pdf" rel="nofollow">http://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.p...</a>
评论 #5216725 未加载
评论 #5216451 未加载
评论 #5216660 未加载
goronbjorn超过 12 年前
Aside from the Heroku issue, this is an amazing use of RapGenius for something besides rap lyrics. I didn't have to google anything in the article because of the annotations.
评论 #5218345 未加载
评论 #5216868 未加载
zeeg超过 12 年前
If this is such a problem for you, why are you still on Heroku? It's not a be-all end-all solution.<p>I got started on Heroku for a project, and I also ran into limitations of the platform. I think it can work for some types of projects, but it's really not that expensive to host 15m uniques/month on your own hardware. You <i></i>can<i></i> do just about anything on Heroku, but as your organization and company grow it makes sense to do what's right for the product, and not necessarily whats easy anymore.<p>FYI I wrote up several posts about it, though my reasons were different (and my use-case is quite a bit different from a traditional app):<p>* <a href="http://justcramer.com/2012/06/02/the-cloud-is-not-for-you/" rel="nofollow">http://justcramer.com/2012/06/02/the-cloud-is-not-for-you/</a><p>* <a href="http://justcramer.com/2012/08/30/how-noops-works-for-sentry/" rel="nofollow">http://justcramer.com/2012/08/30/how-noops-works-for-sentry/</a>
rdl超过 12 年前
Wow. I suspect Rap Genius has the dollars now where it's totally feasible for them to go beyond Heroku, but it still might not be the best use of their time. But if they have to do it, they have to do it.<p>OTOH, having a customer have a serious problem like this AND still say "we love your product! We want to remain on your platform", just asking you to fix something, is a pretty ringing endorsement. If you had a marginal product with a problem this severe, people would just silently leave.
评论 #5216436 未加载
lquist超过 12 年前
Heroku implements this change in mid-2010, then sells to Salesforce six months later. Hmm...wondering how this impacted revenue numbers as customers had to scale up dynos following the change...
bifrost超过 12 年前
I am only going to suggest a small edit -&#62; s/Postgres can’t/Heroku's Postgres can't/<p>PG can scale up pretty well on a single box, but scaling PG on AWS can be problematic due to the disk io issue, so I suspect they just don't do it. I'd love to be corrected :)
评论 #5216486 未加载
评论 #5216446 未加载
评论 #5216412 未加载
bad_user超过 12 年前
I noticed problems with Heroku's router too.<p>However, contrary to the author, I'm serving 25,000 real requests per second with only 8 dynos.<p>The app is written in Scala and runs on top of the JVM. And I was dissatisfied that 8 dynos seem like too much for an app that can serve over 10K requests per sec on my localhost.
评论 #5218252 未加载
nacho2sweet超过 12 年前
Why is everyone like against rapgenius.com for "forcing the issue with a public spat". They are the customer not getting a service they are paying for. I would be fucking pissed too. Heroku isn't being the darling the service they advertised. They tried to work on it with Heroku. This is useful information to most of you. Are most of you against Yelp?
评论 #5218203 未加载
zeeg超过 12 年前
Here's a very simple gevent hello world app.<p>This is run from inside AWS on an m1.large:<p><a href="https://gist.github.com/dcramer/4950101" rel="nofollow">https://gist.github.com/dcramer/4950101</a><p>For the 50 dyno test, this was the second run, making the assumption that the dynos had to warm up before they could effectively service requests.<p>You'll see that with 49 more dynos, we only managed to get around 400 more requests/second on an app that isnt even close to real world.<p>(By no means is this test scientific, but I think it's telling)
thehodge超过 12 年前
Shame that this seems to have been flagged off the homepage before a reasonable discussion can ensue
评论 #5216372 未加载
评论 #5216393 未加载
评论 #5216675 未加载
评论 #5216247 未加载
abat超过 12 年前
The cost of New Relic on Heroku looks really high because each dyno is billed like a full server, which makes it many times more expensive than if you were to manage your own large EC2 servers and just have multiple rails workers.<p>New Relic could be much more appealing if they had a pricing model that was based on usage instead of number of machines.
omfg超过 12 年前
Someone from Heroku really needs to weigh in on this.
评论 #5216499 未加载
评论 #5216470 未加载
kmfrk超过 12 年前
Looks like Cloud 66 couldn't have picked a better day to announce their service: <a href="http://news.ycombinator.com/item?id=5213862" rel="nofollow">http://news.ycombinator.com/item?id=5213862</a>.
tim_sw超过 12 年前
randomized routing is not necessarily bad if they look at 2 choices and pick the min. See <a href="http://en.wikipedia.org/wiki/2-choice_hashing" rel="nofollow">http://en.wikipedia.org/wiki/2-choice_hashing</a> and <a href="http://www.eecs.harvard.edu/~michaelm/postscripts/handbook2001.pdf" rel="nofollow">http://www.eecs.harvard.edu/~michaelm/postscripts/handbook20...</a>
评论 #5216739 未加载
chc超过 12 年前
OK, maybe I'm missing something here, but it seems to me that the OP's real problem is that he's artificially limiting himself to one request per dyno. They now allow a dyno to serve more than one request at a time, and he's presenting that as a <i>bad thing</i>! It seems to me that the answer to Rap Genius' problems is not "rage at Heroku," but rather "gem 'unicorn'".
评论 #5216629 未加载
ajsharp超过 12 年前
My hunch is that Heroku isn't doing this to bleed customers dry. I know more than a few really, really great people who work there, and I don't think they'd stand for that type of corporate bullshittery. If this were the case, I think we'd have heard about it by now.<p>My best guess is that they hit a scaling problem with doing smart load balancing. Smart load balancing, conceptually, requires persistent TCP connections to backend servers. There's some upper limit per LB instance or machine at which maintaining those connections causes serious performance degredations. Maybe that overhead became too great at a certain point, and the solution was to move to a simpler random round-robin load balancing scheme.<p>I'd love to hear a Heroku employee weigh in on this.
评论 #5217606 未加载
squidsoup超过 12 年前
Given that ElasticBeanstalk has support for rails now, does Heroku still have any advantage over AWS for a new startup?
评论 #5216822 未加载
评论 #5217621 未加载
评论 #5216694 未加载
jotto超过 12 年前
<a href="http://stackoverflow.com/questions/6370479/heroku-cedar-slower-response-time-than-bamboo" rel="nofollow">http://stackoverflow.com/questions/6370479/heroku-cedar-slow...</a> (from june 2011), the first discovery of cedar stack being slower than the bamboo stack
blatyo超过 12 年前
I assumed people were running multiple rails processes on their dynos.<p><a href="http://michaelvanrooijen.com/articles/2011/06/01-more-concurrency-on-a-single-heroku-dyno-with-the-new-celadon-cedar-stack/" rel="nofollow">http://michaelvanrooijen.com/articles/2011/06/01-more-concur...</a>
nsrivast超过 12 年前
OP is a friend of mine, and when I first heard of his problem I wondered if there might be an analytical solution to quantify the difference between intelligent vs naive routing. I took this problem as an opportunity to teach myself a bit of Queueing Theory[1], which is a fascinating topic! I'm still very much a beginner, so bear with me and I'd love to get any feedback or suggestions for further study.<p>For this example, let's assume our queueing environment is a grocery store checkout line: our customers enter, line up in order, and are checked out by one or more registers. The basic way to think about these problems is to classify them across three parameters:<p>- arrival time: do customers enter the line in a way that is <i>D</i>eterministic (events happen over fixed intervals), rando<i>M</i> (events are distributed exponentially and described by Poisson process), or <i>G</i>eneral (events fall from an arbitrary probability distribution)?<p>- checkout time: same question for customers getting checked out, is that process <i>D</i> or <i>M</i> or <i>G</i>?<p>- <i>N</i> = # of registers<p>So the simplest example would be <i>D</i>/<i>D</i>/1, where - for example - every 3 seconds a customer enters the line and every 1.5 seconds a customer is checked out by a single register. Not very exciting. At a higher level of complexity, <i>M</i>/<i>M</i>/1, we have a single register where customers arrive at rate <i>_L</i> and are checked out at rate <i>_U</i> (in units of # per time interval), where both <i>_L</i> and <i>_U</i> obey Poisson distributions. (You can also model this as an infinite Markov chain where your current node is the # of people in the queue, you transition to a higher node with rate <i>_L</i> and to a lower node with rate <i>_U</i>.) For this system, a customer's average total time spent in the queue is 1/(<i>_U</i> - <i>_L</i>) - 1/<i>_U</i>.<p>The intelligent routing system routes each customer to the next available checkout counter; equivalently, each checkout counter grabs the first person in line as soon as it frees up. So we have a system of type <i>M</i>/<i>G</i>/<i>R</i>, where our checkout time is <i>G</i>enerally distributed and we have <i>R</i>&#62;1 servers. Unfortunately, this type of problem is analytically intractable, as of now. There are approximations for waiting times, but they depend on all sorts of thorny higher moments of the general distribution of checkout times. But if instead we assume the checkout times are randomly distributed, we have a <i>M</i>/<i>M</i>/<i>R</i> system. In this system, the total time spent in queue per customer is C(<i>R</i>, <i>_L</i>/<i>_U</i>)/(<i>R</i> <i>_U</i> - <i>_L</i>), where C(a,b) is an involved function called the Erlang C formula [2].<p>How can we use our framework to analyze the naive routing system? I think the naive system is equivalent to an <i>M</i>/<i>M</i>/1 case with arrival rate <i>_L_dumb</i> = <i>_L</i>/<i>R</i>. The insight here is that in a system where customers are instantaneously and randomly assigned to one of <i>R</i> registers, each register should have the same queue characteristics and wait times as the system as a whole. And each register has an arrival rate of 1/<i>R</i> times the global arrival rate. So our average queue time per customer in the dumb routing system is 1/(<i>_U</i> - <i>_L</i>/<i>R</i>) - 1/<i>_U</i>.<p>In OP's example, we have on average 9000 customers arriving per minute, or <i>_L</i> = 150 customers/second. Our mean checkout time is 306ms, or <i>_U</i> ~= 3. Evaluating for different <i>R</i> values gives the following queue times (in ms):<p># Registers 51 60 75 100 150 200 500 1000 2000 4000<p>dumb routing 16,667 1,667 667 333 167 111 37 18 9 4<p>smart routing 333 33 13 7 3 2 1 0 0 0<p>which are reasonably close to the simulated values. In fact, we would expect the dumb router to be comparatively even worse for the longer-tailed Weibull distribution they use to model request times, because you make bad outcomes (e.g. where two consecutive requests at 99% request times are routed to the same register) even more costly. This observation seems to agree with some of the comments as well [3].<p>[1] <a href="http://en.wikipedia.org/wiki/Queueing_theory" rel="nofollow">http://en.wikipedia.org/wiki/Queueing_theory</a><p>[2] <a href="http://en.wikipedia.org/wiki/Erlang%27s_C_formula#Erlang_C_formula" rel="nofollow">http://en.wikipedia.org/wiki/Erlang%27s_C_formula#Erlang_C_f...</a><p>[3] <a href="http://news.ycombinator.com/item?id=5216385" rel="nofollow">http://news.ycombinator.com/item?id=5216385</a>
评论 #5217942 未加载
评论 #5219464 未加载
评论 #5218065 未加载
评论 #5219243 未加载
评论 #5218023 未加载
评论 #5222572 未加载
评论 #5226080 未加载
评论 #5218145 未加载
andrewcooke超过 12 年前
i don't think some details of the argument hold. it alleges that you need more dynos to get the same throughput. but that's not true once you have sufficient demand to keep a queue of about sqrt(n) (i think - someone who knows more theory than me can correct me) in size on the dyno (where you have n dynos). because at that point all dynos will be running continuously, and the throughput will be the same with either routing.<p>the average latency will be higher, though (and the spread in latency larger).
评论 #5216419 未加载
评论 #5216524 未加载
mixedbit超过 12 年前
I think this analysis and simulation does not account for one important thing: random routing is stateless and thus easy to be distributed. Routing to the least loaded Dyno needs to be stateful. It is quite easy to implement when you have one centralized router, but for 75 dynos this router would likely become a bottleneck. With many routers, intelligent routing has its own performance cost, the routers need to somehow synchronize state, and the simulation ignores this cost.
评论 #5216865 未加载
dblock超过 12 年前
I believe routing is not random, but round robin. I'd like Heroku to confirm. It's still a problem. If you are looking to run Unicorn on Heroku, use the heroku-forward gem (<a href="https://github.com/dblock/heroku-forward" rel="nofollow">https://github.com/dblock/heroku-forward</a>). Works well, but application RAM is quickly its own issue, we failed to run that in production as our app takes ~300MB.
jasonwatkinspdx超过 12 年前
The problem is the request arrival rate vs the distribution of service times in your app.<p>New Relic may be giving you an average number you feel happy about, but the 99th percentile numbers are extremely important. If you have a small fraction of requests that take much longer to process, you'll end up with queuing, even with a predictive least loaded balancing policy.<p>This is a very common performance problem in rails apps, because developers often use active record's associations without any sort of limit on row count, not considering that in the future individual users might have 10000 posts/friends/whatever associated object.<p>Fix this and you'll see your end user latency come back in line.
jules超过 12 年前
Why not get yourself ONE beefy server? (or two) That should be able to handle your 150 requests per second, simplify your architecture a lot, and buying it would be cheaper than 1 month on Heroku (at $20,000/month).
评论 #5217019 未加载
michaelfairley超过 12 年前
There's another fun issue that falls out of this: any requests sitting in the dyno queue when the app restarts get dropped with a 5xx error. <a href="https://github.com/michaelfairley/unicorn-heroku/issues/1#issuecomment-8601906" rel="nofollow">https://github.com/michaelfairley/unicorn-heroku/issues/1#is...</a>
codex_irl超过 12 年前
Personally - I prefer Linode to Heroku, sure there is more of my time consumed with sys admin, but I like having full control over my platform &#38; setup, rather than having it virtually dictated to me. I'm always open to change but this strategy has served me very well for almost 3 years now.
评论 #5216497 未加载
dangrahn超过 12 年前
I was in contact with Heroku support a couple of weeks ago since we experienced some timeout on our production app. Got a detailed explanation how the routing on heroku works by a Heroku engineer, and thought I could share:<p>"I am a bit confused by what you mean by an "available" dyno. Requests get queued at the application level, rather than at the router level. Basically, as soon as a request comes in, it gets fired off randomly to any one of your web dynos.<p>Say your request that takes 2 seconds to be handled by the dyno was dispatched to a dyno that was running a long running request. Eventually, after 29 seconds, it completed serving the response, and started working on the new, faster 2 second request. Now, at this point it had already been waiting in the queue for 29 seconds, so after 1 second, it'll get dropped, and after another 1 second, the dyno will be done processing it, but the router is no longer waiting for the response as it has already returned an H12.<p>That's how a fast request can be dropped. Now, the one long 29 second request could also be a series of not-that-long-but-still-long requests. Say you had 8 requests dispatched to that dyno at the same time, and they all took 4 seconds to process. The last one would have been waiting for 28 second, and so would be dropped before completion and result in an H12."
jhuckestein超过 12 年前
Watch out, this affects small rails applications with few dynos as well.<p>If you hit the wall with one dyno and add another one, you won't get twice the throughput even though you pay twice the price.<p>I've always had suspicions about this on some smaller apps but never really looked into it. You can configure New Relic to measure round-trip response times on the client side. At peak loads those would be unreasonably high. Much higher than can be explained by huge latencies even.
simpletouch超过 12 年前
This is something that I have been struggling with the past long while. Very troublesome when a dyno cycles itself (like they always will at least every 24 hours), as the routing layer continues to send it requests, resulting in router level "Request Timeouts" if it takes too long to restart.<p>Especially difficult to diagnose when the queue and wait time in your logs are 0. What is the point of these in the logs if it never waits or queues?
tlrobinson超过 12 年前
Question from a non-Ruby-expert: does Thin, which uses Event Machine, help with this at all, or do requests still block on other IO like database calls, etc?
评论 #5217145 未加载
zrail超过 12 年前
For those of you looking to migrate to other, barer hosting solutions like AWS or another VPS provider, I've put together a Capistrano add-on that let's you use Heroku-style buildpacks to deploy with Nginx doing front-end proxy. I use it for half a dozen apps on my VPSs and it works swimmingly well.<p><a href="https://github.com/peterkeen/capistrano-buildpack" rel="nofollow">https://github.com/peterkeen/capistrano-buildpack</a>
评论 #5216522 未加载
评论 #5216901 未加载
anon640超过 12 年前
"For a Rails app, each dyno is capable of serving one request at a time."<p>Is this a deliberate design choice on Heroku's part, or is this just how Ruby and Rails work? It sounds bizarre that you would need multiple virtual OS instances just to serve multiple requests at the same time. What are the advantages of this over standard server fork()/threaded accept designs?
评论 #5216541 未加载
评论 #5216718 未加载
评论 #5218704 未加载
trotsky超过 12 年前
Those charts of "simulated" load balancing strategies don't look at all reasonable at first glance. You certainly don't see such spiky patterns with normal web loads. I think you'd have to have some crazy amount of std. dev in completion time cranked way, way up in your simulation before you saw a bunch of servers stacked at 30 with others at 1.<p>It's not that there is no benefit to better balancing, it's just that I've never seen it have anything close to that impact. It seems like it's only being perceived as a problem here because somebody drank too much of the (old) kool-aid.<p>Some of the other numbers are hard to take at face value as well. 6000ms avg on a specific page? If requests are getting distributed randomly shouldn't all your pages show a similar average time in queue? Sounds more like they're using a hash balancing alg and the static page was hashing on to a hot spot.
评论 #5217249 未加载
评论 #5217627 未加载
Giszmo超过 12 年前
I searched for variance and apparently nobody mentioned this before: They talk of Mean request time: 306ms Median request time: 46ms Which indicates a very high variance, so don't take for granted that an x50 increase of performance would result from intelligent routing. The problem is that the fast tasks suffer from being queued after the slow tasks, so each such fast task takes an extra latency. If the variance is lower, the random routing will be favorable at some point as the delay of getting the task from the router queue to the dyno is not zero neither. In the case of no variance, "intelligent routing" would always add that delay as soon as all dynos are at their limit. Before that, the router would simply keep a list of idle dynos and send work there without delay.<p>Sure if you never hit 100% load, intelligent routing is cheap and comes at no delay. Imagine 40ms jobs getting all dynos to 100% load. Now the dynos would be idle for the duration of the ping that it takes to report being idle. let that be 4ms. That is 10% less throughput than with items queuing up on the dyno.<p>The router being the bottleneck would therefore justify to make it stateless and give the dynos a chance to use these last 10% of processing power as well, ultimately increasing the throughput by 10%. Sure, a serious project would not run its servers at 120% load hoping to eventually get back to 100% within time, so all this being said I would always favor intelligent routing to get responsive servers, add dynos in rush hours and only opt for dyno-queuing for stuff that may come with a delay (scientific number crunching, …)
pdog超过 12 年前
What's the advantage of randomized routing over intelligent routing? Why would this change be made?
评论 #5216600 未加载
评论 #5216611 未加载
评论 #5216686 未加载
评论 #5216632 未加载
pointful超过 12 年前
Just adding a top-level post to point out something buried in one of the threads here that is an important point on what is happening here:<p>The "queue at the dyno level" is coming from the Rails stack -- it's not something that Heroku is doing to/for the dynos.<p>Thin and Unicorn (and others, I imagine) will queue requests as socket connections on their listener. Both default to 1024 backlog requests. If you lower that number, Heroku will (according to the implications in the documentation on H21 errors) try multiple other dynos first before giving up.<p>See <a href="https://devcenter.heroku.com/articles/error-codes#h21-backend-connection-refused" rel="nofollow">https://devcenter.heroku.com/articles/error-codes#h21-backen...</a><p>For a single-threaded process to be willing to backlog a thousand requests is problematic when combined with random load balancing. Dropping this number down significantly will lead to more sane load-balancing behavior by the overall stack, as long as there are other dynos available to take up the slack.<p>Also, the time the request spends on the dyno, including the time in the dyno's own backlog, is available in the heroku router log. It's the "service" time that you'll see as something like "... wait=0ms connect=1ms service=383ms ...". Definitely wish New Relic was graphing that somewhere...
juanbyrge超过 12 年前
LOL, heroku is not designed for real apps. It's designed for side projects and consulting projects that don't go anywhere.<p>Anytime you get traffic, move off ASAP!
mattbillenstein超过 12 年前
Single request per server? What year is this? 2003?
评论 #5217394 未加载
jmount超过 12 年前
Turns out to be a great queueing problem. Please check out my analysis of much simplified version of the random routing algorithm that fails with near certainty: <a href="http://www.win-vector.com/blog/2013/02/randomized-algorithms-can-have-bad-deterministic-consequences/" rel="nofollow">http://www.win-vector.com/blog/2013/02/randomized-algorithms...</a>
kawsper超过 12 年前
This explains why some of my benchmarking tools gave very different and sometimes weird results when figuring out how many dynos our application needed.<p>As this blogpost also states Heroku really need to keep their documentation up to date. I sometimes stumble across something old referring to an old stack, or something contradicting.
mleach超过 12 年前
The balance of a subjective, sensationalist headline with objective statistical simulation was impressive.<p>I'm a huge Heroku fan using Cedar/Java, but can't help but wonder how many optimization options remain for Rails Developers, assuming nothing else changes on Heroku:<p>* Serving static HTML from CDN * Unicorn * Redis caching with multiget requests
frankc超过 12 年前
I know nothing about Heroku's architecture than what I just read in this post, but couldn't you alleviate this problem greatly by having the dyno's implement work stealing? Obviously the they would have to know about each other then, but perhaps that is easier to do that global intelligent routing.
vineet超过 12 年前
It seems that the delay is in the variance in the length of the different jobs. Having slow jobs is generally not a good idea, and I can imagine that they are happening for uncommon tasks.<p>When you are running a 100+ servers it seems like a simple answer would be to think about these uncommon tasks differently. Options would be for prioritizing them differently, showing different UI indicators, and also wanting them happening on a separate set of machines.<p>Doing these would mean that an intelligent routing mechanism would not have as much use. Am I wrong here?<p>I do believe that Heroku should document such problems of theirs more clearly, so that we know what challenges that we are facing as we develop applications, but in this particular case, it seems that they do have the right plumbing, and that they just need to be used differently.
eignerchris_超过 12 年前
Thanks for calling this out. As you said, random routing is about as naive as it gets. They need to make upgrades to the routing mesh - expose some internal stats about dyno performance and route accordingly. Even if the stats were rudimentary, anything would be an improvement over random.
fatbird超过 12 年前
Maybe this is a dumb question, but wouldn't straightforward Round Robin routing by Heroku restore their "one dyno = one more concurrent request" promise without incurring the scaling liabilities of tracking load across an arbitrarily large number of dynos?
评论 #5216583 未加载
评论 #5216514 未加载
clouddevops超过 12 年前
Perhaps easy deployments are not worth the performance and blackbox trade-off. An alternative approach is a cloud infrastructure provider with baremetal and virtual servers on L2 broadcast domain, and one that provides a good API and orchestration framework so that you can easily automate your deployments. Here are some things we at NephoScale suggest you consider when choosing an infrastructure provider: <a href="http://www.slideshare.net/nephoscale/choosing-the-right-infrastructure-provider" rel="nofollow">http://www.slideshare.net/nephoscale/choosing-the-right-infr...</a>
评论 #5217371 未加载
grandalf超过 12 年前
I'd think that most of the requests being served by rapgenius.com would be highly cacheable (99% are likely just people viewing content that rarely changes).<p>Seems weird that the site would have such a massive load of non-cacheable traffic. Heroku used to offer free and automatic varnish caching, but the cedar stack removed it. Some architectures make it easy to use cloudfront to cache most of the data being served. My guess that refactoring the app to lean on cloudfront would be easier and more cost-effective (and faster) than manually managing custom scaling infrastructure on EC2.
AliEzer超过 12 年前
Interesting article but every time I read something on RapGenius and move my eyes from the screen, I keep seeing white lines, very annoying. White font on black background is bad. Off topic I know, but still.
habosa超过 12 年前
Somewhat unrelated:<p>Does anyone else think that RapGenius makes a great blogging platform? I'd love a plugin that enabled similar annotations on any blog, even if they're just by the original author and not crowdsourced.
评论 #5216562 未加载
benjamincburns超过 12 年前
This kind of validates an idea I've been flirting with: a Heroku-like service which routes requests via AMQP or similar message broker and actually exposes the routing dynamics to the client apps.<p>From a naive, inexperienced view the idea of having web nodes "pull" requests from a central queue rather than the queue taking uneducated guesses seems to be a no-brainer. I can see this making long-running requests (keep-alive, streaming, etc) a bit more difficult, but not impossible.<p>What am I missing? This seems so glaringly obvious that it must have been done before...
评论 #5217286 未加载
评论 #5217157 未加载
评论 #5217121 未加载
评论 #5218174 未加载
评论 #5218051 未加载
lquist超过 12 年前
How does this compare to EngineYard/AppFog/any other Heroku competitors?
评论 #5216580 未加载
评论 #5216513 未加载
googletron超过 12 年前
Does anyone know how if python applications are affected by this? I know they can handle multiple requests per dyno, I would be interested to know if random routing affects python apps too.
评论 #5216668 未加载
评论 #5216635 未加载
joshwa超过 12 年前
Fundamentally we're talking about a load balancer. Even the most basic load-balancers can use a least-connections algorithm. Even a round-robin algorithm would be better since that would give each dyno (number-of-dynos * msec per request) to finish a long-running request. Random routing is a viable option where the number of concurrent requests a node can handle is large or unknown, but when the limit is known and in the <i>single-digits</i>, random routing is a recipe for disaster.
justinhj超过 12 年前
Seems like message here is that if you use an off the shelf solution you need to work around its limitations. In this case random load balancing may sound dumb but it's actually quite a reasonable way to spread load. The customers real problem is the single threaded server bottleneck compounded by the sporadic slow requests. Seems like they have outgrown Heroku and a more custom solution is required. Either that or rebuild the server in whole or in part with a more concurrent one.
jonnycat超过 12 年前
This might be the case "out of the box", but it's very simple to go multithreaded on the Cedar stack and avoid this issue (provided that your app is threadsafe, of course).<p>You can do this pretty easily with a Procfile and thin: bundle exec thin -p $PORT -e $RACK_ENV --threaded start<p>And then config.threadsafe! in the app<p>Regarding Rails app threadsafety, there are some gotchas around class-level configuration and certain gems, but by and large these issues are easily manageable if you watch out for them during app development.
评论 #5219482 未加载
haddr超过 12 年前
I would like to see why actually Heroku fell back to random routing. It doesn't really make sense. Of course this all routing stuff is really tricky, but on the other hand there is a lot of work done (look at TCP agorithms). When I was studying ZeroMQ routing based stuff for one project, I came across "credit-based flow control" pattern, that could make perfect sense in this kind of situation (Publisher-Subscriber scenario). Why not implementing such thing?
krutulis超过 12 年前
I can't help but wonder if this kind of surreptitious change to the platform might in any way be connected to Byron Sebastian's sudden resignation last September from Salesforce. Is that nutty of me?<p><a href="http://gigaom.com/2012/09/05/heroku-loses-a-star-as-ceo-and-salesforce-evp-sebastian-resigns/" rel="nofollow">http://gigaom.com/2012/09/05/heroku-loses-a-star-as-ceo-and-...</a>
Uchikoma超过 12 年前
~$20,000 sounds like a lot of money, they would need at least $100M a year in revenue to justify this number. This will be a major challenge if they want to grow profitable after the $15M VC money runs out. I'd assume they'd get the same for $5000 in rented servers which would free up enough money - outside of the valley - to have an DevOps and another developer.
评论 #5219054 未加载
评论 #5219067 未加载
gtirloni超过 12 年前
VMs running full frameworks that are single-threaded. Why does that feel like wasting resources or bloating the architecture?
izietto超过 12 年前
From the Heroku docs:<p>[...] Request distribution<p>The routing mesh uses a random selection algorithm for HTTP request load balancing across web processes. [...]<p>If the algorithm is random, the load balancing simply doesn't happen, am I wrong?<p><a href="https://devcenter.heroku.com/articles/http-routing#request-distribution" rel="nofollow">https://devcenter.heroku.com/articles/http-routing#request-d...</a>
JuDue超过 12 年前
I'd love to see some better tutorials on how to use AWS Beanstalk to scale Rails apps.<p>There is this one, but it doesn't give me a sense of the scalability or management <a href="http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/create_deploy_Ruby_rails.html" rel="nofollow">http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/create...</a><p>Any recommendations?
评论 #5217293 未加载
jedahan超过 12 年前
Take a look at deliver if you like heroku push, but want it on a machine you control at a bit lower level: <a href="https://github.com/gerhard/deliver" rel="nofollow">https://github.com/gerhard/deliver</a> I got it working on EC2 / Ubuntu real easy, and even added some basic support for SmartOS/illumos for joyent cloud
filvdg超过 12 年前
You can Model Queues and calculate the service level when you do inteligent routing<p>The Erlang C formula expresses the probability that an arriving customer will need to queue (as opposed to immediately being served).<p><a href="http://owenduffy.net/traffic/erlangc.htm" rel="nofollow">http://owenduffy.net/traffic/erlangc.htm</a>
hpguy超过 12 年前
Can anyone explain why this random routing is supposedly good for Node.JS and Java? I mean the net effect is busy dynos might serve more requests while idle ones remain idle and that is certainly not good for Node.JS or anything. What am I missing?
cwalcott超过 12 年前
Managing load with thin workers isn't very hard...haproxy [<a href="http://haproxy.1wt.eu/" rel="nofollow">http://haproxy.1wt.eu/</a>] makes it pretty easy to setup rather complex load distributions (certainly more complex than random!).
leoh超过 12 年前
Ugh. These guys are so cocky.
评论 #5216511 未加载
评论 #5216795 未加载
evan2超过 12 年前
Great article. Have you thought about the alternative of building your own auto-scaling architecture with 99.9% uptime? I'd be interested to hear if you plan to move off heroku and, if so, what your plans are.
craigkerstiens超过 12 年前
The initial response from GM of Heroku - <a href="https://blog.heroku.com/archives/2013/2/15/bamboo_routing_performance/" rel="nofollow">https://blog.heroku.com/archives/2013/2/15/bamboo_routing_pe...</a>
wastedbrains超过 12 年前
Heroku for static content is always terrible, I am always surprised at how many people host static sites on Heroku, it is really easy to host of S3 buckets and it is much faster for static pages.
lil_tee超过 12 年前
We have updated our post to incorporate some popular suggestions and reactions into our simulations: <a href="http://rapgenius.com/1504221" rel="nofollow">http://rapgenius.com/1504221</a>
ratherbefuddled超过 12 年前
If it's true, I can't see how random routing can be anything but a cynical cash grab.<p>Even a very simple algorithm like round robin would give you a significantly better latency characteristic wouldn't it?
aneth4超过 12 年前
Would love to see some more perspectives on this. We also spend a lot of resources on heroku.<p>I'm not sure if this change by heroku is worse than the intermediary popups on all the links on this blog.
joeblau超过 12 年前
Thanks for the write up. I've been looking for some more reviews on Heorku's platform and this in-depth review definitely illuminates some challenges with the platform.
oellegaard超过 12 年前
This really sux. I like all their other offerings though - I'm considering running the Cloud Foundry "dyno" part alone and using the heroku services with it.
pm90超过 12 年前
Did anyone else find the headline a bit confusing? I got the feeling that they had abandoned RoR for another framework and almost skipped the article itself
EGreg超过 12 年前
And this kind of thing is why I prefer to have our own VPS. Linode is great, but we are slowly switching over to AWS and automating all the scaling up/down.
adminonymous超过 12 年前
I do hope that someone brings this up during Heroku's "Waza" developer conference. It's the perfect opportunity to air it out.
cachvico超过 12 年前
Can't they make the choice of intelligent or random scheduling a per-platform setting?<p>Java, node.js use random. Django + Rails use intelligent.
MediaSquirrel超过 12 年前
Heroku: The Rap Genius "Success" Story<p><a href="http://success.heroku.com/rapgenius" rel="nofollow">http://success.heroku.com/rapgenius</a>
philipDS超过 12 年前
Off-topic: RapGenius should really open source their "Explain tooltips" with the inline explanation window. Awesome :)
knodi超过 12 年前
This kind of routing wouldn't be a problem if they didn't charge $35 a dyno. It's such a high cost for a dyno.
bryanwbh超过 12 年前
Thanks for the write-up on this as I am currently reviewing heroku as an option for a proper PAAS for my app.
damian2000超过 12 年前
A bit of context:<p><a href="http://success.heroku.com/rapgenius" rel="nofollow">http://success.heroku.com/rapgenius</a>
Sami_Lehtinen超过 12 年前
Interesting. No I didn't read it because the page crashes my mobile browser everytime reliably.
JuDue超过 12 年前
I'd love to see some better tutorials on how to use AWS Beanstalk to scale Rails apps.<p>There is this one, but it doesn't give me a sense of the scalability or management <a href="http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/create_deploy_Ruby_rails.html" rel="nofollow">http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/create...</a><p>Any recommendations for good tutorials?
rubyrescue超过 12 年前
this is a bit hyperbolic. "Heroku Swindle Factor" just seems rude.
aren55555超过 12 年前
This was a great read.
benihana超过 12 年前
I really like Rap Genius, but I wish they would tone down the blackness of the background. Reading #CCC text on #000 background makes my eyes bug out.
评论 #5216981 未加载
评论 #5218732 未加载
评论 #5217760 未加载
seivan超过 12 年前
rails server -p $PORT rake jobs:work Webrick and DJ<p>No procfile? No Unicorn or Puma? No worker process or threads defined
dschiptsov超过 12 年前
Why on Earth any sane engineer would think that adding layers of "virtualized" crap in front of your application will be of any benefit?)<p>The only advantage of virtualization is on the developing stage and it is an ability to add quickly more slow and crappy resources you not own.)<p>Production is an entirely different realm, and the less layers of crap is in between of your TCP request and DB storage - the better. As for load balancing - it is Cisco level problem.)<p>Last question: why each web site must be represented as a hierarchy of some objects, instead of thinking in terms of what it is - a list of static files and some cached content generation on demand?)
alekseyk超过 12 年前
How do they know what algorithm Heroku uses for randomization to stimulate the results?<p>The differences in simulations are astonishing, I would not think Heroku's engineers were fine with this approach.<p>'Let's push this random balancing out.. 1000% increase in resources? Oh well, just update documentation!'
评论 #5217133 未加载
felipelalli超过 12 年前
cacilda!