> <i>This works for future customers, since once Heroku makes these documentation changes, everyone who signs up will understand exactly how routing works. But it does nothing to address the time and money that existing customers have spent over the past few years. What does Heroku owe them?</i><p>Lawyers -- well, judges really -- are good at coming up with answers for this exact sort of question.<p>I am not being facetious. There are legal rules for assessing losses in even very complex, very entangled situations. If you feel Heroku has dudded you, find a torts lawyer.<p>Heck, Salesforce.com have deep pockets. Round up a few other $20k/month customers and start a class action.<p>Web companies need to realise that boring old-fashioned rules like "your claims should not be misleading" apply to them too.<p>(IANAL, TINLA)
If you use Heroku and New Relic, make sure you install the gem we wrote to make New Relic report correct queue times: <a href="https://github.com/RapGenius/heroku-true-relic" rel="nofollow">https://github.com/RapGenius/heroku-true-relic</a>
My guess as an armchair observer (and tiny-scale Heroku user) would be that Heroku will offer some affected customers refunds, especially if those customers "threw dynos" at latency problems that were aggravated by the drift in Bamboo routing behavior and hidden by the misleading NewRelic monitoring.<p>I don't think Adam@Heroku's response on the 11th is that bad. He accepts the feedback and also wants Heroku to help RapGenius 'modernize their stack'. That's not a full and proper solution, nor a remedy for the lost cost/effort so far, but it would have offered a lot of performance and cost relief.<p>In fact, I think that's why this problem festered: many customers managed to soften the pain by going to Cedar, multiple-workers, app-optimizations, and more dynos... so deeper investigations kept getting backburnered, both inside and outside Heroku, until now.<p>RapGenius has done us a mitzvah by finally digging deeper, but I'm still eager to see what Heroku thinks the right remedies are, beyond RapGenius's 'must do' ultimatums.
There are still some important points missing from the discussion:<p>1. Operating at scale with parallel routing.
2. Handle faults while operating at scale with parallel routing
3. Providing correct statistical models for the situation. The one we have right now is a crude approximation.
4. Measuring on the real system for problems.<p>The optimum routing is to have each dyno with 0 or 1 job at a time and a global queue of all incoming requests. But this is a latency problem then since it takes time for a dyno to tell that it is "ready". The net result is very bad performance and the global queue is a single point of failure. The solution is to queue because this removes the latency --- but with the price you see RG paying if a Dyno can only serve one request at a time.<p>If a dyno does not report "ready" to the routing mesh, then you can't route optimally:<p>Queue length doesn't work since a request in queue may take 7000ms while still having a length of 1. Another queue with length 5 consisting of 5 70ms requests is better to route to.<p>The time the last request spent in queue is not useful either because the very next message may be a 7000ms one.<p>So to solve this problem, you must do something else. You cannot use "intelligent routing" unless you can describe how it will work distributed with, say, 8 routing machines while avoiding latency. And while you are at it, you better measure your solution in a real-world scenario.
This incident has done wonders for RapGenius's technical brand. I don't know how many people would've identified them as a 'tech company' before, but that number has surely gone up.
Guys, you've made a lot more money than me, so you don't need my advice. But if you want money back, you should probably be communicating in private through your lawyers. Posts like this look like you're trying to get (more) attention.
Heroku's suggestion: "modernize and optimize your web stack."<p>I don't have any experience with Ruby web stacks so I'm curious if this is actually an option for you guys? What would it take to do that? Would the performance increase on Heroku be worth it?<p>It also seems like if you wanted to self host you would probably need to do those same improvements, right?<p>Please don't take my comment the wrong way, I'm not trying to say Heroku is somehow excused from their mistakes here. I'm just trying to understand that suggestion from Heroku.
I've lost a lot of faith in Heroku this last week. Going to be doing a lot of investigating Cloud66/Elastic Beanstalk + EC2 for my Rails app. Good excuse to up my sysadmin abilities a bit.
Why does Adam Wiggins repeatedly use the word 'evolve' as a transitive verb in an awkward fashion? Is this some sort of start-up usage that I managed to avoid thus far?<p>"We're working on evolving away from the global backlog concept in order to provide better support for different concurrency models, and the docs are no longer accurate."<p>"Getting user perspective is very helpful and I'll apply your feedback as we continue to evolve our product."<p>"You're correct that we've made some product decisions over the past few years that have evolved our HTTP routing layer away from the "intelligent routing" approach that we used in 2009."<p>Evolve to me connotates natural selection -- which is rather more haphazard than I would hope for from a engineering process.
Maybe this is offtopic, but I really don't like the way Rap Genius does links. It makes it so I essentially have to click on each link twice to get to what it actually goes to...
I'm sorry, but I don't understand any of this hating on Rap Genius.<p>There's a reason they are the fastest growing YC company ever, and got a16z in for 15M -- because they are straight killers. They have quietly created an internet empire until this point, and are building something that people love and use everyday.<p>A lot of folks wouldn't have the chutzpah to call out Heroku like that or are just too small to make this kind of attention. To me it seems as though they are helping Ruby devs save money and time. 8 dynos vs 4 dynos is a hell of a big difference when you're starting out. Also, seems like something that would be pretty fun to do if you worked there.
Thank you so much for forcing Heroku to confront this issue!<p>We've been seeing strange delays and optimizing based on New Relic for a long time... and whenever we reported this to Heroku, they would not admit to an issue.<p>We ended up using threads (on cedar stack) to get more concurrency per dyno.
"Explain Now, as Rap Genius is widely known for its expertise in queuing theory"
Is this true, or are they being sarcastic that if they could do it Heroku really should've?
Ironically, it's possible to get a huge gain over purely random load balancing by examining just two queues at random -- essentially, you should always be doing this since the cost is O(1) and the improvement is large.[0] This doesn't require any distributed locking and at least would qualify as "intelligent" routing -- probably the bare minimum needed to justify that marketing label.<p>Oh, and it also scales incredibly well. Like I said, there's no reason not to use it over purely random load balancing.<p>[0] <a href="http://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.pdf" rel="nofollow">http://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.p...</a>
Heroku should have done something about the issue earlier, but it seems like the problem was just poor prioritization/time management on their end. Yes, these posts got them to finally get moving, but i wonder if perhaps RapGenius could have had the same effect by continuing to bug them privately in the same unyielding manner, instead of going public with it so quickly. That would have allowed Heroku to have focused their energy on fixing the problem, rather than upon worrying about PR and class action lawsuits.<p>Also, on the topic of lawsuits, how many small startups will go out of business if they get hit with a class action lawsuit every time their documentation accidentally diverges from reality? In this case, RapGenius is small and Salesforce is big, but the legal system will apply the same standard when the plaintiff is big and the defendent is poor. If this becomes precedent, then soon we will have lawyers trying to treat any public post by company employees as 'documentation', forcing startups to have a policy of not allowing their employees to freely help others with their product in public forums. Also, any small startup with a large competitor will have the large competitor paying people to sign up for the product with the sole intent of finding a bug in the documentation so that the small startup can be sued out of business.
I agree that Heroku's response is pretty unbelievable and their engineering choices very suspect. Reading the email chain between Tom & Adam really drives home how badly this has been handled by Heroku.<p>Heroku is massively crippling its own product with random routing. Other cloud providers have been able to get this right, and Heroku very obviously knows what kind of applications are running on its server (e.g. deploy a Rails application, Heroku says "Rails" in the console). It would not be difficult to apply different routing schemes for each type of application.<p>Given that this has been going on for years now, Heroku is either acting with pronounced malice or incompetence. Any competent engineer would not be satisfied with switching the routers over to random and calling it a day. How could that have possibly been approved, then remained for years? They must not have realized what a grave mistake it is.<p>The #1 thing they should be doing <i>right now</i> (aside from damage control) is to move the routers over to round-robin routing. Random is the most naive scheme possible and is laughably inappropriate for this situation.<p>See for yourself using this simulator: <a href="http://ukautz.github.com/pages/routing-simulator.html" rel="nofollow">http://ukautz.github.com/pages/routing-simulator.html</a>
To what extent would using something like Amazon's ELB mitigate this sort of issue in a bring-your-own-cloud approach? Completely?<p>I've been looking at using something like Cloud66 and an ELB to move off of Heroku.
This is big stuff...<p>Sorry to see Rap Genius investing all that money in New Relic, I can't really imagine being on their shoes.<p>I would be so pissed.<p>PS: Heroku user here