I am developing an app that pulls data from a third party API into a database. The data is only updated once a month by the third party. The database acts as a local cache so that users can log into the app and run different types of reports against the data. Each report processes around 15k rows of data.<p>That part of the app runs like clockwork. The backend is what I am questioning. Each day a script pings the API to see if the data has updated, if it has then a background job queues up that starts pulling data for the 40 topics in the database to a total of about 220k rows of data.<p>Is Heroku suited for a task like this? Would I be better off running it on a dedicated EC2 instance that gets spun up when necessary? I would love to hear some experiences from others.<p>Notes:<p>Not using Heroku's DB, instead connected to a small Amazon RDS instance<p>2 dynos<p>Resque instead of delayed job
Fanvibe's backend is hosted on Heroku. We have 10 processes pulling in real-time sport stats every 1.25 seconds each. Our product has also been adopted by the NBA (yes, the national basketball association) and we power all of their apps across all mobile and web. In a nutshell, we're serving tons of users, and have huge peak times during the evenings when there's tons of games, and everything works great.<p>We have stats, an API, web / mobile destinations, and a boatload of users.
Heroku is definitely up to the job. I've stored 100s of thousands of tweets a day using Heroku workers and a larger Amazon RDS instance.<p>Have a look at scaling your Resque workers up for your daily task, then back down once you're done. I haven't used the following technique, but it looks sound:<p><a href="http://blog.darkhax.com/2010/07/30/auto-scale-your-resque-workers-on-heroku" rel="nofollow">http://blog.darkhax.com/2010/07/30/auto-scale-your-resque-wo...</a>
Heroku will probably work fine for this, but if you have the technical chops to run your own EC2 instances, I'd recommend that - you'll pay less and you'll have more control.<p>Since this is effectively a cronjob sort of operation, though, is there a reason you wouldn't run it on one of your application servers? Unless you're in a position where CPU + memory are effectively all spoken for, I'm not sure I understand why you'd need a dedicated instance for this sort of thing - it'll run for a bit, but I suspect most of that time will be network comms, so it's not even going to thrash your CPU that hard.