Weekend deployments are for chumps

93 pointsby rockhymasabout 14 years ago

14 comments

viraptorabout 14 years ago

I can't believe in how misguided this post is...1. They provide a service to people around the world, yet they don't ensure that someone is available as an emergency contact on Sunday evening, when the first post-deployment usage happens.2. They don't have a universal list of "this breaks, contact that guy".3. They don't have a known instant rollback procedure for a release.4. They don't have cross-component integration tests and they don't do them manually either.5. They decide that since they can't do a release that doesn't break stuff and can't organise themselves to resolve it quickly during the weekend when it affects only a small number of people, they'll do releases in the middle of the day now, so that they hear customers complaining right away.Is that for real? Is he serious? Here's what I would get out of that issue (even if it's basically reiterating the "wrong" things above):They need to do more integration testing before a release. They need to know who to contact and have to make sure the person is on call and ready for action. The person handling the issue needs to have a simple, quick way to reverse the release without manual intervention (tweaking the code). Again, this specific issue should get regression tests right away. And the most important thing - NEVER treat your customers as a test suite.Of course I'm aware not everyone can afford operating like that. But at least this could be their goal. "Let's make breakage affect more people, so we know about it earlier and when we're at work" is a really silly conclusion.

评论 #2301660 未加载

评论 #2301865 未加载

评论 #2301702 未加载

swombatabout 14 years ago

Wait, what?You have large numbers of paying customers to whom you're delivering a mission-critical system (source control isn't exactly optional), and your releases involve neither automated production monitoring/continuous deployment nor formal release procedures?I think your problem is more than just weekend deployments!My full comments here: <a href="http://swombat.com/2011/3/8/fog-creek-dont-do-cowboy-deployments" rel="nofollow">http://swombat.com/2011/3/8/fog-creek-dont-do-cowboy-deploym...</a>

评论 #2301680 未加载

评论 #2301178 未加载

评论 #2301252 未加载

评论 #2301890 未加载

评论 #2301231 未加载

评论 #2301240 未加载

评论 #2301223 未加载

TamDenholmabout 14 years ago

I work for web dev agencies and it surprises me just how often they launch on a Friday afternoon despite how every single developer pleads with them that its an absolutely awful idea.Golden rule, Never launch on a Friday.Personally i've found it easy to persuade clients to do this once you say it'll cost an extra ten grand just for the privilege of a friday launch.

patio11about 14 years ago

I feel for you. On the plus side, process improvements to prevent it from happening next time are exactly how you should respond to things like this.One which has saved my bacon numerous times is investing a few hours into tweaking monitoring and alert systems. I hear PagerDuty exists to help with this. I use a bunch of scripts and bubblegum, and even that caught 10 of the last 12 big problems. Queuing systems dying has hosed me many times over the years, for example, and a borked deploy which causes that would have my phone ringing before I got my laptop closed.

ww520about 14 years ago

Deploying in weekend or at night is a terrible idea in disguise of a good idea. What we used to do are:- No deployment on Weekend- No deployment on Friday- No deployment after 4pm on Monday to Thursday- Deployment is rolled out in stage: one server, 5%, 10%, 50%, 100% of servers.- Rollback steps must be accompanied with deployment steps.- Verification steps must be specified in the deployment ticket. Verification is done by QA or OPS, other than Dev.- Common deployment and rollback steps are automated.- Emergency deployment is an exception to the above but must take extra precaution to babysit the deployment process.Stress level has gone down a lot and problems are resolved much faster once we have the above.

agentultraabout 14 years ago

Amen.I've tried convincing many companies I've worked for that weekend deployments are a bad idea over the years.Even with continuous integration tests, rolling deployments, and all the precautions in the world things can still happen.You need live people available to handle a deployment.Personally, I don't like working on weekends. I've worked for companies that refused to believe that this was a bad idea. I learned pretty fast that life is too short to work on a weekend.If something does go wrong, it's better to have people on hand to correct the error and get back on track. It's much easier to schedule those people during the work week. It's not rocket science.

andrewvcabout 14 years ago

I agree that weekend deploys are a shitty idea, but isn't the real issue here not being able to roll back?

评论 #2301098 未加载

adamzochowskiabout 14 years ago

I was taught that Thursdays are best for deployment because you got Friday to fix stupid things, and then weekend to fix the terrible things. By Monday all is working anyways.And best of all, Friday people are generally happy (it is last day of the week), respectively on Monday expect grumpy users.

评论 #2301194 未加载

评论 #2301874 未加载

zeruchabout 14 years ago

Users seem more comfortable with predictable maintenance than arbitrary outages. Weekend deploys are just bad all around.When I began in my current role (managing QA/DBAs and app deploys) one of the first things I killed was the late Friday/weekend deploys. They are spirit-crushing and if they go south, they usually go south in a terminal-velocity nose dive.We set up early Fridays for maintenance, to give us enough time in case something goes south. Aggressive Change Control Requests means the people impacted get a heads up (including Account Managers, who in turn inform clients) if there are any user-facing impacts, and we avoid trying to pack too much in at once.Having QA, Engineering and the SOC team on hand is...helpful. Maybe its paranoid, but its been very solid so far. When things have gone south, I think the events -since everyone is "on deck" have actually helped build some cameraderie in the teams themselves.

评论 #2304014 未加载

badmash69about 14 years ago

In my experience , there are two kinds of deployment -- ones without DB changes and ones that are accompanies with DB changes.The deployments that do not require DB changes are easy -- mirror the prod box(non db) onto a smaller box, , deploy upgrades/updates to prod box . If things go wrong, put mirror box online with a DNS/proxy configuration while apologizing to your customer who complain about slower performance .When DB changes are involved , you need to have your DBAs do a dry run of backing out changes-- after all practice makes perfect. Communicate scheduled outage to customers, backup db . Mirror your production box. Roll put update -- if things go wrong, restore DB and bring the mirror box online.I have always focused on DB aspect more -- loss of integrity of data can cause customers to look for your replacement.But I am not sure if weekly upgrades of production environment with paying customers is advisable .

mgrouchyabout 14 years ago

I'm lucky to run a system that is small enough that an entire deploy consists of around 2 seconds of downtime for the server to restart and start the new instance of the application.We deploy new versions side by side and then then the webserver points at the new application on restart.Only time it takes any longer is when there are sweeping database changes(schedule the downtime, inspect snapshots incase of issues, etc.)

powdahoundabout 14 years ago

We use PagerDuty (<a href="http://pagerduty.com" rel="nofollow">http://pagerduty.com</a>) at HipChat and while I absolutely loathe being woken up by it, it's helped us identify issues during off-peak hours much more quickly.But no matter what systems you have in place or how many hundreds of deploys you've done, there's always a new way for things to break.

评论 #2301556 未加载

Hominemabout 14 years ago

Oh god. I release 5pm pacific every week because users "can't have a single second" of downtime. We manually test an ever growling checklist fo functionality. There is always, always an issue. The angry emails start to roll in around 5:15 pacific.

krobertsonabout 14 years ago

Their problem isn't their deployment process, its their monitoring.Blindly ignoring errors is a recipe for failure. You should always look at situation like that asking "how can we monitor this weak point?" Logging plus a service like Splunk work great.Should always have a solid on call rotation. We have two rotations, and ops one which is first line, and dev in case deeper code changes or more eyes on it are needed.