Type in the exact number of machines to proceed

554 pointsby viiover 4 years ago

51 comments

I've seen this called "pointing and calling" [1], Japan's train drivers use the technique to force themselves to perform actions and take notice of the current environment.I personally took it to heart, it's a good system for forcing a cache miss in the brain - make sure you're on "database production" or "database localhost" etc.[1] <a href="https://en.wikipedia.org/wiki/Pointing_and_calling" rel="nofollow">https://en.wikipedia.org/wiki/Pointing_and_calling</a>

评论 #24911286 未加载

评论 #24913545 未加载

评论 #24907869 未加载

评论 #24914088 未加载

评论 #24910873 未加载

评论 #24908728 未加载

评论 #24911428 未加载

评论 #24911161 未加载

评论 #24911977 未加载

评论 #24915203 未加载

评论 #24911275 未加载

评论 #24913659 未加载

评论 #24913850 未加载

评论 #24927581 未加载

评论 #24914242 未加载

评论 #24913760 未加载

xamuelover 4 years ago

I wish it were possible for similar prompts to appear before all sorts of policy-makers and bureaucrats. "It appears you are about to institute a policy which will require 400 million patients to sign an additional waiver every time they visit a clinic, this will waste a total of 354,921 human hours within the next year alone. Please type 354,921 to proceed."

评论 #24908515 未加载

评论 #24908036 未加载

评论 #24907501 未加载

评论 #24908169 未加载

harikbover 4 years ago

I have a habit of creating cli tools, which potentially do dangerous things, to default to dry-run mode. For example, instead of the typical `--dry-run` or `-n` option, my scripts instead had a cheesy `--do-it` to be non-dry-run. It is annoying as hell to my colleagues, but saved the day many times.

评论 #24913849 未加载

评论 #24914164 未加载

评论 #24910489 未加载

评论 #24910671 未加载

评论 #24912673 未加载

评论 #24910506 未加载

评论 #24910487 未加载

评论 #24910627 未加载

评论 #24910570 未加载

评论 #24914222 未加载

roydivisionover 4 years ago

Reminds me of the proposal to keep the nuclear launch codes inside the body of an innocent volunteer, so the President would have to kill the person to get the codes.<a href="https://boingboing.net/2015/12/11/proposal-keep-the-nuclear-lau.html" rel="nofollow">https://boingboing.net/2015/12/11/proposal-keep-the-nuclear-...</a>

评论 #24908389 未加载

评论 #24917029 未加载

评论 #24912205 未加载

dgritskoover 4 years ago

Similar idea as GitHub's "type the exact name of this repository if you want to delete it" confirmation dialog. Maybe that's really what you want to do, but in case that's not actually what you meant to do, having a few extra hoops to jump through seems like a good idea.

评论 #24907935 未加载

评论 #24915983 未加载

评论 #24914428 未加载

评论 #24916677 未加载

luhnover 4 years ago

One of the largest AWS outages to date was caused by a scenario like this. [1] A mistyped commanded removed too many servers from an S3 subsystem, overloading the remaining servers and crashing the subsystem. The failure snowballed until the entire S3 region was down, which then caused issues with dependent services like EBS, ALB, and Lambda. They couldn't even update the status page because that also depended on S3.[1] <a href="https://aws.amazon.com/message/41926/" rel="nofollow">https://aws.amazon.com/message/41926/</a>

评论 #24910635 未加载

评论 #24913910 未加载

jasonpeacockover 4 years ago

Raskin talks about the futility of this in his book The Humane Interface.Basically, what happens is the brain switches operating context from "I want to do something" to "resolve this interruption (confirmation box)" and you don't relate the one to the other - you're so focused on getting rid of the interruption that the original task is forgotten until after the interruption is gone.Then you switch back to the original task that had been interrupted by the confirmation box and then you realize you made a mistake.It's much better to engineer "undo" ability into systems - like delaying commands (GMail's "Undo Send" does this), or caching previous state, etc.

评论 #24910485 未加载

bronco21016over 4 years ago

It amazes me that something like this can be done by a single person.In aviation any time input is given to the machine, it's entered by one human (typically pilot flying) and then verified by the other human (typically pilot monitoring) before being committed to or executed. For example... when a new altitude is assigned by ATC, say FL300, the pilot flying will spin it in the selector window and keep his hand or finger there until the second pilot agrees with and confirms the selection by reading FL300 out of the selector window.I know there are meat bags in these giant tubes so that changes attitudes towards safety etc. However, it seems to me that when organizations start putting the power to halt nearly the entire business in the hands of one person, there should be some slightly different attitudes. A breaking change in a million servers could easily cost hundreds of thousands or maybe even millions in lost revenue or employee productivity.I'm just an outsider though. Perhaps this level of attention is practiced at some shops. It's just interesting to me how in some fields we settle on pretty uniform standard practices whereas others are seen as non-human-life threatening so it's just shoot first, ask questions later.

评论 #24911280 未加载

评论 #24911800 未加载

评论 #24914264 未加载

illumin8over 4 years ago

This is a great idea, and I'd like to point out that having such a system in place would have prevented one of the largest Internet outages in recent memory - the Amazon S3 outage in 2017: <a href="https://aws.amazon.com/message/41926/" rel="nofollow">https://aws.amazon.com/message/41926/</a>> At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.

评论 #24910488 未加载

educationctoover 4 years ago

Terraform prints out the number of resources changed and at least requires a "yes" to proceed. Not quite as onerous as described but at least prevents some type of fat-fingering. Basically all changes with Terraform are risky as they usually involved bringing up and down infrastructure.<pre><code> Terraform will perform the following actions: # google_compute_instance.vm_instance will be created + resource "google_compute_instance" "vm_instance" { + ... <more> Plan: 2 to add, 0 to change, 0 to destroy. Do you want to perform these actions? Terraform will perform the actions described above. Only 'yes' will be accepted to approve. Enter a value: yes</code></pre>

评论 #24907978 未加载

评论 #24907955 未加载

remramover 4 years ago

A similar system is molly-guard [1], which replaces the reboot/halt/poweroff/... commands with scripts that make you type in the name of the machine before proceeding. Avoids shutting down the wrong machine because you forgot where you SSH'd.[1]: <a href="https://manpages.debian.org/buster/molly-guard/molly-guard.8.en.html" rel="nofollow">https://manpages.debian.org/buster/molly-guard/molly-guard.8...</a>

评论 #24916844 未加载

Darkphibreover 4 years ago

Reminds me of when the Fortune 50 company (150k employees) I worked for rolled out new firewall restrictions that blocked the DNS port.To all machines. Employee and servers alike.Yes. Including the DNS servers.Took them a day or two to work out how to roll that one back.

评论 #24912718 未加载

tialaramexover 4 years ago

So, related obviously correct designs:1. Git's Force-with-lease. Git push's "force" is too powerful, you will likely regret this much power, but it's tempting. So force-with-lease is the same power but conditional on you telling git what exactly the state was that you're overriding.This has two benefits, one is like Rachel's, it is an opportunity for a human to stop for a moment and consider, wait, why are we overriding this state? To find out what it is we might as well read... oh the state says it's an "emergency fix. Call Jerry". Maybe, just maybe, I ought to call Jerry before I force overwrite it?But the other is about race conditions which Rachel doesn't specifically address. If you are very careful to check that the state you want to overwrite with force is indeed a state that should be overridden, nothing prevents it meanwhile changing and then you overwrote state you didn't even know existed. But force-with-lease fixes that because your lease won't match.I believe Force-with-lease is a pattern that ought to be far more widespread. I've used several configuration management tools that let somebody say "Temporarily don't mess with config on these machines" and some of them let you write a reason like "James is rebuilding the RAID arrays" but none of them have that force-with-lease pattern that would be let me say "I know James is rebuilding the RAID arrays, this change must happen anyway but if anything else is blocking the change then reject it and let me know".2. Prefer Undo to Confirmation. If the computer can undo the action, even if that's a bunch of work and you'd rather not bother, put that work in and enable undo. Humans always know they "really" wanted to do the thing you're asking them to confirm so it's somewhat futile to ask, but they often realise they didn't want to afterwards and will undo it if you make that possible.Not everything can be undone. Undo factory reset isn't a thing. But lots of things you can't undo it was just laziness, try to do better in your own software. Your users (which might include you) will be grateful.

评论 #24913537 未加载

vondurover 4 years ago

That may have helped when Emory University's IT dept. accidentally sent a wipe and reformat command using Microsoft's SCCM to all of the Windows computers and servers on campus back in 2014. <a href="https://it.slashdot.org/story/14/05/17/051214/emory-university-sccm-server-accidentally-reformats-all-computers-campus-wide" rel="nofollow">https://it.slashdot.org/story/14/05/17/051214/emory-universi...</a>

kbensonover 4 years ago

This is a topic near and dear to my heart, as I'm often that person arguing to make some slightly less automated because the small trade-off in time is insurance against some of the worst mistakes you can have. Automation to the point of removing humans leads to stupid problems that a human wouldn't make if they looked at what was going on. So we automate tot he point where we minimize human contact, presenting a summary of actions that as humans we can apply our wonderful brains to and prevent those problems. Except some percentage of the time we don't actually pay attention, and depending on how the human interaction was introduced instead of complete automation, some percentage (or multiple!) of errors still sneak through.Automation to the point of minimal human contact where you assume the human will read the presented information and make an informed decision doesn't work. The point is that we want a human to understand what is being asked, so taking some step to ensure they do understand is warranted. It will never be perfect, but adding steps like she proposes are definitely a step in the right direction, IMO.

rossjudsonover 4 years ago

This resonates with me. Years ago I took down a service in a cell accidentally (Googlers might empathize: never 'borg' when you meant to 'borgcfg'). If I had been asked to enter the exact number of tasks I was about to nuke, I might have thought twice ;)

评论 #24911820 未加载

gabeioover 4 years ago

I do like this idea, this is I assume why github makes you type the repo name out in full. I wish AWS followed suit, when deleting any RDS (database) instance on AWS all you have to type is "delete me"... very easy to copy and paste as well as just know what you need to type and be on autopilot. I have even poked support about it and their response was underwhelming.

jaclazover 4 years ago

Side question.How many/which companies have more than one million Linux machines?

评论 #24908565 未加载

评论 #24910654 未加载

Ayeshover 4 years ago

I have an old laptop with a dead battery, and for a BIOS upgrade, it prevents me from updating without 50% battery.I have to type "danger" to bypass this restriction, and I thought it was pretty cool.Another good UI pattern is in Firefox, that it disables the Run button on downloads for a few seconds.

评论 #24912143 未加载

ineedasernameover 4 years ago

Oh god this would have saved me so much stress once. It was early in my career, and part of my duties was to run a merge/purge process on dupe records.I'd select the dupes for merge using a checkbox, but the vendor's interface for this just had a "confirm" button. So, I confirmed. However I'd selected the "select all" box and.... confirmed. Merging every. single. record. into one (1) record.I was fortunate, the vendor was able to roll back the changes, and nothing was lost. I also had a very good mentor-like boss who avoided reaming me out before we knew if there was a solution or not, and when there was he simply told me "I'm sure you've learned your lesson, but don't do that again."

aqme28over 4 years ago

Nitpicking> "This might be as simple as printing the number with your locale's version of numerical separators, like "123,456" or "123.456" or "123 456" or whatever else you might use where you are. The trick is then to NOT accept that as input, but instead demand that they remove the separator and jam it in as just digits. "It's easier to just strip non-digit characters than to parse the input for them and respond accordingly. This is a confirmation step with basically a checksum, so you're not going to get many false positives.

评论 #24913063 未加载

nemo1618over 4 years ago

Notably, Discord does something like this when you @everyone in a large channel: "You're about to push a notification to 12,000 people, are you sure you want to do that...?"

评论 #24908441 未加载

评论 #24909316 未加载

tigger0jkover 4 years ago

I've typically used pdsh <a href="https://github.com/chaos/pdsh" rel="nofollow">https://github.com/chaos/pdsh</a> for these types of commands, and I don't think they have any such safety options. The only protection is to be wracked with fear whenever you type pdsh. Obviously this fear wanes with use, and eventually you don't think about a command for long enough before you do it and hit enter on a regrettable one.

cleover 4 years ago

Even better than you confirming your own action, is someone else confirming it. If the stakes are high, require two people to turn the keys, instead of just one.

rcarmoover 4 years ago

This reminded me that a few years back I worked at a place where (notoriously) Puppet would occasionally go over some random box and remove access to people, just because.Or to all the machines, on one occasion.(It was actually some sort of race condition when we massively updated per-project access permissions and asked for SSH keys to be redeployed, but it was annoying as heck, and sure to happen whenever you really needed to access that particular machine.)

lqetover 4 years ago

Github has been doing this for quite a while know when you try to delete a repository - you have to type in the exact repository name to confirm.

评论 #24912744 未加载

temporallobeover 4 years ago

This is similar to a UI solution a colleague and I came up with. The action the user could kick off was unstoppable and irreversible (a large batch job), and it seemed like even a confirmation prompt was too easy to simply click through. So we had the UI present a modal dialog asking the user to type in a specific word in all caps to confirm the action. Worked like a charm.

评论 #24915558 未加载

TravHatesMeover 4 years ago

Reminds me of a study done where a test was given with questions that weren't difficult but likely to make a silly error. Around 85% of participants got at least one question wrong, but when they repeated the same test with a difficult-to-read font, that number dropped to ~25% or so. That's another way to make your brain work, use a terrible font.

评论 #24911528 未加载

willvarfarover 4 years ago

I am so adding this to a query api I have, where its all too easy to leave off constraints and end up asking for massive data sets by mistake.Thinking I can probably enhance it by forcing the user to type in the number as text rather than numeric, so they can't cut-n-paste. Kind of force them to type in "I am sure I want all data ever" or something.

评论 #24911447 未加载

mcintyre1994over 4 years ago

AWS sometimes does something similar to this like “enter the name of the thing you’re trying to delete to confirm”. I think it makes sense because you can have such a huge difference between how much you care about certain s3 buckets or CloudFormation deploys etc. In true AWS fashion it’s inconsistent between services though.

评论 #24926429 未加载

heelixover 4 years ago

Back in the Spiderman 2 days, I worked for a content management company that was supporting a really, really big website. I believe they were playing host file games for Stage/Prod. Was in the room on when they demo'ed something, did a restart of the system - and every pager in the room went off. Yah...

Cthulhu_over 4 years ago

I for one can't fathom any organization managing a million devices / servers / VMs / whatnot. I'm having enough trouble with one, and my biggest employers had maybe a few dozen at best, and they already had a dedicated ops team that worked mainly with infrastructure-as-code.

woliveirajrover 4 years ago

Once I had to deal with some software-RAID in Linux (mdadm it is), around 2007. There was some -force option that would just print information explaining what it would do and, to perform the real action, you needed to type another flag (that should never be revealed).Edit: added name of software

andrewfromxover 4 years ago

i've done this before by displaying unix epoc and asking the user to copy/paste that value WITHIN a 3 second window as an env var. i.e. if you up arrow and run same TIMESTAMP=1603827448 ./foo it won't work because 1603827448 is now way too old.

评论 #24911077 未加载

sidpatilover 4 years ago

Hmm, it's conceptually like a combination of a CAPTCHA and a launch code.

vsnfover 4 years ago

I do this with a git pre-push hook to the main branch of my repositories. It displays a prompt in red and forces me to type in the name of the branch.The result of one too many mindlessly accidental pushes.

regularfryover 4 years ago

I've seen this implemented as "Please type: My username is $USERNAME and I will not cry over spilt milk" but that was more to guard against support tickets.

diebeforei485over 4 years ago

I'm thinking this could also be useful for cases where colleges mistakenly email all applicants saying they'd been accepted, when they in fact had not been.

gitgudover 4 years ago

> "I've worked at a few places that had a large number of Linux boxes. I'm talking about well over a million."A few places!? What is an example of this?

评论 #24914406 未加载

ComodoHackerover 4 years ago

In role-playing games, it's a common practice to confirm deletion of your character by typing in some word, like 'delete' or character name.

bnasticover 4 years ago

Promise Pegasus (thunderbolt storage) comes with a GUI that does the same thing - to shut it down you have to type “CONFIRM” before clicking the button

Animatsover 4 years ago

Yes. Github does that when you delete a repository. You have to confirm by typing in the name of the repository you are deleting.

larrikover 4 years ago

I've seen this sort of thing in a few places, and I really do think it's a great idea.

RobRiveraover 4 years ago

Having babysat my fair share of critical clusters, i support this advice

wottonover 4 years ago

Marketo, the marketing automation platform, does this when you try to do things to large data sets, very useful.

konjinover 4 years ago

Finally the Roman numeral converter I programmed in university will be useful.

eznztover 4 years ago

Debian already does this, it asks you to type something like "yes do as I asked" if you want to remove a package that is considered to be part of the core.

jerfover 4 years ago

<a href="https://news.ycombinator.com/item?id=24907002" rel="nofollow">https://news.ycombinator.com/item?id=24907002</a>Looks like https vs http link.

评论 #24910371 未加载

jancsikaover 4 years ago

It would be neat to print out an esoteric error that gets a single result in Google, where the "forum" in the result has a rando answer about using a certain esoteric flag.Then you search the logs to see who is trying the command with the esoteric flag and "fix the glitch with payroll" for those employees.

JoeAltmaierover 4 years ago

Makes it harder to nest that command inside a script - you have to parse out the number and paste it back? Or do I misunderstand - should it still prompt the user in the middle of the process when that step arrives? That would be problematical if it were included in a web page or whatever.

评论 #24907733 未加载

评论 #24907784 未加载

评论 #24907984 未加载

outworlderover 4 years ago

> 1221425541 machines will be affected"Do you care? (Y/N)"Cattle, people. Not pets. Just make sure you don't hit all machines simultaneously and are rolling, instead.Since the post is talking about automation anyway, assume that any machine that can go down will go down. Ensure that any such disruption will be minimal. Oops, you just killed the production database? Whatever, who cares, it has just failed over anyway (or, for a distributed one, a new node was elected, data started replicating, etc).If one considers having to SSH to a machine to be an anti-pattern, it's amazing how much crap goes away.In the more generalized case, where it's not about machines, then it makes more sense. Maybe you are running a query that's going to perform updates across multiple clusters. It still should not be done by hand with direct production access - unless you are in the middle of a declared (and urgent!) incident and everything is on fire. In which case there's a bunch of people watching over your shoulder (or more likely, screen sharing in a conference call).The same job you have (hopefully) run in QA you should be able to re-target to production. Make the question just be a way to "unlock" your automation - for instance, by not copying credentials or environment information until the proper confirmation has been received. One should still have an escape hatch for when (not IF) things go wrong.

评论 #24913621 未加载