I've seen this called "pointing and calling" [1], Japan's train drivers use the technique to force themselves to perform actions and take notice of the current environment.<p>I personally took it to heart, it's a good system for forcing a cache miss in the brain - make sure you're on "database production" or "database localhost" etc.<p>[1] <a href="https://en.wikipedia.org/wiki/Pointing_and_calling" rel="nofollow">https://en.wikipedia.org/wiki/Pointing_and_calling</a>
I wish it were possible for similar prompts to appear before all sorts of policy-makers and bureaucrats. "It appears you are about to institute a policy which will require 400 million patients to sign an additional waiver every time they visit a clinic, this will waste a total of 354,921 human hours within the next year alone. Please type 354,921 to proceed."
I have a habit of creating cli tools, which potentially do dangerous things, to default to dry-run mode. For example, instead of the typical `--dry-run` or `-n` option, my scripts instead had a cheesy `--do-it` to be non-dry-run. It is annoying as hell to my colleagues, but saved the day many times.
Reminds me of the proposal to keep the nuclear launch codes inside the body of an innocent volunteer, so the President would have to kill the person to get the codes.<p><a href="https://boingboing.net/2015/12/11/proposal-keep-the-nuclear-lau.html" rel="nofollow">https://boingboing.net/2015/12/11/proposal-keep-the-nuclear-...</a>
Similar idea as GitHub's "type the exact name of this repository if you want to delete it" confirmation dialog. Maybe that's really what you want to do, but in case that's not actually what you meant to do, having a few extra hoops to jump through seems like a good idea.
One of the largest AWS outages to date was caused by a scenario like this. [1] A mistyped commanded removed too many servers from an S3 subsystem, overloading the remaining servers and crashing the subsystem. The failure snowballed until the entire S3 region was down, which then caused issues with dependent services like EBS, ALB, and Lambda. They couldn't even update the status page because that also depended on S3.<p>[1] <a href="https://aws.amazon.com/message/41926/" rel="nofollow">https://aws.amazon.com/message/41926/</a>
Raskin talks about the futility of this in his book The Humane Interface.<p>Basically, what happens is the brain switches operating context from "I want to do something" to "resolve this interruption (confirmation box)" and you don't relate the one to the other - you're so focused on getting rid of the interruption that the original task is forgotten until after the interruption is gone.<p>Then you switch back to the original task that had been interrupted by the confirmation box and <i>then</i> you realize you made a mistake.<p>It's much better to engineer "undo" ability into systems - like delaying commands (GMail's "Undo Send" does this), or caching previous state, etc.
It amazes me that something like this can be done by a single person.<p>In aviation any time input is given to the machine, it's entered by one human (typically pilot flying) and then verified by the other human (typically pilot monitoring) before being committed to or executed. For example... when a new altitude is assigned by ATC, say FL300, the pilot flying will spin it in the selector window and keep his hand or finger there until the second pilot agrees with and confirms the selection by reading FL300 out of the selector window.<p>I know there are meat bags in these giant tubes so that changes attitudes towards safety etc. However, it seems to me that when organizations start putting the power to halt nearly the entire business in the hands of one person, there should be some slightly different attitudes. A breaking change in a million servers could easily cost hundreds of thousands or maybe even millions in lost revenue or employee productivity.<p>I'm just an outsider though. Perhaps this level of attention is practiced at some shops. It's just interesting to me how in some fields we settle on pretty uniform standard practices whereas others are seen as non-human-life threatening so it's just shoot first, ask questions later.
This is a great idea, and I'd like to point out that having such a system in place would have prevented one of the largest Internet outages in recent memory - the Amazon S3 outage in 2017: <a href="https://aws.amazon.com/message/41926/" rel="nofollow">https://aws.amazon.com/message/41926/</a><p>> At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
Terraform prints out the number of resources changed and at least requires a "yes" to proceed. Not quite as onerous as described but at least prevents some type of fat-fingering. Basically all changes with Terraform are risky as they usually involved bringing up and down infrastructure.<p><pre><code> Terraform will perform the following actions:
# google_compute_instance.vm_instance will be created
+ resource "google_compute_instance" "vm_instance" {
+ ... <more>
Plan: 2 to add, 0 to change, 0 to destroy.
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes</code></pre>
A similar system is molly-guard [1], which replaces the reboot/halt/poweroff/... commands with scripts that make you type in the name of the machine before proceeding. Avoids shutting down the wrong machine because you forgot where you SSH'd.<p>[1]: <a href="https://manpages.debian.org/buster/molly-guard/molly-guard.8.en.html" rel="nofollow">https://manpages.debian.org/buster/molly-guard/molly-guard.8...</a>
Reminds me of when the Fortune 50 company (150k employees) I worked for rolled out new firewall restrictions that blocked the DNS port.<p>To <i>all</i> machines. Employee and servers alike.<p>Yes. Including the DNS servers.<p>Took them a day or two to work out how to roll that one back.
So, related obviously correct designs:<p>1. Git's Force-with-lease. Git push's "force" is too powerful, you will likely regret this much power, but it's tempting. So force-with-lease is the same power <i>but</i> conditional on you telling git what exactly the state was that you're overriding.<p>This has two benefits, one is like Rachel's, it is an opportunity for a human to stop for a moment and consider, wait, why are we overriding this state? To find out what it is we might as well read... oh the state says it's an "emergency fix. Call Jerry". Maybe, just maybe, I ought to call Jerry before I force overwrite it?<p>But the other is about race conditions which Rachel doesn't specifically address. If you are <i>very careful</i> to check that the state you want to overwrite with force is indeed a state that should be overridden, nothing prevents it meanwhile <i>changing</i> and then you overwrote state you didn't even know existed. But force-with-lease fixes that because your lease won't match.<p>I believe Force-with-lease is a pattern that ought to be far more widespread. I've used several configuration management tools that let somebody say "Temporarily don't mess with config on these machines" and <i>some</i> of them let you write a reason like "James is rebuilding the RAID arrays" but none of them have that force-with-lease pattern that would be let me say "I know James is rebuilding the RAID arrays, this change must happen anyway <i>but</i> if anything else is blocking the change then reject it and let me know".<p>2. Prefer Undo to Confirmation. If the computer <i>can</i> undo the action, even if that's a bunch of work and you'd rather not bother, put that work in and enable undo. Humans always know they "really" wanted to do the thing you're asking them to confirm so it's somewhat futile to ask, but they often realise they didn't want to afterwards and will undo it <i>if</i> you make that possible.<p>Not everything can be undone. Undo factory reset isn't a thing. But <i>lots</i> of things you can't undo it was just laziness, try to do better in your own software. Your users (which might include you) will be grateful.
That may have helped when Emory University's IT dept. accidentally sent a wipe and reformat command using Microsoft's SCCM to all of the Windows computers and servers on campus back in 2014.
<a href="https://it.slashdot.org/story/14/05/17/051214/emory-university-sccm-server-accidentally-reformats-all-computers-campus-wide" rel="nofollow">https://it.slashdot.org/story/14/05/17/051214/emory-universi...</a>
This is a topic near and dear to my heart, as I'm often that person arguing to make some <i>slightly</i> less automated because the small trade-off in time is insurance against some of the worst mistakes you can have. Automation to the point of removing humans leads to stupid problems that a human wouldn't make if they looked at what was going on. So we automate tot he point where we minimize human contact, presenting a summary of actions that as humans we can apply our wonderful brains to and prevent those problems. Except some percentage of the time we <i>don't</i> actually pay attention, and depending on how the human interaction was introduced instead of complete automation, some percentage (or multiple!) of errors still sneak through.<p>Automation to the point of minimal human contact where you <i>assume</i> the human will read the presented information and make an informed decision doesn't work. The point is that we want a human to <i>understand</i> what is being asked, so taking some step to ensure they <i>do</i> understand is warranted. It will never be perfect, but adding steps like she proposes are definitely a step in the right direction, IMO.
This resonates with me. Years ago I took down a service in a cell accidentally (Googlers might empathize: never 'borg' when you meant to 'borgcfg'). If I had been asked to enter the exact number of tasks I was about to nuke, I might have thought twice ;)
I do like this idea, this is I assume why github makes you type the repo name out in full. I wish AWS followed suit, when deleting any RDS (database) instance on AWS all you have to type is "delete me"... very easy to copy and paste as well as just know what you need to type and be on autopilot. I have even poked support about it and their response was underwhelming.
I have an old laptop with a dead battery, and for a BIOS upgrade, it prevents me from updating without 50% battery.<p>I have to type "danger" to bypass this restriction, and I thought it was pretty cool.<p>Another good UI pattern is in Firefox, that it disables the Run button on downloads for a few seconds.
Oh god this would have saved me so much stress once. It was early in my career, and part of my duties was to run a merge/purge process on dupe records.<p>I'd select the dupes for merge using a checkbox, but the vendor's interface for this just had a "confirm" button. So, I confirmed. However I'd selected the "select all" box and.... confirmed. Merging every. single. record. into one (1) record.<p>I was fortunate, the vendor was able to roll back the changes, and nothing was lost. I also had a very good mentor-like boss who avoided reaming me out before we knew if there was a solution or not, and when there was he simply told me "I'm sure you've learned your lesson, but don't do that again."
Nitpicking<p>> "This might be as simple as printing the number with your locale's version of numerical separators, like "123,456" or "123.456" or "123 456" or whatever else you might use where you are. The trick is then to NOT accept that as input, but instead demand that they remove the separator and jam it in as just digits. "<p>It's easier to just strip non-digit characters than to parse the input for them and respond accordingly. This is a confirmation step with basically a checksum, so you're not going to get many false positives.
Notably, Discord does something like this when you @everyone in a large channel: "You're about to push a notification to 12,000 people, are you sure you want to do that...?"
I've typically used pdsh <a href="https://github.com/chaos/pdsh" rel="nofollow">https://github.com/chaos/pdsh</a> for these types of commands, and I don't think they have any such safety options. The only protection is to be wracked with fear whenever you type pdsh. Obviously this fear wanes with use, and eventually you don't think about a command for long enough before you do it and hit enter on a regrettable one.
Even better than you confirming your own action, is someone else confirming it. If the stakes are high, require two people to turn the keys, instead of just one.
This reminded me that a few years back I worked at a place where (notoriously) Puppet would occasionally go over some random box and remove access to people, just because.<p>Or to all the machines, on one occasion.<p>(It was actually some sort of race condition when we massively updated per-project access permissions and asked for SSH keys to be redeployed, but it was annoying as heck, and sure to happen whenever you really needed to access that particular machine.)
This is similar to a UI solution a colleague and I came up with. The action the user could kick off was unstoppable and irreversible (a large batch job), and it seemed like even a confirmation prompt was too easy to simply click through. So we had the UI present a modal dialog asking the user to type in a specific word in all caps to confirm the action. Worked like a charm.
Reminds me of a study done where a test was given with questions that weren't difficult but likely to make a silly error. Around 85% of participants got at least one question wrong, but when they repeated the same test with a difficult-to-read font, that number dropped to ~25% or so. That's another way to make your brain work, use a terrible font.
I am so adding this to a query api I have, where its all too easy to leave off constraints and end up asking for massive data sets by mistake.<p>Thinking I can probably enhance it by forcing the user to type in the number as text rather than numeric, so they can't cut-n-paste. Kind of force them to type in "I am sure I want all data ever" or something.
AWS sometimes does something similar to this like “enter the name of the thing you’re trying to delete to confirm”. I think it makes sense because you can have such a huge difference between how much you care about certain s3 buckets or CloudFormation deploys etc. In true AWS fashion it’s inconsistent between services though.
Back in the Spiderman 2 days, I worked for a content management company that was supporting a really, really big website. I believe they were playing host file games for Stage/Prod. Was in the room on when they demo'ed something, did a restart of the system - and every pager in the room went off. Yah...
I for one can't fathom any organization managing a million devices / servers / VMs / whatnot. I'm having enough trouble with one, and my biggest employers had maybe a few dozen at best, and they already had a dedicated ops team that worked mainly with infrastructure-as-code.
Once I had to deal with some software-RAID in Linux (mdadm it is), around 2007. There was some -force option that would just print information explaining what it would do and, to perform the real action, you needed to type another flag (that should never be revealed).<p>Edit: added name of software
i've done this before by displaying unix epoc and asking the user to copy/paste that value WITHIN a 3 second window as an env var. i.e. if you up arrow and run same TIMESTAMP=1603827448 ./foo it won't work because 1603827448 is now way too old.
I do this with a git pre-push hook to the main branch of my repositories. It displays a prompt in red and forces me to type in the name of the branch.<p>The result of one too many mindlessly accidental pushes.
I've seen this implemented as "Please type: My username is $USERNAME and I will not cry over spilt milk" but that was more to guard against support tickets.
I'm thinking this could also be useful for cases where colleges mistakenly email all applicants saying they'd been accepted, when they in fact had not been.
> <i>"I've worked at a few places that had a large number of Linux boxes. I'm talking about well over a million."</i><p>A few places!? What is an example of this?
In role-playing games, it's a common practice to confirm deletion of your character by typing in some word, like 'delete' or character name.
Promise Pegasus (thunderbolt storage) comes with a GUI that does the same thing - to shut it down you have to type “CONFIRM” before clicking the button
Debian already does this, it asks you to type something like "yes do as I asked" if you want to remove a package that is considered to be part of the core.
<a href="https://news.ycombinator.com/item?id=24907002" rel="nofollow">https://news.ycombinator.com/item?id=24907002</a><p>Looks like https vs http link.
It would be neat to print out an esoteric error that gets a single result in Google, where the "forum" in the result has a rando answer about using a certain esoteric flag.<p>Then you search the logs to see who is trying the command with the esoteric flag and "fix the glitch with payroll" for those employees.
Makes it harder to nest that command inside a script - you have to parse out the number and paste it back? Or do I misunderstand - should it still prompt the user in the middle of the process when that step arrives? That would be problematical if it were included in a web page or whatever.
> 1221425541 machines will be affected<p>"Do you care? (Y/N)"<p>Cattle, people. Not pets. Just make sure you don't hit all machines simultaneously and are rolling, instead.<p>Since the post is talking about automation anyway, assume that any machine that can go down will go down. Ensure that any such disruption will be minimal. Oops, you just killed the production database? Whatever, who cares, it has just failed over anyway (or, for a distributed one, a new node was elected, data started replicating, etc).<p>If one considers having to SSH to a machine to be an anti-pattern, it's amazing how much crap goes away.<p>In the more generalized case, where it's not about machines, then it makes more sense. Maybe you are running a query that's going to perform updates across multiple clusters. It <i>still</i> should not be done by hand with direct production access - unless you are in the middle of a declared (and urgent!) incident and everything is on fire. In which case there's a bunch of people watching over your shoulder (or more likely, screen sharing in a conference call).<p>The same job you have (hopefully) run in QA you should be able to re-target to production. Make the question just be a way to "unlock" your automation - for instance, by not copying credentials or environment information until the proper confirmation has been received. One should still have an escape hatch for when (not IF) things go wrong.