TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Dead Air on the Incident Call

132 点作者 nalgeon大约 1 年前

15 条评论

markerz大约 1 年前
&gt; When there are more than 10 people, the verbal approach stops working. It becomes necessary to have a shared document of some sort, continuously updated by a “scribe.” It’s not sufficient for this document to be merely a timeline of events: it must highlight the current state of the joint diagnostic effort. I recommend clinical troubleshooting for this.<p>My previous company used Conditions, Actions, Needs (CAN) reports to maintain consistent understanding. This compares differently to their recommended &quot;clinical troubleshooting&quot; (symptoms, hypothesis, actions) by having a &quot;Needs&quot; section. I think the Needs section is super helpful because many times, the right people haven&#x27;t joined a war room yet and so you can just specify the needs and as people join, they can immediately jump into whatever their expertise is.<p><a href="https:&#x2F;&#x2F;www.fireengineering.com&#x2F;firefighter-training&#x2F;drill-of-the-week-the-can-report-1&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.fireengineering.com&#x2F;firefighter-training&#x2F;drill-o...</a>
eightnoteight大约 1 年前
&gt; “Oscar, do you mind sharing your screen so Deepak and Deanna can see the weird log messages too?”<p>it seems so obvious from an Incident Commander perspective but so much goes into this workflow during an incident<p>* what if the person is a fresher, you are asking him to share screen, debug and perform actions in front of 100 people in the incident call and the anxiety that comes with it<p>* While IC has much more practice with handling fires continuously, for instance, if there is a fire every week in a 50-team organisation, a specific team would only be seeing their first incident once a year<p>* Self-consciousness&#x2F;awareness instantly triggers a flight or fight response from even the most experienced folks<p>I don&#x27;t know how other industries handle such a thing, I&#x27;m pretty sure even in non-tech there would be a hierarchy for the anomaly response and sometimes leaf level teams might be called to answer questions at top level of the incident response (like a forest fire response, might have a state wide response team and they pulling local response team and making them answer questions) probably they get much more time to prepare than in tech where its a matter of minutes
评论 #39754439 未加载
评论 #39752369 未加载
评论 #39753126 未加载
评论 #39752607 未加载
评论 #39754180 未加载
评论 #39754919 未加载
Twirrim大约 1 年前
&gt; There is, however, a healthy kind of dead air<p>This is the thing that drives me nuts. I was really hoping the article would be about the value of dead air, or at least expound on it more, instead there is barely a paragraph.<p>What continues to frustrate the hell out of me is that Incident Commanders keep taking silence as inaction (or ineffective action), even when you tell them in advance you need to dive into through logs and think for a few minutes.<p>I&#x27;ve now switched to taking my headset off when I need to do it (after letting them know and giving them a chance to respond).<p>It is practically impossible to debug complex scenarios, especially when you need intuition and your subconscious mind involved, while being pestered with questions.
_carbyau_大约 1 年前
Culture doesn&#x27;t seem to be mentioned in TFA. Likely because come an incident it probably can&#x27;t be influenced much at the time. But attitude can be. People as a team are working together to solve an issue. Humans vs issue. Not teams working to prove it isn&#x27;t their fault - or is the fault of some other team.<p>I have been in places where a team can say &quot;Mea culpa&quot; and the worst thing that happens is next incident people grin and give them friendly jibes. Of course reasonable actions (workplaces can be unreasonable too...) are taken to ensure it doesn&#x27;t happen again but that is simply part of the learning process.<p>I have also been in places where vast majority think the issue points at one team. They are silent on comms despite being present. Then miraculously the issue is gone. The response to the question of what changed? &quot;Nothing.&quot; And we all go to bed having suspicions but no concrete answer...<p>Attitude is also related to many comments here expressing concerns over &quot;people watching my screen&quot; or &quot;over my shoulder&quot;.<p>In times of crisis, if I am running a line of investigation then having a second pair of eyes is reassuring! If I think &quot;maybe this thing is related&quot; and someone more experienced can simply glance at it and say &quot;Nope&quot; then great. My idea had it&#x27;s day in the sun and the group can move on.<p>And if you <i>really really</i> think it is still related then you can keep investigating without people looking - but as a second priority to group.
评论 #39753469 未加载
评论 #39753454 未加载
评论 #39754622 未加载
w-ll大约 1 年前
This reminds be of doing WOW raids with Ventrilo back in the day, and how much I miss that, but something missing from back then.<p>It didnt have screens but it had multiple rooms, so full-party&#x2F;group leaders&#x2F;tanks&#x2F;healers&#x2F;dps&#x2F;etc...each had rooms, and you could still 1on1 with someone.<p>Sometimes I feel like a team&#x2F;department would like to discuss, or maybe even someone 1on1 wants to talk, and it seams all moderen meeting software misses that today.<p>I hadn&#x27;t actually thought about this in a while, but there are few things more stressful than the entire company&#x2F;raid party watching over every breath and movement, and being able to talk to a coworker or someone&#x2F;team can&#x27;t really be done with todays meeting software because its 1... ONE... shared room, vs even in recent memory at least in office teams were in the own spaces&#x2F;buildings&#x2F;etc, and they could mute the confrence call and talk amongst themselves.
评论 #39752572 未加载
评论 #39753321 未加载
评论 #39752563 未加载
cwillu大约 1 年前
“Oscar announces, “I’m seeing some log entries from the web server that look a little weird. I’m gonna look at those.” This is the beginning of a 5-minute silence. […] So it’s incumbent on you to interrupt this silence.”<p>This is “we need to do something, this is something, we need to do it” thinking. The role of the commander imo is to insulate the investigators from exactly this sort of meaningless interruption.<p>““I need 5 minutes” [...] There is, however, a healthy kind of dead air.”<p>If you need to be told this, you are being managed <i>by</i> your staff, not managing them.
评论 #39753048 未加载
评论 #39754784 未加载
评论 #39754706 未加载
geor9e大约 1 年前
The elephant in the room is that these &quot;What is Oscar up to? If only I could glance at their monitor… If only I could see their facial expression… If only I could spitball ideas within earshot of him.&quot; problems would also be solved with everyone in office. Don&#x27;t shoot me tho, I&#x27;m just a messenger. I love remote work. But the friction is tough.
评论 #39752068 未加载
评论 #39753183 未加载
评论 #39752284 未加载
评论 #39752301 未加载
评论 #39753656 未加载
评论 #39756014 未加载
评论 #39755999 未加载
评论 #39755297 未加载
评论 #39754798 未加载
评论 #39752682 未加载
onetimeuse92304大约 1 年前
Simple concept, the author is overthinking it.<p>I have been &quot;problem manager&quot; for many large outages. I use the term &quot;problem manager&quot; to remind people that an outage is something you manage just like any other kind of project, except on much shorter time scales.<p>Everything you learned about project management applies to dealing with outages.<p>&gt; Sometimes an investigator needs to go silent for a while to chase down a hunch, or collect some data, or research some question. As long as such a silence is negotiated in advance, with a specific time to reconvene, it can serve a crucial purpose. I call this functional dead air.<p>Hey, if you are the kind of project manager that talks and does not listen to your team... that&#x27;s a problem.<p>My ideal stance on those occasions is to present myself as somebody who &quot;wants to be educated about the issue&quot;. I think it is more helpful and creates less stress. As I am asking questions I am trying to not seem to be interrogating them but instead emphasise I am a noob on the topic but need to learn quickly.<p>My ideal is this scene from Margin Call: <a href="https:&#x2F;&#x2F;youtu.be&#x2F;Hhy7JUinlu0?t=67" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;Hhy7JUinlu0?t=67</a><p>This usually is actually true, btw.<p>There is no single way to do it right but as a manager it is your job to maintain good information flow between you and your reports and on an outage, your reports are essentially everybody involved.
评论 #39755281 未加载
techdmn大约 1 年前
Maybe it&#x27;s a personal problem, but I struggle to communicate and investigate at the same time. I&#x27;m fine task switching, but it&#x27;s one or the other. I&#x27;ve been on numerous incidents where an anxious manager is asking for constant updates, ensuring no work is getting done. My favorite is when they ask engineers to stop investigating in order to send a status update to the wider organization. I don&#x27;t know, how about maybe the person whose sole role on the call is to manage communication, maybe that person could send the update. But I digress. Communication is important, but it&#x27;s not free. Seek balance.
verdverm大约 1 年前
Anecdotally, had a moment of silence once, after it became apparent the 100x network bill was from a compromised vm due to two human errors combined<p>I appreciate that Google Cloud refunded the $10k despite our faults in the situation<p>The errors<p>1. Spinning up a vm for some experimentaion, with a public ip<p>2. Setting a weak password on a well-known username<p>The vm became involved in a ddos network
评论 #39756190 未加载
评论 #39752967 未加载
silisili大约 1 年前
One thing I found supremely helpful in my varied experiences was having an engineer step up to be the single voice who starts running things and coordinating.<p>Some companies have a NOC or support person run calls, but they often feel nervous and just ask sheepishly for updates.<p>Having a principal or eng manager run the call gives it a different, more commanding feel. They better understand the system and start calling people and teams by name. They also aren&#x27;t talked down to or snapped at like people tend to do with support people, sadly.
评论 #39754338 未加载
评论 #39753219 未加载
boopmaster大约 1 年前
CAN I GET AN UPDATE? !?! !! every 60 seconds is the only way
Animats大约 1 年前
If you have an operation so large that 100 people can be involved in an incident, why isn&#x27;t there a way to shift to a backup system?
observationist大约 1 年前
The manager and incident commander should be on their own call, with at most a liaison that checks in with the people actually doing the work every 30 minutes. They should be secure enough in their own people that they can effectively communicate &quot;we are aware of the problem and are working to fix it&quot; to affected parties.<p>The people doing the work should be left the fuck alone.<p>A manager should not be involved in troubleshooting, in coordinating multiple nontechnical third parties on the same task, because 100% of the time spent doing anything other than fixing the underlying problem is wasted time. The people doing the work should be comfortable coordinating amongst eachother as needed - having a two or three way conversation or video call, or conference call. The affected parties don&#x27;t need 30 second blow by blow accounts of the things the troubleshooters are doing. They don&#x27;t need to constantly stop and interrogate the troubleshooters and recap each step of troubleshooting.<p>Bring the troubleshooters in after the repair to explain the steps taken, the problems found, what could have gone better, what went well, and any recommendations for prevention, mitigation, or resources needed.<p>The notion that you&#x27;re supposed to do highly complex real-time technical repairs while juggling personalities and ass-kissing is counterproductive at best, completely moronic at worst.<p>&quot;I understand your concerns. I just wanted to let you know I have faith in my team and I know for a fact they&#x27;re doing the best they can to get you back up and running as fast as humanly possible. We&#x27;ll hear back from them soon, but I don&#x27;t want to do anything at all to get in their way, or to take time away from this repair.&quot; This is what a good manager might say, being adept in handling customer concerns and having confidence and trust in their team.<p>Coddling and handholding superfluous non-technical stakeholders by hosting incident calls like this is goddamn stupid.<p>The notion that you need to get everyone together in a giant group - that you need to pressure the people doing the work by introducing personalities and social issues into the process - is an a move by a manager deliberately intended to show that the manager is doing something. They coordinate these so they can claim credit for the work of the troubleshooters, and place blame on the troubleshooters if anything goes wrong by mischaracterizing the inevitable miscommunications during these boneheaded calls.<p>If it costs you $10,000 a minute for every minute you&#x27;re down, then let&#x27;s do the things that make sense. Giant ass conference calls with a whole bunch of people who aren&#x27;t involved in fixing the technical problem is stupid. Blitheringly, moronically, stupid. The kind of stupid that picks up a brick and wonders what it would feel like to smash one&#x27;s own stupid face with the stupid brick.<p>If you, as a manager, can&#x27;t cope with this, you shouldn&#x27;t be managing people. Quit, immediately. Your team will be far better off without your presence if you think this type of incident response is good for anything except politics and shitty games.<p>If you&#x27;re a customer and you&#x27;re treated to one of these giant group calls, know that it&#x27;s a sign of incompetence, insecurity, toxic office politics, bad corporate culture, top heavy management, and probably high turnover rates.<p>Fire companies that treat their employees like this, or rewards management for playing stupid games. Find companies with competence and assurance in their products or services, and don&#x27;t feel the need to trot out their troubleshooters in the middle of a crisis to do talk therapy, customer service, tiktok dances, or anything else other than effectively troubleshooting whatever the technical problem is.<p>If you&#x27;re a troubleshooter and you find yourself on these calls frequently, my heart goes out to you. Better jobs exist, you deserve one, and I hope you make it there without too much suffering.
hn_user82179大约 1 年前
Interesting article. I don&#x27;t think I agree with some of the points or maybe I just don&#x27;t follow them exactly.<p>For example:<p>&gt; Oscar announces, “I’m seeing some log entries from the web server that look a little weird. I’m gonna look at those.” This is the beginning of a 5-minute silence.<p>&gt; During the silence, Deanna, Deepak, and Sylvain are all waiting, hoping that these log entries that Oscar just noticed turn out to be the smoking gun. They’re putting their eggs in the basket of Oscar’s intuition. Hopefully he’s seen this issue before, and any minute now he’ll say “Okay, I’m pushing a fix.”<p>&gt; An incident commander is responsible for keeping the whole problem-solving effort moving forward. So it’s incumbent on you to interrupt this silence.<p>&gt; Try drawing more information out of Oscar:<p>&gt; - “Oscar, do you mind sharing your screen so Deepak and Deanna can see the weird log messages too?”<p>&gt; - “What’s the error message, Oscar? Can you send a link to a log search?”<p>&gt; - &quot;Do we know when these log events started? Does that line up with when we started receiving these support tickets, Sylvain?”<p>This is totally a problem that happens during incidents. The problem of the group selecting on the first &quot;I think I see something weird, let me check&quot; idea is a great point made by the author. But having that person share their screen&#x2F;talk through their thoughts doesn&#x27;t really solve that problem, it just focuses the group on that idea (leaving any other ideas to be dropped). _Perhaps_ if other investigators are also familiar with the area being investigated, it&#x27;s helpful to have multiple people looking at Oscar&#x27;s screen, but that doesn&#x27;t seem to scale past having ~3 people on the call. It also immediately makes the call be only dedicated to investigating the problem. That&#x27;s not bad, but if you&#x27;re in a scenario where support is being involved, you&#x27;re likely going to be coordinating broader updates, messaging to customers, figuring out who else to pull in, etc. The point of the incident commander (imo) is to do those things, or ensure that all of those things are happening.<p>&gt; “Let’s see here…”<p>&gt; In order to keep a problem-solving effort moving forward, an incident commander should ensure that every new participant gets up-to-date knowledge of what the group is doing and why. For example, you could say to Deepak when he joins the call, “Hi Deepak. Right now, Oscar and Deanna are investigating a web server error message that might be related to failed stylesheet loads. You can see the error message in the chat.”<p>I think this should be done over Slack, and with like any incident response meeting with more than... 3 people. One thing my org does that I&#x27;m happy with is creating a thread for an initial issue (and a Slack channel once it&#x27;s identified as a bigger issue) and a quick 2 sentence summary. People post comments as they discover new things, which provides a timeline of investigation and does a good job of showing what&#x27;s been checked (and what hasn&#x27;t). Honestly, unless the person giving the verbal summary is technically familiar with the issue at hand, they frequently will glaze over important things or highlight irrelevant things when trying to give a summary of what&#x27;s happened so far. Not their fault, it&#x27;s objectively hard to figure out what&#x27;s relevant&#x2F;irrelevant in the spur of the moment.<p>That said, I&#x27;m probably a bit biased because I don&#x27;t like being on incident response calls in general. When I&#x27;m actively investigating an issue, being in a large incident response room makes things much harder for me to think. It feels like there&#x27;s more pressure when people are waiting on the call for you to solve the problem, or if they&#x27;re talking about other things it&#x27;s just a distraction. My org has a culture of people replying to their own comments in Slack as they investigate, which makes the brainstorming over Slack feel a lot more intuitive, and it&#x27;s easier to share error logs &amp; snippets, or have multiple parallel conversations at once. And once the incident is over, it&#x27;s a lot easier to have a precise incident timeline when you can use timestamps of comments.