I was responsible for some dev ops stuff at a state's health department and one of the more infuriating things about working at that place was that it was like pulling teeth getting more storage allocated. Our backups would be running out of disk and they'd allocate me 50 or 100 GB at a time. I'm sure someone Toyototian was yelling that this was going to happen for the past 6 months.
I worked at a phone factory once. One day, an app stopped working. It was running on an old Linux server. Nobody had access to it, or so I thought. I noticed the pings were showing high failure rates. I asked to check switch logs or the server's ifconfig output and got weird excuses why they couldn't check anything.<p>One week goes by, people are living with the error and adjusting the factory to work around that. Many meetings, etc.<p>Network engineer pulls me to the side and says "you seem like a nice guy, stop worrying about this issue. we won't fix it because if we do, this server becomes ours and we are responsible for this app.. so nobody will touch it. yes, the cable to that server needs to be redone. yeah we won't do that"<p>I stayed there for 10 months and quit.
As we move to more abstract systems I wonder how well we (as in companies) are keeping basic systems management capabilities in place at a personnel level.<p>At $DAY_JOB we recently scuttled most development efforts for a week for our teams. Our nightly backup job that sanitizes PHI ballooned overtime to, say, 20GB+1Byte and ran out of disk space. Because we are running Kubernetes on Fargate we don’t need a full time operations guy, right?<p>Commence the company (me) scrambling to learn how to use a Peristent Volume and Peristent Volume Claim because a career programmer should be able to perform systems administration takes because DevOps has the word “Dev” in it, right?<p>So we lost a week of productivity to disk space, but in reality we lost a week to poor personnel and capabilities.
Aaah, plant shutdowns. I worked in IT in an automotive assembly plant at one point. Once, I was out on the plant floor with a colleague. We were diagnosing a new network drop and they plugged in a ping testing device into the cable, hit the button, and within seconds the entire assembly line went down. Cue lots of radio chatter and people driving up to us in their carts trying to figure out what was going on.<p>Turns out the ping testing device was last configured with the IP of a critical application server that talks to every device in the plant. When the colleague pressed the button, every plant device stopped contacting the apps server and the plant went down. Unplugging the device eventually fixed it. The line was stopped for 7 minutes. I believe stoppages were usually quoted at tens of thousands of dollars per minute and billed to the responsible department.<p>I remember how stressed everyone was after and meetings for root cause analysis in the following days. I was always surprised at this considering how frequently the line went down for small amounts of time, and they never seemed as big of a deal as that one specific time, though for the others we were not responsible.
If it’s not an excuse for something more serious they definitely need to rethink their infrastructure. It’s not allowed to have a single server with no disk space, taking offline 14 factories.<p>That should’ve used the old trick. Create a big empty file inside the server (i.e. delete_me_in_case_of_need.txt with size a few GBs) and delete that in case of emergency. This will buy you some time to take the necessary actions ;-)
Just-in-time will cost you far more than it will save you. The problem is that when it fails, you judge those failures as "backwards looking one offs" as opposed to a cost that is part and parcel to being JIT.<p>Factory downtime is the single biggest cost a manufacturing shop can incur. No amount of working capital savings from carrying fewer screws, widgets, and bolts can offset the costs of taking 14 plants down due to penny wise, pound foolish math.
If 14 geographically distinct factories can be shut down by a single server acting up, you’ve got problems that overnight shipping from Newegg can’t solve…
Had a production system crash because of a similar issue somewhat recently.<p>The cause? / filled. It was configured at somewhere around 80GB. There was a scratch volume on the VM as well, also 80GB.<p>Recently the machine was upgraded to a newer version of Ubuntu, that used a Snap for Chromium. tmp on the Snap wasn't on the scratch drive like on the old version, it was in /tmp<p>For some reason, ops never set a watch/alert on that disk. That would have prevented the issue.
My company is going through a cost cutting phase. And somehow cutting disk space ended being one of the priority items.<p>One of my servers utilized 50% of the allocated space and the "low disk space" alert was set to go off at 80%.<p>Management came up with an idea - why not reduce it to used space + 20% of used space. That way we save a lot of space.<p>I had to explain them simple math - if the space is 100GB and I utilize 50GB then the proposal was to have 60GB on the server. But 50/60 is already beyond the 80% alert. And if the target was lower at say 70% alert then we needed at least 22GB i.e. used space + 44% of used space to make it work. It was mind boggling that people can't understand simple ratios.<p>The most maddening part was that no one could explain the savings if the space was reduced. People making noise said we could "avoid costs" while forgetting that used space is already cost, you cannot "avoid" it.
Recent and related:<p><i>Toyota to restart Japan production on Wednesday after system failure</i> - <a href="https://news.ycombinator.com/item?id=37303569">https://news.ycombinator.com/item?id=37303569</a> - Aug 2023 (66 comments)
A well oiled car factory drops one of the belt ever n minutes. On a new car there are y% of profit. n x y > price of Harddisk is were the yelling and firing of administration began. Power trips and savings are nice. But factory stops have a price and when it turns out to be thiefdoms, those chiefthiefdomtains are goners. Can't run "the rules stated.." by the shareholders who know that the rules are made up for operations not against it.
Yes, when you fully optimize something by one metric, like cost, other metrics, like reliability, get worse and worse.<p>Funny thing that if you get any book about Toyotism and Lean, this warning is around the first things you see. You only optimize until you start seeing problems, then you either stop or improve things so the problems don't happen anymore and you can move further. Well, looks like the management at the company that led this culture is forgetting the lesson.
I don’t see how a one-day factory stoppage aligns with the article’s teaser, “ Just in time’ production system minimises costs but technical glitch highlights risks.”<p>It seems like Toyota identified the issue and got things running again quickly enough. This does not at all seem like a good case study for exploring the risks of JIT.<p>I read The Guardian regularly and will continue to, but they sometimes go out of their way to find an anti-business angle.
They didn't say which database software Toyota uses. But in general, if you are performing DML and you have configured backup retention to "normal".... that means the transactions file will balloon in size until the database is BACKEDUP. Because this caused an outage, I assume there was no DBA assigned and management assumes it will run faithfully like it always did.
Wow, my company (mailpace.com) recently had an outage for basically the same reason:<p><a href="https://blog.mailpace.com/blog/postgres-outage-post-mortem/" rel="nofollow noreferrer">https://blog.mailpace.com/blog/postgres-outage-post-mortem/</a><p>Luckily, we only went offline for about 2 hours, but glad to see behemoths like these suffer from similar issues...
Hmmm, after experiencing a similar "Insufficient Disk Space" situation once, there is now a folder with 10 Gb of "data" I can delete to "remediate" the situation in short order.
Didn't they see the balloon?!? <a href="https://3.bp.blogspot.com/-Pv29dGQwIMI/UFcMD6hgcyI/AAAAAAAAFHI/g_z4acYUqT8/s1600/ttUntitled-2.jpg" rel="nofollow noreferrer">https://3.bp.blogspot.com/-Pv29dGQwIMI/UFcMD6hgcyI/AAAAAAAAF...</a>