Toyota blames factory shutdown in Japan on ‘insufficient disk space’

196 pointsby tjwdsover 1 year ago

22 comments

xeromalover 1 year ago

I was responsible for some dev ops stuff at a state's health department and one of the more infuriating things about working at that place was that it was like pulling teeth getting more storage allocated. Our backups would be running out of disk and they'd allocate me 50 or 100 GB at a time. I'm sure someone Toyototian was yelling that this was going to happen for the past 6 months.

评论 #37404152 未加载

评论 #37405493 未加载

评论 #37404069 未加载

评论 #37404926 未加载

评论 #37404477 未加载

评论 #37404054 未加载

评论 #37404704 未加载

评论 #37404495 未加载

ndhfkashdbenover 1 year ago

I worked at a phone factory once. One day, an app stopped working. It was running on an old Linux server. Nobody had access to it, or so I thought. I noticed the pings were showing high failure rates. I asked to check switch logs or the server's ifconfig output and got weird excuses why they couldn't check anything.One week goes by, people are living with the error and adjusting the factory to work around that. Many meetings, etc.Network engineer pulls me to the side and says "you seem like a nice guy, stop worrying about this issue. we won't fix it because if we do, this server becomes ours and we are responsible for this app.. so nobody will touch it. yes, the cable to that server needs to be redone. yeah we won't do that"I stayed there for 10 months and quit.

评论 #37413963 未加载

评论 #37406981 未加载

评论 #37404214 未加载

wizerdrobeover 1 year ago

As we move to more abstract systems I wonder how well we (as in companies) are keeping basic systems management capabilities in place at a personnel level.At $DAY_JOB we recently scuttled most development efforts for a week for our teams. Our nightly backup job that sanitizes PHI ballooned overtime to, say, 20GB+1Byte and ran out of disk space. Because we are running Kubernetes on Fargate we don’t need a full time operations guy, right?Commence the company (me) scrambling to learn how to use a Peristent Volume and Peristent Volume Claim because a career programmer should be able to perform systems administration takes because DevOps has the word “Dev” in it, right?So we lost a week of productivity to disk space, but in reality we lost a week to poor personnel and capabilities.

评论 #37405278 未加载

评论 #37404273 未加载

评论 #37416691 未加载

hhhhhtimeover 1 year ago

Aaah, plant shutdowns. I worked in IT in an automotive assembly plant at one point. Once, I was out on the plant floor with a colleague. We were diagnosing a new network drop and they plugged in a ping testing device into the cable, hit the button, and within seconds the entire assembly line went down. Cue lots of radio chatter and people driving up to us in their carts trying to figure out what was going on.Turns out the ping testing device was last configured with the IP of a critical application server that talks to every device in the plant. When the colleague pressed the button, every plant device stopped contacting the apps server and the plant went down. Unplugging the device eventually fixed it. The line was stopped for 7 minutes. I believe stoppages were usually quoted at tens of thousands of dollars per minute and billed to the responsible department.I remember how stressed everyone was after and meetings for root cause analysis in the following days. I was always surprised at this considering how frequently the line went down for small amounts of time, and they never seemed as big of a deal as that one specific time, though for the others we were not responsible.

评论 #37405649 未加载

评论 #37404980 未加载

评论 #37415158 未加载

NKosmatosover 1 year ago

If it’s not an excuse for something more serious they definitely need to rethink their infrastructure. It’s not allowed to have a single server with no disk space, taking offline 14 factories.That should’ve used the old trick. Create a big empty file inside the server (i.e. delete_me_in_case_of_need.txt with size a few GBs) and delete that in case of emergency. This will buy you some time to take the necessary actions ;-)

评论 #37405612 未加载

评论 #37405263 未加载

archildressover 1 year ago

Just-in-time will cost you far more than it will save you. The problem is that when it fails, you judge those failures as "backwards looking one offs" as opposed to a cost that is part and parcel to being JIT.Factory downtime is the single biggest cost a manufacturing shop can incur. No amount of working capital savings from carrying fewer screws, widgets, and bolts can offset the costs of taking 14 plants down due to penny wise, pound foolish math.

评论 #37404410 未加载

评论 #37404995 未加载

评论 #37404591 未加载

评论 #37404297 未加载

zokyover 1 year ago

If 14 geographically distinct factories can be shut down by a single server acting up, you’ve got problems that overnight shipping from Newegg can’t solve…

评论 #37403942 未加载

bluedinoover 1 year ago

Had a production system crash because of a similar issue somewhat recently.The cause? / filled. It was configured at somewhere around 80GB. There was a scratch volume on the VM as well, also 80GB.Recently the machine was upgraded to a newer version of Ubuntu, that used a Snap for Chromium. tmp on the Snap wasn't on the scratch drive like on the old version, it was in /tmpFor some reason, ops never set a watch/alert on that disk. That would have prevented the issue.

thisisitover 1 year ago

My company is going through a cost cutting phase. And somehow cutting disk space ended being one of the priority items.One of my servers utilized 50% of the allocated space and the "low disk space" alert was set to go off at 80%.Management came up with an idea - why not reduce it to used space + 20% of used space. That way we save a lot of space.I had to explain them simple math - if the space is 100GB and I utilize 50GB then the proposal was to have 60GB on the server. But 50/60 is already beyond the 80% alert. And if the target was lower at say 70% alert then we needed at least 22GB i.e. used space + 44% of used space to make it work. It was mind boggling that people can't understand simple ratios.The most maddening part was that no one could explain the savings if the space was reduced. People making noise said we could "avoid costs" while forgetting that used space is already cost, you cannot "avoid" it.

robertlagrantover 1 year ago

Why is the article tying just-in-time to this issue?

评论 #37403846 未加载

评论 #37403838 未加载

评论 #37403966 未加载

评论 #37403824 未加载

评论 #37403864 未加载

dangover 1 year ago

Recent and related:Toyota to restart Japan production on Wednesday after system failure - <a href="https://news.ycombinator.com/item?id=37303569">https://news.ycombinator.com/item?id=37303569</a> - Aug 2023 (66 comments)

FreshStartover 1 year ago

A well oiled car factory drops one of the belt ever n minutes. On a new car there are y% of profit. n x y > price of Harddisk is were the yelling and firing of administration began. Power trips and savings are nice. But factory stops have a price and when it turns out to be thiefdoms, those chiefthiefdomtains are goners. Can't run "the rules stated.." by the shareholders who know that the rules are made up for operations not against it.

marcosdumayover 1 year ago

Yes, when you fully optimize something by one metric, like cost, other metrics, like reliability, get worse and worse.Funny thing that if you get any book about Toyotism and Lean, this warning is around the first things you see. You only optimize until you start seeing problems, then you either stop or improve things so the problems don't happen anymore and you can move further. Well, looks like the management at the company that led this culture is forgetting the lesson.

danielfosterover 1 year ago

I don’t see how a one-day factory stoppage aligns with the article’s teaser, “ Just in time’ production system minimises costs but technical glitch highlights risks.”It seems like Toyota identified the issue and got things running again quickly enough. This does not at all seem like a good case study for exploring the risks of JIT.I read The Guardian regularly and will continue to, but they sometimes go out of their way to find an anti-business angle.

rawgabbitover 1 year ago

They didn't say which database software Toyota uses. But in general, if you are performing DML and you have configured backup retention to "normal".... that means the transactions file will balloon in size until the database is BACKEDUP. Because this caused an outage, I assume there was no DBA assigned and management assumes it will run faithfully like it always did.

albertgoeswoofover 1 year ago

Wow, my company (mailpace.com) recently had an outage for basically the same reason:<a href="https://blog.mailpace.com/blog/postgres-outage-post-mortem/" rel="nofollow noreferrer">https://blog.mailpace.com/blog/postgres-outage-post-mortem/</a>Luckily, we only went offline for about 2 hours, but glad to see behemoths like these suffer from similar issues...

freedudeover 1 year ago

Hmmm, after experiencing a similar "Insufficient Disk Space" situation once, there is now a folder with 10 Gb of "data" I can delete to "remediate" the situation in short order.

tantalorover 1 year ago

Leslie, I typed your symptoms into the thing up here, and it says you could have network connectivity problems

Octokiddieover 1 year ago

Can anyone offer evidence that this is not due to low demand, with insufficient disk space being used as an excuse?

CodeCompostover 1 year ago

That's why kids, you shouldn't use Event Sourcing.

评论 #37405531 未加载

daviddever23boxover 1 year ago

Someone needs to fall on a very sharp sword for this one.

jiofjover 1 year ago

Didn't they see the balloon?!? <a href="https://3.bp.blogspot.com/-Pv29dGQwIMI/UFcMD6hgcyI/AAAAAAAAFHI/g_z4acYUqT8/s1600/ttUntitled-2.jpg" rel="nofollow noreferrer">https://3.bp.blogspot.com/-Pv29dGQwIMI/UFcMD6hgcyI/AAAAAAAAF...</a>

评论 #37403812 未加载

评论 #37404333 未加载