科技回声

14 条评论

mml超过 10 年前

Reminds me of the time I sped up the main business app of a large company by 85% by removing "debug" logging. 2tb/hr of "made it here" isn't really useful at the end of the day.Not the first time I've seen that by a long shot.//shakes zimmerframe, shuffles off

评论 #8359352 未加载

评论 #8360547 未加载

cryowaffle超过 10 年前

Grey text on white background? Please help!

评论 #8358790 未加载

评论 #8359272 未加载

ibar超过 10 年前

I had a similar problem with java -- except that the entire application would freeze for double digit seconds. Another application would sometimes write a huge amount of data out very quickly to the fs cache. 30 seconds later (or w/e the expiration is), all those dirty bytes would get sync'd to disk more or less at once.Turns out it was the JVM provided GC logging hanging on flush (not even fsync) calls. The flush call was during GC, and while the GC implementation held a stop the world lock. Digging through JVM source code is 'fun'.

评论 #8359731 未加载

评论 #8359550 未加载

jcampbell1超过 10 年前

If all three machines had the same logging code, and one machine was fsync'ing slowly, isn't turning off fsync just a bandaid that hides the true problem?When they discover the actual problem is some issue with the raid controller, I promise not to say "I told you so".

评论 #8359556 未加载

tendeer超过 10 年前

While I really like the article and its analysis, I think there is a more general point here - scale is tricky.Most programming abstractions and tools (such as Logging / UI design / persistence) across instances are all "easy" when it comes to one machine (or a few machines) in a closed, deterministic environment.All of these become extremely tricky when you have to deal with many instances across many different possibly indeterministic environments. For UI, it quickly becomes what's the viewport? What's the real estate? How is interaction (touch? mouse? keyboard? numpad?). For persistence, one has to worry about number of simultaneous connections, character encoding and normalization, read / writes per second, synchronisation, etc..So, also for logging - Are we logging client side or over the network? what's the memory on the client? What else is running on the client? What's the permissions through which we log? Do we care about a system call on this client given any user interaction? etc..Logging, UI, Persistence, Network protocols, heck even just choosing an intermediate data structure are all tricky .. at scale across devices and indeterministic environments.

dclusin超过 10 年前

Compressing the text logs prior to writing them to the disk also helps with these kinds of issues. You can also offload your logging to a dedicated thread and then use a lock free queue to increase your performance even more.

评论 #8359399 未加载

erikb超过 10 年前

Evan, I would be happy if you could explain more about what you see in the strace outputs. E.g.:> Time spent in fsync accounts for almost exactly the delay we were seeing.What delay? I see the whole thing taking 1.5 seconds and 1.3 seconds spent in futex (0.4 more than on the normal host). Not sure, why we are suddenly talking about fsync. I also don't know what either method (futex, fsync) could be doing.All these are not questions I want answers to (some stuff I could google of course). I just want to show that it's a rather confusing read for some readers if you expect them to understand the strace outputs as well as you do, when you seem to be using that tool on a daily basis and the readers might not have used it at all, ever. It would be great to follow your insides better. Just small additions like the following would help a lot: "[The X seconds] spent in fsync [seen in diagram A] accounts for almost exactly the delay we were seeing [in diagram B]".

评论 #8360649 未加载

评论 #8377201 未加载

sargun超过 10 年前

Why would you fsync logs for a high-level service? Are you afraid a power outage is going to cause you to lose service logs?

评论 #8359407 未加载

评论 #8362117 未加载

govindkabra31超过 10 年前

Did anyone understand why logging calls on only problem machine had different fsync behavior than normal machines?

评论 #8359565 未加载

kordless超过 10 年前

If you want tricky, try building a logging service and then enable logging to itself.

评论 #8358628 未加载

nkozyra超过 10 年前

Does anyone think about doing logging to shared memory / memcached and then committing snapshots to disk at regular intervals via another process/machine?If you're not all that concerned about consistency, each web server can keep their logs in a segregated memory space and then another process can combine/commit and send a flush command, leaving the primary machines relatively unencumbered.

评论 #8359597 未加载

评论 #8359563 未加载

digikata超过 10 年前

Doe this increase the chance that you lose a bit of critical log if a fault causes the system to go down?

评论 #8358692 未加载

canterburry超过 10 年前

Wow...this post made me think I shouldn't even attempt to run my own server infrastructure for my startup. This kind of analysis is way deeper than I'm currently capable of.

评论 #8359585 未加载

评论 #8359774 未加载

评论 #8358994 未加载

mleonhard超过 10 年前

I think this problem is mostly due to using Go, whose libraries are not mature.

14 条评论

mml超过 10 年前

评论 #8359352 未加载

评论 #8360547 未加载

cryowaffle超过 10 年前

Grey text on white background? Please help!

评论 #8358790 未加载

评论 #8359272 未加载

ibar超过 10 年前

评论 #8359731 未加载

评论 #8359550 未加载

jcampbell1超过 10 年前

评论 #8359556 未加载

tendeer超过 10 年前

dclusin超过 10 年前

评论 #8359399 未加载

erikb超过 10 年前

评论 #8360649 未加载

评论 #8377201 未加载

sargun超过 10 年前

Why would you fsync logs for a high-level service? Are you afraid a power outage is going to cause you to lose service logs?

评论 #8359407 未加载

评论 #8362117 未加载

govindkabra31超过 10 年前

Did anyone understand why logging calls on only problem machine had different fsync behavior than normal machines?

评论 #8359565 未加载

kordless超过 10 年前

If you want tricky, try building a logging service and then enable logging to itself.

评论 #8358628 未加载

nkozyra超过 10 年前

评论 #8359597 未加载

评论 #8359563 未加载

digikata超过 10 年前

Doe this increase the chance that you lose a bit of critical log if a fault causes the system to go down?

评论 #8358692 未加载

canterburry超过 10 年前

Wow...this post made me think I shouldn't even attempt to run my own server infrastructure for my startup. This kind of analysis is way deeper than I'm currently capable of.

Logging can be tricky

14 条评论

Logging can be tricky

14 条评论