TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Logging can be tricky

138 点作者 hariharan_uno超过 10 年前

14 条评论

mml超过 10 年前
Reminds me of the time I sped up the main business app of a large company by 85% by removing &quot;debug&quot; logging. 2tb&#x2F;hr of &quot;made it here&quot; isn&#x27;t really useful at the end of the day.<p>Not the first time I&#x27;ve seen that by a long shot.<p>&#x2F;&#x2F;shakes zimmerframe, shuffles off
评论 #8359352 未加载
评论 #8360547 未加载
cryowaffle超过 10 年前
Grey text on white background? Please help!
评论 #8358790 未加载
评论 #8359272 未加载
ibar超过 10 年前
I had a similar problem with java -- except that the entire application would freeze for double digit seconds. Another application would sometimes write a huge amount of data out very quickly to the fs cache. 30 seconds later (or w&#x2F;e the expiration is), all those dirty bytes would get sync&#x27;d to disk more or less at once.<p>Turns out it was the JVM provided GC logging hanging on flush (not even fsync) calls. The flush call was during GC, and while the GC implementation held a stop the world lock. Digging through JVM source code is &#x27;fun&#x27;.
评论 #8359731 未加载
评论 #8359550 未加载
jcampbell1超过 10 年前
If all three machines had the same logging code, and one machine was fsync&#x27;ing slowly, isn&#x27;t turning off fsync just a bandaid that hides the true problem?<p>When they discover the actual problem is some issue with the raid controller, I promise not to say &quot;I told you so&quot;.
评论 #8359556 未加载
tendeer超过 10 年前
While I really like the article and its analysis, I think there is a more general point here - scale is tricky.<p>Most programming abstractions and tools (such as Logging &#x2F; UI design &#x2F; persistence) across instances are all &quot;easy&quot; when it comes to one machine (or a few machines) in a closed, deterministic environment.<p>All of these become extremely tricky when you have to deal with many instances across many different possibly indeterministic environments. For UI, it quickly becomes what&#x27;s the viewport? What&#x27;s the real estate? How is interaction (touch? mouse? keyboard? numpad?). For persistence, one has to worry about number of simultaneous connections, character encoding and normalization, read &#x2F; writes per second, synchronisation, etc..<p>So, also for logging - Are we logging client side or over the network? what&#x27;s the memory on the client? What else is running on the client? What&#x27;s the permissions through which we log? Do we care about a system call on this client given any user interaction? etc..<p>Logging, UI, Persistence, Network protocols, heck even just choosing an intermediate data structure are all tricky .. at scale across devices and indeterministic environments.
dclusin超过 10 年前
Compressing the text logs prior to writing them to the disk also helps with these kinds of issues. You can also offload your logging to a dedicated thread and then use a lock free queue to increase your performance even more.
评论 #8359399 未加载
erikb超过 10 年前
Evan, I would be happy if you could explain more about what you see in the strace outputs. E.g.:<p>&gt; Time spent in fsync accounts for almost exactly the delay we were seeing.<p>What delay? I see the whole thing taking 1.5 seconds and 1.3 seconds spent in futex (0.4 more than on the normal host). Not sure, why we are suddenly talking about fsync. I also don&#x27;t know what either method (futex, fsync) could be doing.<p>All these are not questions I want answers to (some stuff I could google of course). I just want to show that it&#x27;s a rather confusing read for some readers if you expect them to understand the strace outputs as well as you do, when you seem to be using that tool on a daily basis and the readers might not have used it at all, ever. It would be great to follow your insides better. Just small additions like the following would help a lot: &quot;[The X seconds] spent in fsync [seen in diagram A] accounts for almost exactly the delay we were seeing [in diagram B]&quot;.
评论 #8360649 未加载
评论 #8377201 未加载
sargun超过 10 年前
Why would you fsync logs for a high-level service? Are you afraid a power outage is going to cause you to lose service logs?
评论 #8359407 未加载
评论 #8362117 未加载
govindkabra31超过 10 年前
Did anyone understand why logging calls on only problem machine had different fsync behavior than normal machines?
评论 #8359565 未加载
kordless超过 10 年前
If you want tricky, try building a logging service and then enable logging to itself.
评论 #8358628 未加载
nkozyra超过 10 年前
Does anyone think about doing logging to shared memory &#x2F; memcached and then committing snapshots to disk at regular intervals via another process&#x2F;machine?<p>If you&#x27;re not all that concerned about consistency, each web server can keep their logs in a segregated memory space and then another process can combine&#x2F;commit and send a flush command, leaving the primary machines relatively unencumbered.
评论 #8359597 未加载
评论 #8359563 未加载
digikata超过 10 年前
Doe this increase the chance that you lose a bit of critical log if a fault causes the system to go down?
评论 #8358692 未加载
canterburry超过 10 年前
Wow...this post made me think I shouldn&#x27;t even attempt to run my own server infrastructure for my startup. This kind of analysis is way deeper than I&#x27;m currently capable of.
评论 #8359585 未加载
评论 #8359774 未加载
评论 #8358994 未加载
mleonhard超过 10 年前
I think this problem is mostly due to using Go, whose libraries are not mature.