科技回声

6 条评论

vii超过 4 年前

There are many alternatives for distributed tracing like Lightstep, Jaeger and so on but the ambitious level of integration with log searching (like ELK) and payload tracking is like an integrated in-house Splunk. Great idea and great to see the energy and enthusiasm put into making debugging tools better! One dream feature for a tool like this: code execution counts showing which version of the code and even which lines were executed - in aggregate is useful but ideally for each trace.Unfortunately, the tradeoff of value gained saving debugging time against cost of infrastructure and development is hard to manage. The storage costs are very easy to measure so it is tempting to go after them rather than the more intangible benefits that rely on a counterfactual of how hard things would be to debug without it.

AndrewKemendo超过 4 年前

Super powerful platform tools like these are crazy hard to build, even more-so if platform teams don't have iron-fist like control of their infrastructure. So that's the curious thing organizationally to me.I'm curious how they manage their infrastructure in a way that enables these tools.If it's truly distributed - where individual teams and provision resources self-service - then they would have to mandate (or template) that new services have service discovery and eventing/logging as a condition for SLA contracts?Is infra completely abstracted away from product teams? How are resources provisioned and new services developed that ensures that these enterprise capabilities are pervasive?

评论 #24453364 未加载

aero142超过 4 年前

I'm seeing this "3 pillars of observability" framing as the main description these days, and I think it is nice breakdown of options. However, there is a common problem I am curious how others are dealing with. All metrics systems I have seen don't handle high cardinality data like Ids well. Most SASS products base their pricing on unique metrics because each unique aggregation has a cost. In the 3 pillars model, companies like Datadog are pushing those high cardinality values into the log pillar, and this article seems to imply the same. However, logs are often unstructured. There are often tools to search through logs, find values within those logs, and I can even do aggregations on them. However, when you know ahead of time what you want to aggregate on, text logs are more brittle than simply defining an event in json or another structured format. This log tier quickly becomes an ad-hoc version of a data warehouse and I feel like there is a missing tier here where you would send structured data to a datastore to do aggregations on for observability purposes only. I know datadog supports parsing structured data in this way, but I'm curious what is a common solution to this.Is sending structured data through the "log" tier common, or are there structured event and reporting systems that are part of another system that is seldom discussed in this topic.

评论 #24454125 未加载

评论 #24454980 未加载

yowlingcat超过 4 年前

> Edgar captures 100% of interesting traces, as opposed to sampling a small fixed percentage of traffic.Very interesting. This is probably my single major complaint with AWS X-Ray, which I otherwise am a huge fan of and find really useful. Would love to figure out how they ensure their "interesting" classifier works well, or figure out how to workaround what happens if it doesn't classify things properly.

tnolet超过 4 年前

Every tech / SaaS product will eventually be reinvented internally at Netflix. Or vice versa.Without being facetious: I would love to see some Of their cost/benefit analyses.

评论 #24453394 未加载

tmd83超过 4 年前

Here's what I want on observability/diagnostic platform. Are there tools to achieve something like this open source or SaaS in a reasonable way? Am I ignorant or is this actually too hard to do at scale I wonder.1. Detail response/time taken for every single endpoint. So I want a histogram and not just an average. I have seen often you get the top X endpoints (for me mostly requests) in tools and top X by global measure. At my work the number of unique requests are a lot and the good response time for them varies a lot from 50ms to 1s easily. A global threshold is useless. There's also the fact that users have different cost for same endpoint but that I don't have a super good solution.2. Some level of traces for requests that crosses a threshold. That means the agent has to keep the threshold for every endpoint but is that so expensive? I think I saw some ideas (in some commercial product) where they collect data and drop them if the request ends up being fast enough. I think that's a very good approach. So I want specifics when something is slow and against a per endpoint threshold. Perhaps I want all requests slower than 99 percentile, perhaps I keep 20% for > 90 percentile etc.3. Now the per end point statistics has to be kept for significant period, months not days otherwise how would you see the change. I have things slowed down due to code change, due to usage increase, due to query plan getting whacky. You have different performance due to different level of usage (concurrency). No one can probably afford per second resolution for a year sure but if my 9am spikes are averaged out how would I know that this is an old problem or the spike actually worsened by 20% in the last two month? I think you can get away with reducing resolution a lot if you keep a histogram and not just average. But I don't think anyone optimizes for that. And you also need to keep those important traces for quite a long time.4. I work in Java and use unstructured log and have been trying to figure out how to parse that reliably for debugging so that we don't spend hours to grep the logs. I just realized recently the most common queries are easily parsable for me users, servers/app-instance, codeLine (java loggers prints the class:line) that narrows things down so much for me that I can afford to just export those and grep if I need to. But most log tools seems like either grep & super structured. Also someone else mentioned cardinality because while the rest can be fine, user is definitely high cardinality so everything might break down there.I think for my problems, at my scale Million of user not Million concurrent user I think these are nicely solvable given the capabilities the tools today have (even if they are not doing it exactly like that) but is is unscalable at large or this is just not needed if you do 'something` which I/we are not doing?

6 条评论

vii超过 4 年前

AndrewKemendo超过 4 年前

评论 #24453364 未加载

aero142超过 4 年前

评论 #24454125 未加载

评论 #24454980 未加载

yowlingcat超过 4 年前

tnolet超过 4 年前

Every tech / SaaS product will eventually be reinvented internally at Netflix. Or vice versa.Without being facetious: I would love to see some Of their cost/benefit analyses.

评论 #24453394 未加载

tmd83超过 4 年前

Edgar: Solving Mysteries Faster with Observability

6 条评论

Edgar: Solving Mysteries Faster with Observability

6 条评论