Hi all,<p>I'm trying to build realtime data infrastructure for logging. For ingestion layer, I'm thinking about using Kafka / Logstash for logging ingestion layer and then after that I can store it in any database and I can easily change the store without changing the ingestion layer in the future.<p>Any experience using logstash or Kafka in production?<p>Also, additional question is, I concern a lot for missing data case when we're sending data to logstash or kafka using kind of lightweight shipper like filebeat? Any experience for handling missing data at scale?
Are you really ready to forego hundreds of megabytes of RAM for merely log
shipping? Fluentd could be a cheaper alternative to JVM-based routers.<p>Also, what exactly is your question?<p>> Any experience for handling missing data at scale?<p>For <i>logs</i>? Missing logs <i>don't matter</i> (unless they're a required audit
data). Your system should be prepared not to fall apart on missing hours or
days of logs, similar to how it should treat missing metrics and other
monitoring data.<p>And what volume is "at scale"?
We're using Kafka as a log delivery platform and are quite happy with it. Kafka by nature is highly available and can be scaled quite trivially with the log load by adding new cluster nodes.<p>We've decided to use journald for storing all of our application data. We pump the entries from journald to Kafka by using a tool that we open sourced: <a href="https://github.com/aiven/journalpump" rel="nofollow">https://github.com/aiven/journalpump</a>.<p>From Kafka, we export the logs to Elasticsearch for viewing and analysis. Some specific logs are also stored in S3 for long term storage for e.g. audit purposes.