Very nice & all, but... MongoDB?<p>MongoDB is very popular, but all the (limited) criticisms of it seem to related to insert performance once the dataset it too big to fit in RAM.<p>Normally the easy-of-development arguments make up for that, but log files is one of those areas that has a tendency to expand quickly beyond any expectations.<p>There is a reason why most companies are using HDFS and/or Cassandra for structured log file storage.
This is SUPER helpful! Just the other day I was wondering how someone like me could get involved in the hard scalability problems I read so much about here on the hackers news. But how to make my boring old highly cachable read-only web traffic into a major scalability problem? Then I read this blog entry, and wow, now each log entry on my site turns into a random btree update in MongoDB made while holding a global write lock. Thanks again hackers news, and thanks again BIG DATA!
How does fluentd resume tailing the apache log if it crashes? Does it maintain the current file position on disk? What if logs are rotated between a fluentd crash and recovery?<p>I've had to solve this problem for Yahoo!'s performance team, and ended up setting a very small log rotation timeout, and only parsing rotated logs. There's a 5-30 minute delay in getting data out of logs (depending on how busy the server is), but since we're batch processing anyway, it doesn't matter.<p>The added advantage, is that you just maintain a list of files that you've already parsed, so if the parser/collector crashes, it just looks at the list and restarts where it left off. Smart key selection (ie, something like IP or userid+millisecond time) is enough to ensure that if you do end up reprocessing the same file (eg, if a crash occurs mid-file), then duplicate records aren't inserted (use the equivalent of a bulk INSERT IGNORE for your db).<p>This scales to billions of log entries a day.
I have a syslog-ng -> MongoDB project that I've been working on at my university.<p>github.com/ngokevin/netshed<p>It is written in Python current parses out fields from several types of logs (such as dhcpd). It is initially set up to read from named pipes (it has a tail function as well). Each type of log is dumped to its own database, and each date has its own collection. I have it set up with a master/slave configuration to overcome the global write lock. It has functions to simulate capped collections by days. It is followed with a Django frontend for querying via PyMongo.<p>This version is several weeks old and I will push out a new one soon.
I'd also suggest looking at both Logstash and Greylog2. They both can use MongoDB as the storage engine for logs, and can also do the field extractions.
Great! If we use Fluentd and MongoDB, we can collect realtime event without writing some codes, but only configuration setting. I also think about more flexible aggregation system using them: "An Introduction to Fluent & MongoDB Plugins" <a href="http://www.slideshare.net/doryokujin/an-introduction-to-fluent-mongodb-plugins" rel="nofollow">http://www.slideshare.net/doryokujin/an-introduction-to-flue...</a> . Please tell me if there exists more powerful use-case using Fluentd & Mongo!