I recently had a brief chat with a couple of developers working on different data-heavy websites, who were both using an interesting pattern for filling their databases.<p>Their data-gathering components (pulling from external sources like crawlers and APIs) would append new data to the bottom of a log file.<p>Another process sat doing something like a 'tail -f' on the same file, and parsed and added the updates to the database.<p>This seems like it might solve some problems for my case:<p>- Very easy to recreate the database if the schema changes or things blow up, just reread the log files<p>- Good history for debugging<p>What worries me is that it feels funky using files for IPC, and I can't find any examples of this being used elsewhere.<p>So, is anyone else using this pattern, or have any references to it that I'm missing?
It's certainly an interesting approach! Depending on the volume, it could have the benefits you described - but it may get a bit old when the volumes increase.<p>I'd engineer the crawler to talk to a persistent message queue, and load the database from there. That gives you a lot of flexibility to move loads around, instrument the queue and you're not reinventing things, either.
<i>Very easy to recreate the database if the schema changes or things blow up, just reread the log files</i><p>This would be nasty though if your log files wrapped, the disk they were on ran out of space, etc.
As an update, I did some groundwork to see how this might work in PHP, by creating a small example that tails the Apache error log:
<a href="http://petewarden.typepad.com/searchbrowser/2009/09/how-to-follow-your-apache-error-logs-in-a-browser.html" rel="nofollow">http://petewarden.typepad.com/searchbrowser/2009/09/how-to-f...</a><p>Still feels kinda sketchy...
RDBMS has logging mechanisms for this kind of recreating database. For example, PostgreSQL has WAL - write ahead log. These can be used to rebuild the db or for asynchronous replication. Likewise Mysql has binary logging.