I am in the process of setting up a new website, and was hoping to get some advice on how others store web traffic log data.<p>At the moment I am thinking of just using Google Analytics and Mixpanel.<p>I have no intention of saving apache access log files - so Im looking for a web based service.
I was considering rolling out a custom db table just to log hits, but it just feels wrong.<p>What are you using? If you are storing data in a db, what is the db (i.e. mongo, postgres, cassandra)?
How you store your logs depends on your server configuration. Analytic services like Google Analytics or Mixpanel will work for any type of config as they're initiated by the client. They both also have a nice UI so can see live user's, plot them on maps, etc.<p>If you want lower level detail such as data for each user's IP address you'll need something on the server side. I haven't used Mixpanel but Google Analytics doesn't give you raw IP addresses. Also, if a user has it blocked (ex: by Ghostery) then you don't see them in Google Analytics. To get around this we also log all requests server side.<p>The two options I know are either do it yourself (that's what we did, more below) or use something like Piwik (<a href="http://piwik.org/" rel="nofollow">http://piwik.org/</a>). The latter is kind of like your own Google Analytics that you run on your own infrastructure.<p>For our public cloud app (<a href="https://cloud.jackdb.com/" rel="nofollow">https://cloud.jackdb.com/</a>) we run all the infrastructure so we aggregate the server access logs from each nginx instance and push them to an S3 bucket. It's pretty straightforward and <i>really</i> cheap (S3 costs peanuts and log data gzips well). Besides audit events (which do get logged to a database and can be queried) any funky research is done by good ol' awk/grep/sed.<p>Our public website (<a href="http://www.jackdb.com" rel="nofollow">http://www.jackdb.com</a>) is hosted on S3 so we don't even control the actual server. Instead we've got logging enabled on the S3 bucket sent to another S3 bucket[1]. S3 creates files there with 1-3 hour lag of all requests with full details (IP, useragent, etc). Only pain is that S3 creates <i>a lot</i> of files so we've got a cron job that runs regularly to combine them into daily files, gzip them, and put them in a different S3 bucket. Again ad hoc research is done via unix commands on either the latest log files or the archived files (we keep a local copy in addition to the ones in S3).<p>Regardless of how you get your logs onto S3. If you want to make the storage costs 10x cheaper in the long run (again this will only matter once you actually have a significant amount of data) then you push it from S3 to Glacier. Even better you can setup S3 to auto expire data to Glacier after X days[2]. Just remember that you can't access them directly from Glacier. It's just for "cold-storage".<p>[1]: <a href="http://docs.aws.amazon.com/AmazonS3/latest/dev/ServerLogs.html" rel="nofollow">http://docs.aws.amazon.com/AmazonS3/latest/dev/ServerLogs.ht...</a><p>[2]: <a href="http://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html#intro-lifecycle-rules" rel="nofollow">http://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecy...</a>
If you have the cash and analytics aren't a real distinct business advantage just go Mixpanel.<p>If you decide to do it yourself this is what I've done:
Create a small web service that you can call to log data from the UI. Start with one server and if it starts to go over 60-80% usage consistently create a second.<p>The server should log every call to the service in a large flat file (csv is easiest). The file should be named by date and time down to the minute. As you scale up servers you just have a process pull down each file and aggregate them server side. Or just throw them into S3 and use Hive/EMR to report on the data.<p>It's a middle-class man's Mixpanel. I served tens of millions of logging events a day with this solution. At the time the cost was somewhere around $1,500 a month I believe. I was running 6 servers on Ruby/Sinatra though and never tried to optimize much.<p>EDIT: typo
If you are planning to run at any sort of scale I advise staying away from logging into a database. Tying throughput to i/o like that could really hurt.
just use Google Analytics. It's a free and extremely powerful reporting. Trying to roll your own solution is a total waste of time.<p>The only thing you should be doing with your logs are archiving them in case of a security breach so your can try to pin point how the attach happened.<p>Don't waste the space on your lan either for the log archives. Get an S3 account, zip them up and store on S3.