Ask HN: Scalable way of logging page request data from a PHP application?

6 pointsby ichiltonabout 14 years ago

A web application I am developing (in PHP) requires the ability to log each page request.Just like a normal access_log, it will store details like url requested, source ip address, date/time but I also need it to store the User ID of the logged in user (which is stored in a php session variable).This data will then be queried to create site-wide or per user analytics reports as required at a later date - things such as total number of visits/unique visits, page views in a certain time period, geo-locating the ip addresses and looking at locations, most active times of day, most active members etc.The obvious thing to do would be to have a mysql insert statement on each page but if the application is receiving thousands of req/sec, this is going to be a hugh bottleneck on the database so I am looking at alternative, scalable ways of doing this without big infrastructure requirements.A few of the ideas i've had are:1) Work on a way for Nginx (it's hosted on Nginx with PHP-FPM) to be able to log the user_id from the session/application in the normal web server access_log, which can be parsed and loaded into a database periodically (nightly). This feels like a bit of a hack and will need doing on each web server as the system scales out.2) Log each page request into Redis which has high write speeds - the problem with this is the lack of ability to query the date at a later date.3) Log each page request into either Memcache/Redis acting as a cache (or a message queue) and from there it would be regularly extracted, inserted into MySQL and removed.4) Would something like MongoDB which has more query capability be suitable?I'm interested in how HN readers would approach this and if anyone has any experience of a similar application (or has come across anything online).I'm also interested on thoughts on how the data could be suitably structured to be stored in memcache/redis.Thanks

7 comments

latchabout 14 years ago

My $0.03The first thing you need to answer is whether you'll do data processing on select, on insert, or as a mix of both.On select, you only insert raw data (like an access log). You can then pick one of two ways to generate your reports and transform your data. You can do them in real time - when a user requests the data. Or you can do it as a batch process and do the whole OLTP to OLAP offline.On insert you calculate everything as each hit comes in and fill in your reporting structure directly.Both the select with real-time report generation and the insert approach give users access to real time data. Select with an offline transformation script will provide cached data until the batch process is run again.Imma guess that you're users want as-real-time as possible, with some data having to be more real-time than other. This is where you adopt some form of hybrid/mix.As for actual implementation...First, can you afford to lose data? Now don't jump the gun and immediately say "no". A lot (but certainly not all) of analysis of this type of data works on averages and general trends. If you happen to lose a couple hundred rows in a couple hundred thousands, it doesn't really change the outcome that much. Anyways, it's important to know this because some storage engines support non-durable writes (writes to memory) which are stupid fast, but might result in lost data (or you can possibly write your own buffering logic in the app/web code to spin off another thread and write every 500 hits in a bulk-insert).Technologies. MongoDB has a couple things going for it. First, writes are fast, and they can either be done to memory (stupid fast) or to disk (or to X replicas). Even writes to disk are pretty damn fast (to X replicas will largely depend on network latency). The other thing MongoDB has going for it is MapReduce...sadly it's single threaded, but whether you are doing real-time reporting on an offline OLTP->OLAP transformation, MapReduce is significantly more powerful than OrderBy.In chapter 6 of the (free) Little MongoDB Book (<a href="http://openmymind.net/mongodb.pdf" rel="nofollow">http://openmymind.net/mongodb.pdf</a>) I actually outlined the initial OLTP to OLAP process we were using for a very modest reporting tool.Since then, we've switched to Redis and do it all on insert to provide real-time analytics. However, we are only tracking 3 very basic incrementing statistics. You can see the code here: <a href="https://github.com/mogade/mogade-server/blob/master/app/models/stat.rb" rel="nofollow">https://github.com/mogade/mogade-server/blob/master/app/mode...</a> (the hit method).Of course, why don't you just use Google Analytics?EDIT:If your data processing needs are truly huge, you'll likely end up with something like Hadoop (which I haven't had the opportunity/pleasure to need, so I can only mention it in passing).

评论 #2559711 未加载

aonicabout 14 years ago

How much traffic are you really expecting on this project? Dont pre-optimize, most of the time a regular ol' MySQL table will do just fine and people over-worry about performance too early in the development process.Stick it into MySQL, if it causes problems later, worry about it then (or email me from my email in the profile)

评论 #2559715 未加载

stevencoronaabout 14 years ago

Writing to mySQL on every page request is generally a "bad idea" unless you aren't getting much traffic (in which case, it's the easiest solution).The scalable way to build something like this would be to log your raw data to a service like Scribe Server (open sourced by Facebook) or rsyslogd and then aggregating it in a background process. The aggregated data can be stored in mySQL or whatever datastore you prefer.<a href="http://scribeserver.svn.sourceforge.net/" rel="nofollow">http://scribeserver.svn.sourceforge.net/</a> <a href="http://www.rsyslog.com/" rel="nofollow">http://www.rsyslog.com/</a>(Scribe is a bit more complicated to setup and hook into PHP, but is a much better choice for a highly distributed/scalable system)

评论 #2556131 未加载

评论 #2560459 未加载

petervandijckabout 14 years ago

<a href="http://www.google.com/analytics/" rel="nofollow">http://www.google.com/analytics/</a> probably does 90% of what you need; add a more user-focused analytics program (there are plenty) and you'll probably get more than you need, executed better than you could yourself.ps: the inability of many developers to go with a "good enough" solution or to question "requirements" always baffles me. What you think are requirements are probably just rough guesses at what the client might want (I know, I've been writing requirements for a living for 10 years). This sounds like exactly a case of that, although I may be wrong of course.

评论 #2560449 未加载

chrismillerabout 14 years ago

If you do go the MySQL route be sure to check out INSERT DELAYED <a href="http://dev.mysql.com/doc/refman/5.5/en/insert-delayed.html" rel="nofollow">http://dev.mysql.com/doc/refman/5.5/en/insert-delayed.html</a>

评论 #2559999 未加载

raabout 14 years ago

No problem with you asking this here, but you might find this question is better suited to StackOverflow

评论 #2559986 未加载

bemmuabout 14 years ago

How about putting log lines into memcached, then once a minute write them to disk?

评论 #2559992 未加载