...wut?<p>That's ridiculous. What evidence is there that there are groups of nefarious hackers out there spoofing analytics data on people's websites? I don't think there is a need for this solution because the problem doesn't exist. If I wanted to mess with someone's websites, there are much better ways than injecting some false data into their Google Analytics.
Our company has its own internal analytics system and, while their approach could technically work to prevent spoofing, there's other, simpler ways. The first is simple deduplication of received events. This will carve out a large portion of invalid requests, particularly if you have thresholds of time for how frequently a received event is considered valid. The second is to calculate the quartiles and outliers. This allows you to remove all but the most sophisticated spoofing and is good practice to remove ill-behaved browsers and filter out things like malware detection tools that duplicate browser requests if they haven't seen the site before. There's many operations you can do to determine the validity of data received, however who knows how much of this is actually done by analytics providers. We've built our own internal analytics system (and expose it to customers) because existing solutions weren't robust enough for our needs. The biggest lesson has been that trying to get higher than about 98% accuracy on delivered events actually lowered the accuracy of events and using calculations on the backend was more reliable, but requires specific knowledge of the type of events.
First, that's not a "digital signature", it's a MAC. It's the secret-suffix SHA1 MAC, to be precise.<p>Second, the secret-suffix SHA1 MAC isn't secure. Its insecurity is the reason we have HMAC.<p>This seems to me to be the kind of thing you'd want to get right if the whole value proposition of your solution was "verifying URLs with cryptography".
1. Considering most client-side analytics are based on IP address, you will require a large number of IPs.<p>2. It should not be terribly hard to filter out known open proxies or sessions with a specific nefarious pattern.<p>Overall, I think this post addresses a problem that doesn't quite exist yet; and if/when it does, it can be addresses in many ways.
We noticed this problem at Yahoo! (I worked on the web performance analytics). Approximately 2% (note, that's 2% of 200 million daily) of our beacons were "fake". Now there are two reasons for fake beacons.<p>1. (Most common) many small sites seem to really like the design of various Yahoo! pages, so they copy the code verbatim, and change the content, but they leave the beaconing code in there, so you end up with fake beacons.<p>2. (Less common) individuals trying to break the system. We would see various patterns including XSS attempts in the beacon variables, and also in the user agent string. We'd see absurd values (eg: load time of 1 week, or 20ms or -3s, or bandwidth of 4Tbps).<p>It's completely possible to stop all fake requests, provided you have control over the web servers that serve pages as well as the servers that receive beacons. It's costly though. Requiring you to not just sign part of the request, but also add a nonce to ensure that the request came from a server you control (avoid replays). Also throw in rate limiting for added effect (hey, if you're random sampling, then randomly dropping beacons works in your favour ;)).<p>It doesn't stop there though, post processing and statistical analysis of the data can take you further.<p>It gets harder when you're a service provider providing an analytics service to customers where you do not have access or control over their web servers.<p>At my new startup (lognormal.com) we try to mitigate the effect of fake beacons the best that we can.
Well.. I can see that a problem for 0.5% of business's... maybe... I think he is over thinking this, most business do not need that kind of protection<p>There are better ways to "hack" a company that spoofing their websites analytic lol, people that got that large number of ips have better (worst) things to do than that..<p>Also how the f would you know they are ab testing something..
Rather than signing requests for the (largish) javascript file (which would benefit most from being cached), it would make more sense for the signed-timestamp key to be passed as one parameter via the image grab. Or am I missing something?
Totally off topic but there is a bug in the PHP code example:<p>echo "<script src=\"<a href="http://example.com/analytics.js?ts=$ts&r=$r&ds=$ts\></script>" rel="nofollow">http://example.com/analytics.js?ts=$ts&r=$r&ds=$ts\&...</a>;<p>should be:<p>echo "<script src=\"<a href="http://example.com/analytics.js?ts=$ts&r=$r&ds=$ds\></script>" rel="nofollow">http://example.com/analytics.js?ts=$ts&r=$r&ds=$ds\&...</a>;
In before solution waiting for a... oh, too late. It <i>is</i> a problem. However, signing resources means no HTTP caching of the most expensive resource we generate. That is not practical where I work. Guess the cache can be programmed to do the signing.<p>There are trade-offs just like every other CAPTCHA-class problem out there. Isn't that what you are after: an automated human detector?
This is a solution looking for a problem.<p>I know my Google Analytics aren't 100% correct, but I don't think people are spoofing them. The differences lie more in people who click through faster than GA can load (which can be easily possible on those still on 56k), or have "privacy blockers" in their ad block to remove GA altogether.