TechEcho

9 comments

bowyakkaover 13 years ago

I have been saying this for years, but most people have refused to believe meLike most people a long time ago I also held the belief that robots were just dumb scripts, however I learnt that this is not the case when I had to trap said robots for a previous employer.See at the time I was working for one of the many online travel sites; now most people probably are not aware that there is quite a bit of money that can be made in knowing airline costs. The thing is that to get this information is not actually cheap, see most of the GDS (Global Distribution System) providers are big mainframe shops that require all sorts of cunning to happen to emulate a green-screen session for the purposes of booking a flight.The availability search (I forget the exact codenames for this) is done first, this search gives you the potential flights (after working through the byzantine rules of travel) and a costing or fare quote for your trip. This information is reliable about ~95% of the time. Each search costs a small amount against a pre-determined budget, and the slightly more over the limit (kinda like how commercial bandwidth is sold), if my memory serves it was 0.001 euro cents for each search.During the booking phase (known as the GDS code FXP) the price is actually settled, the booking is a weird form of two-phase commit where first you get a concrete fair quote. This quote ‟ringfences” the fare - essentially ensuring that the seat cannot be booked for roughly 15 minutes. In practise there are a load more technicalities around this part of the system and as such it is possible for double bookings and over bookings to happen, but lets keep it simple for the sake of this story. These prebookings are roughly 99.5% accurate on price but cost something like 0.75 cents (there is a _lot_ that happens when you start booking a flight).So with that in mind if you are in the business of trying to resell flights it can be to your advantage to avoid the GDS costs and scrape one of the online travel companies. You also want the prebook version of the fare as its more likely to be accurate, the travel sites mind less about people scrapping the lookup search.Thus begins the saga of our bot elimination projects, first we banned all IP's that smash the site thousands of times, this is easy and kills 45% of the bots dead. Next up we start proper robots.txt and ways to discourage googlebot and the more "honest" robots, that gets us up to dealing with 80% of the bots. Next we take out china, russia etc as ip-addresses, we find that these often have the most fraudulent bookings anyhow so no big loss, that takes us up to 90% of the bots.Killing the last 10% was never done, every time we tried something new (captua's, JS nonce values, weird redirect patterns, bot traps and pixels, user agent sniffs etc etc) the bots seemed to immediately work around it. I remember watching the access logs where we had one IP that never, ever bought products, just looked for really expensive flights. I distinctly remember seeing it hit a bottrap, notice the page was bad, and then out of nowhere the same user session appears on a brand new IP address with a new user agent, one that essentially said "netscape navigator 4.0 on X11" (this was firefox 1-2 days so seeing unix netscape navigator was a rare sight), it was clear the bot went and executed the javascript nonce with a full browser, and then went back to fast scraping.A few years later, at the same company but for very different reasons I wrote a tool to replace a product known as gomez with an in house system. The idea of gomez and similar products like site-confidence is to run your website as the user sees it, from random ip's across the world and then report on it. I wrote this tool with XulRunner which is a stripped down version of firefox. Now admittedly I had the insider knowledge of where the bot traps were, but I was amazed at how easy it was to side-step all of our bot-detection in only a few days, I also had unit tests for the system that ran it on sites like Amazon and Google and even there is was shocking how easily I was able to side step bot traps (I am sure since they have got better, but it surprised me how easy it was).I am not saying all the bots are smart, but my mantra since then has been that "if there is value for the bots to be smart, they can get very smart". I guess its all about the cost payoff for those writing the bots, is it a good idea to run JS all the time as a spider - probably not, does it make sense to save you from 0.75 cents of cost per search - very much so !

评论 #3305970 未加载

gibersonover 13 years ago

It occurs to me that if GoogleBot is executing client javascript you could take advantage of Google's resources for computational tasks.For instance, let me introduce you to SETI@GoogleBot. SETI@GoogleBot is much like SETI@home except it takes advantage of GoogleBot's recently discovered capabilities. Including the SETI@GoogleBot script into your web pages will cause (after the page load event) the page to fetch a chunk of data from the SETI servers via ajax request and proceed to process that data in JavaScript. Eventually, once the data has been processed it will be posted back to the SETI servers (via another ajax request) and then repeat the cycle. Thus enabling you, for the small cost of a page load, have GoogleBot process your SETI data and enhance your SETI@home score.Obviously, this isn't a new idea (using page loads to process data via JavaScript) but it is an interesting application to exploit GoogleBot's likely vast resources.

评论 #3305381 未加载

评论 #3305964 未加载

评论 #3307519 未加载

heyitsnickover 13 years ago

Is there any example of a site having their dynamically generated* disqus comments indexed by google? Disqus is probably one of the most common form of ajax-generated content on the web, so if this were the case that googlebot was actively indexing dynamic content like this, I would expect to see disqus supported.* disqus has an API to allow you to display disqus comments serverside, so some disqus implementations - think mashable is one - will have comments indexed without the aid of Javascript.

评论 #3307205 未加载

eliover 13 years ago

I don't buy this argument. Wanting to have a more complete rendering engine for their crawler might have been a factor in designing Chrome, but I can't imagine it was in any way the driving force. The costs of developing a browser that runs well on millions of different computers and configurations are far beyond what it would take to make a really great headless version of WebKit for your crawler.

评论 #3306151 未加载

wiradikusumaover 13 years ago

i wonder if it also means we don't need to implement _escaped_fragment_ anymore <a href="http://code.google.com/web/ajaxcrawling/docs/getting-started.html" rel="nofollow">http://code.google.com/web/ajaxcrawling/docs/getting-started...</a>

评论 #3305090 未加载

almostover 13 years ago

Why the assumption that "GoogleBot" is a single thing? Of course we know that google has a headless browser running, we see it's output in the instant previews, but I'm sure they still do plenty of standard crawling (and probably some half way partial JS execution and/heuristics too).

tripzilchover 13 years ago

> My personal favorite example of this is Google Translate, which is one of the most accurate machine translating tools on the planet. Google almost sacked it because it was not profitable, and had it not been for public outcry we may have lost access to this technology altogether.I kind of missed this "public outcry", when did it happen? And if Google listens to public outcry, why did we lose Google Code Search?

评论 #3305353 未加载

nyellinover 13 years ago

You might be able to check what the Googlebot executes by adding javascript to your site and checking the thumbnail.EDIT: Removed comment about the bot's user-agent. The article links to a Google FAQ which answers the question.

评论 #3307199 未加载

jwatteover 13 years ago

I think they have it backwards. What if Chrome is GoogleBot? You get quality measurement on pages based on user behavior on the page. Crowdsourcing beats crawling!

9 comments

bowyakkaover 13 years ago

评论 #3305970 未加载

gibersonover 13 years ago

评论 #3305381 未加载

评论 #3305964 未加载

评论 #3307519 未加载

heyitsnickover 13 years ago

评论 #3307205 未加载

eliover 13 years ago

评论 #3306151 未加载

wiradikusumaover 13 years ago

评论 #3305090 未加载

almostover 13 years ago

tripzilchover 13 years ago

评论 #3305353 未加载

nyellinover 13 years ago

评论 #3307199 未加载

jwatteover 13 years ago

I think they have it backwards. What if Chrome is GoogleBot? You get quality measurement on pages based on user behavior on the page. Crowdsourcing beats crawling!

Google's Indexing Javascript more than we thought

9 comments

Google's Indexing Javascript more than we thought

9 comments