I'm convinced that Google has several Googlebots that are run depending on how popular a site is.<p>That is, new and low traffic sites are crawled by less intelligent bots, and as the site gets more visitors or better rankings, more complicated and resource intensive bots are deployed.<p>How this might work with the most popular sites out there, the Amazons and Wikipedias of this world - I'm not so sure about that. If I were in charge, I'd be tempted to have customised bots and ranking weights for each of these exceptional sites.<p>Sadly the chances of getting a real answer on this in my lifetime are close to zero.
Another, faster way to see what JavaScript google can crawl on your website is the google search console (previously known as google webmaster tools). They have a fetch as google button that allows you to enter a url on a site you own and see a visual rendering of how google crawlers see your page. It even gives you a side by side comparison of what the crawler sees vs what a user sees.
My pet theory is that Google actually developed chrome as a web crawler and that consumer release was to ensure that Google would always be able to crawl pages (since sites would always want to work properly with chrome)<p>It also explains why they effectively killed flash and Java applets. They were competing technologies that weren't owned by Google and not crawlable. If they would have taken off, Google's position as top search engine could have been in danger
Google executes JS but maybe not on every website. If you have a JS error reporting tool on such site then you can get reports from Google IP addresses. I saw them first maybe 4 or 5 years ago.<p>Executing JS everywhere would require a lot of CPU time and I think Google prefers not to do that when possible. And indexing a JS app is a very complicated task anyway (it is difficult for a robot to even find navigation elements if they are implemented as div's with onclick handler instead of links) so you better use sitemaps to make sure the bot can find content.<p>And I don't think it is necessary to index rich apps. It makes no sense to index a ticket search app (the data become outdated too fast) or an online spreadsheet editor. Just make indexable pages as server-rendered HTML pages and put their URLs into a sitemap.<p>Also Google looks for strings in JS code that look like URLS (e.g var url = '/some/page') and crawls them later.
Google's announcement when they started parsing javascript: <a href="https://webmasters.googleblog.com/2014/05/understanding-web-pages-better.html" rel="nofollow">https://webmasters.googleblog.com/2014/05/understanding-web-...</a>
This is a subject that really irks the engineering side of me. It's utterly ridiculous that engineering and efficiency decisions are so deeply affected by whether or not the largest search engine will properly index your content.<p>Why is it that Google doesn't get flak for not discovering content that's engineered to send the absolute minimum over the wire, cache intelligently in localStorage and IndexedDB, and scale well by distributing the appropriate amount of rendering work to the client agent? Why can't I expose a (JSON/)REST-API-to-deep-link mapping and have Google just crawl my JSON data and understand (perhaps verifying programmatically some percent of the time) that the links they show in search will deep link appropriately to the structured JSON content they crawled?<p>It's such a waste of talent and resources to force server-side rendering. There's obviously the resource cost of transmitting more repetitive content over the wire, and requiring servers to do more work that the client could do. (Yes, even with compression this will still be a higher cost, because more repeated sequences reduces the value of variable-length encoding). But more than that, what bothers me is that there's this false truth that server-side rendering is a requirement for modern architectures, which must result in hundreds of thousands of wasted engineering hours trying to enable the idea of server-side <i>and</i> client-side rendering with the same code.<p>This is not about time-to-first-byte either. Yes, the user-perceived latency matters, but the idea that server rendering even solves this problem is again utterly false. Sure, the time to very first byte ever may be faster, but that's not a winning long-term strategy unless you never expect your client to request the same content twice (or come back to your site at all). When properly cached and synchronized, the client-side-only app has many orders of magnitude faster TTFB, because it's coming from disk or even memory, and can be shown immediately. The only thing left to do is ask the server "what's new since my last timestamp?"<p>All of these benefits seem to be completely disregarded 99% of the time because the golden "SEO" handcuffs are already on. I really hope we can get away from this mindset as a community and rather let the better-engineered and sites with the best and fastest UX <i>over time</i> will start driving search engine technology, instead of the other way around.
IT has now also executed ajax:<p><a href="https://www.google.nl/search?q=site%3Adoesgoogleexecutejavascript.com+yes" rel="nofollow">https://www.google.nl/search?q=site%3Adoesgoogleexecutejavas...</a>
Good, uncomplicated article.<p>If you can get Google servers to execute Javascript, that sounds like a possible attack vector. It's likely that Google runs these in a proprietary feature-sparse interpreter.<p>The lack of AJAX would make it difficult to leak information about the black-box interpreter.
For a more in depth look at how Google treats JS best watch this talk <a href="https://youtu.be/JlP5rBynK3E" rel="nofollow">https://youtu.be/JlP5rBynK3E</a> by Googles John Müller at an Angular conference.
While google is certainly the main search engine most people use, isn't it to some point also very important what other engines such as bing, yandex, baidu etc. do.<p>If you have a professional website you want to be found also by these other engines. Until also these support javascript you may end up with a hybrid SEO architecture anyway which means nothing was gained?
I've checked a site I know which is using nothing but Angular/JS on the frontend towards PageSpeed Insights [1] and it fully failed that test - no results visible. Also, the whole page is not indexed but the root URL itself. No page snippet preview, nothing.<p>[1] <a href="https://developers.google.com/speed/pagespeed/insights/" rel="nofollow">https://developers.google.com/speed/pagespeed/insights/</a>
Nicely done!
I have been writing crawlers for a while now and executing Javascript is very expensive and slow, even for Google. When I crawl the web I usually run javascript from a headless browser only on top priority sites.
Best post I've seen on Hacker News. I've always played it safe and never assumed Google would index content displayed dynamically with JavaScript, but now I know!
If I click the link from the article that leads to the webcache version, I get "yes, but embedded only".<p>If I click the link within <i>that</i> page that leads to the exact same webcache url, I get "yes, embedded and external but no ajax".<p>If I google the site, the preview text is the non-changing portion of the text only ("This is an experiment to...") - not even a "No".<p>I think Google is just trolling us.
Beyond the theory, the talks and the articles. I have multiple 100% JS rendered pages (blank page with no javascript).<p>Google is crawling and indexing them with zero issues.
I wonder what Google does to avoid indexing too many pages. There are a fair number of SPAs and software like shopping carts that have a large number of checkboxes, pulldowns, knobs, dials, etc...that both change the content and the current url query params.
I'm loading i18n before showing any content, and was afraid Google wouldn't index the content, but it didn't have any problem doing that.
This is a great way to test a hypothesis, and a good experiment.<p>I'll mention that there is a ule that was added a few years back, in backbone.js days, that urls with /#!route anchors will enable (read: force) ajax requests and JavaScript from the spider. Still remains a helpful way to force caching/indexing of JavaScript-only pages in Google.
One random thought... Google goes to SomeWebsite.com. The site has only enough HTML to load a big ol' JavaScript app, which Google slowly crawls. Well, that JS app makes a bunch of AJAX calls. There's no reason I can think of that would prevent Google from remembering which AJAX calls were made, and then just crawling the URLs for those calls on subsequent visits. Why load SomeWebsite.com's JavaScript.com every time you want to index the site, when you can just remember that the JS calls SomeWebsite.com/some-endpoint.json? Sucking the JSON out of an endpoint might even be faster that indexing regular HTML. Haven't written a lot of crawlers, so I'm mostly guessing here.
We got website, react+babel+ajax, we monitor those ajax requests because of bad scrappers:)
And we constantly see googlebot. At least 1k request per day, google bot agent from google ip range.
So yes google does ajax and does understand packed react also.
My experience is that content hidden behind js will get indexed later, if at all, and it will be updated not as often. Also they will run js for bigger sites first, and not so much for smaller sites.
I think one of the most frustrating things about indexing from Google is the complete lack of transparency. I understand that it helps Google slow down the arms race of search engines, but it also means that devs doing 100% banal work need to sift through mountains of rumors and spin up sites to test assumptions.<p>I have literally heard every combination of practices with regards to SEO and have no idea what is truly correct. Every source contradicts each other, Google employee statements contradict those, etc.
If you don't want the deal with the ambiguity of whether your AJAX will run or not, I'll shamelessly suggest <a href="https://www.prerender.cloud/" rel="nofollow">https://www.prerender.cloud/</a> which is helping a few sites who couldn't get Google to execute their AJAX.
here's an interesting experiment from a while ago <a href="http://searchengineland.com/tested-googlebot-crawls-javascript-heres-learned-220157" rel="nofollow">http://searchengineland.com/tested-googlebot-crawls-javascri...</a><p>tldr; google indexes js generated content.
My pet theory is that part of the anonymous usage data Chrome sends back, is digested page contents that go into pagerank. And such browser level digesting would be on rendered pages (after JavaScript execution.)<p>I have no other reason to believe it is true other than it's what I would do to distribute the job of crawling the web to my users if i were Google :-)
> Does Google execute JavaScript?<p>Yes.<p>There are sites that can't be loaded without javascript that are indexed fine by Google. The only explanation is that they run some javascript.