Wouldn't it be simpler to use VirtualHost so you only respond with content to requests for your domain?<p>Then set it up so requests without the domain name get a 301 redirect to the canonical URL.
I don't quite understand why this article doesn't recommend using <link rel="canonical" href="..."> as described at <a href="http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html" rel="nofollow">http://googlewebmastercentral.blogspot.com/2009/02/specify-y...</a> (resp. <a href="http://googlewebmastercentral.blogspot.com/2009/12/handling-legitimate-cross-domain.html" rel="nofollow">http://googlewebmastercentral.blogspot.com/2009/12/handling-...</a>)<p>Such an easy solution to this problem.
Sorry, but the author is disqualified by this sentence:<p>"Now there were no external links to these AWS subdomains but, being a domain registrar, Google was notified of the new DNS entries and went ahead and indexed loads of pages."
"Now there were no external links to these AWS subdomains but, being a domain registrar, Google was notified of the new DNS entries and went ahead and indexed loads of pages"<p>Domain registrars wouldn't be notified of new RR's inside a second-level domain - that would be pointless.<p>I can't see any way they would ever index a URL that used a dns RR that was brand new - I'd hazard a guess that either the URL was used previously within the cloud and published somewhere, or it was set up as a CNAME in your own DNS, or your main webserver returned it as a response to a googlebot in some fashion at some point.
I think we would all be better off just using an elastic ip address, and not using the dynamic address for public websites.<p>Also, the same problem applies to normal servers where the webserver is configured to show the website for the ip address, kind of like:<p><a href="http://174.132.225.106/" rel="nofollow">http://174.132.225.106/</a><p>which google also has picked up:<p><a href="http://www.google.dk/search?q=site:174.132.225.106" rel="nofollow">http://www.google.dk/search?q=site:174.132.225.106</a>
Every website should have a similar redirect rule in there somewhere (I implement it in PHP). If someone hits yoursite.com, you probably want to redirect them to www.yoursite.com. I whitelist my domains such that if someone goes goes to anything that points to my server and isn't a valid subdomain, they get redirected to www.
If accessing your web server via .amazonaws.com does not make sense for you, why not just block (whether 403 or 404) all HTTP requests with a Host: *.amazonaws.com header, rather than messing around with rewrites and robots.txt?