科技回声

Hey HN folks,A few months ago I received a manual action penalty from Google as they detected spam pages on our domain. The problem was that when people were searching on our site they are directed to a page with the following:https://$domain/search?query=$QUERYSome users (most likely bots) are generating huge spam searches on our search page and somehow Google is indexing these and there are no inbound links to these pages (at least I cannot find any).To resolve this I did the following:* On our search page I set the following header: X-Robots-Tag: noindex (based off of the documentation here https://developers.google.com/search/reference/robots_meta_tag).* Submitted URLs to be dropped from Google Index via Webmaster console* Submitted 3 reconsideration requests to Google to avoid the penaltiesIn theory this should stop all search pages being indexed (as they all contain the noindex header) and it has helped drop the number of indexed pages marked as spam by 99% however we still have a significant number of urls marked as spam and so our site has a penalty from Google.Has anyone had this issue before? How can I stop these pages becoming indexed when I have the noindex header set _and_ if you search the spam urls there are no inbound links to them?Any help appreciated folks!

13 条评论

jacquesm将近 7 年前

Hilarious how Google thinks they are now in editorial control of your content to the point where you are on the hook for fixing their bugs. You're being treated as a wayward content provider, rather than that they should be happy to get the benefit of your content to index.

评论 #17564571 未加载

helij将近 7 年前

You need to add <meta name="robots" content="noindex, follow"> to the <head> section of all your search results pages.You want robots NOT to index pages but to still follow links on your search pages.Create clean sitemap.xml file and submit it to Search Console.Another way is to just canonicalize all search results pages to your search page.With Google and these things time is involved. Once it's in the index it will take time to properly clean everything up. How was the traffic before this happened? Did the website rank for any decent keyword? Sometimes when this happens the smart thing to do is to just start from scratch with a new domain.If you want more extensive help email me.

dgranda将近 7 年前

Based on my experience:A.- I would also add "nofollow, noarchive" tags [1] to your X-Robots-Tag header:- "nofollow" -> do not to follow (i.e., crawl) any outgoing links on the page.- "noarchive" -> prevents Google from showing the Cached link for a page.B.- I would specify in Search Console (former Webmaster Console) how should Google handle "query" parameter [2]C.- Prevent those spam searches by blocking source IP address, User-Agents, combinations of both, etc.Good luck![1] <a href="https://support.google.com/webmasters/answer/79812?hl=en" rel="nofollow">https://support.google.com/webmasters/answer/79812?hl=en</a>[2] <a href="https://www.google.com/webmasters/tools/crawl-url-parameters?hl=en&siteUrl=https://<domain>/" rel="nofollow">https://www.google.com/webmasters/tools/crawl-url-parameters...</a>

评论 #17558801 未加载

tangue将近 7 年前

You should use the canonical tag. Moz has a good page on how it works.<a href="https://moz.com/blog/canonical-url-tag-the-most-important-advancement-in-seo-practices-since-sitemaps" rel="nofollow">https://moz.com/blog/canonical-url-tag-the-most-important-ad...</a>

sebst将近 7 年前

You could also annotate your page. <a href="https://schema.org/SearchResultsPage" rel="nofollow">https://schema.org/SearchResultsPage</a>Edit: Maybe it is also worth annotating the search field (<a href="https://developers.google.com/search/docs/data-types/sitelinks-searchbox" rel="nofollow">https://developers.google.com/search/docs/data-types/sitelin...</a>) so that google can match it against your search results page.

Jaruzel将近 7 年前

Register for Google Webmaster tools. There's an option in there to exclude links that have dynamic parameters. You can define the parameters you want it to ignore.

itamarst将近 7 年前

Maybe also add a robots.txt? <a href="http://www.robotstxt.org/" rel="nofollow">http://www.robotstxt.org/</a>

评论 #17558767 未加载

eddflrs将近 7 年前

Adding <meta name="robots" content="noindex" /> to each page should work. Also as a heads up, having an entry in robots.txt to disallow is not enough since pages can still be indexed if they can be navigated from anywhere else on the web.

评论 #17564653 未加载

computator将近 7 年前

Can anyone answer a related question: Are you penalized for not running Google Analytics and/or Google Webmaster tools? In other words, if you have a clean website with no analytics whatsoever, is your ranking likely to be worse?

评论 #17564241 未加载

评论 #17564352 未加载

评论 #17564556 未加载

评论 #17564827 未加载

评论 #17564781 未加载

emilfihlman将近 7 年前

Heh, I ran into a similar issue previously: <a href="https://news.ycombinator.com/item?id=16302821" rel="nofollow">https://news.ycombinator.com/item?id=16302821</a>GoogleBot is broken.

detaro将近 7 年前

Are they still being added newly, or have just not been purged from Google index yet?

评论 #17558757 未加载

lgats将近 7 年前

Blocking the search function in robots.txt may help as well.User-agent: *Disallow: /searchDisallow: /search

known将近 7 年前

You can restrict in .htaccess

评论 #17564527 未加载

13 条评论

jacquesm将近 7 年前

评论 #17564571 未加载

helij将近 7 年前

dgranda将近 7 年前

评论 #17558801 未加载

tangue将近 7 年前

sebst将近 7 年前

Jaruzel将近 7 年前

Register for Google Webmaster tools. There's an option in there to exclude links that have dynamic parameters. You can define the parameters you want it to ignore.

itamarst将近 7 年前

Maybe also add a robots.txt? <a href="http://www.robotstxt.org/" rel="nofollow">http://www.robotstxt.org/</a>

评论 #17558767 未加载

eddflrs将近 7 年前

评论 #17564653 未加载

computator将近 7 年前

评论 #17564241 未加载

评论 #17564352 未加载

评论 #17564556 未加载

评论 #17564827 未加载

评论 #17564781 未加载

emilfihlman将近 7 年前

Heh, I ran into a similar issue previously: <a href="https://news.ycombinator.com/item?id=16302821" rel="nofollow">https://news.ycombinator.com/item?id=16302821</a>GoogleBot is broken.

detaro将近 7 年前

Are they still being added newly, or have just not been purged from Google index yet?

评论 #17558757 未加载

lgats将近 7 年前

Blocking the search function in robots.txt may help as well.User-agent: *Disallow: /searchDisallow: /search

known将近 7 年前

You can restrict in .htaccess

评论 #17564527 未加载

Ask HN: How to stop Google indexing dynamic search pages?

13 条评论

Ask HN: How to stop Google indexing dynamic search pages?

13 条评论