TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: How to stop Google indexing dynamic search pages?

83 点作者 scalesolved将近 7 年前
Hey HN folks,<p>A few months ago I received a manual action penalty from Google as they detected spam pages on our domain. The problem was that when people were searching on our site they are directed to a page with the following:<p>https:&#x2F;&#x2F;$domain&#x2F;search?query=$QUERY<p>Some users (most likely bots) are generating huge spam searches on our search page and somehow Google is indexing these and there are no inbound links to these pages (at least I cannot find any).<p>To resolve this I did the following:<p>* On our search page I set the following header: X-Robots-Tag: noindex (based off of the documentation here https:&#x2F;&#x2F;developers.google.com&#x2F;search&#x2F;reference&#x2F;robots_meta_tag).<p>* Submitted URLs to be dropped from Google Index via Webmaster console<p>* Submitted 3 reconsideration requests to Google to avoid the penalties<p>In theory this should stop all search pages being indexed (as they all contain the noindex header) and it has helped drop the number of indexed pages marked as spam by 99% however we still have a significant number of urls marked as spam and so our site has a penalty from Google.<p>Has anyone had this issue before? How can I stop these pages becoming indexed when I have the noindex header set _and_ if you search the spam urls there are no inbound links to them?<p>Any help appreciated folks!

13 条评论

jacquesm将近 7 年前
Hilarious how Google thinks they are now in editorial control of your content to the point where you are on the hook for fixing <i>their</i> bugs. You&#x27;re being treated as a wayward content provider, rather than that they should be happy to get the benefit of your content to index.
评论 #17564571 未加载
helij将近 7 年前
You need to add &lt;meta name=&quot;robots&quot; content=&quot;noindex, follow&quot;&gt; to the &lt;head&gt; section of all your search results pages.<p>You want robots NOT to index pages but to still follow links on your search pages.<p>Create clean sitemap.xml file and submit it to Search Console.<p>Another way is to just canonicalize all search results pages to your search page.<p>With Google and these things time is involved. Once it&#x27;s in the index it will take time to properly clean everything up. How was the traffic before this happened? Did the website rank for any decent keyword? Sometimes when this happens the smart thing to do is to just start from scratch with a new domain.<p>If you want more extensive help email me.
dgranda将近 7 年前
Based on my experience:<p>A.- I would also add &quot;nofollow, noarchive&quot; tags [1] to your X-Robots-Tag header:<p>- &quot;nofollow&quot; -&gt; do not to follow (i.e., crawl) any outgoing links on the page.<p>- &quot;noarchive&quot; -&gt; prevents Google from showing the Cached link for a page.<p>B.- I would specify in Search Console (former Webmaster Console) how should Google handle &quot;query&quot; parameter [2]<p>C.- Prevent those spam searches by blocking source IP address, User-Agents, combinations of both, etc.<p>Good luck!<p>[1] <a href="https:&#x2F;&#x2F;support.google.com&#x2F;webmasters&#x2F;answer&#x2F;79812?hl=en" rel="nofollow">https:&#x2F;&#x2F;support.google.com&#x2F;webmasters&#x2F;answer&#x2F;79812?hl=en</a><p>[2] <a href="https:&#x2F;&#x2F;www.google.com&#x2F;webmasters&#x2F;tools&#x2F;crawl-url-parameters?hl=en&amp;siteUrl=https:&#x2F;&#x2F;&lt;domain&gt;&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.google.com&#x2F;webmasters&#x2F;tools&#x2F;crawl-url-parameters...</a>
评论 #17558801 未加载
tangue将近 7 年前
You should use the canonical tag. Moz has a good page on how it works.<p><a href="https:&#x2F;&#x2F;moz.com&#x2F;blog&#x2F;canonical-url-tag-the-most-important-advancement-in-seo-practices-since-sitemaps" rel="nofollow">https:&#x2F;&#x2F;moz.com&#x2F;blog&#x2F;canonical-url-tag-the-most-important-ad...</a>
sebst将近 7 年前
You could also annotate your page. <a href="https:&#x2F;&#x2F;schema.org&#x2F;SearchResultsPage" rel="nofollow">https:&#x2F;&#x2F;schema.org&#x2F;SearchResultsPage</a><p>Edit: Maybe it is also worth annotating the search field (<a href="https:&#x2F;&#x2F;developers.google.com&#x2F;search&#x2F;docs&#x2F;data-types&#x2F;sitelinks-searchbox" rel="nofollow">https:&#x2F;&#x2F;developers.google.com&#x2F;search&#x2F;docs&#x2F;data-types&#x2F;sitelin...</a>) so that google can match it against your search results page.
Jaruzel将近 7 年前
Register for Google Webmaster tools. There&#x27;s an option in there to exclude links that have dynamic parameters. You can define the parameters you want it to ignore.
itamarst将近 7 年前
Maybe also add a robots.txt? <a href="http:&#x2F;&#x2F;www.robotstxt.org&#x2F;" rel="nofollow">http:&#x2F;&#x2F;www.robotstxt.org&#x2F;</a>
评论 #17558767 未加载
eddflrs将近 7 年前
Adding &lt;meta name=&quot;robots&quot; content=&quot;noindex&quot; &#x2F;&gt; to each page should work. Also as a heads up, having an entry in robots.txt to disallow is not enough since pages can still be indexed if they can be navigated from anywhere else on the web.
评论 #17564653 未加载
computator将近 7 年前
Can anyone answer a related question: Are you penalized for <i>not</i> running Google Analytics and&#x2F;or Google Webmaster tools? In other words, if you have a clean website with no analytics whatsoever, is your ranking likely to be worse?
评论 #17564241 未加载
评论 #17564352 未加载
评论 #17564556 未加载
评论 #17564827 未加载
评论 #17564781 未加载
emilfihlman将近 7 年前
Heh, I ran into a similar issue previously: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=16302821" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=16302821</a><p>GoogleBot is broken.
detaro将近 7 年前
Are they still being added newly, or have just not been purged from Google index yet?
评论 #17558757 未加载
lgats将近 7 年前
Blocking the search function in robots.txt may help as well.<p>User-agent: *<p>Disallow: &#x2F;search<p>Disallow: &#x2F;search<i>
known将近 7 年前
You can restrict in .htaccess
评论 #17564527 未加载