科技回声

This applies to sites indexed on Google that hope to gain organic traffic. As an indie blogger and SEO enthusiast, I foolishly updated my robots.txt file to prevent indexing of certain unwanted parts of my site, leading to subtle repercussions that I couldn't have foreseen.A few days ago, while reading about SEO, I came across the concept of a "crawl budget." Apparently, Google allocates a specific crawl budget to your indexed site, and the more useless content it has to index and store on its servers, the more it affects your site—resulting in delays for new content indexing, favicon updates, and robots.txt crawling.Being a minimalist and utilitarian, I decided to prevent indexing of the `/uploads/` directory on my site since it mostly contained images used in my articles. I thought blocking this "useless content" would free up more crawling budget for my primary content, i.e., articles. So, I added this directory to my site's robots.txt:<pre><code> # Group 1 User-agent: * Disallow: /public/ Disallow: /drafts/ Disallow: /theme/ Disallow: /page* Disallow: /uploads/ Sitemap: https://prahladyeri.github.io/sitemap.xml </code></pre> The way search engines work means there's typically a 5-7 day gap between updating the robots.txt file and crawlers processing it. After about a week, I noticed that my site's favicon disappeared from SERPs on mobile browsers! Instead, there was a bland (empty) icon in its place. That’s when I realized that my favicons also resided in the `/uploads/` directory. After I recently optimized the favicon format by switching from WEBP to PNG, Google was unable to crawl and index the new favicon at all!Once I realized this mistake, I removed the blocking of `/uploads/` from the robots.txt and requested a recrawl. But who knows how long it will take for Google's systems to sync this change and start showing the site's favicon back in SERPs! Two lessons learned:1. The robots.txt file is highly sensitive; avoid modifying it if possible. 2. Applying SEO is like steering an extremely large ship or vessel. You pull a lever now, and the ship only moves after several days!

12 条评论

andrethegiant6 个月前

> there's typically a 5-7 day gap between updating the robots.txt file and crawlers processing itYou could try moving your favicon to another dir, or root dir, for the time being, and update your HTML to match. That way it would be allowed according to the version that Google still has cached. Also, I think browsers look for a favicon at /favicon.ico regardless, so it might be worth making a copy there too.

评论 #42011292 未加载

评论 #42011152 未加载

dazc7 个月前

USE X-Robots-Tag: noindex to prevent files being indexed and let google determine how they crawl your site for themselves.A nightmare scenario can result, otherwise, where you have content indexed but don't allow googlebot to crawl it. This does not end well.<a href="https://developers.google.com/search/docs/crawling-indexing/block-indexing" rel="nofollow">https://developers.google.com/search/docs/crawling-indexing/...</a>

评论 #42011609 未加载

hk13376 个月前

It's good information but...1. Why is your favicon in the uploads directory? Usually, those would be at the root of your site or in an image directory?2. Why is there an uploads directory for a static site hosted on GitHub? I don't believe that is useful on GitHub, is it? You cannot have visitors upload files to it, right?

评论 #42011737 未加载

seanwilson6 个月前

How big is your site? Crawl budget is likely only relevant for huge sites, not personal blogs.

评论 #42011580 未加载

评论 #42014192 未加载

xnx6 个月前

The best SEO advice is to not focus on SEO and make a site that people will like.

评论 #42012417 未加载

评论 #42013146 未加载

maciekpaprocki6 个月前

You dont want to exclude your images. That can very much affect your results as it will remove you from image tab, but also content of articles that contain them might be affected.

Theodores6 个月前

I thought that Google Search Console had tools to test robots.txt and sitemap.xml files, but it has been a while since I have needed to do that.For those wondering why favicon is in a directory, nowadays there are half a dozen different favicon files for different devices in different situations and there are online tools such as The Real Favicon Generator that will take a source image and make the variants for you. These come with a code snippet for head and the option to use a sub directory so that you don't clutter the root.Maybe they should offer a robots.txt snippet too.Fun fact, for a single page, you can base64 encode the favicon and shove it in the page, thereby not needing a separate file. Why would you want to do that? If you base64 encode all the images and add the scripts and stylesheets in, then you can have a HTML page that you don't have to upload, you can email it to someone. This is useful if wanting to share a design mockup.

liendolucas6 个月前

I'm a complete ignorant when it comes to SEOs so what are the consequences of not having a robots.txt nor a sitemap.xml at all? Will that be detrimental in a big way?

评论 #42013951 未加载

评论 #42014620 未加载

评论 #42013532 未加载

评论 #42013706 未加载

tiffanyh6 个月前

Does anyone have suggestions on what a proper robots.txt would be?How about:<pre><code> User-agent: * Allow: / Sitemap: https://example.com/sitemap.xml</code></pre>

评论 #42012549 未加载

评论 #42011966 未加载

评论 #42012196 未加载

dewey6 个月前

If you don’t have millions of pages the crawl budget limitations most likely will have zero impact.Make sure your basic technical SEO factors are all good. Search console is looking good and then don’t continue to worry unless you are a huge site that’s living off SEO traffic.

Arech6 个月前

TLDR; I shoot myself in a foot thinking I'm shooting elsewhere. Don't do this!Thanks for a useful info! /s

KateSterling6 个月前

SEO can feel like such a balancing act—one tweak, and it’s a waiting game to see the impact! Sounds like you’ve learned a lot about the sensitivity of robots.txt.If you’re into exploring new tech, you might like Rig. It’s a Rust library for building scalable, modular apps with LLMs, ideal if you’re branching out into AI or complex workflows. Keeps things type-safe and flexible.

12 条评论

andrethegiant6 个月前

评论 #42011292 未加载

评论 #42011152 未加载

dazc7 个月前

评论 #42011609 未加载

hk13376 个月前

评论 #42011737 未加载

seanwilson6 个月前

How big is your site? Crawl budget is likely only relevant for huge sites, not personal blogs.

评论 #42011580 未加载

评论 #42014192 未加载

xnx6 个月前

The best SEO advice is to not focus on SEO and make a site that people will like.

评论 #42012417 未加载

评论 #42013146 未加载

maciekpaprocki6 个月前

You dont want to exclude your images. That can very much affect your results as it will remove you from image tab, but also content of articles that contain them might be affected.

Theodores6 个月前

liendolucas6 个月前

I'm a complete ignorant when it comes to SEOs so what are the consequences of not having a robots.txt nor a sitemap.xml at all? Will that be detrimental in a big way?

评论 #42013951 未加载

评论 #42014620 未加载

评论 #42013532 未加载

评论 #42013706 未加载

tiffanyh6 个月前

Does anyone have suggestions on what a proper robots.txt would be?How about:<pre><code> User-agent: * Allow: / Sitemap: https://example.com/sitemap.xml</code></pre>

评论 #42012549 未加载

评论 #42011966 未加载

评论 #42012196 未加载

dewey6 个月前

Arech6 个月前

TLDR; I shoot myself in a foot thinking I'm shooting elsewhere. Don't do this!Thanks for a useful info! /s

KateSterling6 个月前

Tell HN: Robots.txt pitfalls – what I learned the hard way

12 条评论

Tell HN: Robots.txt pitfalls – what I learned the hard way

12 条评论