TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Best practices for technical SEO for domains with more than 100k pages?

3 pointsby brntsllvnover 2 years ago
By &quot;technical SEO&quot; I mean SEO problems that require software development.<p>What&#x27;s special about domains with 100,000+ pages (e.g. niche search engines and large e-commerce sites) is that most basic SEO advice is necessary but far from sufficient.<p>I&#x27;m on a project with 1,000,000 pages and only about 50,000 are indexed. My &quot;crawl budget&quot; is absolutely small, but relatively large. How do I make sure the &quot;right&quot; content gets indexed? How does Google even define &quot;right&quot;?<p>I seem to be writing a lot of code to answer basic question like &quot;which of my 1,000,000 pages are indexed?&quot; and &quot;which pages should I be actively seeking to index and de-index?&quot;<p>If you work on &quot;technical SEO&quot; for something like Stack Overflow or Amazon I&#x27;d love to learn from you.<p>And if these are the wrong questions to ask, I&#x27;d especially like to learn from you.

4 comments

vgeekover 2 years ago
I&#x27;m curious to hear other people&#x27;s takes on this, too.<p>How unique is the content on each page? What do the crawled and not indexed vs discovered not crawled ratios look like relative to indexed pages? What does linking to the detail pages look like (category pages, similar items)? How much authority do category pages have in terms of backlinks? Page speed issues-- in GSC, can you manage average download time &lt;100ms (makes a huge difference)?<p>Noindexing may help-- if you just initially handle that based on page content uniqueness and&#x2F;or presumed MSV potential for the page. Otherwise, based on your language, can you request a quota increase from the 200&#x2F;day notifications for the Indexing API?<p>If you like reading about the topic, maybe head to BHW forums (very poor SNR, though) and read about what some of the more unscrupulous people are doing to index MM+ pages&#x2F;mo and extract some of the less-risky strategies (e.g. don&#x27;t abuse the news&#x2F;job posting APIs to get spam content indexed).<p>As additional commentary, the last 2-3 months have been <i>chaos</i> in terms of algo updates. There was essentially a 30 day period where Googlebot all but disappeared for most sites, then the two HCUs and other random updates tossed in for good measure, stacked with desktop infinite scroll and the switch to mobile first indexing. I can&#x27;t remember as many changes happening in a single quarter in over 15 years of being SEO or SEO adjacent.
评论 #33903456 未加载
135792468over 2 years ago
Finally an SEO question on HN! While I haven’t worked on SO, I’ve worked on a medical equivalent site which started with 1500 of 2.5m pages indexed and when I was finished with my contract they had over 1.4m indexed.<p>There are a few factors that you need to adjust for: authority, quality, structure.<p>Authority is mostly out of your hands as a technical SEO scope but as more content gets indexed, more links and signals will come in. The real culprits are quality and structure. As others have said, low-quality pages need to be addressed. They can always be worked back in later but you’re better to be ruthless about it now and add them back later.<p>Profile pages, thin content pages, duplicates all need to go. No index them first, then block in robots eventually and doesn’t hurt to canonical them where applicable. If you jump straight to robots, they won’t pickup the noindex directive.<p>The structure, a lot of people think of folder structure as IA but really it’s linking structure. It’s important to know where you’re placing emphasis on pages by how you’re internal links are setup. This is also the best way to surface deep pages as you can link to them to bring Google deeper into your site.<p>Also you can try refreshing dates and seeing if that helps. Short term solution but works plenty well.<p>I know this is pretty general but it’s hard without seeing your issues specifically. If I can help you get on the right track, lmk. Gladly take a look. I run a small(free) community with some strong technical SEOs in it as well who like helping. Not sure how to connect but lmk if interested and we will figure it out.
评论 #33904559 未加载
Minor49erover 2 years ago
A few questions that come to mind off the bat:<p>Do you have a sitemap.xml file? Does it include accurate last updated dates or suggest realistic crawl schedules (eg: only index static pages every month, but frequently updated pages every day)?<p>Have you run your site through an SEO checker (or at least some of the ignored pages) to make sure that they are accessible and not being blocked&#x2F;ignored by a crawler? Any server responses or canonical links that may signal the wrong thing to a bot?<p>Do you have any deep links to your more overlooked pages on the homepage (or at least on your more popular pages)?<p>Do you have backlinks in the wild that point to the pages that you want to have crawled? Are they marked as nofollow? Do you post links to these pages on any social media accounts?
评论 #33903529 未加载
sn0w_crashover 2 years ago
Hmm maybe you can structure your links such that the high priority pages you want indexed are getting 10x more link juice than the rest? This would at least signal to Google which pages you deem more important to you
评论 #33902604 未加载