TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Hidden In HTML: Parsing Page Layouts. 2.9B Web Page Analysis

5 pointsby benwills4 months ago

1 comment

benwills4 months ago
This is an analysis I put together of the November 2024 Common Crawl HTML&#x2F;Warc dataset. I counted HTML tag attribute values to identify the most common values per tag+attribute combination. I&#x27;ve done this analysis several times over the years and have found it to be invaluable when it comes to writing parsers.<p>The post is interactive, allowing you to search on the 500 most common values per tag+attribute. There is also a free SQLite database available for download of the top 1,000 values per tag+attribute.<p>This is the first post of an 8-part series that builds toward writing an article parser, the lessons from which can be transferred to writing any other kind of parser you might want.<p>This is my first time to publish content like this and I&#x27;d love any feedback you might have.