TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Best way to keep the raw HTML of scraped pages?

32 pointsby vitorbaptistaaover 2 years ago
I&#x27;m scraping information regarding civil servants&#x27; calendars. This is all public, text-only information. I&#x27;d like to keep a copy of the raw HTML files I&#x27;m scraping for historical purposes, and also in case there&#x27;s a bug and I need to re-run the scrapers.<p>This sounds like a great usage for a forward proxy like Squid or Apache Traffic Server. However, I couldn&#x27;t find in their docs a way to both:<p>* Keep a permanent history of the cached pages<p>* Access old versions of the cached pages (think Wayback Machine)<p>Does anyone know if this is possible? I could potentially mirror the pages using wget or httrack, but a forward cache is a better solution as the caching process is driven by the scraper itself.<p>Thanks!

6 comments

mdanielover 2 years ago
If you weren&#x27;t already aware, Scrapy has strong support for this via their HTTPCache middleware; you can choose whether to have it actually behave like a cache, choosing to returned already scraped content if matched or merely to act as a pass-through cache: <a href="https:&#x2F;&#x2F;docs.scrapy.org&#x2F;en&#x2F;2.7&#x2F;topics&#x2F;downloader-middleware.html#writing-your-own-storage-backend" rel="nofollow">https:&#x2F;&#x2F;docs.scrapy.org&#x2F;en&#x2F;2.7&#x2F;topics&#x2F;downloader-middleware....</a><p>Their OOtB storage does what the sibling comment says about sha1-ing the request and then sharding the output filename by the first two characters: <a href="https:&#x2F;&#x2F;github.com&#x2F;scrapy&#x2F;scrapy&#x2F;blob&#x2F;2.7.1&#x2F;scrapy&#x2F;extensions&#x2F;httpcache.py#L332-L333" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;scrapy&#x2F;scrapy&#x2F;blob&#x2F;2.7.1&#x2F;scrapy&#x2F;extension...</a>
PaulHouleover 2 years ago
Content addressable storage. Generate names with SHA-3, split off bits of the names into directories like<p><pre><code> name[0:2]&#x2F;name[0:4]&#x2F;name[0:6]&#x2F;name </code></pre> to keep any of the directories from getting too big (even the filesystem can handle huge directories, various tools you use with it might not) Keep a list of where the files came from and other metadata so you can find things in a database.
评论 #33563836 未加载
placidpandaover 2 years ago
When doing this in the past, I settled on an sqlite database with one table that stores the compressed html (gzip or lzma) along with other columns (id&#x2F;date&#x2F;url&#x2F;domain&#x2F;status&#x2F;etc.)<p>Also made it easy to alert on when something broke (query the table for count(*) where status=error) and rerun the parser for failures.
评论 #33566356 未加载
compressedgasover 2 years ago
WARC.
评论 #33563832 未加载
评论 #33564047 未加载
sbricksover 2 years ago
i&#x27;d just apply intelligent file naming strategy, based on timestamps and urls. keep in mind, that a folder should not contain more than 1000 files or other folders, otherwise it&#x27;s slow to list.
nf-xover 2 years ago
Did you try using some of the cheap cloud storage, like AWS S3?