TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: What is the best way to archive a webpage

16 pointsby badwolffabout 7 years ago
I am working with some peers who have a website that links to and catalogs a number of resources (think blog posts).<p>It would be ideal for the administrators to be able to archive or have a copy of that web page on their server in case of the original post being deleted, links moving, servers being down etc.<p>Currently they are using http:&#x2F;&#x2F;archive.is to implement a half solution to this intent. It does not work for some websites and ideally they could host their own archived copy.<p>What are easy solutions to do this?<p>With Python I was thinking requests - but this would just grab the HTML and not images, or content generated by javascript.<p>Thinking Selenium, you could take a screenshot of the content - not the most user friendly to read.<p>What are some other solutions?

6 comments

mdanielabout 7 years ago
I&#x27;ve enjoyed great success with various archiving proxies, including <a href="https:&#x2F;&#x2F;github.com&#x2F;internetarchive&#x2F;warcprox#readme" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;internetarchive&#x2F;warcprox#readme</a> and <a href="https:&#x2F;&#x2F;github.com&#x2F;zaproxy&#x2F;zaproxy#readme" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;zaproxy&#x2F;zaproxy#readme</a> (which saves the content to an embedded database, and can be easier to work with than warc files). The benefit of those approaches over just save-as from the browser is that almost by definition the proxy will save all the components required to re-render the page, whereas save will only grab the parts it sees at that time.<p>If it&#x27;s a public page, you can submit the URL to the Internet Archive, and benefit both you and them
cimmanomabout 7 years ago
If it&#x27;s not doing silly things like using Javascript to load static content, wget can do recursive crawls.
adultSwimabout 7 years ago
Either curl or wget will get you pretty far. Learn one of them well. They are basically equivalent. I use curl.<p>For current web apps, there is an interactive archiver written in Python, Web Recorder. It captures the full bi-directional traffic of a session. <a href="https:&#x2F;&#x2F;webrecorder.io&#x2F;" rel="nofollow">https:&#x2F;&#x2F;webrecorder.io&#x2F;</a> Web Recorder uses an internal Python library, pywb. That might be a good place to look. <a href="https:&#x2F;&#x2F;github.com&#x2F;webrecorder&#x2F;pywb" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;webrecorder&#x2F;pywb</a><p>It looks like Selenium has done a lot of catching up on it&#x27;s interface. I&#x27;d be curious how they compare now.<p>Talk to librarians about archiving the web. They made Internet Archive and have a lot of experience.
inceptionnamesabout 7 years ago
Save the page using the browser&#x27;s Save feature and zip the created assets (a html file plus directory with graphics, js, css, etc) for ease of sharing.
tyingqabout 7 years ago
If you’re okay with easy, but saves at a third party: <a href="https:&#x2F;&#x2F;www.npmjs.com&#x2F;package&#x2F;archive.is" rel="nofollow">https:&#x2F;&#x2F;www.npmjs.com&#x2F;package&#x2F;archive.is</a>
anotheryouabout 7 years ago
perma.cc looks sweet, but it&#x27;s very limited for private people
评论 #16557455 未加载