TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: What is the best way to archive a webpage

16 点作者 badwolff大约 7 年前
I am working with some peers who have a website that links to and catalogs a number of resources (think blog posts).<p>It would be ideal for the administrators to be able to archive or have a copy of that web page on their server in case of the original post being deleted, links moving, servers being down etc.<p>Currently they are using http:&#x2F;&#x2F;archive.is to implement a half solution to this intent. It does not work for some websites and ideally they could host their own archived copy.<p>What are easy solutions to do this?<p>With Python I was thinking requests - but this would just grab the HTML and not images, or content generated by javascript.<p>Thinking Selenium, you could take a screenshot of the content - not the most user friendly to read.<p>What are some other solutions?

6 条评论

mdaniel大约 7 年前
I&#x27;ve enjoyed great success with various archiving proxies, including <a href="https:&#x2F;&#x2F;github.com&#x2F;internetarchive&#x2F;warcprox#readme" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;internetarchive&#x2F;warcprox#readme</a> and <a href="https:&#x2F;&#x2F;github.com&#x2F;zaproxy&#x2F;zaproxy#readme" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;zaproxy&#x2F;zaproxy#readme</a> (which saves the content to an embedded database, and can be easier to work with than warc files). The benefit of those approaches over just save-as from the browser is that almost by definition the proxy will save all the components required to re-render the page, whereas save will only grab the parts it sees at that time.<p>If it&#x27;s a public page, you can submit the URL to the Internet Archive, and benefit both you and them
cimmanom大约 7 年前
If it&#x27;s not doing silly things like using Javascript to load static content, wget can do recursive crawls.
adultSwim大约 7 年前
Either curl or wget will get you pretty far. Learn one of them well. They are basically equivalent. I use curl.<p>For current web apps, there is an interactive archiver written in Python, Web Recorder. It captures the full bi-directional traffic of a session. <a href="https:&#x2F;&#x2F;webrecorder.io&#x2F;" rel="nofollow">https:&#x2F;&#x2F;webrecorder.io&#x2F;</a> Web Recorder uses an internal Python library, pywb. That might be a good place to look. <a href="https:&#x2F;&#x2F;github.com&#x2F;webrecorder&#x2F;pywb" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;webrecorder&#x2F;pywb</a><p>It looks like Selenium has done a lot of catching up on it&#x27;s interface. I&#x27;d be curious how they compare now.<p>Talk to librarians about archiving the web. They made Internet Archive and have a lot of experience.
inceptionnames大约 7 年前
Save the page using the browser&#x27;s Save feature and zip the created assets (a html file plus directory with graphics, js, css, etc) for ease of sharing.
tyingq大约 7 年前
If you’re okay with easy, but saves at a third party: <a href="https:&#x2F;&#x2F;www.npmjs.com&#x2F;package&#x2F;archive.is" rel="nofollow">https:&#x2F;&#x2F;www.npmjs.com&#x2F;package&#x2F;archive.is</a>
anotheryou大约 7 年前
perma.cc looks sweet, but it&#x27;s very limited for private people
评论 #16557455 未加载