科技回声

I am working with some peers who have a website that links to and catalogs a number of resources (think blog posts).It would be ideal for the administrators to be able to archive or have a copy of that web page on their server in case of the original post being deleted, links moving, servers being down etc.Currently they are using http://archive.is to implement a half solution to this intent. It does not work for some websites and ideally they could host their own archived copy.What are easy solutions to do this?With Python I was thinking requests - but this would just grab the HTML and not images, or content generated by javascript.Thinking Selenium, you could take a screenshot of the content - not the most user friendly to read.What are some other solutions?

6 条评论

mdaniel大约 7 年前

I've enjoyed great success with various archiving proxies, including <a href="https://github.com/internetarchive/warcprox#readme" rel="nofollow">https://github.com/internetarchive/warcprox#readme</a> and <a href="https://github.com/zaproxy/zaproxy#readme" rel="nofollow">https://github.com/zaproxy/zaproxy#readme</a> (which saves the content to an embedded database, and can be easier to work with than warc files). The benefit of those approaches over just save-as from the browser is that almost by definition the proxy will save all the components required to re-render the page, whereas save will only grab the parts it sees at that time.If it's a public page, you can submit the URL to the Internet Archive, and benefit both you and them

cimmanom大约 7 年前

If it's not doing silly things like using Javascript to load static content, wget can do recursive crawls.

adultSwim大约 7 年前

Either curl or wget will get you pretty far. Learn one of them well. They are basically equivalent. I use curl.For current web apps, there is an interactive archiver written in Python, Web Recorder. It captures the full bi-directional traffic of a session. <a href="https://webrecorder.io/" rel="nofollow">https://webrecorder.io/</a> Web Recorder uses an internal Python library, pywb. That might be a good place to look. <a href="https://github.com/webrecorder/pywb" rel="nofollow">https://github.com/webrecorder/pywb</a>It looks like Selenium has done a lot of catching up on it's interface. I'd be curious how they compare now.Talk to librarians about archiving the web. They made Internet Archive and have a lot of experience.

inceptionnames大约 7 年前

Save the page using the browser's Save feature and zip the created assets (a html file plus directory with graphics, js, css, etc) for ease of sharing.

tyingq大约 7 年前

If you’re okay with easy, but saves at a third party: <a href="https://www.npmjs.com/package/archive.is" rel="nofollow">https://www.npmjs.com/package/archive.is</a>

anotheryou大约 7 年前

perma.cc looks sweet, but it's very limited for private people

评论 #16557455 未加载

6 条评论

mdaniel大约 7 年前

cimmanom大约 7 年前

If it's not doing silly things like using Javascript to load static content, wget can do recursive crawls.

adultSwim大约 7 年前

inceptionnames大约 7 年前

Save the page using the browser's Save feature and zip the created assets (a html file plus directory with graphics, js, css, etc) for ease of sharing.

tyingq大约 7 年前

If you’re okay with easy, but saves at a third party: <a href="https://www.npmjs.com/package/archive.is" rel="nofollow">https://www.npmjs.com/package/archive.is</a>

anotheryou大约 7 年前

perma.cc looks sweet, but it's very limited for private people

评论 #16557455 未加载

Ask HN: What is the best way to archive a webpage

6 条评论

Ask HN: What is the best way to archive a webpage

6 条评论