I am working with some peers who have a website that links to and catalogs a number of resources (think blog posts).<p>It would be ideal for the administrators to be able to archive or have a copy of that web page on their server in case of the original post being deleted, links moving, servers being down etc.<p>Currently they are using http://archive.is to implement a half solution to this intent. It does not work for some websites and ideally they could host their own archived copy.<p>What are easy solutions to do this?<p>With Python I was thinking requests - but this would just grab the HTML and not images, or content generated by javascript.<p>Thinking Selenium, you could take a screenshot of the content - not the most user friendly to read.<p>What are some other solutions?
I've enjoyed great success with various archiving proxies, including <a href="https://github.com/internetarchive/warcprox#readme" rel="nofollow">https://github.com/internetarchive/warcprox#readme</a> and <a href="https://github.com/zaproxy/zaproxy#readme" rel="nofollow">https://github.com/zaproxy/zaproxy#readme</a> (which saves the content to an embedded database, and can be easier to work with than warc files). The benefit of those approaches over just save-as from the browser is that almost by definition the proxy will save all the components required to re-render the page, whereas save will only grab the parts it sees at that time.<p>If it's a public page, you can submit the URL to the Internet Archive, and benefit both you and them
Either curl or wget will get you pretty far. Learn one of them well. They are basically equivalent. I use curl.<p>For current web apps, there is an interactive archiver written in Python, Web Recorder. It captures the full bi-directional traffic of a session.
<a href="https://webrecorder.io/" rel="nofollow">https://webrecorder.io/</a>
Web Recorder uses an internal Python library, pywb. That might be a good place to look.
<a href="https://github.com/webrecorder/pywb" rel="nofollow">https://github.com/webrecorder/pywb</a><p>It looks like Selenium has done a lot of catching up on it's interface. I'd be curious how they compare now.<p>Talk to librarians about archiving the web. They made Internet Archive and have a lot of experience.
Save the page using the browser's Save feature and zip the created assets (a html file plus directory with graphics, js, css, etc) for ease of sharing.
If you’re okay with easy, but saves at a third party: <a href="https://www.npmjs.com/package/archive.is" rel="nofollow">https://www.npmjs.com/package/archive.is</a>