HTTrack Website Copier

216 点作者 yamrzou2 个月前

27 条评论

_Chief2 个月前

This brings back so many fond memories. I grew up in a rural part of Kenya where the internet was scarce and tech practically non-existent. I was interested in web dev and taught myself PHP using HTTrack to download the php manual site, then the cprogramming.com website. I remember writing these site contents onto a thick notebook to read in school. Cprogamming.com imho was my programming foundation as I treated it as programming gospel. That kid back then would be shocked at how far I've come, now a dev at MS. Not sure how I came across httrack back then but I am so glad I did

评论 #43409231 未加载

icameron2 个月前

I’ve used it a few times to “secure” an old but relevant dynamic website site. Like a site for a mature project that shouldn’t disappear from the internet but it’s not worth upgrading 5 year old code that wont pass our “cyber security audit” due to unsupported versions of php or rails so we just convert to a static site and delete the database. Everything pretty much works fine on the front end, and the CMS functionality is no longer needed. It’s great for that niche use case.

评论 #43406037 未加载

ksec2 个月前

Not sure about the context on why this is on HN but it surely put a smile on my face. Used to use it during 56K era when I just download everything and read it. Basically using it as RSS before RSS was a thing.

评论 #43405974 未加载

评论 #43403345 未加载

评论 #43405287 未加载

jeff_tyrrill2 个月前

I've been using HTTrack for almost two decades to create static archives of a yearly website for an annual event.It doesn't do the job 100% but it's a start. In particular, HTTrack does not support srcset, so only the default (1x) pixel-density images were archived (though I manually edited the archives to inject the high pixel-density images, as well as numerous other necessary fix-ups).The benefit of the tool is fine control over the crawling process as well as which files are included. Included files have their URLs rewritten in the archived HTML (and CSS) to account for querystrings, absolute vs. relative URLs, external paths, etc.; non-included files also have their URLs rewritten to change relative to absolute links; thus, you can browse the static archive, and non-included assets still function if they are online at their original URL, even if the static archive is on local storage or hosted at a different domain than the original site.It was more work each year as the website gradually used script in more places, leading to more and more places I would need to manually touch-up the archive to make it browsable. The website was not itself an SPA, but contained SPAs on certain pages; my goal was to capture the snapshot of the initial HTML paint of these SPAs but not to have them functional beyond that. This was (expectedly) beyond HTTrack's capabilities.At least one other team member wanted to investigate <a href="https://github.com/Y2Z/monolith" rel="nofollow">https://github.com/Y2Z/monolith</a> as a potential modern alternative.

评论 #43408261 未加载

frozenice2 个月前

Funny timing. Just yesterday I was looking for an easy Windows tool to do a simple stress-test on a website (legally ofc). A requirement of mine was to just give it the root URL and the tool should discover the rest automatically (staying on the same domain). Also, parameters like parallelism had to be easily manageable. After trying some crawlers / copiers and other tools I went back to a simple one I already knew from saving static copies of websites in the past: HTTrack. It fit the bill perfectly! You can add the root URL, set it to "scan only" (so it doesn't download everything) and tweak the settings like connections and speed (and even change some settings mid-run, save settings, pause, ...). So thanks xroche for HTTrack! :)

评论 #43412145 未加载

评论 #43409565 未加载

benhoff2 个月前

I used this recently to download websites, stuffed them into a sqlite db, processed them with Mozllia's readability library, and then used the result and an llm to ask questions of the webpage itself.It was helpful to take each step in chunks, as I didn't have a complete processing pipeline when I started.I had wondered if there was an easier or better way to do this, as I probably would have liked to get the sitemap, pass the sitemap to an llm, then only download selected html pages vs the entire website.

评论 #43412107 未加载

alberth2 个月前

Or just do:<pre><code> wget -rkpN -e robots=off https://www.example.com/</code></pre>

评论 #43408135 未加载

评论 #43408156 未加载

op72 个月前

This isnt 1998 anymore so downloading the files from modern websites doesn't really work if youre trying to maintain your own private local / re-hosted copy of a site. especially ones with dynamically loaded content. Some additional processing is needed to fix the files. I have never been able to find a modern scraping solution that works with most modern websites. I suppose the existence of this sort of tool is in conflict of interest of Big Tech, for it would make the creation of visually identical looking phishing sites that easier much.

评论 #43409569 未加载

Hard_Space2 个月前

I used this all the time twenty years ago. Tried it out again for some reason recently, I think at the suggestion of ChatGPT (!), for some archiving, and it actually did some damage.I do wish there was a modern version of this that could embed the videos in some of my old blog posts so I could save them entire locally as something other than an HTML mystery blob. None of the archive sites preserve video, and neither do extensions like SingleFile. If you're lucky, they'll embed a link to the original file, but that won't help later when the original posts go offline.

评论 #43406051 未加载

nbenitezl2 个月前

Long time ago, HTTrack came very handy for me at work. We created a PHP/Mysql application to store data for a census of industrial sites and related info. Some day my boss tell me the customer wants this census being delivered to them in a auto-startup CD-ROM which was very fashionable at that time, I used HTTrack to download every page of our PHP database and all be browseable offline from the CD-ROM, the auto-startup just launch the browser at the index page.Very handy.

superjan2 个月前

A few years ago my workplace got rid of our on-premise install of fogbugz. I tried to clone the site with HTTrack but did not work due to client-side JavaScript and authentication issues.I was familiar with C#/webview2 and used that: generate the URL’s, load the pages one by one, wait for it to build the HTML, and then save the final page. Intercept and save the css/image request.If you have ever integrated a browserview in a dektop or mobile app, you already know how to do this.

NetOpWibby2 个月前

I thought there was an update but it just looks like a random share…so I’ll share something random!My favorite forum back in the day was the Rockman.EXE Online forums and they had server issues a few times. I was afraid of it going offline for good and came across HTTrack. My laptop was crappy as hell so maybe that’s why I didn’t have the best experience with it?Or maybe trying to backup a forum from the front-end wasn’t a good idea LOL.

gtirloni2 个月前

Every techie that lived through dialup days was constantly downloading stuff to read offline. I'm usually not very nostalgic as I think even through crazy times like these, we are still improving, but I feel I had much better focus back then.Of course this doesn't translate into better productivity because we have way better tools today but it was nice to read, say, the gcc manual in one go.

bruh22 个月前

I can't recall the details, but this tool had quite some friction last time I tried downloading a site with it. Too many new definitions to learn too many knobs it asks you to tweak. I opted to use `wget` with the `--recursive` flag, which just did what I expected it to do out of the box: crawl all links you can find and download them. No tweaking needed, and nothing new to learn.

评论 #43405912 未加载

hosteur2 个月前

Related: <a href="https://news.ycombinator.com/item?id=27789910">https://news.ycombinator.com/item?id=27789910</a>

mdtrooper2 个月前

I still remember Netvampire <a href="https://web.archive.org/web/19990125091054/http://netvampire.com/" rel="nofollow">https://web.archive.org/web/19990125091054/http://netvampire...</a> , an old application I used to do the same in winNT days.

deanebarker2 个月前

I used the hell out of this, back in the day. We would copy websites down, then use Swish-E from the command line to index everything from the file system, then somehow make a web interface out of it. Good times.

alanh2 个月前

How does it compare to SiteSucker (<<a href="https://ricks-apps.com/osx/sitesucker/index.html" rel="nofollow">https://ricks-apps.com/osx/sitesucker/index.html</a>>)?

arkensaw2 个月前

omg I used to use httrack to archive interesting sites at home, usual some sort of hobbyist boardgame thing or historical resource. I never kept them but the originals are long gone now, I should have!

sfmike2 个月前

The good ole days this vlc lots of other free yet functional quirky designed little tools with critical thought for use cases

jmsflknr2 个月前

Never found a great alternative of this for Mac.

评论 #43405318 未加载

评论 #43405429 未加载

评论 #43405299 未加载

bomewish2 个月前

No one has mentioned firecawl? Can anyone compare to archivebox, httrack?

tamim172 个月前

In old time, I used to download entire website using HTTrack and read it later.

solardev2 个月前

This doesn't really work with most sites anymore, does it? It can't run JavaScript (unlike headless browsers with Playwright/Puppeteer, for example), has limited supported for more modern protocols, etc.?Any suggestions for an easy way to mirror modern web content, like an HTTrack for the enshittifed web?

评论 #43404706 未加载

评论 #43411467 未加载

评论 #43409001 未加载

ak0072 个月前

Nostalgia !!

NewEntryHN2 个月前

Is thiswget --mirror?

评论 #43405861 未加载

shuri2 个月前

Time to add AI mode to this :).

评论 #43403570 未加载

评论 #43403566 未加载