My last contract job was to build a 100% perfect website mirroring program for a group of lawyers who were interested in building class action lawsuits against some of the more henious scammers out there.<p>I ended up building like 8 versions of it, literally using every PHP and Python library and resource I could find.<p>I tried httrack, php-ultimate-web-scraper (from github), headless chromium. headless selenium, and a few others<p>By far the biggest problem was dealing with JS links...you wouldn't think from the start it would be such a big deal but yet..it was.<p>Selenium with python turned out to be the winning combination, and of course, it was the last one I tried. Also, this is an ideal project to implement recursion altho you have to be careful about exit conditions.<p>One thing that was VERY important for performance was not visiting any page more then once because, obviously, certain links in headers and footers are duped sometimes 100s of times.<p>JS links often made it very difficult to discover the linked page, are certain library calls that were supposed to get this info for you often didn't work.<p>It was a super fun project, and in the end considering I only worked for 2 months, I shipped some decent code that was getting like 98.6% of the pages perfectly.<p>The final presentation was interesting...for some reason my client I think got in his head that I wasn't very good programmer or something, and as we ran thru his list of sample sites expecting my program to error out or incorrectly mirror the site, but it handled all 10 of the sites about perfectly and he was rather flabbergasted because he told me it would have taken him a week hand clicking the site for the mirror but instead the program did them all in under an hour.