If it simply visits sites, it will face a paywall too. If it identifies itself as archive.is, then other people could identify themselves the same way.
Probably the people who operate archive.is just purchased subscriptions for the most common newspaper sites. And then they can use something like <a href="https://pptr.dev/" rel="nofollow">https://pptr.dev/</a> to automate login and article retrieval.<p>I guess the business model is to inject their ads into someone else's content, so kinda like Facebook. That would also surely generate more money from the ads than the cost of subscribing to multiple newspapers.
Off topic but for years I've been using a one-off proxy to strip javascript and crap from my local newspaper site (sfgate.com). It just reads the site with python urllib.request and then does some DOM cleanup with beautiful soup. I wasn't doing any site crawling or exposing the proxy to 1000s of people or anything like that. It was just improving my own reading experience.<p>Just in the past day or so, sfgate.com put in some kind of anti scraping stuff, so urllib, curl, lynx etc. now all fail with 403. Maybe I'll undertake the bigger and slower hassle of trying to read the site with selenium or maybe I'll just give up on newspapers and get my news from HN ;).<p>I wonder if archive.is has had its sfgate.com experience change. Just had to mention them to stay slightly on topic.
I think they might just try all the user agents in the robots.txt. [1] I've included a picture showing an example. In this second image, [2] I receive the paywall with the user agent left as default. There might also just be an archival user agent that most websites accept, but I haven't looked into it very much.<p>[1] <a href="https://i.imgur.com/lyeRTKo.png" rel="nofollow">https://i.imgur.com/lyeRTKo.png</a><p>[2] <a href="https://i.imgur.com/IlBhObn.png" rel="nofollow">https://i.imgur.com/IlBhObn.png</a>
Just archived a website I created. It looks like it runs HTTP requests from a server to pull the HTML, JS and image files (it shows the individual requests completing before the archival process is complete). It must then snapshot the rendered output, then it renders those assets served from their domain. Buttons on my site don't work after the snapshot, since the scripts were stripped.
> If it identifies itself as archive.is, then other people could identify themselves the same way.<p><i>Theoretically</i>, they could just publish the list of IP ranges that canonically "belongs" to archive.is. That would allow websites to distinguish if a request identifying itself as archive.is is <i>actually</i> from them (it fits one of the IP ranges), or is a fraudster.
According to their blog they use AMP: <a href="https://blog.archive.today/post/675805841411178496/how-does-archive-bypass-hard-paywalls-for#notes" rel="nofollow">https://blog.archive.today/post/675805841411178496/how-does-...</a>
A lot of sites don't seem to care about their paywall. Plenty of them load the full article, then "block" me from seeing the rest by adding a popup and `overflow: hidden` to `body`, which is super easy to bypass with devtools. Others give you "free articles" via a cookie or localStorage, which you can of course remove to get more free articles.<p>There are your readers who will see a paywall and then pay, and there are your readers who will try to bypass it or simply not read at all. And articles spread through social media attention, and a paywalled article gets much less attention, so it's non-negligibly beneficial to have people read the article for free who would otherwise not read it.<p>Which is to say: the methods archiv.is uses may not be that special. Clear cookies, block JavaScript, and make deals with or special-case the few sites which actually enforce their paywalls. Or identify yourself as archiv.is, and if others do that to bypass the paywall, good for them.
Not specifically related to archive.is, but news sites have a tightrope to walk.<p>They need to both allow the full content of their articles to be accessed by crawlers so they can show up in search results, but they also want to restrict access via paywalls. They use 2 main methods to achieve this: javascript DOM manipulation and IP address rate limiting.<p>Conceivably one could build a system which directly accesses a given document one time from a unique IP address and then cache the HTML version of the page for further serving.
Many (most?) "big content" sites let Google and Bing spiders scrape the contents of articles so when people search for terms in the article they'll find a hit and then get referred to the pay wall.<p>Google doesn't want everyone to know what a Google indexing request looks like for fear the CEO mafia will institute shenanigans. And the content providers (NYT, WaPo, etc.) don't want people to know 'cause they don't want people evading their paywall.<p>Or maybe they're okay with letting the archive index their content...
If the people who know that tell you, they could lose access to said ressources.<p>But it's kind of an open secret, you just don't look in the right place.
I just tried it with a local newspaper, it did remove the floating pane but didn't unblur and the text is also scrambled (used to be way worse protected, firefox reader mode could easily bypass it)<p>(<a href="https://archive.is/1h4UV" rel="nofollow">https://archive.is/1h4UV</a>)
I thought they used this browser extension: <a href="https://gitlab.com/magnolia1234/bypass-paywalls-chrome-clean" rel="nofollow">https://gitlab.com/magnolia1234/bypass-paywalls-chrome-clean</a>
My hypothesis is that they use a set of generic methods (e.g., robot UA, transient cache, and JS filtering) and rely on user reports (they have a tumblr page for that) to identify and manually fix access to specific sites. Having a look at the source of the bypasspaywallclean extension will give you a good idea of most useful bypassing methods. Indeed, most publishers are only incentivized to paywall their content to the degree where most of their audience are directed to pay and they have to leave backdoors here or there for purposes such as SEO.
What happens when you first load a paywalled article? 9 times out of 10 it shows the entire article before the JS that runs on the page pops up the paywall. Seems like it probably just takes a snapshot prior to JS paywall execution combined with the Google referrer trick or something along those lines.
They use you, as a proxy. If you (who archives it) have access to the site (either because you paid or have free articles), they can archive it too. If you don't have access, they only archive a paywall.
Alas it doesn't allow access to the comment section of the WSJ which is the only reason I would visit the site. WSJ comments re-enforce my opinion of the majority of humanity. My father allowed his subscription to lapse and I won't send them my money so I will just have to imagine it.