Ask HN: How does archive.is bypass paywalls?

133 pointsby fleroviumalmost 2 years ago

If it simply visits sites, it will face a paywall too. If it identifies itself as archive.is, then other people could identify themselves the same way.

24 comments

RicoElectricoalmost 2 years ago

Nice try, media company employee ;)/jk

评论 #36063158 未加载

fxtentaclealmost 2 years ago

Probably the people who operate archive.is just purchased subscriptions for the most common newspaper sites. And then they can use something like <a href="https://pptr.dev/" rel="nofollow">https://pptr.dev/</a> to automate login and article retrieval.I guess the business model is to inject their ads into someone else's content, so kinda like Facebook. That would also surely generate more money from the ads than the cost of subscribing to multiple newspapers.

评论 #36063522 未加载

评论 #36063150 未加载

评论 #36063169 未加载

评论 #36066931 未加载

评论 #36066074 未加载

评论 #36068842 未加载

throwaway81523almost 2 years ago

Off topic but for years I've been using a one-off proxy to strip javascript and crap from my local newspaper site (sfgate.com). It just reads the site with python urllib.request and then does some DOM cleanup with beautiful soup. I wasn't doing any site crawling or exposing the proxy to 1000s of people or anything like that. It was just improving my own reading experience.Just in the past day or so, sfgate.com put in some kind of anti scraping stuff, so urllib, curl, lynx etc. now all fail with 403. Maybe I'll undertake the bigger and slower hassle of trying to read the site with selenium or maybe I'll just give up on newspapers and get my news from HN ;).I wonder if archive.is has had its sfgate.com experience change. Just had to mention them to stay slightly on topic.

评论 #36065917 未加载

评论 #36067931 未加载

World177almost 2 years ago

I think they might just try all the user agents in the robots.txt. [1] I've included a picture showing an example. In this second image, [2] I receive the paywall with the user agent left as default. There might also just be an archival user agent that most websites accept, but I haven't looked into it very much.[1] <a href="https://i.imgur.com/lyeRTKo.png" rel="nofollow">https://i.imgur.com/lyeRTKo.png</a>[2] <a href="https://i.imgur.com/IlBhObn.png" rel="nofollow">https://i.imgur.com/IlBhObn.png</a>

评论 #36070379 未加载

评论 #36074263 未加载

chrisco255almost 2 years ago

Just archived a website I created. It looks like it runs HTTP requests from a server to pull the HTML, JS and image files (it shows the individual requests completing before the archival process is complete). It must then snapshot the rendered output, then it renders those assets served from their domain. Buttons on my site don't work after the snapshot, since the scripts were stripped.

评论 #36063022 未加载

lcnPylGDnU4H9OFalmost 2 years ago

I think a browser extension which people who have access to the article use to send the article data to the archive server.

评论 #36066055 未加载

评论 #36062558 未加载

评论 #36061961 未加载

janejeonalmost 2 years ago

> If it identifies itself as archive.is, then other people could identify themselves the same way.Theoretically, they could just publish the list of IP ranges that canonically "belongs" to archive.is. That would allow websites to distinguish if a request identifying itself as archive.is is actually from them (it fits one of the IP ranges), or is a fraudster.

评论 #36062453 未加载

评论 #36061972 未加载

Miner49eralmost 2 years ago

According to their blog they use AMP: <a href="https://blog.archive.today/post/675805841411178496/how-does-archive-bypass-hard-paywalls-for#notes" rel="nofollow">https://blog.archive.today/post/675805841411178496/how-does-...</a>

评论 #36061946 未加载

评论 #36061923 未加载

armchairhackeralmost 2 years ago

A lot of sites don't seem to care about their paywall. Plenty of them load the full article, then "block" me from seeing the rest by adding a popup and `overflow: hidden` to `body`, which is super easy to bypass with devtools. Others give you "free articles" via a cookie or localStorage, which you can of course remove to get more free articles.There are your readers who will see a paywall and then pay, and there are your readers who will try to bypass it or simply not read at all. And articles spread through social media attention, and a paywalled article gets much less attention, so it's non-negligibly beneficial to have people read the article for free who would otherwise not read it.Which is to say: the methods archiv.is uses may not be that special. Clear cookies, block JavaScript, and make deals with or special-case the few sites which actually enforce their paywalls. Or identify yourself as archiv.is, and if others do that to bypass the paywall, good for them.

alex_youngalmost 2 years ago

Not specifically related to archive.is, but news sites have a tightrope to walk.They need to both allow the full content of their articles to be accessed by crawlers so they can show up in search results, but they also want to restrict access via paywalls. They use 2 main methods to achieve this: javascript DOM manipulation and IP address rate limiting.Conceivably one could build a system which directly accesses a given document one time from a unique IP address and then cache the HTML version of the page for further serving.

retrocryptidalmost 2 years ago

Many (most?) "big content" sites let Google and Bing spiders scrape the contents of articles so when people search for terms in the article they'll find a hit and then get referred to the pay wall.Google doesn't want everyone to know what a Google indexing request looks like for fear the CEO mafia will institute shenanigans. And the content providers (NYT, WaPo, etc.) don't want people to know 'cause they don't want people evading their paywall.Or maybe they're okay with letting the archive index their content...

评论 #36063030 未加载

评论 #36063038 未加载

评论 #36062841 未加载

w1nst0nsm1thalmost 2 years ago

If the people who know that tell you, they could lose access to said ressources.But it's kind of an open secret, you just don't look in the right place.

thallosaurusalmost 2 years ago

I just tried it with a local newspaper, it did remove the floating pane but didn't unblur and the text is also scrambled (used to be way worse protected, firefox reader mode could easily bypass it)(<a href="https://archive.is/1h4UV" rel="nofollow">https://archive.is/1h4UV</a>)

xiekombalmost 2 years ago

I thought they used this browser extension: <a href="https://gitlab.com/magnolia1234/bypass-paywalls-chrome-clean" rel="nofollow">https://gitlab.com/magnolia1234/bypass-paywalls-chrome-clean</a>

评论 #36062025 未加载

评论 #36063589 未加载

jrochkind1almost 2 years ago

Every once in a while I _do_ get a retrieval from archive.is that has the paywall intact.But I don't know the answer either.

Yujfalmost 2 years ago

I don't know about archive.is, but 12ft.io does identify as google to bypass paywalls afaik

评论 #36063029 未加载

评论 #36065079 未加载

firexcyalmost 2 years ago

My hypothesis is that they use a set of generic methods (e.g., robot UA, transient cache, and JS filtering) and rely on user reports (they have a tumblr page for that) to identify and manually fix access to specific sites. Having a look at the source of the bypasspaywallclean extension will give you a good idea of most useful bypassing methods. Indeed, most publishers are only incentivized to paywall their content to the degree where most of their audience are directed to pay and they have to leave backdoors here or there for purposes such as SEO.

w1nst0nsm1thalmost 2 years ago

Follow the magnolia trail...

shipscodealmost 2 years ago

What happens when you first load a paywalled article? 9 times out of 10 it shows the entire article before the JS that runs on the page pops up the paywall. Seems like it probably just takes a snapshot prior to JS paywall execution combined with the Google referrer trick or something along those lines.

rifficalmost 2 years ago

your browser usually downloads an entire article and certain elements are overlayed.it's trivial to bypass most paywalls isn't it?

评论 #36064916 未加载

not_your_vasealmost 2 years ago

They use you, as a proxy. If you (who archives it) have access to the site (either because you paid or have free articles), they can archive it too. If you don't have access, they only archive a paywall.

mr-pinkalmost 2 years ago

every time you visit they force some kid in a third world country to answer captchas until they can pay for one article's worth of content

jwildeboeralmost 2 years ago

It’s internet magic. <rainbowmagicsparkles.gif> ;)

jakedataalmost 2 years ago

Alas it doesn't allow access to the comment section of the WSJ which is the only reason I would visit the site. WSJ comments re-enforce my opinion of the majority of humanity. My father allowed his subscription to lapse and I won't send them my money so I will just have to imagine it.

24 comments

RicoElectricoalmost 2 years ago

Nice try, media company employee ;)/jk

评论 #36063158 未加载

fxtentaclealmost 2 years ago

评论 #36063522 未加载

评论 #36063150 未加载

评论 #36063169 未加载

评论 #36066931 未加载

评论 #36066074 未加载

评论 #36068842 未加载

throwaway81523almost 2 years ago

评论 #36065917 未加载

评论 #36067931 未加载

World177almost 2 years ago

评论 #36070379 未加载

评论 #36074263 未加载

chrisco255almost 2 years ago

评论 #36063022 未加载

lcnPylGDnU4H9OFalmost 2 years ago

I think a browser extension which people who have access to the article use to send the article data to the archive server.

评论 #36066055 未加载

评论 #36062558 未加载

评论 #36061961 未加载

janejeonalmost 2 years ago

评论 #36062453 未加载

评论 #36061972 未加载

Miner49eralmost 2 years ago

评论 #36061946 未加载

评论 #36061923 未加载

armchairhackeralmost 2 years ago

alex_youngalmost 2 years ago

retrocryptidalmost 2 years ago

评论 #36063030 未加载

评论 #36063038 未加载

评论 #36062841 未加载

w1nst0nsm1thalmost 2 years ago

If the people who know that tell you, they could lose access to said ressources.But it's kind of an open secret, you just don't look in the right place.

thallosaurusalmost 2 years ago

xiekombalmost 2 years ago

I thought they used this browser extension: <a href="https://gitlab.com/magnolia1234/bypass-paywalls-chrome-clean" rel="nofollow">https://gitlab.com/magnolia1234/bypass-paywalls-chrome-clean</a>

评论 #36062025 未加载

评论 #36063589 未加载

jrochkind1almost 2 years ago

Every once in a while I _do_ get a retrieval from archive.is that has the paywall intact.But I don't know the answer either.

Yujfalmost 2 years ago

I don't know about archive.is, but 12ft.io does identify as google to bypass paywalls afaik

评论 #36063029 未加载

评论 #36065079 未加载

firexcyalmost 2 years ago

w1nst0nsm1thalmost 2 years ago

Follow the magnolia trail...

shipscodealmost 2 years ago

rifficalmost 2 years ago

your browser usually downloads an entire article and certain elements are overlayed.it's trivial to bypass most paywalls isn't it?

评论 #36064916 未加载

not_your_vasealmost 2 years ago

mr-pinkalmost 2 years ago

every time you visit they force some kid in a third world country to answer captchas until they can pay for one article's worth of content

jwildeboeralmost 2 years ago

It’s internet magic. <rainbowmagicsparkles.gif> ;)

jakedataalmost 2 years ago