TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: How to remove Ads from a downloaded HTML file to output an ad free file?

1 pointsby suramya_tomar6 months ago
Is there a tool&#x2F;script that will allow me to filter out ads from a page when downloading it using curl. (Similar to how uBlock Origin works for a browser).<p>Basically, what I am doing is downloading a snapshot of a site using curl. But the sites have advertisements in them which I want to filter out. So is there a tool that will let me do that from the command line so that the output file doesn&#x27;t have ads in it?<p>In short, I want something like uBlock Origin but for html files that I will be converting to PDF&#x27;s or epubs. Something like:<p>curl https:&#x2F;&#x2F;www.google.com | AdRemover.sh | htmltopdf<p>Most of the solutions I found require you to update the &#x2F;etc&#x2F;hosts file to stop showing the ads but would rather avoid that if possible.

3 comments

suramya_tomar6 months ago
After taking a break and stepping away for a bit, I realized that I was recreating an archiving system for websites and that there are existing solutions that do the same thing.<p>I found <a href="https:&#x2F;&#x2F;github.com&#x2F;ArchiveBox&#x2F;ArchiveBox&#x2F;">https:&#x2F;&#x2F;github.com&#x2F;ArchiveBox&#x2F;ArchiveBox&#x2F;</a> which is a self hosted web archiving system. It covers most of my usecases (and I can extend it for additional functionality) so I am going to set this up and try it out.<p>Thanks all for the help.
solardev6 months ago
Do you have to use Curl? It wouldn&#x27;t render a lot of sites correctly anyway (anything that uses JS for rendering).<p>Can you run a puppeteer&#x2F;playwright instance (which control real browsers) and add an ad blocker to that? e.g. <a href="https:&#x2F;&#x2F;github.com&#x2F;ghostery&#x2F;adblocker">https:&#x2F;&#x2F;github.com&#x2F;ghostery&#x2F;adblocker</a> or <a href="https:&#x2F;&#x2F;github.com&#x2F;microsoft&#x2F;playwright-python&#x2F;issues&#x2F;782">https:&#x2F;&#x2F;github.com&#x2F;microsoft&#x2F;playwright-python&#x2F;issues&#x2F;782</a>
评论 #42101366 未加载
inhumantsar6 months ago
Editing &#x2F;etc&#x2F;hosts is going to be the easiest option.<p>The best option would be to use a programming language and a good HTML parser to do the job. eg: Use Python and BeautifulSoup to dig through the tree looking for any HTML tag which references an ad-serving network.