TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: HtmlWasher – An HTML cleanup tool

75 点作者 seky大约 8 年前

14 条评论

ElijahLynn大约 8 年前
The washer makes an XHR to &#x2F;ajax&#x2F;paste to do the &#x27;washing&#x27;.<p>Seems like this could be done in JavaScript without an XHR, and not send your info to them.<p>However, <a href="https:&#x2F;&#x2F;www.htmlwasher.com&#x2F;privacy&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.htmlwasher.com&#x2F;privacy&#x2F;</a>:<p>&quot;The Operator may collect the personal data, such as, without limitation, (i) name; (ii) age; (iii) sex; (iv) address; (v) homepage URL address; (vi) telephone number; (vii) email address; (viii) bank account number; as well as (ix) any information relating and relevant to the Services, including, without limitation, opening and administering the Account, or getting feedback for improving the Services.&quot;<p>&quot; In the event that the Operator is involved in a bankruptcy, merger, acquisition, reorganization or sale of assets, your personal data may be sold or transferred as part of that transaction.&quot;
评论 #14204494 未加载
goodgood大约 8 年前
Pandoc can do this:<p><pre><code> cat tea-dance.html | pandoc --from=html --to=markdown | pandoc --from=markdown --to=html </code></pre> I learned that from vimcasts.org: <a href="http:&#x2F;&#x2F;vimcasts.org&#x2F;episodes&#x2F;using-external-filter-commands-to-reformat-html&#x2F;" rel="nofollow">http:&#x2F;&#x2F;vimcasts.org&#x2F;episodes&#x2F;using-external-filter-commands-...</a>
评论 #14206575 未加载
burnbabyburn大约 8 年前
of this matter I really like <a href="https:&#x2F;&#x2F;github.com&#x2F;mozilla&#x2F;bleach" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;mozilla&#x2F;bleach</a> .<p>is your project any different aside from the &quot;service oriented&quot; nature? (also I don&#x27;t see any usage method, if not from the browser)
评论 #14202477 未加载
评论 #14202504 未加载
评论 #14202613 未加载
评论 #14202790 未加载
tshadwell大约 8 年前
from experience, I wouldn&#x27;t recommend other than context-aware safe templating systems for html safety in this day and age.<p>to an even greater extent than templating systems, sanitization systems of this type need to be built by an expert and align perfectly with how browsers parse tags, which is no small feat.<p>to give more concrete examples, from a few minutes of testing:<p>&lt;a href=&quot;javascript:&#x2F;&#x2F;%0Aalert`xss`&quot;&gt;1&lt;&#x2F;a&gt; &lt;- xss on click<p>&lt;img src=javascript:alert(2)&gt; &lt;- XSS in Opera Mobile, Opera 10, early versions of IE<p>&lt;img src=&quot;&#x2F;logout&quot;&gt; &lt;- csrf which affects nearly everything built without security knowhow
评论 #14204434 未加载
Continuous大约 8 年前
This is brilliant!<p>I wrote an HTML file in Microsoft Word. Then uploaded that .html file which had 800 lines. HtmlWasher cleaned up all the file content, the endless meta tags, non sense IE style tags, etc.
评论 #14203173 未加载
评论 #14205802 未加载
bluetidepro大约 8 年前
This would be really useful as a service. Send a glob of html to their endpoint, and return what this site does (the cleaned&#x2F;washed html). As a service, it could be more efficient than doing 1 file at a time on their site. Or better yet, it would be awesome to open source the way this cleans the html. Regardless, awesome site. I could see the use for various scenarios.
评论 #14203050 未加载
评论 #14202688 未加载
评论 #14202628 未加载
DvdGiessen大约 8 年前
Reminds me of a cleaner tool I wrote about 10+ years ago, a huge single God-class which would parse an HTML string, allowed me to do various transformations on the object tree, and rerendered the entire source code in correct and nicely indented XHTML. Back then I had unused server capacity, so I often used it to do compression of dynamically rendered pages from for example message boards. Also allowed me to place a badge bragging about my 100% W3C validator score, since the original software packages often did not produce such clean HTML. :p The code is actually still being run on every pageload for some old sites I never updated much since.<p>It has a tiny little webinterface a which remains online today on some underpowered server. Doesn&#x27;t work well with anything except XHTML though. <a href="http:&#x2F;&#x2F;htmlcleaner.blackholestudios.nl&#x2F;" rel="nofollow">http:&#x2F;&#x2F;htmlcleaner.blackholestudios.nl&#x2F;</a>
tannhaeuser大约 8 年前
If you&#x27;re serious about HTML checking and cleanup consider using SGML and my (inofficial) HTML 5.1 DTD [1].<p>It doesn&#x27;t do magic (like indentation or removing&#x2F;simplifying CSS) if that&#x27;s what you&#x27;re after, but it gives you straightforward capabilities to filter out script elements, check&#x2F;suppress event handler attributes and other places where JavaScript can occur maliciously in HTML, enforce presence of HTML elements, etc. Since it&#x27;s entirely driven by an SGML DTD grammar for HTML it can be customized to death really (for context-dependent filtering, injection prevention, whatever).<p>[1]: <a href="http:&#x2F;&#x2F;sgmljs.net&#x2F;blog&#x2F;blog1701.html" rel="nofollow">http:&#x2F;&#x2F;sgmljs.net&#x2F;blog&#x2F;blog1701.html</a>
richardwhiuk大约 8 年前
&quot;Reduces a HTML document (or fragment) to basic HTML tags and attributes&quot; - meaning what exactly? What counts as a basic attribute?
评论 #14202481 未加载
评论 #14202446 未加载
egfx大约 8 年前
This should be a library or an API, otherwise I don&#x27;t really see a use for this. Also seems overly aggressive, and there should be some options on what to keep. I see a need to remove JavaScript from HTML but keep events for example.
评论 #14204394 未加载
kazinator大约 8 年前
Here is one in C, with flex-generated lexing, for back-end use:<p><a href="http:&#x2F;&#x2F;www.kylheku.com&#x2F;cgit&#x2F;hc&#x2F;tree&#x2F;" rel="nofollow">http:&#x2F;&#x2F;www.kylheku.com&#x2F;cgit&#x2F;hc&#x2F;tree&#x2F;</a><p>I used this for allowing HTML in a mailing list e-mails to be incorporated into the web archive. (The archiver is a modified version of Lurker.)<p>P.S. &quot;wl&quot; stands for &quot;whitelist&quot;: what elements are allowed to pass through, and of those, which attributes are allowed to pass through. The condensed &quot;wl&quot; config file is translated into compiled-in static tables by the wl.txr script. No run-time config.
评论 #14215288 未加载
JustSomeNobody大约 8 年前
What is the use case for this?
评论 #14202569 未加载
评论 #14202509 未加载
评论 #14202490 未加载
hsivonen大约 8 年前
This doesn&#x27;t appear to use a spec-compliant HTML parser as the first step of the processing. Any tool of this nature created this day and age really should.
评论 #14202952 未加载
redxblood大约 8 年前
But.. why?
评论 #14205841 未加载