TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: How to organize archived webpages locally?

7 点作者 linuxfan2718将近 2 年前
I've been going through 100's of bookmarks I made over the years, all carefully tagged and organized, but a lot of the pages are taken down. I want to start archiving them locally, probably using Firefox's "Save Page As..." feature. Do people here do this, how do you organize and tag them? Folders aren't perfect because some pages deserve multiple tags.

7 条评论

networked将近 2 年前
Check out <a href="https:&#x2F;&#x2F;gwern.net&#x2F;archiving" rel="nofollow">https:&#x2F;&#x2F;gwern.net&#x2F;archiving</a>.<p>Since your bookmarks are already tagged, perhaps you don&#x27;t need to tag the files? In some ways, it may be convenient, but at the cost of duplicating the information. As long as you can map a bookmarked URL to a file path or paths, you can find archived copies through your bookmarks.<p>Here is what I do for external URLs on my personal website. It is inspired by Gwern&#x27;s approach. A major difference is that he doesn&#x27;t nest directories; he uses <i>${domain}&#x2F;${url-checksum}.ext</i>.<p>I translate the URL to a file path in my <i>link-archive</i> directory by applying the function <i>dest-dir</i> from the Tcl code below. In the directory, I save whatever is at the URL with a name based on its checksum (<i>b2sum -l 32</i>), so I can have multiple archived copies of the same URL. I use <a href="https:&#x2F;&#x2F;github.com&#x2F;gildas-lormeau&#x2F;single-file-cli">https:&#x2F;&#x2F;github.com&#x2F;gildas-lormeau&#x2F;single-file-cli</a> to save the URL. I determine the destination file extension from the MIME type.<p>This gives you paths like <i>link-archive&#x2F;365tomorrows.com&#x2F;2005&#x2F;10&#x2F;23&#x2F;postcard&#x2F;e5445dff.html</i> for <a href="https:&#x2F;&#x2F;365tomorrows.com&#x2F;2005&#x2F;10&#x2F;23&#x2F;postcard&#x2F;" rel="nofollow">https:&#x2F;&#x2F;365tomorrows.com&#x2F;2005&#x2F;10&#x2F;23&#x2F;postcard&#x2F;</a>.<p><pre><code> proc slug s { set s [string tolower $s] regsub -all {[^A-Za-z0-9\.\_\~\-]+} $s - s string trim $s - } proc dest-dir link { set slugs [lmap part [file split [regsub {#[^!].*$} $link {}]] { set x [string range [slug [regsub {^.&#x2F;} $part {}]] 0 127] regsub {^~} $x {.&#x2F;~} }] # Drop the protocol. file join {*}[lrange $slugs 1 end] }</code></pre>
DantesKite将近 2 年前
You should try OpenAI embeddings. They&#x27;re fairly cheap to run over a large amount of text (should cost &lt;$10 if you have thousands of documents I believe, but correct me if I wrong).<p>Then you can run searches for content even if the exact words aren&#x27;t the same.<p>Like let&#x27;s say you have a document titled &quot;Measuring canine tooth caries over 2004-2020&quot; and it never once mentions the word &quot;dog&quot;.<p>If you type in &quot;dog&quot; after doing the embeddings, it&#x27;ll suggest that specific document because &quot;canine&quot; and &quot;dog&quot; are closely related.<p>Great way to organize large groups of texts, there&#x27;s plenty of YouTube videos on how to do it, and best of all, you don&#x27;t have to spend time manually organizing everything. You just let the machine model do it for you.<p>You could even get it to auto-tag your documents based on what it thinks is the best category for the document and make it easier for you to parse that way as well.
thriller将近 2 年前
I use an extension called SingleFile, and have it save EVERY page I visit. It saves every page locally with a timestamp at the beginning of the filename followed by the page title. Normally, I can find what I&#x27;m looking for using search, so no need for tags.
epirogov将近 2 年前
it is better to store only text, in most cases layout and images don&#x27;t matters. save as pdf make documents hard to search, chrome do all as svg image on a page. You can use online converters to get well formatted pdf with selectable text <a href="https:&#x2F;&#x2F;products.aspose.app&#x2F;pdf&#x2F;webpage-to-pdf" rel="nofollow">https:&#x2F;&#x2F;products.aspose.app&#x2F;pdf&#x2F;webpage-to-pdf</a> I tried to organize my collection, after save to dvds I also created text file with names and disk numbers.
decide1000将近 2 年前
I use Mozilla&#x27;s Pocket for this. The paid version stores it for you. Getpocket.com
hamsterbase将近 2 年前
you can try hamsterbase.com<p>this tool will index all of your html. Support take highlight, full text search.
rmdes将近 2 年前
Wallabag