TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

A Unix-style personal search engine and web crawler for your digital footprint

342 点作者 amirGi将近 4 年前

26 条评论

MisterTea将近 4 年前
&gt; I&#x27;ve wasted many an hour combing through Google and my search history to look up a good article, blog post, or just something I&#x27;ve seen before.<p>This is the fault of web browser vendors who have yet to give a damn about book marks.<p>&gt; Apollo is a search engine and web crawler to digest your digital footprint. What this means is that you choose what to put in it. When you come across something that looks interesting, be it an article, blog post, website, whatever, you manually add it (with built in systems to make doing so easy).<p>So it&#x27;s a searchable database for bookmarks then.<p>&gt; The first thing you might notice is that the design is reminiscent of the old digital computer age, back in the Unix days. This is intentional for many reasons. In addition to paying homage to the greats of the past, this design makes me feel like I&#x27;m searching through something that is authentically my own. When I search for stuff, I genuinely feel like I&#x27;m travelling through the past.<p>This does not make any sense. It&#x27;s Unix-like because it feels old? It seems like the author thoroughly misses the point of unix philosophy.
评论 #27962160 未加载
评论 #27965799 未加载
评论 #27962219 未加载
评论 #27969009 未加载
评论 #27963869 未加载
评论 #27964213 未加载
评论 #27967726 未加载
simonw将近 4 年前
My version of this is <a href="https:&#x2F;&#x2F;dogsheep.github.io&#x2F;" rel="nofollow">https:&#x2F;&#x2F;dogsheep.github.io&#x2F;</a> - the idea is to pull your digital footprint from various different sources (Twitter, Foursquare, GitHub etc) into SQLite database files, then run Datasette on top to explore them.<p>On top of that I built a search engine called Dogsheep Beta which builds a full-text search index across all of the different sources and lets you search in one place: <a href="https:&#x2F;&#x2F;github.com&#x2F;dogsheep&#x2F;dogsheep-beta" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;dogsheep&#x2F;dogsheep-beta</a><p>You can see a live demonstration of that search engine on the Datasette website: <a href="https:&#x2F;&#x2F;datasette.io&#x2F;-&#x2F;beta?q=dogsheep" rel="nofollow">https:&#x2F;&#x2F;datasette.io&#x2F;-&#x2F;beta?q=dogsheep</a><p>The key difference I see with Apollo is that Dogsheep separates fetching of data from search and indexing, and uses SQLite as the storage format. I&#x27;m using a YAML configuration to define how the search index should work: <a href="https:&#x2F;&#x2F;github.com&#x2F;simonw&#x2F;datasette.io&#x2F;blob&#x2F;main&#x2F;templates&#x2F;dogsheep-beta.yml" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;simonw&#x2F;datasette.io&#x2F;blob&#x2F;main&#x2F;templates&#x2F;d...</a> - it defines SQL queries that can be used to build the index from other tables, plus HTML fragments for how those results should be displayed.
评论 #27969749 未加载
评论 #27962943 未加载
评论 #27962801 未加载
yunruse将近 4 年前
I love this idea, but the name “digital footprint” sort of implies it’s what effect you’ve had on the Internet for helping keep your online persona under control: your tweets, comments, emails, et cetera.<p>But this is a great idea! Having a search engine for vaguely _anything_ you touch very much does look like it’d increase the signal:noise ratio. It’d be interesting to be able to add whole sites (using, say, DuckDuckGo as an external crawler) to be able to fetch general ideas, such as, say, “Stack Exchange posts marked with these tags”.
评论 #27961808 未加载
SahAssar将近 4 年前
Looks very much like one of the ideas I&#x27;ve been thinking of building! The way I planned to do it was to use a similar approach to rga for files ( <a href="https:&#x2F;&#x2F;github.com&#x2F;phiresky&#x2F;ripgrep-all" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;phiresky&#x2F;ripgrep-all</a> ) and having a webextension to pull all webpages I vist (filtered via something like <a href="https:&#x2F;&#x2F;github.com&#x2F;mozilla&#x2F;readability" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;mozilla&#x2F;readability</a> ), dump that into either sqlite with FTS5 or postgres with FTS for search.<p>A good search engine for &quot;my stuff&quot; and &quot;stuff I&#x27;ve seen before&quot; is not available for most people in my experience. Pinboard and similar sites fill some of that role, but only for things that you bookmark (and I&#x27;m not sure they do full-text search of the documents).<p>---<p>Two things I&#x27;d mention are:<p>1. Digital footprint usually means your info on other sites, not just things I&#x27;ve accessed. If I read a blog that is not part of my footprint, but if I leave a comment on that blog that comment is part of it. The term is also mostly used in a tracking and negative context (although there are exceptions), so you might want to change that: <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Digital_footprint" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Digital_footprint</a><p>2. I don&#x27;t really get what makes it UNIX-style (or what exactly you mean by that? There seems to be many definitions), and the readme does not seem to clarify much besides expecting me to notice it by myself.
评论 #27962255 未加载
评论 #27963604 未加载
wydfre将近 4 年前
It seems pretty cool - but I think falcon[0] is more practical. You can install it from the chrome extension store[1], if you are too lazy to get it running yourself.<p>[0]: <a href="https:&#x2F;&#x2F;github.com&#x2F;lengstrom&#x2F;falcon" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;lengstrom&#x2F;falcon</a><p>[1]: <a href="https:&#x2F;&#x2F;chrome.google.com&#x2F;webstore&#x2F;detail&#x2F;falcon&#x2F;mmifbbohghecjloeklpbinkjpbplfalb?hl=en" rel="nofollow">https:&#x2F;&#x2F;chrome.google.com&#x2F;webstore&#x2F;detail&#x2F;falcon&#x2F;mmifbbohghe...</a>
评论 #27961547 未加载
etherio将近 4 年前
This is cool! Similar to one of the goals I&#x27;m trying to accomplish with Archivy (<a href="https:&#x2F;&#x2F;archivy.github.io" rel="nofollow">https:&#x2F;&#x2F;archivy.github.io</a>) with the broader goal of not just storing your digital presence but also acting as a personal knowledge base.
smusamashah将近 4 年前
This sounds similar to Monocle <a href="https:&#x2F;&#x2F;github.com&#x2F;thesephist&#x2F;monocle" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;thesephist&#x2F;monocle</a><p>Demo: <a href="https:&#x2F;&#x2F;monocle.surge.sh&#x2F;" rel="nofollow">https:&#x2F;&#x2F;monocle.surge.sh&#x2F;</a><p>Blog post explaining motivation <a href="https:&#x2F;&#x2F;thesephist.com&#x2F;posts&#x2F;monocle&#x2F;" rel="nofollow">https:&#x2F;&#x2F;thesephist.com&#x2F;posts&#x2F;monocle&#x2F;</a>
评论 #27968310 未加载
ryanfox将近 4 年前
I run a similar project: <a href="https:&#x2F;&#x2F;apse.io" rel="nofollow">https:&#x2F;&#x2F;apse.io</a><p>It runs locally on your laptop&#x2F;desktop, so you don’t need a server to host anything.<p>Also, it can index <i>everything</i> you do, not just web content.<p>It works really well for me!
rhn_mk1将近 4 年前
This seems similar to recoll augmented with recoll-we.<p><a href="https:&#x2F;&#x2F;addons.mozilla.org&#x2F;en-US&#x2F;firefox&#x2F;addon&#x2F;recoll-we&#x2F;" rel="nofollow">https:&#x2F;&#x2F;addons.mozilla.org&#x2F;en-US&#x2F;firefox&#x2F;addon&#x2F;recoll-we&#x2F;</a>
jll29将近 4 年前
Microsoft Research&#x27;s Dr. Susan Dumais is the expert on this kind of personal information management.<p>Her landmark system (and associated seminal SIGIR&#x27;03 paper) &quot;Stuff I&#x27;ve Seen&quot; tackled re-finding material: <a href="http:&#x2F;&#x2F;susandumais.com&#x2F;UMAP2009-DumaisKeynote_Share.pdf" rel="nofollow">http:&#x2F;&#x2F;susandumais.com&#x2F;UMAP2009-DumaisKeynote_Share.pdf</a>
encryptluks2将近 4 年前
Why do all these bookmark projects:<p>1. Rely on JavaScript for the interface. Being built in Go, why not just paginate the results and utilize Bleve or Xapian for search?<p>2. Store data in a format that is not easily readable by itself. The only exception to this is nb.<p>3. Suck at CLI tools. I&#x27;m looking to rclone, Hugo, kubectl, etc for the right way to build a CLI.
Minor49er将近 4 年前
This looks really cool. It&#x27;s beyond the scope of this project, but I think that having something like this as a browser extension would make it easier to use: instead of manually copying and scraping links, it could index and save pages that you&#x27;ve been on, placing much more significance on anything that you&#x27;ve bookmarked. Granted, this is just an immediate thought. I&#x27;m going to give this a proper try once I have some more spare time.
评论 #27961696 未加载
ThinkBeat将近 4 年前
I use Evernote for this.<p>You can set it ot save a link, a screenshot, or content of the page. You can add tags if you want, and it is also easy to annotate it so you can remember the context better. You can also add links to other post inside Evernote.<p>Pocket is also a great tool I used for many years. Quite similar and different.<p>Both have browser extensions, so it is easy to clip.<p>With Evernote I even have shortcuts defined so I dont have to click for the webpage to be clipped.
cratermoon将近 4 年前
Interesting project but some of what the author writes just sounds flat-out weird. &quot;The first thing you might notice is that the design is reminiscent of the old digital computer age, back in the Unix days.&quot;<p>&quot;Apollo&#x27;s client side is written in Poseidon.&quot;<p>I had to look that up: Poseidon is not a language, it&#x27;s just a javascript framework for event-driven dom updates.
kordlessagain将近 4 年前
Cool! It’s great to see others thinking about this. I’ve been working on <a href="https:&#x2F;&#x2F;mitta.us" rel="nofollow">https:&#x2F;&#x2F;mitta.us</a> for a while now and it uses solr, a headless brrowser and google vision to snapshot and index full text. The UI is a bit odd but you can just append mitta.us&#x2F; to any URL to save it.
zerop将近 4 年前
How&#x27;s it different from instapaper like services. There is also open source alternative of instapaper called wallabag.
dpcx将近 4 年前
Similar also to Promnesia (<a href="https:&#x2F;&#x2F;github.com&#x2F;karlicoss&#x2F;promnesia" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;karlicoss&#x2F;promnesia</a>), which includes a browser extension to search the records.
soheil将近 4 年前
There is something really strange about a lot of recent Go projects including this one. I can&#x27;t put my finger on, but the combination of the author and the type of problem they choose to tackle oftentimes seems baffling to me. Most projects seem to be solving a problem that is often misidentified or otherwise badly solved, but somehow the focus ends up being on the code architecture or the UI design. It&#x27;s like they&#x27;re trying to solve a problem just for the sake of writing some code and the correct way to use Go idiomatically or something and don&#x27;t really care about the problem or how well the solution actually works.
评论 #27963965 未加载
评论 #27963228 未加载
评论 #27967392 未加载
toomanyducks将近 4 年前
If nothing else, that README is fantastic!
pantulis将近 4 年前
Reminds me a lot of DEVONthink for Mac
alanh将近 4 年前
code comment in the readme describes the Record as constituting an &#x27;interverted index&#x27;. typo for inverted? although it is not obvious to me what would make this an inverted index instead of a normal index
totetsu将近 4 年前
there used to be an actity timeline journal program i ran on ubuntu that let me see which days i accessed which files. It was very useful as a sudent.
anthk将近 4 年前
Is this like recoll, hyperstrayer and so on?
ctocoder将近 4 年前
wrote something along the same ilk but got distracted <a href="https:&#x2F;&#x2F;github.com&#x2F;dathan&#x2F;go-find-hexagonal" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;dathan&#x2F;go-find-hexagonal</a>
dandanua将近 4 年前
A similar tool – <a href="https:&#x2F;&#x2F;github.com&#x2F;go-shiori&#x2F;shiori" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;go-shiori&#x2F;shiori</a>
评论 #27963944 未加载
soheil将近 4 年前
Has the author tried pressing CMD+Y to view and search browser history?