Building personal search infrastructure for your knowledge and code

661 pointsby october_skyover 5 years ago

38 comments

djhworldover 5 years ago

I've given up with trying to find The One True Note Taking Tool, so have ended up writing my own thing that I tinker with now and again to tune it to exactly what I need.It's essentially a simple web server that sits on top of a bunch of markdown files.The frontend renders the markdown using markdown-it and supports KaTeX for simple inline mathy things, along with the extended markdown stuff like tables etc. I've even made it so that you can drag and drop files (including images) into the edit box and it will upload them to the server and render the correct markdown syntax so they can be rendered when you look at the note.Alongside the files, the data is also stored in a SQLite database file with some metadata, and I'm using the Full Text Search (FTS5) engine to support search which seems to work ok.If the database gets corrupted it can just be rebuilt, it's really just there to augment the notes. If I stop developing it or want to move on, the notes are there as text files.It works well enough in a mobile browser, although admittedly a bit rubbish if you need offline access.Works well enough for me. I might open source it one day but I think I'd need to clean up the code a bit first :)EDIT: the core of the tool was mostly inspired by this article <a href="https://golang.org/doc/articles/wiki/" rel="nofollow">https://golang.org/doc/articles/wiki/</a>

评论 #22164336 未加载

评论 #22163885 未加载

评论 #22171663 未加载

评论 #22165713 未加载

sqsover 5 years ago

Sourcegraph CEO here. I see the doc mentions Sourcegraph for code search (cool!). Something like ripgrep is indeed better for your case, a single person who just needs to search code in local directories on their own machine. I made a PR for our docs at <a href="https://github.com/sourcegraph/sourcegraph/pull/8075" rel="nofollow">https://github.com/sourcegraph/sourcegraph/pull/8075</a> that should clarify this.Sourcegraph is a web-based code search tool that automatically syncs and indexes many repositories from your organization's code host(s). It's intended for every developer at an organization to use for searching across all of the organization's code (and for navigating/cross-referencing with code intelligence). It's self hosted and usually there is 1 Sourcegraph instance per organization. If you love local+personal code search, I bet you and your teammates would love organization-wide code search, so give Sourcegraph a try (<a href="https://docs.sourcegraph.com/#quickstart" rel="nofollow">https://docs.sourcegraph.com/#quickstart</a>). :)

评论 #22163248 未加载

评论 #22163598 未加载

评论 #22164778 未加载

评论 #22164871 未加载

评论 #22163398 未加载

评论 #22165157 未加载

评论 #22163749 未加载

ssivarkover 5 years ago

Meta-observation. This topic seems to be getting a lot of attention on HN over the last few months, indicating massive interest. Further, looking at the landscape of developments in this space (past all the me-too Markdown note taking apps): Evernote seems to have a fading presence on the landscape, Notion seems to be a (too?) well-funded behemoth startup, Roam is trying some exciting things, and Tiago Forte is putting together some interesting things under the BASB banner. (Any others? Oh btw, there’s also Perkeep)It’s amazing for how long Emacs’ Org-mode has been largely unparalleled! Apart from the revered desktop setup, there are now a bunch of mobile offerings including Organice — not quite slick, but definitely useful.I‘m sincerely rooting for more experiments in this area. I would love to be able to write by hand or speak to my memex (multi-modal interaction). Vannevar Bush’s “As we may think” has languished uncourted for pitifully long. In some ways, this was supposed to be the first “killer app” for personal computing.

评论 #22161907 未加载

评论 #22162772 未加载

评论 #22169099 未加载

评论 #22163504 未加载

评论 #22164055 未加载

评论 #22165540 未加载

klftover 5 years ago

(1) For note taking I stumbled across anno[1] via[2] two weeks ago. It's a python flask application which you run on your localhost. You write markdown which gets stored locally as file and is rendered as html using pandoc[3]. It's really basic but I love it.(2) For physical documents I use a Fujitsu ScanSnap iX500[4] for scanning. A runtime-licencse of ABBYY FineReader for OCR is included. The resulting PDF has embedded text which I extract using pdftotext[5]. I wrote a python application to search and tag this documents. It loads all the text in-memory which is perfecty fine as I have < 10,000 documents. I use it since 5 years and it works OK.[1] <a href="https://github.com/gwgundersen/anno" rel="nofollow">https://github.com/gwgundersen/anno</a>[2] <a href="https://news.ycombinator.com/item?id=22033792" rel="nofollow">https://news.ycombinator.com/item?id=22033792</a>[3] <a href="https://pandoc.org/" rel="nofollow">https://pandoc.org/</a>[4] <a href="https://www.fujitsu.com/global/products/computing/peripheral/scanners/scansnap/ix500/" rel="nofollow">https://www.fujitsu.com/global/products/computing/peripheral...</a>[5] <a href="https://en.wikipedia.org/wiki/Pdftotext" rel="nofollow">https://en.wikipedia.org/wiki/Pdftotext</a>

评论 #22179350 未加载

评论 #22164758 未加载

评论 #22164739 未加载

评论 #22164129 未加载

stillwater56over 5 years ago

Does anyone else find that the simple act of writing notes helps them remember and process better? I spent forever trying to find an ideal note-taking solution, but now I just write things in a single notebook. I rarely review my notes, but I find that simply writing thoughts down consistently has improved my memory and understanding of new concepts.

评论 #22165910 未加载

评论 #22165620 未加载

评论 #22164973 未加载

评论 #22170927 未加载

评论 #22164637 未加载

评论 #22164605 未加载

评论 #22165153 未加载

lcallover 5 years ago

I wrote and use daily <a href="http://onemodel.org" rel="nofollow">http://onemodel.org</a> (AGPL, uses postgres), for many reasons listed there :) . One way to think of its current state is a text-mode, easy-to-learn (i hope) infinite mind map of things, where I store and can query effectively everything: calendar, reminders, quasi-anki-like knowledge review, journal, automatic activity log, notes on subjects, very efficiently for the user. (It also stores documents, but that is not very smooth compared to other document systems, nor is browser integration smooth at all.)Edit: It also has a very basic security model (private, public, unspecified), and with that in mind, can export trees of notes as html or as outline documents (text), with or w/o indentation & numbering, which I've found very useful. And anything can be in as many places in the tree as is helpful. The export to simple html, I use to generate my 2 web sites.(I plan to move it to Rust, and maybe sqlite, eventually, as well as add features like anki, internal code attached to entity classes for cheap internal customization/automation, etc, but have been slow lately.)(Edit: it is currently only self-hosted by each user. Have considered doing hosting for other users, and might some day.)

评论 #22163486 未加载

评论 #22171056 未加载

评论 #22163842 未加载

评论 #22162961 未加载

gricardo99over 5 years ago

A great time saver for me was simply setting up better bash history and search capabilities[1].I wrote a wrapper function, sbh (search bash history) that allows me to input date strings like "2 months ago", or "last week", which narrows the search. Linux 'date' function with --date string arg is pretty powerful[2].1 - <a href="https://spin.atomicobject.com/2016/05/28/log-bash-history/" rel="nofollow">https://spin.atomicobject.com/2016/05/28/log-bash-history/</a>2 - <a href="https://www.thegeekstuff.com/2013/05/date-command-examples/" rel="nofollow">https://www.thegeekstuff.com/2013/05/date-command-examples/</a>

dchichkovover 5 years ago

Reminds me somewhat similar - CEO of Wolfram developed a nice way of record keeping: <a href="https://writings.stephenwolfram.com/2019/02/seeking-the-productive-life-some-details-of-my-personal-infrastructure/" rel="nofollow">https://writings.stephenwolfram.com/2019/02/seeking-the-prod...</a>By the way, is there, by chance, a "note taking/indexing tool from photo"? I'd like to be able to take a photo of an title/abstract of computer science paper with my phone. And then be able to find it, by approximate date and keywords. (I use Android. Seems like something relatively easy to hack, actually, on top of Google photos.)

评论 #22164086 未加载

napoleondover 5 years ago

I've been thinking a lot about how I manage my own data lately (notes, photos, code, reference material, etc) and have concluded that the primary feature I'm looking for is longevity. I'm saddened by the amount of data I've lost over the years, either because of hard disk failures or third-party services going out of business/making it difficult to extract things/getting too expensive.In light of this, I'm biasing toward simple file formats managed by tools I write myself, and optimizing for cost in a way that I otherwise don't, since any recurring costs incurred by the system are effectively a lifelong commitment. I am relying on S3 for primary storage (so that it is accessible anywhere) but with a sync to offline backup.So far, I've implemented a personal Zettelkasten tool (with built-in spaced repetition, so doubles as an Anki replacement) and a search engine that's based on Presto (via AWS Athena) so that I don't need to keep an Elasticsearch instance alive. I'm planning to build out other repository tools as I go.It's been very liberating to build tools that are never meant to be used by anyone other than myself, and with the confidence that the tools don't matter too much anyway since the underlying files are stored in evergreen formats.

评论 #22162875 未加载

评论 #22165076 未加载

spdustinover 5 years ago

I'd really like a personal "correlate all the things!" setup that has a plugin architecture for any source and creates a time series and document-based store of whatever I want. Tweets, e-mails, text messages, time tracking, etc.There are lots of tools that do the individual moving parts, but a personal aggregator of everything would be interesting. Basically, a tool that lets you become your own personal data broker—just for your own personal data.

评论 #22163761 未加载

评论 #22163763 未加载

user00012-abover 5 years ago

My problem with a lot of services listed below, is they all eventually go away, and all your data is off somewhere else. Unless you store your data locally in a human readable format (markdown) you are just putting all your data into a system that WILL go away at some point in the future.Google has already had 2-3 services to manage your data that they have closed down. Maybe they are the ones that taught me not to trust your data with anything on the web.Even something like Evernote is iffy, they seem like they are constantly on the verge of shutting down.Although I do find it sad that that the human race as a whole puts so little value into this type of software, and so much value into sports and politics.

评论 #22163196 未加载

评论 #22164965 未加载

评论 #22163929 未加载

评论 #22168940 未加载

评论 #22163854 未加载

ketzoover 5 years ago

It's been mentioned a few times in these comments, but I want to add a +1 for Roam[1]. Note-taking/personal knowledge tool that's very, very different from anything I've seen before -- closest thing I can compare it to is Wikipedia. It's still in beta with some rough edges, but VERY worth checking out.[1] roamresearch.com

评论 #22166670 未加载

评论 #22164902 未加载

评论 #22163556 未加载

评论 #22170957 未加载

Fiveplusover 5 years ago

> all digital trace I'm leaving (tweets, internet comments, annotations)I would be open to the idea of a tool which combines the entirety of my digital presence at any point in time in a single platform. Kinda like a dynamically updated list which updates itself - every time a linked account makes a comment, 'likes' a post or performs any activity that may link it back to me.

评论 #22161884 未加载

评论 #22162577 未加载

评论 #22163650 未加载

wtracyover 5 years ago

This have me a hairbrained idea for a browser extension that drops every web page you visit into a private Lucene database.

评论 #22163436 未加载

评论 #22163129 未加载

评论 #22162487 未加载

评论 #22162372 未加载

jmakovover 5 years ago

No mentions of <a href="https://tiddlywiki.com/" rel="nofollow">https://tiddlywiki.com/</a>?

评论 #22163274 未加载

capablewebover 5 years ago

Everything I write about (journal + other things, task lists and what not) is written in plain markdown files currently (about to move it to TiddlyWiki, one of these days...) and to get search, I just use `the-silver-searcher` which searches the entire directory of my files. Simple and scalable (got around 9k documents by now)

insomniacityover 5 years ago

My eternal frustration in this space is that my employer has strict firewalls, web filtering and data-loss prevention software, and remote access is over Citrix with no copy-paste. Consequently, if I build a knowledge base, it is stuck inside the firewall. Equally, if I build it outside, I can't use it at work.

评论 #22163615 未加载

karlicossover 5 years ago

Hey, author here. Happy to answer any questions!

评论 #22185958 未加载

porkerover 5 years ago

> Ideally I want to be able to do fulltext realtime search over anything that I ever had in my visual field. Not even necessarily text, but audio and video as well.Where I find all these systems break down is recall. They're designed for someone who can recall a word or phrase that was in the content. I can usually recall "It was about X" or "The document/web page/image looked like Y". But an actual word? The author's name? Not a chance.While a more difficult problem, if the tool is to live up to the "Future" section of this page, it's got to go a long way beyond what's in the source data, to what's thought of by the user.

albertzeyerover 5 years ago

This topics comes up again and again. I collected some notes about this here: <a href="https://github.com/albertz/wiki/blob/master/personal-knowledge-base.md" rel="nofollow">https://github.com/albertz/wiki/blob/master/personal-knowled...</a>E.g. one software I started to use is nvALT, via: <a href="https://www.macstories.net/links/organizing-everything-with-plain-text-notes/" rel="nofollow">https://www.macstories.net/links/organizing-everything-with-...</a>But I'm nowhere near a perfect and complete solution yet...

评论 #22164271 未加载

ajphdivover 5 years ago

I self host a confluence server. All my content is available to me offline. Might be a bit overkill, but I have knowledge bases for all my work. If there is a web page I come across I can just copy/paste the content into a new post. Everything is searchable. It really is great. They offer a starter license, which is $10 per year:<a href="https://www.atlassian.com/licensing/starter" rel="nofollow">https://www.atlassian.com/licensing/starter</a>

tomerbdover 5 years ago

I have less notes after being fed up with nites. It's really time consing to manage notes so - I manage logs. I just log everything I do each task in it's new page. It's append only.For notes which I mutate I just keep a personal web site and I tried to keep this as cheatsheet and as compact as possible so I don't need to manage it.So append only log in quip new folder for each task.Mutative cheatsheet super compact pages in personal website.Oh and for quick sniper's alfred.That's it.

glinkotover 5 years ago

I use a few things for this (on windows):- For notes, OneNote, though I'm always on the lookout for an alternative with decent UI and syncing, but using open file formats. Full text search simple enough with this. Code formatting isn't good but there's an addin where the free version formats it as it was copied.- To search local files, Voidtools Everything is great. Searching instantly by filename is a real time saver.- If I want full text search of a large base of documents, I used Likasoft Archivarius which cost me $30 about 10 years ago and is still handy. It's the only local desktop search I've found that supports full text indexing of tons of formats like outlook .ost, etc and can look inside archive files- For backups I've continued to stick with external drives, mirrored periodically with Freefilesync. 3 copies - one as master, two mirrors ensuring one is offsite.

评论 #22168539 未加载

flaqueover 5 years ago

If you're into this sort of thing, you might want to checkout Roamresearch: <a href="https://roamresearch.com/" rel="nofollow">https://roamresearch.com/</a>

评论 #22162135 未加载

dapithorover 5 years ago

I wish things like <a href="https://piggydb.net/" rel="nofollow">https://piggydb.net/</a> had more momentum or competitors... personal knowledge databases seem to be such a tough niche to tackle.Edit: since there is a new project here is more details years back: <a href="http://www.linux-magazine.com/Issues/2014/160/Workspace-Piggydb" rel="nofollow">http://www.linux-magazine.com/Issues/2014/160/Workspace-Pigg...</a>

jslakroover 5 years ago

We could fill a whole internet with each personal method for storing, classifying and accesing. We're missing a OS for our own memory.

jefuriiover 5 years ago

I wish there was a method for printing QR codes or URLs on paper that would be the reverse of scanning a QR code. This would make it easy to write complex URLs in your paper diary/techo/commonplace book/notebook.

andreygrehovover 5 years ago

I keep my knowledge in a private Git repo managed by <a href="https://www.gitbook.com/" rel="nofollow">https://www.gitbook.com/</a>. So far it works out great. Going to make it public soon.

评论 #22163298 未加载

ziyadbover 5 years ago

The holy grail [<a href="https://beepb00p.xyz/pkm-search.html#future" rel="nofollow">https://beepb00p.xyz/pkm-search.html#future</a>] of this really resonated with me and fully mirrors what I've been thinking about the past few months. In my observations, it's input capture, information organization, and subsequent retrieval:Information Capture:Input Capture - You’re going to have all-encompassing tracking and recording of all activity, but want configurable privacy on the extent to which you want your daily conversations and observations of external things you encounter and are exposed to. Capturing input needs to be holistic and incorporate all properties of encounters and new information.Potential sources of input:Vision — point of view recording, see snapchat spectacles, etc as primitive examples. Audio (voice notes and multi-party conversations) - voice calls, video, etc. and other forms of audio transmission where there is more than a single party in the interaction. Digital interactions You will need to keep track of web pages you visit at what times Conversations you see on Twitter, etc.Properties and cues must be extrapolated from the information that is captured on input, in the case of audio, transcriptions are sufficient for transcription and retrieval purposes, however since video is a visual medium, it includes significantly more properties that need to be accounted for.The aim here is to identify sufficient data points (cues) that are subsequently represented in such a way that they are easy to search across things you have encountered but only seem to recall a certain property or cue from. This is because of the fact that human beings tend to remember things in fragments, for instance, you might remember a certain color on a page that you visited within the last 6 months and nothing else.So long as you are capturing sufficient input and actions then you should be able to go back to any given point in time. How and where are you going to store this information? Storing everything is going to be a large amount of data. The essence of the information and context must be preserved. If you want to wind back to an arbitrary position in time with the original context intact, you want to retain as much as you can in the most efficient manner possible, so determining which data points to retain is essential. (Once the content structure has been figured out, this will be viable).Examples of Primary Cues:Time - humans generally keep track of things in a linear time-based fashion. Color - invokes emotion and is memorable. Physical Location - the efficiency of information retrieval is highly influenced by the location at which it is originally synthesized, encountered, and stored. Keywords - the default conventional mode. Can and should be extracted from video/imagery and audio. Imagery - search for images based on their contents and ambience.Potential Secondary Cue — Music - see historical associated input and actions while certain music was played. (What else?)Meta Cues — Subjects - Automated tagging of keywords/encountered content.Any combination of these queries is possible, but ultimately the killer feature is the ability to backtrack through time to find a certain piece of information that is made available thanks to the always-on recorded nature of your interactions with the physical and digital worlds combined.Knowing what to store, and how, + displaying it needs to be worked on further.

评论 #22163221 未加载

mauritsover 5 years ago

I've been pondering on building something like this for a while.For now, I've settled on sphinx because it can be easily exported to dash, and tied in to an alfred workflow for search.

Unsimplifiedover 5 years ago

Tried the custom webapp and DB solution for a while. Wasn't publicly portable enough (for others to copy paste/export easily).Currently using markdown files in git repos.

hvasilevover 5 years ago

I use a vim plugin called vimwiki and I export my todos and notes into HTML. Works fine for me.

JabavuAdamsover 5 years ago

I basically live in Evernote. Will gradually transition to personal tooling.

rawoke083600over 5 years ago

Most stuff (links, photos, docs, etc) I just email it to myself

voltagex_over 5 years ago

Is there anything for people who don't use Vim?

评论 #22168452 未加载

jacquesmover 5 years ago

A search infrastructure for my knowledge would require access to wetware. Code I can see working.

marv3llsover 5 years ago

Ya lost me at $(emacs)!

chimichanggaover 5 years ago

I just email links, code, docs, etc. to myself with descriptive subjects and tags.