Taxonomy Is Hard

132 pointsby creckerover 2 years ago

31 comments

Tagging and categorizing are two subtly different things to do. Having dealt with a lot of real world data, all I can say is that getting your hands on consistently tagged or categorized data is hard and gets harder the more data sources you have.A real world example of how tagging can be both super useful and get out of hand is open street maps. The only meta data allowed in there are tags. The OSM community depends on people using tags correctly. But of course the tagging is incomplete, inconsistent, and subject to regional and data source specific variations. Which makes interpreting the tags a bit of a dark art.But it still adds up to a very complete and rapidly evolving world map. So, there's that.I've had the pleasure of seeing the documentation for Navteq's internal meta datas schema that they used for their maps while I still worked for Nokia Maps, which at the time owned Navteq. These days the whole thing is known as Here Maps. This was a PDF of around 4K pages. Thousands of attributes. Lots of weird little details related to traffic lights, subway entrances and exits, and other features you have on maps. This stuff gets complicated quickly.Two very different approaches to the same problem. I think I like the OSM way a bit better. Neither is easy.My job at the time was trying to align some of that data with data we got from external data sources like TripAdvisor, Qype, HRS, and a few others. Way too much time got invested in dumbing down categories, mapping one to the other, and trying to make sense of stuff. We had lots of issues with duplicate POIs because of all sorts of subtle differences in how different data sources were annotated with categories, tags, etc. Some of the data just isn't that good, complete, or consistent and you have to deal with that.

评论 #33420983 未加载

bambaxover 2 years ago

All true. Taxonomy is indeed hard. But, does it actually matter?It seems what matters is not how files are stored/organized, but how one can find the files one is looking for. Taxonomy is mostly a search problem.Yet, although one is usually capable of remembering specific or unique details about a file, it's still incredibly hard to search for a file or its contents effectively.Dropbox, to pick just one example, ditched all its advanced search functionalities a couple of years ago and only allows to search for any keyword (not all of them), with forced stemming, and no filtering of the results.I seem to remember there were startups a decade ago trying to address the search problem on the desktop but they all got acquired or folded; I don't understand why. The problem is real and seems quite solvable; yet it seems there's no actual market for it. It's a bit of a mystery.

评论 #33420554 未加载

评论 #33419480 未加载

评论 #33421639 未加载

评论 #33420338 未加载

评论 #33422800 未加载

评论 #33419998 未加载

qwerty456127over 2 years ago

Taxonomy is a demon which separates people into perfectionists and non-perfectionists just before dragging both the kinds to hell of exceptions, weird relations and impractical location. Perfectionists get stuck spending infinite amounts of time engineering the taxonomy, non-perfections face the quirks later.Tags are better but can turn out to be even harder (for similar reasons, amplified combinatorially).Labels are the most practical. The GMail inventor was genius. Folders must begone (except for system files).

评论 #33419752 未加载

dimaturaover 2 years ago

Nice article. As a data hoarder this is something I've run into several times over the years. Still haven't found a great solution. If xattrs were more universally supported, then that would probably be the best solution. Instead, I've come to specialized solutions for different data types.For research papers (in PDF), I have a half-baked python solution I wrote myself that cobbles together the cermine pdf parser/content extractor, the whoosh full text search engine, and an ncurses-based interface.For personal images, I use the elodie CLI tool, but I'd like to move away from it as I don't like how it modifies files by embedding metadata in them. For research/computer vision data, I use custom tooling based on sidecar files and a pg database kept in sync with the sidecars . For audio samples, I just use a commercial solution, sononym, that uses an sqlite database.For other miscellaneous use cases, I've also used TMSU. Pretty nice as a more general purpose solution, except for the inherent issues mentioned by the article.So yeah, I agree it's a hard problem.

karaterobotover 2 years ago

Taxonomies are not only hard, but impossible. Tagging is hardly better. The problem you run into with tagging is that one day you call them `photos` and the next day you call them `pictures` or `pics` or `photography` or `disneyland trip 2019` or nothing at all, and then the day after that you can't find anything. The only solution is constant maintenance. Organization of non-trivial amounts of information is an ongoing problem, not solvable per se. You can hack out a path through the jungle, but the jungle just keeps growing.

评论 #33426230 未加载

评论 #33445136 未加载

nobrainsover 2 years ago

Tagging, seems like a solution. But isn't.Specific problems with tagging: - Need to tag every file (whereas in folders, you just navigate to the folder and everything you store there is in that folder) - Takes too long - Too much thinking overhead (at the time of storing) - To be effective have to enter the name for all tag entries (e.g. project, type, etc.). If anything is missed for a file, that file will never be found. - You have to remember what tag categories (e.g. project, type, etc.) you have used. If you don't use in 3 months, now you have forgotten. - You have to remember the enumeration you are using for some tag categories (e.g. for type you might decide to use only photo, video and music. Now you have to remember that. You also have to remember its "photo" not "image") - If tags were the solution, they would have already been used everywhere. The tagging system SEEMS like a good solution, but once you go deeper, it just doesn't work.

评论 #33420502 未加载

评论 #33420842 未加载

评论 #33420484 未加载

Zaheerover 2 years ago

Fun fact: Taxonomist is actually a role at many of the top tech companies. Much of the faceted search experiences are manually determined by taxonomists. Example: search cars allows you to filter by brand, color, engine type, etc vs searching furniture allows filtering by dimensions. Facebook, Walmart, etc. employ a few of these folks.

Joker_vDover 2 years ago

Well, if symlinks seem inelegant, there are hard links as well, you know. Ultimate tag system: for each tag, make a directory with corresponding name, fill with hard links to appropriate files stored in one big directory of mud.Of course, it all have to fit into a single disk drive, and actually deleting a file is difficult but hey, that must be easily solvable, details are left as an exercise for the reader.

评论 #33420027 未加载

kkfxover 2 years ago

In my "personal digital garden evolution" I've reached the timeline-level taxonomy or, with Emacs/org-mode/org-roam/org-attach etc:- new textual entries (headings) goes into monthly notes ($org-roam-directory/timeline/year/$month-name.org), they might attach files or not of course. Attached files are generally directly linking in the heading/inside the textual content of the heading for a single-click quick access and glance view. Doing so allow to have not too many too small files, not too big ones who operate slowly;- another subdir of org-roam-directory is for "topics", a note per topic, linking or org-transcluding (slow and a bit limited but still useful) the collected entries in timeline style;- another is workdir where I craft my catalogue (using org-mode drawers created with templates to allow easy org-ql queries) and queries to explore my notes in different view. It's not as easy as TiddlyWiki transparent transclusion but allow a certain degree of practical usability, fine grain selection and ease composition.MOST of my files and config live or as org-attachments or tangled from org-mode. So yes, taxonomy is hard, but we have tools to master them IF we decide to discover them and invest time in improving our digital garden for real instead of leaving classic mess of files hoping for some miracle "application" that solve automagically all issues. Unfortunately due to the lack of interests by most leave such systems too little developed to be as effective as they can...My personal experience is:- we need taxonomy anyway, just mere full-text searching with extras à-la-google do suffice for a certain percentage but fails more than that;- we need taxonomy that are a bit flexible in storage terms and can change at a slow peace;- we need integration, witch is NOT possible in ALL modern software, we need for that classic desktops where the OS was a framework/live image and anything is just a module, a bit of code, of it. With end-user programming concepts because no UI can be effective enough in "no code" style and no "modern programming" styles are usable for user programming.A bottomline: people should learn a bit about information management at school, from how a library or a pharmacy organize books/meds on their shelves to book's indices and personal information archives. Nothing exaggerated but the bare minimum to understand how to manage data, digital and physical in various forms for a lifetime...

fungiblecogover 2 years ago

Many people here are saying this is a search problem, but actually there are two distinct ways of finding information: search and browse. Unfortunately it is hard to support both without a lot of work.Sometimes you want to locate a specific item - in which case you need a good way of searching - and sometimes you want to browse through related information so you want to see a hierarchical structure.Google drive was originally designed on the principle that search was all you needed so it was all tag-based. And it was terrible as soon as you had a lot of data. So google was forced to introduce the ability to create a folder structure.

tabtabover 2 years ago

I agree. Set theory is more powerful and flexible than taxonomic trees (although not perfect). It's why I believe the future is Table Oriented Programming (TOP), where code blocks are either in or managed by RDBMS. Code-centric tools rely too much on file trees and other trees. If you instead try to design your stack and/or language around sets, you'll probably end up with something similar to TOP.<a href="https://news.ycombinator.com/item?id=33413124&p=2#33415249" rel="nofollow">https://news.ycombinator.com/item?id=33413124&p=2#33415249</a>

galfarragemover 2 years ago

Been there and tried them all. One day you realise there isn’t a perfect approach and you must settle and compromise. I settled on project based [0]. When you notice too much repetition - and it happens more rarely than you may think - it’s time to simply consider a new project and a symlink. Not pure but is simple and practical.[0] <a href="https://github.com/slowernews/hamster-system#hamster-folder---organize-your-documents" rel="nofollow">https://github.com/slowernews/hamster-system#hamster-folder-...</a>

John_Wilkinsover 2 years ago

<a href="http://www.alamut.com/subj/artiface/language/johnWilkins.html" rel="nofollow">http://www.alamut.com/subj/artiface/language/johnWilkins.htm...</a>

nonrandomstringover 2 years ago

Plain old unix find with grep, locate or xapian help me navigate. Remembering to run updatedb is the pain. I won't put it in cron because grinding disks every night annoys the hell out of me.

drsoppover 2 years ago

I store my files chronologically. Every time I want to store something I want to keep, I make a folder with the name format "2022-11-01 something something". The text here is just a short description in natural language. If I feel like it I add a tag here like invoice or photos. The point is to make it searchable. I easily find most things I am looking for with Directory Opus and Recoll.

评论 #33420727 未加载

lob_itover 2 years ago

Taxonomies are one of the funnest things I have worked with (very limited capacities) in the Wordpress implementation (about 1.5 million indexed pages).I think spatial datasets/spatial aptitude either makes them relevant or just an untapped avenue for exploration.Interesting article and most likely part of 21st century technology on many.... levels.

tsthenameover 2 years ago

Separating functions helps us use the right tool for the job. Taxonomies are for semantics, and the file system is for retrievability. The comfort of hierarchies makes it easy to try and do both simultaneously.- From computer science, we know graphs give us expressive modeling capabilities. I sometimes use mermaid ER diagrams as a concept map to capture complex relationships between files and concepts.- From library science, faceted classification works well for extensive collections because inserting a new entry does not require thinking about existing entries. I maintain entries in a spreadsheet for extensive collections that matter to me. Note: Facets are meant for unchanging or infrequently changing properties. Creating a concept map and maintaining a faceted classification system take work, so I only use them for things that are very important to me.90% of files I only care about for a short amount of time. I use the file system to co-locate the files I'm currently working on (so a project) but then archive all of it when I move on to something else.The trade-off is that I give up on sharing files between projects. I don't want to deal with references. I copy from the archive when I need to. On the rare occasion when I need to reconcile the same file between projects, I do it manually. What helps is working on only a few projects at the same time.TL;DR: Archive more. Use high-investment techniques only for the small percentage of files that really matter.

bluetomcatover 2 years ago

Categorization is order, tagging is chaos. No order is perfect, but it enables people to talk about the same entities and to agree on certain ideas. Tagging is when everyone invents their own labels and conventions, requiring tons of "smart" algorithms to make sense of the mess.

Self-Perfectionover 2 years ago

> Linux is still in the stone age when it comes to tagging. Common Linux filesystems (including ext4 & ZFS) support extended attributes, but I'm not aware of any Linux distro or file manager that includes tagging features based on them (or embedded metadata, for that matter).KDE has decent support of tags in extended attributes. Tags are shown and can be edited in Dolphin (file manager), gets indexed by baloo. This is far from stone age as author claims!

benaover 2 years ago

It's hard for everything in every way.Biology. Everything is a fish or nothing is a fish. Trees don't exist. Tomatoes are fruits as are cucumbers, pumpkins, bell peppers, and most things we don't consider fruits. But all fruits are also vegetables. Strawberries are neither straw nor berries. Etc.When they said the two hardest things in computer science was naming things and cache invalidation, it's partly because naming things is a hard problem in every discipline.

mcvover 2 years ago

The big central problem here is that non-trivial taxonomies aren't trees but graphs. Trying to get a tree-based filesystem to represent a taxonomy means you're forcing a graph into a tree. Symlinks help because they turn your tree into a graph (albeit one that breaks too easily; I think that could be fixed, though). But in the end, a traditional filesystem is a poor way to represent a taxonomy.

评论 #33420500 未加载

legulereover 2 years ago

I guess the lesson to be learned is that you will need to support multiple classification systems. For biological taxonomy both morphological as well as genetics classifcation makes sense.In the health data exchange format FHIR you have identifiers and codings have a system and a value/code. Usually you can specify multiple of them.

bjhartinover 2 years ago

Can anyone recommend a paper or book that approaches this from first principles, e.g. are some limitations/abilities due to mathematical structures such as functions (folders), relations (tags), etc?

MrPatanover 2 years ago

<a href="https://en.wikipedia.org/wiki/Celestial_Emporium_of_Benevolent_Knowledge" rel="nofollow">https://en.wikipedia.org/wiki/Celestial_Emporium_of_Benevole...</a>

评论 #33422780 未加载

nlover 2 years ago

Taxonomies are a crutch that simplify a complex problem - usually too much to be useful.Multidimensional latent spaces based on content and other characteristics of the object is the real solution here.

评论 #33419984 未加载

ChrisMarshallNYover 2 years ago

Totally agree that it's difficult.> Symlinks are brittle.On the Mac OS, aliases are a lot less brittle. You can move aliased files around, and they usually won't lose their connection.

encryptluks2over 2 years ago

Preferably whatever web server you use will allow you to assign sections to multiple categories. I agree project based is usually the way to go though.

pvaldesover 2 years ago

This is not taxonomy, is classification.

leocover 2 years ago

This dicussion crops up every so often, eg. at <a href="https://news.ycombinator.com/item?id=29141800" rel="nofollow">https://news.ycombinator.com/item?id=29141800</a> . Here I repost a composite of my old comments <a href="https://news.ycombinator.com/item?id=14542595" rel="nofollow">https://news.ycombinator.com/item?id=14542595</a> <a href="https://news.ycombinator.com/item?id=14546682" rel="nofollow">https://news.ycombinator.com/item?id=14546682</a> from a previous occasion when this was discussed <a href="https://news.ycombinator.com/item?id=14537650" rel="nofollow">https://news.ycombinator.com/item?id=14537650</a> . Anyone who is serious about this stuff should probably start here.> Well, since you ask, here's Hans Reiser's old stuff:<a href="https://reiser4.wiki.kernel.org/index.php/Future_Vision" rel="nofollow">https://reiser4.wiki.kernel.org/index.php/Future_Vision</a><a href="https://reiser4.wiki.kernel.org/index.php/V4" rel="nofollow">https://reiser4.wiki.kernel.org/index.php/V4</a>(and <a href="http://lwn.net/2001/1108/a/reiser4-transaction.php3" rel="nofollow">http://lwn.net/2001/1108/a/reiser4-transaction.php3</a> ). And here's some emails etc. I wrote in response:<a href="https://web.archive.org/web/20040728044342/http://www.st-and" rel="nofollow">https://web.archive.org/web/20040728044342/http://www.st-and</a>...<a href="https://marc.info/?l=linux-kernel&m=111624697710426" rel="nofollow">https://marc.info/?l=linux-kernel&m=111624697710426</a><a href="https://www.mail-archive.com/reiserfs-list@namesys.com/msg09" rel="nofollow">https://www.mail-archive.com/reiserfs-list@namesys.com/msg09</a>...<a href="https://www.mail-archive.com/reiserfs-list@namesys.com/msg20" rel="nofollow">https://www.mail-archive.com/reiserfs-list@namesys.com/msg20</a>...<a href="https://www.mail-archive.com/reiserfs-list@namesys.com/msg20" rel="nofollow">https://www.mail-archive.com/reiserfs-list@namesys.com/msg20</a>...<a href="https://www.mail-archive.com/reiserfs-list@namesys.com/msg20" rel="nofollow">https://www.mail-archive.com/reiserfs-list@namesys.com/msg20</a>...<a href="https://www.mail-archive.com/reiserfs-list@namesys.com/msg20" rel="nofollow">https://www.mail-archive.com/reiserfs-list@namesys.com/msg20</a>..., plus some of the discussion threaded from those posts. (Sorry, my stuff needs rewriting and updating but I'm not in the position to do it at present. If there's anything you would like to ask about please do. <a href="https://news.ycombinator.com/item?id=9809041" rel="nofollow">https://news.ycombinator.com/item?id=9809041</a> and <a href="https://news.ycombinator.com/item?id=10548477" rel="nofollow">https://news.ycombinator.com/item?id=10548477</a> touch on things that are a bit further down the line, but related—in particular, to the handling of "internal metadata" and files with a compound internal structure.)

评论 #33436121 未加载

Garlefover 2 years ago

tl,dr:* Partitioning files into folders is most likely wrong: Most things need to be in multiple folders.* (Sidenote: Symlinks are not a good solution)* Tagging would be best but no good support from the OS for metadata* We're working on something

blippageover 2 years ago

Categorising information into taxonomies is like trying to hammer a square peg into a round hole; sometimes a necessary but undecidable problem. As someone once said: a book (article, webpage, whatever) is rarely about one thing.This is a topic that is at the top of my mind, as I grapple to organise my growing gemini/gopher site. Is it better to index, list a table of contents, search, or try to classify it with the (DDC) Dewey Decimal Classification.The DDC. It has come under criticism, and librarians have justified a lot of their efforts in moving away from it. I doubt that the effort was justified. It boils down to this: you have to put a book in a library somewhere. And that somewhere has to boil down to a taxonomy.To illustrate the problem, is a book about programming microcontrollers a book about programming, or is it primarily about microcontrollers? The Arduino Cookbook is in DDC 621.3810285536 (yes, really. That's obviously extreme, though). That's part of the electronics section, which seems fair enough to me. So far so good, But "Beginning MicroPython with the Raspberry Pi Pico: Build Electronics ..." is in section 005.13, which is programming. A completely different place. "Programming with STM32: Getting started with the nucleo board" is in 005.262, which is also programming. But why 005.262 rather than 005.13? It almost seems that whoever is classifying these books has no idea what they're doing ;)I could go on at length about the confusions I have in trying to place my content. In the end, you have to make a somewhat arbitrary decision and just go with it.Tables of contents work reasonably well within a book. Subjects are often non-intersecting, so they can be treated separately. For the most part, anyway.A solution which is fairly reasonable is to index your site. Indices are useful because they allow you to take multiple views on something, thereby eliminating the taxonomy problem.I'm not a great fan of tagging. It is too much of a scattergun approach to my liking. Perhaps some merit, though.Then there's textual searching. In fact, that's how I relocated some of my notes. So, text search it is, then? Well, not quite. It seemed like a good system for my site which is focussed. It has problems scaling. I don't want millions of results, a la Google, I want a few relevant ones.This is even a problem with search engines for the gemini and gopher protocols, where nobody is even trying to game the system. You often end up with a lot of similar stuff at the top which I am not interested in.Oddly, for gemini, I prefer the "Collaborative Directory of Geminispace" over at gemini://cdg.thegonz.net/ , which is a taxonomy of categories, the very thing that I has doubts about.So, in summary, it's not easy.