Ask HN: Is there an equivalent of the Dewey Decimal System for data collections?

27 点作者 Mithriil将近 2 年前

The Dewey Decimal System is famous as a cataloguing system for books and other writings and medias that have to be physically retrievable.Is there an equivalent or something similar for data collections? I'm particularly interested in such a system for Machine Learning Operations (MLOps).

7 条评论

dang将近 2 年前

This is recent and (I think?) related:Johnny Decimal - <a href="https://news.ycombinator.com/item?id=36308366">https://news.ycombinator.com/item?id=36308366</a> - June 2023 (192 comments)

评论 #36367247 未加载

mindcrime将近 2 年前

You might find something useful by poking around the literature around "digital libraries".<a href="https://en.wikipedia.org/wiki/Digital_library" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Digital_library</a>There's also some stuff that could be adapted to the purpose you're talking about, from the Semantic Web space. See the following example of using VoID with DBPedia categories, for one illustration of this idea.<a href="https://www.w3.org/TR/void/#subject" rel="nofollow noreferrer">https://www.w3.org/TR/void/#subject</a>And with that in mind, since somebody brought up the LOC, you might also find this of interest:<a href="http://www.loc.gov/standards/mods/modsrdf-primer.html" rel="nofollow noreferrer">http://www.loc.gov/standards/mods/modsrdf-primer.html</a><a href="http://www.loc.gov/standards/mods/" rel="nofollow noreferrer">http://www.loc.gov/standards/mods/</a><a href="https://www.loc.gov/librarians/standards" rel="nofollow noreferrer">https://www.loc.gov/librarians/standards</a>And from the "see also" category:<a href="https://www.dublincore.org/" rel="nofollow noreferrer">https://www.dublincore.org/</a>

genpfault将近 2 年前

Dublin Core[0]?[0]: <a href="https://en.wikipedia.org/wiki/Dublin_Core" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Dublin_Core</a>

评论 #36367287 未加载

评论 #36368569 未加载

FloorEgg将近 2 年前

What are you trying to do / get done?

评论 #36367427 未加载

rapjr9将近 2 年前

Having worked both in libraries and with large data sets I don't think a classification system like Dewey Decimal would work for data sets. For example, if you had a large data set of locations for people on a college campus, that data set could be used for sociological research (who meets up with whom?), or epidemiology (to see who potentially spread a virus), or building maintenance (which buildings have the most people using them?) So what do you classify the data set under? Maybe you could file it under location data (though it is also student+professor+manager data), but what if the basic data is WiFi AP associations? Then the data might also be of interest to wireless network managers and could be classified as wireless management data. So any classification system probably needs a tag system that can apply multiple tags to one data set. The main point of classification is being able to find what you need, so pigeon holing a data set into one category is not a good idea because data is much more versatile than a single book (although there are problems with Dewey Decimal and books in this regard also). There is also the problem of future categories, catagories that don't even exist now but might retroactively apply to current data sets. So an old data set that might have been tagged with "location" might in the future also need to be tagged with "privacy". Being able to add tags in the future is important. So a data set is not like a book on a shelf, it's more like a web page, and there should be multiple ways to link to it. Data sets are not physical things, and are not related to a single topic (though they may have been collected for only one purpose initially), so don't use a cataloging system for physical objects to catalog data sets. That said, the range of data sets is infinite. So understanding your own needs my help limit the number of tags/categories and understanding how your needs may change in the future would help. Machine Learning data sets are often repurposed. Business aims may expand. At the extreme you may want to do some research into the problem of ontologies. Specifically, how to build an ontology of your current and future intentions for these data sets. Look for something flexible because your needs are likely to change, so you may need to transition the "catalog" to a new form in the future. That seems very complicated and difficult, so reduce the problem as much as you can by imposing limits. What is the minimum range of categories that defines your interests?You might take a completely different approach also, instead of categorizing data sets, just feed them (and all their metadata) to a machine learning algorithm that can search and summarize them for you. That puts the categorization into the search query+process, not associating it with the data. By its nature raw data is rather category-less, it can later be used for anything. Notes on butterfly migrations from the 1900's later becomes data on global warming.

评论 #36375029 未加载

dredmorbius将近 2 年前

There are numerous classification and cataloguing systems. Dewey Decimal is only one of a large set. I'd strongly suggest looking at the US Library of Congress's two sets of classifiers: the Library of Congress Classification, a set of 20 alphabetically-denoted categories (A-Z, excluding I, O, W, X, and Y), originally based on a classification devised by Thomas Jefferson, whose donated collection seeded the Library of Congress's holdings.The LoC also has a set of subject headings, which is a controlled vocabulary used to describe works.The chief difference between the two is that any given work is assigned one Classification, which is used for shelving and retrieval, but can have multiple subject headings, which are used for general cataloguing.Paul Otlet, an early 20th-century archivist, created the universal decimal classification, based on Dewey, for a project similar to what you seem to be after: a collection of information rather than works, in a project called the Mundaneum, in many ways a precursor to Google, though based on index cards. Much of the original was destroyed by Nazi Germany during WWII.The UDC shares with the Dewey and Library of Congress classifications the benefit of being in widespread extant use, which is to say that these classifications reflect current informational needs and have been revised from historical standards which may no longer be especially suitable, and have established bodies and procedures for further updates and revisions.The Library of Congress classfication & subject headings, as works of the US government, are also in the public domain, though useful & usable electronic formats are not readily available to the best of my knowledge. I believe the Library of Congress sells various products, however.There are other classifications as well, several of which I've heard of though I've not used them: the colon classification, Bliss Bibliographic Classification, and several national / language-specific classifications (e.g., German, Nippon, Chinese, Korean, Russian).Among the more interesting classifications is SuDocs, the Superintendent of Documents Classification, developed and maintained by the U.S. Government Publishing Office. This is not a universal or subject-based system, but is principally organised by Federal department or agency, additional classes for subordinate offices, category classes (e.g., annual reports, bulletins, law, ...), book numbers. Whilst likely not directly useful to you, it's an example of a classification designed for the specifics of a particular organisational context.There are bibliographic standards such as the Dublin Core, which defines 15 metadata elements (though without clearly defining their specifications, meanings, or encodings): Contributor, Coverage, Creator, Date, Description, Format, Identifier, Language, Publisher, Relation, Rights, Source, Subject, Title, and Type. You'll find many content management systems incorporate Dublin Core to some extent.Another biblographic standard is MARC: Machine-readable cataloging, a standard which emerged from the Library of Congress beginning in the 1960s. It is arcane and has a heavy influence from mainframe-computer (and punch-card data storage) practices of the era.It's quite useful to think of how any system you specify will be used, by whom, and how it will be applied and maintained:- Who will be using the classification?- What purposes will it serve? Especially any processes related to rights management, documentation, provenance, and/or regulatory frameworks.- What processes will it serve?- Who will maintain the classification itself?- Who will maintain the catalogue, e.g., adding new records, updating and correcting existing records, de-aquisitioning records.- What if any extant standards are there in my specific field? (I don't know, for example, what if any machine-learning standards exist.)Keep in mind that classifications tend to be married to the notion of physical records stored in a specific location, which is generally not the case for electronic data. For the latter, useful descriptions of the work or information, information on its provenance, unique item identifier(s), and cross references between related items (e.g., source or derived data) might be more relevant information to capture.There are various schools of thought on cataloguing and classification generally. Over the past few decades, "self-describing" works, often based on full-text search and some relevance measure, has become popular --- essentially Google and other online General Web Search indices. Hashes (e.g., SHA-256 checksums) are another self-descriptive tool, which are useful for identifying a specific file but tell you nothing about related works (e.g., the plain text, PDF, and ePub versions of a document, or JPEG, PNG, and SVG versions of a graphic). The advantage to the self-descriptive approach is that human inputs are relatively minimal, and documents self-describe through their contents and relations to other materials and factors. The disadvantage is that this approach lacks any central coordination, uniform classification, quality controls, or validation of self-described contents. The traditional practice (owing much to Melville Dewey himself) of having an independent cataloger role affords greater control and consistency, but tends to be helplessly out of date with new incoming information --- something of an age-old problem in the library field.In practice hybrid systems are probably most feasible, with assigned bibliographic characteristics being added to works as time permits and need arises. My own thoughts are that the notion of a cataloguing workflow be explicitly notated in the bibliographic metadata, and that levels of automated and manual review and assignment be coded to works as well.There are several schools of library & information science, and you might want to poke around their course offerings, syllabi, etc., for information. The School of Information at UC Berkeley (previously SIMS), and the Information programme at Pittsburg are two of which I'm specifically aware. Wikipedia of course has a more extensive listing: <<a href="https://en.wikipedia.org/wiki/List_of_information_schools" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/List_of_information_schools</a>>There are also organisations working with electronic data collections at scale, including the Internet Archive and the Wikimedia Foundation, most particularly Wikidata, which might be of interest or relevance to you.Wikipedia also has articles on most of the classifications and topics I've mentioned here: Library of Congress Classification & Subject Headings, Universal Decimal Classification, Colon Classification, Paul Otlet, and more.

评论 #36375106 未加载

elijahwright将近 2 年前

Nobody uses the Dewey Decimal system for cataloguing resources. Unless you're an elementary school librarian, possibly, and even then... no.You really want the Library of Congress's classification outline.

评论 #36362862 未加载

评论 #36365513 未加载

评论 #36367252 未加载