TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Frictionless Data: Lightweight standards and tooling for data sharing

84 pointsby rkdaover 7 years ago

5 comments

vitorbaptistaaover 7 years ago
Happy to see this on the frontpage. I work for the Open Knowledge International, which develops the Frictionless Data standard. Feel free to ask me anything, and I'll make sure myself or someone else from the team answers it.
jasodeover 7 years ago
The frictionlessdata landing page has very generalized verbiage so here&#x27;s my technical summary of it...<p>The main idea for &quot;container&quot; or &quot;package&quot; hinges on a file called <i>&quot;datapackage.json&quot;</i>[1].<p>An analogy would be the &quot;sfv&quot; files like <i>&quot;checksums.sfv&quot;</i> for verifying the integrity of files. Since so many people use &quot;sfv&quot; as a defacto standard, many programs exist to scan it and verify the associated files. Another analogy would be <i>DTD</i> for XML files.<p>Similarly, if everybody could converge on the file <i>&quot;datapackage.json&quot;</i> as a metadata &amp; schema description standard, a useful ecosystem of utilities and libraries for processing data would take advantage of it.<p>One example library would be: <a href="https:&#x2F;&#x2F;github.com&#x2F;frictionlessdata&#x2F;datapackage-py" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;frictionlessdata&#x2F;datapackage-py</a><p>(In the Python source code for &quot;package.py&quot;[2], Ctrl+F search for <i>&quot;datapackage.json&quot;</i> to see how it looks for that particular file.)<p>With a data wrangling API like that, one could then do joins on csv files directly[3] and write the results to another csv file <i>with the associated &quot;datapackage.json&quot;</i>.<p>Instead of passing &quot;dumb&quot; csv or raw json files around, add a little &quot;intelligence&quot; to the dataset by way of <i>&quot;datapackage.json&quot;</i> so tools can parse the schema and process csv&#x2F;json at a higher abstraction level. That leads to more &quot;effortless&quot; and &quot;frictionless&quot; data interoperability.<p>What I can&#x27;t tell so far is if <i>&quot;datapackage.json&quot;</i> already has momentum of adoption across many communities such as Julia, Tensorflow, Hadoop, etc. and we need to get on the bandwagon -- or -- adoption is still in its infancy and there are <i>other competing data &quot;container&#x2F;package&quot; specifications</i> to look at.<p>[1] <a href="http:&#x2F;&#x2F;frictionlessdata.io&#x2F;guides&#x2F;data-package&#x2F;" rel="nofollow">http:&#x2F;&#x2F;frictionlessdata.io&#x2F;guides&#x2F;data-package&#x2F;</a><p>[2] <a href="https:&#x2F;&#x2F;github.com&#x2F;frictionlessdata&#x2F;datapackage-py&#x2F;blob&#x2F;master&#x2F;datapackage&#x2F;package.py" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;frictionlessdata&#x2F;datapackage-py&#x2F;blob&#x2F;mast...</a><p>[3] <a href="http:&#x2F;&#x2F;frictionlessdata.io&#x2F;guides&#x2F;joining-tabular-data-in-python&#x2F;" rel="nofollow">http:&#x2F;&#x2F;frictionlessdata.io&#x2F;guides&#x2F;joining-tabular-data-in-py...</a>
评论 #15347135 未加载
craig_peacockover 7 years ago
This is an overly complicated data container format for not much advantage. To be honest, everything you can do with this can be done at the same level or better with SQLite, an actual database system. Having to implement 4 different parers and validation functions spanning a mix of csv, xml and json just to access what is essentially a csv file is not feasible.
评论 #15351802 未加载
评论 #15350212 未加载
jakubpover 7 years ago
CSV as serialization format? Ouch. Could we do better? My experience with CSV has been nothing but pain in the past. Ambiguous formats, quoting issues, incompatible libraries between languages and popular GUI tools like Excel or data vis apps.<p>I wonder if there&#x27;s anything better.
评论 #15346570 未加载
评论 #15346641 未加载
评论 #15346612 未加载
评论 #15346609 未加载
评论 #15348368 未加载
sandGorgonover 7 years ago
it is relevant to point out yesterday&#x27;s article by Wes McKinney on Apache Arrow and the future of high performance data formats - <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=15335462" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=15335462</a>