TechEcho

5 comments

Happy to see this on the frontpage. I work for the Open Knowledge International, which develops the Frictionless Data standard. Feel free to ask me anything, and I'll make sure myself or someone else from the team answers it.

jasodeover 7 years ago

The frictionlessdata landing page has very generalized verbiage so here's my technical summary of it...The main idea for "container" or "package" hinges on a file called "datapackage.json"[1].An analogy would be the "sfv" files like "checksums.sfv" for verifying the integrity of files. Since so many people use "sfv" as a defacto standard, many programs exist to scan it and verify the associated files. Another analogy would be DTD for XML files.Similarly, if everybody could converge on the file "datapackage.json" as a metadata & schema description standard, a useful ecosystem of utilities and libraries for processing data would take advantage of it.One example library would be: <a href="https://github.com/frictionlessdata/datapackage-py" rel="nofollow">https://github.com/frictionlessdata/datapackage-py</a>(In the Python source code for "package.py"[2], Ctrl+F search for "datapackage.json" to see how it looks for that particular file.)With a data wrangling API like that, one could then do joins on csv files directly[3] and write the results to another csv file with the associated "datapackage.json".Instead of passing "dumb" csv or raw json files around, add a little "intelligence" to the dataset by way of "datapackage.json" so tools can parse the schema and process csv/json at a higher abstraction level. That leads to more "effortless" and "frictionless" data interoperability.What I can't tell so far is if "datapackage.json" already has momentum of adoption across many communities such as Julia, Tensorflow, Hadoop, etc. and we need to get on the bandwagon -- or -- adoption is still in its infancy and there are other competing data "container/package" specifications to look at.[1] <a href="http://frictionlessdata.io/guides/data-package/" rel="nofollow">http://frictionlessdata.io/guides/data-package/</a>[2] <a href="https://github.com/frictionlessdata/datapackage-py/blob/master/datapackage/package.py" rel="nofollow">https://github.com/frictionlessdata/datapackage-py/blob/mast...</a>[3] <a href="http://frictionlessdata.io/guides/joining-tabular-data-in-python/" rel="nofollow">http://frictionlessdata.io/guides/joining-tabular-data-in-py...</a>

评论 #15347135 未加载

craig_peacockover 7 years ago

This is an overly complicated data container format for not much advantage. To be honest, everything you can do with this can be done at the same level or better with SQLite, an actual database system. Having to implement 4 different parers and validation functions spanning a mix of csv, xml and json just to access what is essentially a csv file is not feasible.

评论 #15351802 未加载

评论 #15350212 未加载

jakubpover 7 years ago

CSV as serialization format? Ouch. Could we do better? My experience with CSV has been nothing but pain in the past. Ambiguous formats, quoting issues, incompatible libraries between languages and popular GUI tools like Excel or data vis apps.I wonder if there's anything better.

评论 #15346570 未加载

评论 #15346641 未加载

评论 #15346612 未加载

评论 #15346609 未加载

评论 #15348368 未加载

sandGorgonover 7 years ago

it is relevant to point out yesterday's article by Wes McKinney on Apache Arrow and the future of high performance data formats - <a href="https://news.ycombinator.com/item?id=15335462" rel="nofollow">https://news.ycombinator.com/item?id=15335462</a>

Frictionless Data: Lightweight standards and tooling for data sharing

5 comments

Frictionless Data: Lightweight standards and tooling for data sharing

5 comments