Happy to see this on the frontpage. I work for the Open Knowledge International, which develops the Frictionless Data standard. Feel free to ask me anything, and I'll make sure myself or someone else from the team answers it.
The frictionlessdata landing page has very generalized verbiage so here's my technical summary of it...<p>The main idea for "container" or "package" hinges on a file called <i>"datapackage.json"</i>[1].<p>An analogy would be the "sfv" files like <i>"checksums.sfv"</i> for verifying the integrity of files. Since so many people use "sfv" as a defacto standard, many programs exist to scan it and verify the associated files. Another analogy would be <i>DTD</i> for XML files.<p>Similarly, if everybody could converge on the file <i>"datapackage.json"</i> as a metadata & schema description standard, a useful ecosystem of utilities and libraries for processing data would take advantage of it.<p>One example library would be: <a href="https://github.com/frictionlessdata/datapackage-py" rel="nofollow">https://github.com/frictionlessdata/datapackage-py</a><p>(In the Python source code for "package.py"[2], Ctrl+F search for <i>"datapackage.json"</i> to see how it looks for that particular file.)<p>With a data wrangling API like that, one could then do joins on csv files directly[3] and write the results to another csv file <i>with the associated "datapackage.json"</i>.<p>Instead of passing "dumb" csv or raw json files around, add a little "intelligence" to the dataset by way of <i>"datapackage.json"</i> so tools can parse the schema and process csv/json at a higher abstraction level. That leads to more "effortless" and "frictionless" data interoperability.<p>What I can't tell so far is if <i>"datapackage.json"</i> already has momentum of adoption across many communities such as Julia, Tensorflow, Hadoop, etc. and we need to get on the bandwagon -- or -- adoption is still in its infancy and there are <i>other competing data "container/package" specifications</i> to look at.<p>[1] <a href="http://frictionlessdata.io/guides/data-package/" rel="nofollow">http://frictionlessdata.io/guides/data-package/</a><p>[2] <a href="https://github.com/frictionlessdata/datapackage-py/blob/master/datapackage/package.py" rel="nofollow">https://github.com/frictionlessdata/datapackage-py/blob/mast...</a><p>[3] <a href="http://frictionlessdata.io/guides/joining-tabular-data-in-python/" rel="nofollow">http://frictionlessdata.io/guides/joining-tabular-data-in-py...</a>
This is an overly complicated data container format for not much advantage. To be honest, everything you can do with this can be done at the same level or better with SQLite, an actual database system. Having to implement 4 different parers and validation functions spanning a mix of csv, xml and json just to access what is essentially a csv file is not feasible.
CSV as serialization format? Ouch. Could we do better? My experience with CSV has been nothing but pain in the past. Ambiguous formats, quoting issues, incompatible libraries between languages and popular GUI tools like Excel or data vis apps.<p>I wonder if there's anything better.
it is relevant to point out yesterday's article by Wes McKinney on Apache Arrow and the future of high performance data formats - <a href="https://news.ycombinator.com/item?id=15335462" rel="nofollow">https://news.ycombinator.com/item?id=15335462</a>