Moving away from HDF5

116 pointsby nippooover 9 years ago

16 comments

bhoustonover 9 years ago

Alembic, a data transfer format for 3D graphics, especially high end VFX, also initially started with HDF5 but found it to have low performnance and was generally a bottleneck (especially in multithreaded contexts.)Luckily the authors of Alembic were smart and in their initial design abstracted out the HDF5 interface and were able to provide an alternative IO layer based on C++ STL streams. The C++ STL streams-based interface greatly outperformed the HDF5 layer.Details on that transition here:<a href="https://groups.google.com/forum/#!msg/alembic-discussion/FTG1HuuO_qA/jUadpxpk3IoJ" rel="nofollow">https://groups.google.com/forum/#!msg/alembic-discussion/FTG...</a>

ipunchghostsover 9 years ago

Reading all these comments that bash HDF5 makes me want to tell how HDF5 has really worked for my group.Although the spec is huge, there is plenty of sample code online to get it working. You do actually have to read it though to understand slabs, hyperslabs, strides etc. Once you get though, its really versitle.As far as speed, we used it to replace our propriatary data format. We would have to provide readers to all the scientists that use our data. It was nightmare. Some people want stuff in R, some in Python 2.7, some in Python 3.4, some in Matlab, and the list goes on. HDF5 gets rid of all this.When in the field and the system shits the bed, its really easy to open an HDF5 file in HDFview and inspect the file contents. I dont always have matlab available when im in the field, same with python. Sometimes I just need to look at a time series and I can diagnose the problems with the system.For me, its silly in 2016 to have any kind of proprietary binary format when something HDF5 exists.Many of the complaints the author had makes me think he the stereotypical scientist of really smart in one area but cant program worth the beans. I dont think that HDF5's fault.

评论 #10861544 未加载

batbombover 9 years ago

I've done more than my fair share of fucking with FITS and ROOT files, HDF5, SQLite, proprietary struct-based things, etc...It's easy to get a file format working on one system. It's Herculean getting it working on all systems bug free. It's nearly impossible to get something to work portably and performant across many systems.As for simplicity, people start wanting metadata and headers and this and that, and before you know it you need HDF5 or ROOT again and it's no longer simple. Maybe if you're lucky you can stick with something that looks like FITS. If it's tabular, SQLite still can't be beat. Maybe Parquet would work fine too.I'd vehemently oppose anyone in the projects I work on from trying to standardize on a new in-house format. I'd maybe be okay if they were just building on top of MessagePack or Cap'n Proto/thrift etc... but nearly every disadvantage the OP references about HDF5 will undoubtedly be in anything they cook up themselves. For example, a "simpler format" that works well on distributed architectures, well... now you're going to go back to the single implementation problem.

评论 #10861623 未加载

评论 #10861058 未加载

superbatfishover 9 years ago

I liked the post (well, as an HDF5 user, I found it depressing...).My main qualm with it was the claim about 100x worse performance than just using numpy.memmap(). To the author's credit, he posted his benchmarking code so we could try it ourselves. (Much appreciated.) But as it turned out, there were problems with his benchmark. A fair comparison shows a mixed picture -- hdf5 is faster in some cases, and numpy.memmap is faster in other cases. (You can read my back-and-forth about the benchmarking code in the blog's comments.)One minor complaint about presentation: Once the benchmarking claims were shown to be bogus, the author should have removed that section from the post, or added an inline "EDIT:" comment. Instead, he merely revised the text to remove any specific numbers, and he didn't add any inline text indicating that the post had been edited.I think the rest of the post (without performance complaints) is strong enough to stand on its own. After all, performance isn't everything. In fact, I'd say it's a minor consideration compared to the other points.When it comes to performance, I think the main issue is this: When you have to "roll your own" solution, you become intimately aware of the performance trade-offs you're making. HDF5 is so configurable and yet so opaque that it's tough to understand why you're not seeing the performance you expect.

评论 #10862060 未加载

评论 #10862972 未加载

ycmbntrthrwawayover 9 years ago

One reason to use binary formats like HDF5 is to avoid precision loss when storing floating-point values. I started using HDF5 once exactly for this reason and it was overkill. HDFView requires Java installed and HDF library with single implementation and complex API is a problem too.For simple uses I now use '%a' printf specifier. It is specifically designed to avoid losing a single bit of informaiton. And you can easily read floats stored this way in numpy by using genfromtxt with optional converters= argument and builtin float.fromhex function.

评论 #10860753 未加载

评论 #10860746 未加载

评论 #10860509 未加载

评论 #10862829 未加载

评论 #10861504 未加载

评论 #10861425 未加载

评论 #10861484 未加载

ycmbntrthrwawayover 9 years ago

> You can't use standard Unix/Windows tools like awk, wc, grep, Windows Explorer, text editors, and so on, because the structure of HDF5 files is hidden in a binary blob that only the standard libhdf5 understands.HDF provides command-line tools like h5dump and h5diff, so you can dump HDF5 file to text and pipe it into standard and non-standard unix tools [1].[1] <a href="https://www.hdfgroup.org/products/hdf5_tools/index.html#h5dist" rel="nofollow">https://www.hdfgroup.org/products/hdf5_tools/index.html#h5di...</a>

评论 #10861174 未加载

x0x0over 9 years ago

The problem -- and I've been burned on both sides of this -- is that you need either a container file, or you need users to understand that a directory is essentially a file. Not only does this complicate users lives when they want to move what they, quite reasonably, view as a single file between different computers or back it up, but they can and will remove individual pieces. Or copy it around and have some of the pieces not show up and be very confused that copy/move operations -- particularly to/from a NAS -- are now nothing like atomic.Another thing that will happen is this: if you just use a directory full of files as a single logical file, you will end up writing code that does the equivalent of 'rm -rf ${somedir}' because when users choose to overwrite "files" (really, a directory), you need to clear out any previous data so experiment runs don't get mixed. You can easily see where this can go bad; you will have to take extraordinary care.

评论 #10862195 未加载

评论 #10860579 未加载

zvrbaover 9 years ago

In my previous job we were evaluating HDF5 for implementing a data-store a couple of years ago. We had some strict requirements about data corruption (e.g., if the program crashes amid a write operation in another thread, the old data must be left intact and readable), as well as multithreaded access. HDF5 supports parallelism from distinct programs, but its multithreaded story was (is?) very weak.I ended up designing a transactional file format (a kind of log-structured file system with data versioning and garbage collection, all stored in a single file) from scratch to match our requirements. Works flawlessly with terabytes of data.

评论 #10863722 未加载

jbverschoorover 9 years ago

Am I the only one misreading this as HDFS?

评论 #10860809 未加载

评论 #10861346 未加载

ajbonkoskiover 9 years ago

"we have a particular use-case where we have a large contiguous array with, say, 100,000 lines and 1000 columns"This is where they lost me. This is NOT a lot of data. Should we be surprised that memory-mapping works well here?Below about 100-200 GB you can do everything in memory. You simply don't need fancy file-systems. These systems are for actual big data sets where you have several terabytes to several petabytes.Don't try to use a chainsaw to cut a piece of paper and then complain that scissors work better. Of course they do...

评论 #10870921 未加载

iraikovover 9 years ago

This is a very interesting article, thanks for sharing. I attempted several times to understand the HDF5 C API and create a custom format for storing connectivity data for neuroscience models, but each time I found the API exceedingly complex and bizarre. I am quite impressed that the author managed to write a substantial piece of software based around HDF and relieved to read the sections on the excessive complexity and fragility of the format.

skynetv2over 9 years ago

* High risks of data corruption - HDF is not a simple flat file. Its a complex file format with a lot of in memory structures. A crash may result in corruption but there is no high risk of corruption. More over, if your app crashed, what good is the data? How can you make sense of the partial file? if you just need a flat file which can be parsed and data recovered, then you didnt need HDF in the first place. So wrong motivation to pick HDF. On the other hand, one could do a new file for every checkpoint / iteration / step, which is what most people do. If app crashed, you just use the last checkpoint.Bugs and crashes in the HDF5 library and in the wrappers - sure, every sw has bugs. But in over 15 years of using HDF, I have not seen a bug that stopped me from doing what I want. And the HDF team is very responsive in fixing / suggesting work arounds.Poor performance in some situations - yes & no. A well built library with a well designed application should approach posix performance. But HDF is not a simple file format, so expect some overhead.Limited support for parallel access - Parallel HDF is one of the most, if not the top most, popular library for parallel IO. Parallel HDF also uses MPI. If your app is not MPI, you cant use Parallel HDF. If the "parallel access" refers to threading, HDF has a thread safe feature that you need to enable when building the code. If "parallel access" refers to access from multiple processes, then HDF is not the right file format to use. you could do it for read-only purposes but not write. again, not the right motivation to pick HDFImpossibility to explore datasets with standard Unix/Windows tools - again, HDF is not a flat file, so how can one expect standard tools to read it? its like saying I would like to use standard tools to explore a custom binary file format I came up with. wrong expectations.Hard dependence on a single implementation of the library - afaik there is only one implementation of the spec. the author seems to know this before deciding on HDF. Why is this an issue if its already known?High complexity of the specification and the implementation -Opacity of the development and slow reactivity of the development team - slow reactivity to what? HDF source is available so one can go fix / modify whatever they want.seems the author picked HDF with wrong assumptions.HDF serves a huge community that has specific requirements, one of which is preserving precision, portability, parallel access, being able to read/write datasets, query the existing file for information of the data in the file, multi dimensional datasets, large amount of data to fit in a single file, etc.<a href="https://www.hdfgroup.org/why_hdf/" rel="nofollow">https://www.hdfgroup.org/why_hdf/</a>

评论 #10860535 未加载

评论 #10861737 未加载

adolgertover 9 years ago

I may not agree with Cyrille, but what about alternatives for storing binary data that might be structured and play well with newer tools like Spark? ASN.1 and Google Protocol Buffers both specify a binary file format and generate language-specific encoding and decoding. Is there a set of lightweight binary data tools we're missing?

评论 #10860710 未加载

hzhou321over 9 years ago

In the old days, researchers measure the charge of electron with oil drop, and figure out what is gravity with pen and paper. I guess nowadays researchers have to spend million dollar on a electron microscope to look at anything and have to depend on HDF5 to deal with any data.

mydpyover 9 years ago

We used HDF5 and NETCDF at NASA and it was a constant struggle. I remember when someone dropped the specification on my desk and said, "should be a good read. Enjoy!" Glad you found a more suitable alternative.

sreanover 9 years ago

Leaving a few links here. From the discussion that has taken place it seems these two would be of interest.<a href="http://cscads.rice.edu/workshops/summer-2012/slides/datavis/HDF5-CScADS.pdf" rel="nofollow">http://cscads.rice.edu/workshops/summer-2012/slides/datavis/...</a> Extreme IO scaling with HDF5<a href="http://algoholic.eu/sec2j-journalling-for-hdf5/" rel="nofollow">http://algoholic.eu/sec2j-journalling-for-hdf5/</a> HDF5 with a journal