TechEcho

6 comments

fjordsterover 9 years ago

HDF5 isn't perfect but it does this kind of job pretty well. The C, C++, and HDF5 APIs are definitely not fun to use, but there are wonderful and intuitive APIs available in some languages---I'm thinking of Python's h5py here.Let me add that the OP's experience that HDF5 files were less space efficient than comparable CSV files suggest that something was grossly amiss in his use of HDF5.

评论 #10322200 未加载

评论 #10322230 未加载

评论 #10323625 未加载

icsaover 9 years ago

The BTables discussion takes me back to memories of my first college computing class (in FORTRAN). We were asked how we might store a sparse matrix in less memory. The solution was exactly the same as BTables. We thought we'd done something novel when the professor pointed out that it had already been implemented in the 60s.Great ideas never fade. They do get reinvented :).

rspeerover 9 years ago

I haven't tried the BTables format, but I agree with their criticism of HDF5. It seems to be an incredibly over-designed format with under-designed APIs.(Why would I need a directory tree inside a file that only one process can write to anyway? Why wouldn't I just use the filesystem I already have?)

评论 #10321714 未加载

xaaover 9 years ago

It's too bad that this is for sparse data only. ML datasets have differing degrees of sparsity, and when the sparsity gets low enough, it's more efficient to use dense matrices, even when there are still missing values.Also if you have dense data, you can use mmap, which isn't very space efficient but is very fast. I guess it could also be made to be space efficient if you use a filesystem with transparent compression.

评论 #10322126 未加载

bltover 9 years ago

Wondering why they chose row-major storage. I think it's far more common to only care about a subset of columns than a subset of rows.

评论 #10322195 未加载

zobzuover 9 years ago

interesting how it jumps from csv to rewrite stuff without just doing SQL and be done with it. since csv did the job almost good enough, it seem like SQL would just fine and dandy while easier to manage and implement (minutes, literally)note: after reading a little more I suspect SQL would be faster, in fact.

6 comments

fjordsterover 9 years ago

评论 #10322200 未加载

评论 #10322230 未加载

评论 #10323625 未加载

icsaover 9 years ago

rspeerover 9 years ago

评论 #10321714 未加载

xaaover 9 years ago

评论 #10322126 未加载

bltover 9 years ago

Wondering why they chose row-major storage. I think it's far more common to only care about a subset of columns than a subset of rows.

评论 #10322195 未加载

zobzuover 9 years ago

BTables: A fast, compact format for Machine Learning

6 comments

BTables: A fast, compact format for Machine Learning

6 comments