科技回声

6 条评论

fjordster超过 9 年前

HDF5 isn't perfect but it does this kind of job pretty well. The C, C++, and HDF5 APIs are definitely not fun to use, but there are wonderful and intuitive APIs available in some languages---I'm thinking of Python's h5py here.Let me add that the OP's experience that HDF5 files were less space efficient than comparable CSV files suggest that something was grossly amiss in his use of HDF5.

评论 #10322200 未加载

评论 #10322230 未加载

评论 #10323625 未加载

icsa超过 9 年前

The BTables discussion takes me back to memories of my first college computing class (in FORTRAN). We were asked how we might store a sparse matrix in less memory. The solution was exactly the same as BTables. We thought we'd done something novel when the professor pointed out that it had already been implemented in the 60s.Great ideas never fade. They do get reinvented :).

rspeer超过 9 年前

I haven't tried the BTables format, but I agree with their criticism of HDF5. It seems to be an incredibly over-designed format with under-designed APIs.(Why would I need a directory tree inside a file that only one process can write to anyway? Why wouldn't I just use the filesystem I already have?)

评论 #10321714 未加载

xaa超过 9 年前

It's too bad that this is for sparse data only. ML datasets have differing degrees of sparsity, and when the sparsity gets low enough, it's more efficient to use dense matrices, even when there are still missing values.Also if you have dense data, you can use mmap, which isn't very space efficient but is very fast. I guess it could also be made to be space efficient if you use a filesystem with transparent compression.

评论 #10322126 未加载

blt超过 9 年前

Wondering why they chose row-major storage. I think it's far more common to only care about a subset of columns than a subset of rows.

评论 #10322195 未加载

zobzu超过 9 年前

interesting how it jumps from csv to rewrite stuff without just doing SQL and be done with it. since csv did the job almost good enough, it seem like SQL would just fine and dandy while easier to manage and implement (minutes, literally)note: after reading a little more I suspect SQL would be faster, in fact.

6 条评论

fjordster超过 9 年前

评论 #10322200 未加载

评论 #10322230 未加载

评论 #10323625 未加载

icsa超过 9 年前

rspeer超过 9 年前

评论 #10321714 未加载

xaa超过 9 年前

评论 #10322126 未加载

blt超过 9 年前

Wondering why they chose row-major storage. I think it's far more common to only care about a subset of columns than a subset of rows.

评论 #10322195 未加载

zobzu超过 9 年前

BTables: A fast, compact format for Machine Learning

6 条评论

BTables: A fast, compact format for Machine Learning

6 条评论