科技回声

9 条评论

agibsonccc超过 7 年前

The core looks close enough to dataframes that I'd be curious to know how you compare to tablesaw: <a href="https://github.com/jtablesaw/tablesaw" rel="nofollow">https://github.com/jtablesaw/tablesaw</a>This looks neat but I'm not sure why I would care about this. There's a ton of solutions out there in the ecosystem out there already with a columnar like interface.Granted, we wrote our own as well[1] that uses the builder pattern that you then toss to an executor (our main backend is spark for this). One reason we wrote this is for persistence purposes. Being able to encode and persist a series of transforms that you can then load remotely has been very helpful for us in machine learning.We've since migrated this project to the eclipse foundation and intend on doing a rewrite of the interface as well as integrate our baked in tensor library[2] in to certain parts of the pipeline for speed purposes and handling things like computer vision workloads.In general, I always like seeing new takes on the columnar format processing approach but I'm just not seeing anything novel here. Clarification of intent would be great![1]: <a href="https://github.com/deeplearning4j/DataVec" rel="nofollow">https://github.com/deeplearning4j/DataVec</a> [2]: <a href="https://github.com/deeplearning4j/nd4j" rel="nofollow">https://github.com/deeplearning4j/nd4j</a>

评论 #16166250 未加载

buremba超过 7 年前

Is it in-memory? Does it support replication or sharding? What's the main use-case? How does it differ from ORC, Parquet or Arrow? The repository doesn't have any information.

评论 #16163056 未加载

dgudkov超过 7 年前

Interesting idea. Columnar ETL can be quite efficient in some scenarios because frequently an ETL transformation (e.g. calculating a new column) effectively modifies an existing table, rather than creates a new one. This allows calculating only the delta, instead of re-building a new table from. This helps optimize performance and do calculations in-memory without slow disk I/O.Another advantage is that it allows performing many transformations (e.g. filtering) directly on dictionary compressed data, without decompressing it. This works well in Vertica [1] (based on C-Store DB [2]) which was our inspiration for building a light-weight ETL for business users that also uses a columnar in-memory data transformation engine [3].[1] <a href="https://www.vertica.com/" rel="nofollow">https://www.vertica.com/</a>[2] <a href="http://db.csail.mit.edu/projects/cstore/" rel="nofollow">http://db.csail.mit.edu/projects/cstore/</a>[3] <a href="http://easymorph.com/in-memory-engine.html" rel="nofollow">http://easymorph.com/in-memory-engine.html</a>

krat0sprakhar超过 7 年前

Sorry for being that guy, but I just clicked into a random file in src to read the code, and found the code style (indentation etc.) to be quite weird <a href="https://github.com/asavinov/bistro/blob/master/core/src/main/java/org/conceptoriented/bistro/core/ColumnData.java" rel="nofollow">https://github.com/asavinov/bistro/blob/master/core/src/main...</a>.Might I suggest using <a href="https://github.com/google/google-java-format" rel="nofollow">https://github.com/google/google-java-format</a> for formatting?

评论 #16163754 未加载

评论 #16162936 未加载

评论 #16162919 未加载

jitl超过 7 年前

An example would be great. Can you show how to do a given task with SQL, map/reduce, and your framework?Because right now I have no idea why I’d choose to learn this new stuff over using google-able tools I already know.Make your value proposition really clear.

评论 #16162802 未加载

jnordwick超过 7 年前

Might new a cool idea, but not nearly fleshed out enough. I think a larger example instead of just individuals lines of code would be useful. Show a toy widget sales spreadsheet.What is the use case? Does it support time series? How works you do a moving average or pivot table?

评论 #16163178 未加载

nickpeterson超过 7 年前

I skimmed the readme but didn't see the answer to what I regard as a basic question. How is this different from a view? I can easily make derived columns based on functions and reference those in other views (performance issues aside).

julienfr112超过 7 年前

How do that compare to SAS software (<a href="https://en.wikipedia.org/wiki/SAS_(software)" rel="nofollow">https://en.wikipedia.org/wiki/SAS_(software)</a>) ? Particularly the "DATA" steps.

评论 #16172495 未加载

KasianFranks超过 7 年前

This is neat. Vectorspace based AI calculations will benefit from this approach. Great work!

9 条评论

agibsonccc超过 7 年前

评论 #16166250 未加载

buremba超过 7 年前

Is it in-memory? Does it support replication or sharding? What's the main use-case? How does it differ from ORC, Parquet or Arrow? The repository doesn't have any information.

评论 #16163056 未加载

dgudkov超过 7 年前

krat0sprakhar超过 7 年前

评论 #16163754 未加载

评论 #16162936 未加载

评论 #16162919 未加载

jitl超过 7 年前

评论 #16162802 未加载

jnordwick超过 7 年前

评论 #16163178 未加载

nickpeterson超过 7 年前

julienfr112超过 7 年前

How do that compare to SAS software (<a href="https://en.wikipedia.org/wiki/SAS_(software)" rel="nofollow">https://en.wikipedia.org/wiki/SAS_(software)</a>) ? Particularly the "DATA" steps.

评论 #16172495 未加载

KasianFranks超过 7 年前

This is neat. Vectorspace based AI calculations will benefit from this approach. Great work!

Show HN: Bistro – A light-weight column-oriented data processing engine

9 条评论

Show HN: Bistro – A light-weight column-oriented data processing engine

9 条评论