TechEcho

7 comments

dmpetrov7 months ago

Yay! Excited to see DataChain on the front page :)Maintainer and author here. Happy to answer any questions.We built DataChain because our DVC couldn't fully handle data transformations and versioning directly in S3/GCS/Azure without data copying.Analogy with "DBT for unstractured data" applies very well to DataChain since it transforms data (using Python, not SQL) inside in storages (S3, not DB). Happy to talk more!

jerednel7 months ago

Cool! Does this assume the unstructured data already has a corresponding metadata file?My most common use cases involve getting PDFs or HTML files and I have to parse the metadata to store along with the embedding.Would I have to run a process to extract file metadata into JSONs for every embedding/chunk? Would keys created based off document be title+chunk_no?Very interested in this because documents from clients are subject to random changes and I don’t have very robust systems in place.

评论 #42044992 未加载

评论 #42045514 未加载

mpeg7 months ago

It took me a minute to grok what this was for, but I think I like itIt doesn't really replace any of the tooling we use to wrangle data at scale (like prefect or dagster or temporal) but as a local library it seems to be excellent, I think what confused me most was the comparison to dbt.I like the from_* utils and the magic of the Column class operator overloading and how chains can be used as datasets. Love how easy checkpointing is too. Will give it a go

评论 #42044515 未加载

whalesalad7 months ago

> It is made to organize your unstructured data into datasets and wrangle it at scale on your local machine.How does one wrangle terabytes of data on a local machine?

评论 #42044255 未加载

SOLAR_FIELDS6 months ago

> Datachain does not abstract or hide the AI models and API calls, but helps to integrate them into the postmodern data stack.I’m not sure if this term postmodern data stack was invented for the purposes of this copy. Probably not. But terms like this don’t really engender a lot of faith that this isn’t yet another piece of the now decades long hype cycle data engineering products face

评论 #42056962 未加载

datascientist7 months ago

How does this relate to <a href="https://github.com/lancedb/lance">https://github.com/lancedb/lance</a>

评论 #42048068 未加载

tatigabru6 months ago

Wow! Great instrument, so excited to see it here

评论 #42049170 未加载

7 comments

dmpetrov7 months ago

jerednel7 months ago

评论 #42044992 未加载

评论 #42045514 未加载

mpeg7 months ago

评论 #42044515 未加载

whalesalad7 months ago

> It is made to organize your unstructured data into datasets and wrangle it at scale on your local machine.How does one wrangle terabytes of data on a local machine?

评论 #42044255 未加载

SOLAR_FIELDS6 months ago

评论 #42056962 未加载

datascientist7 months ago

How does this relate to <a href="https://github.com/lancedb/lance">https://github.com/lancedb/lance</a>

评论 #42048068 未加载

tatigabru6 months ago

Wow! Great instrument, so excited to see it here

评论 #42049170 未加载

DataChain: DBT for Unstructured Data

7 comments

DataChain: DBT for Unstructured Data

7 comments