Show HN: I built an open-source data copy tool called ingestr

156 pointsby karakanbabout 1 year ago

Hi there, Burak here. I built an open-source data copy tool called ingestr (<a href="https://github.com/bruin-data/ingestr">https://github.com/bruin-data/ingestr</a>)I did build quite a few data warehouses both for the companies I worked at, as well as for consultancy projects. One of the more common pain points I observed was that everyone had to rebuild the same data ingestion bit over and over again, and each in different ways:- some wrote code for the ingestion from scratch to various degrees- some used off-the-shelf data ingestion tools like Fivetran / AirbyteI have always disliked both of these approaches, for different reasons, but never got around to working on what I'd imagine to be the better way forward.The solutions that required writing code for copying the data had quite a bit of overhead such as how to generalize them, what language/library to use, where to deploy, how to monitor, how to schedule, etc. I ended up figuring out solutions for each of these matters, but the process always felt suboptimal. I like coding but for more novel stuff rather than trying to copy a table from Postgres to BigQuery. There are libraries like dlt (awesome lib btw, and awesome folks!) but that still required me to write, deploy, and maintain the code.Then there are solutions like Fivetran or Airbyte, where there's a UI and everything is managed through there. While it is nice that I didn't have to write code for copying the data, I still had to either pay some unknown/hard-to-predict amount of money to these vendors or host Airbyte myself which is roughly back to square zero (for me, since I want to maintain the least amount of tech myself). Nothing was versioned, people were changing things in the UI and breaking the connectors, and what worked yesterday didn't work today.I had a bit of spare time a couple of weeks ago and I wanted to take a stab at the problem. I have been thinking of standardizing the process for quite some time already, and dlt had some abstractions that allowed me to quickly prototype a CLI that copies data from one place to another. I made a few decisions (that I hope I won't regret in the future):- everything is a URI: every source and every destination is represented as a URI- there can be only one thing copied at a time: it'll copy only a single table within a single command, not a full database with an unknown amount of tables- incremental loading is a must, but doesn't have to be super flexible: I decided to support full-refresh, append-only, merge, and delete+insert incremental strategies, because I believe this covers 90% of the use-cases out there.- it is CLI-only, and can be configured with flags & env variables so that it can be automated quickly, e.g. drop it into GitHub Actions and run it daily.The result ended up being `ingestr` (<a href="https://github.com/bruin-data/ingestr">https://github.com/bruin-data/ingestr</a>).I am pretty happy with how the first version turned out, and I plan to add support for more sources & destinations. ingestr is built to be flexible with various source and destination combinations, and I plan to introduce more non-DB sources such as Notion, GSheets, and custom APIs that return JSON (which I am not sure how exactly I'll do but open to suggestions!).To be perfectly clear: I don't think ingestr covers 100% of data ingestion/copying needs out there, and it doesn't aim that. My goal with it is to cover most scenarios with a decent set of trade-offs so that common scenarios can be solved easily without having to write code or manage infra. There will be more complex needs that require engineering effort by others, and that's fine.I'd love to hear your feedback on how can ingestr help data copying needs better, looking forward to hearing your thoughts!Best, Burak

18 comments

simonwabout 1 year ago

I was surprised to see SQLite listed as a source but not as a destination. Any big reasons for that or is it just something you haven't got around to implementing yet?I've been getting a huge amount of useful work done over the past few years sucking data from other systems into SQLite files on my own computer - I even have my own small db-to-sqlite tool for this (built on top of SQLAlchemy) - <a href="https://github.com/simonw/db-to-sqlite">https://github.com/simonw/db-to-sqlite</a>

评论 #39527508 未加载

评论 #39527532 未加载

评论 #39539889 未加载

评论 #39530671 未加载

yevpatsabout 1 year ago

Firstly, congrats :) (Generalized) ingestion is a very hard problem because any abstraction that you come up with will always some limitations where you might need to fallback to writing code and have full access to the 3rd party APIs. But definitely in some cases generalized ingestion is much better then re-writing the same ingestion piece especially for complex connectors. Take a look at CloudQuery (<a href="https://github.com/cloudquery/cloudquery">https://github.com/cloudquery/cloudquery</a>) open source high performance ELT framework powered by Apache Arrow (so you can write plugins in any language). (Maintainer here)

评论 #39536679 未加载

sascjmabout 1 year ago

Hi Burak. I have been testing ingestr using a source and destination Postgres database. What I'm trying to do is copy data from my Prod database to my test database. I find when using replace I get additional dlt columns added to the tables as hints. It also does not work for a defined primary key only natural keys. Composite keys do not work. Can you tell me the basic, minimal that it supports. I would love to use it to keep our Prod and Test databases in sync, but it appears that the functionality I need is not there. Thanks very much.

评论 #39613354 未加载

matijashabout 1 year ago

This looks pretty cool! What was the hardest part about building this?

评论 #39526004 未加载

kipukunabout 1 year ago

Do you think you'll add local file support in the future? Also, do you have any plans on making the reading of a source parallel? For example, connectorx uses an optional partition column to read chunks of a table concurrently. Cool how it's abstracted.

评论 #39530122 未加载

评论 #39527085 未加载

评论 #39527536 未加载

e12eabout 1 year ago

Looks interesting. Clickhouse seems to be conspicuously missing as source and destination. Although I suppose clickhouse can masquerade as postgres: <a href="https://clickhouse.com/docs/en/interfaces/postgresql" rel="nofollow">https://clickhouse.com/docs/en/interfaces/postgresql</a>Ed: there's an issue already: <a href="https://github.com/bruin-data/ingestr/issues/1">https://github.com/bruin-data/ingestr/issues/1</a>

hermitcrababout 1 year ago

I am very interested in data ingestion. I develop a desktop data wrangling tool in C++ ( Easy Data Transform ). So far it can import files in various formats (CSV, Excel, JSON, XML etc). But I am interested in being able to import from databases, APIs and other sources. Would I be able to ship your CLI as part of my product on Windows and Mac? Or can someone suggest some other approach to importing from lots of data sources without coding them all individually?

评论 #39530100 未加载

jrhizorabout 1 year ago

I like the idea of encoding complex connector configs into URIs!

评论 #39529653 未加载

评论 #39526764 未加载

parkcedarabout 1 year ago

This looks awesome. I had this exact problem just last week and had to write my own tool to perform the migration in go. After creating the tool I thought this must be something others would use- glad to see someone beat me to it!I think it’s clever keep the tool simple and only copy one table at a time. My solution was to generate code based on an sql schema, but it was going to be messy and require more user introspection before the tool could be run.

评论 #39530585 未加载

chinupbuttercupabout 1 year ago

This looks pretty cool. Is there any schema management included or do schema changes need to be in place on both sides first?

评论 #39528208 未加载

andenacitelliabout 1 year ago

Any thought on how this compares to Meltano and also their Singer SDK? We use it at $DAYJOB because it gives us a great hybrid of standardizing so we don’t have to treat it differently downstream while still letting us customize,

ab_testingabout 1 year ago

If you can add source and destination as csv, it will increase the usefulness of this product manifold.There are many instances where people either have a csv that they want to load into a database or get a specific database table exported into csv.

评论 #39530115 未加载

评论 #39528864 未加载

评论 #39531732 未加载

评论 #39530545 未加载

infotropyabout 1 year ago

Looks really interesting and definitely a use case I face over and over again. The name just breaks my brain, I want it to be an R package but it’s Python. Just gives me a mild headache.

PeterZaitsevabout 1 year ago

Looks great Burak! Appreciate your contribution to Open Source Data ecosystem!

评论 #39530107 未加载

ijidakabout 1 year ago

Is there a reason CSV (as a source) isn't supported? I've been looking for exactly this type of tool, but that supports CSV.CSV support would be huge.Please please please provide CSV support. :)

评论 #39530175 未加载

评论 #39530566 未加载

skangaabout 1 year ago

Hi Burak, I saw cx_Oracle in the requirements.txt but the support matrix did not mention it. Does this mean Oracle is coming? Or a typo?

评论 #39528854 未加载

Phlogiabout 1 year ago

I'd love to see support for odbc, any plans?

评论 #39529602 未加载

yankoabout 1 year ago

Db2 like not existing db in the real world

评论 #39530578 未加载