Fast Way to Load Data into PostgreSQL Using Python

246 pointsby giladalmost 6 years ago

14 comments

gigatexalalmost 6 years ago

As a former DBA and closet python super fan I found this post to follow all the things I would have done. Kudos to the op.Loading to a temp non-logged table is the right way to go and using built in bulk loading options is smart, too. The type annotations looked really good and kudos for using modern python here (although for the examples I’m not sure they were needed, but being that they could have been used in a larger context that this article was just a microcosm if I could be wrong). Very good article. I’ve bookmarked the blog to consume other such articles.

评论 #20409758 未加载

louis8799almost 6 years ago

I haven't done any benchmarking, but for around 200,000 rows of data, which I am handling daily at work, the following line of python code takes around 10 seconds.pandas.read_json(json_obj).to_sql()Any transformation of the data before writing into DB is split second with vectorized pandas operation.Pandas is still my first choice when it comes to tasks like this.

latchalmost 6 years ago

Not sure what the python code translates to, so maybe this is covered, but in my experience, if you can't use COPY, you should:1 - Generate a static SQL to insert N rows (say 10 to 50): insert into blah (id, name) values ($1, $2), ($3, $4),2 - Loop through the data in batch size3 - Generate a dynamic sql for M rows where M is whatever's left overWrap it in a transaction. You can have an SQL builder where you give it the list of columns and how many rows you want to generate and use that for both steps 1 and 3. You can prepare statement 1 (and keep reusing it), but I'mnot sure how much that'll buy you.If you need to deal with conflicts. Inserting the above into a temp table (on commit drop) and then doing an insert + update from the team into the real table works well.Then there's a handful of postgresql configs to tweak, but if you're willing to risk a bit of data loss (but not corruption), asynchronous_commit=off.

mhdalmost 6 years ago

Seems like a pretty good overview, but I'm not surprised at all that in the end it's COPY TO/FROM that won out by a good margin.Listing all the columns in order seems a bit superfluous and error prone, though. I'd recommend creating the list of columns to be inserted (or copied; copy_from has a 'columns' arg) and slicing the input hash with the same list (dict comprehension in Py).

kristiandupontalmost 6 years ago

I am looking for ways to quickly load data into a PG database for local testing. An initial seed run is ok, but I want test suites to run on identical data sets and since PG doesn't (yet) support nested transactions, I a currently use my seed database as a TEMPLATE for suite-specific databases.It works reasonably well but creating the database with a template takes several seconds, even though my test data is fairly simple. I imagine it should be faster than anything I can do from my programming language (nodejs in my case), or is creating with templates slow for some reason that I don't know of?

评论 #20400606 未加载

评论 #20400266 未加载

评论 #20403387 未加载

drejalmost 6 years ago

I concur - we load a fair amount of data into Postgres using Python and we've learned these lessons the hard way. We ended up feeding generators into execute_batch.We don't use execute_values, because we want to generate dictionaries (since there are usually multiple generators along the way) and we don't use copy_from, because the cognitive overhead of data prep is just way too high for the benefit (for us! your mileage may vary).Great stuff, I've already circulated it at work.

评论 #20400874 未加载

tdaalmost 6 years ago

Very nice article! Explores all the tricks I know of, and a few more. Never heard of the execute_values before, only a bit (3x in the example) slower than using copy but much simpler code.

da_chickenalmost 6 years ago

Why do you drop and recreate an unlogged table? Why not create a temporary table, which is also not WAL-logged and automatically gets cleaned up? It looks like you're not closing the connection. The only advantage is that the staging table remains in place until the next execution, but presumably so does the data file.Alternately, if you have an unlogged table already created, why not simply truncate the table if it exists instead of recreating it on every execution? Drop and truncate should take about the same amount of time, but creating a table is going to be slightly more expensive. If you're worried about making sure the db table schema matches the file format, I don't think I would be. File formats change, you're already going to be manually working on it. You can drop the table so the updated script recreates it.I suppose it doesn't really matter, but I'm wondering if your choice was one of intent or one of convenience.

cridenouralmost 6 years ago

I followed a similar path with TEMPORARY tables and using pgcopy [1] to insert data quickly - the copy operation was extremely quick and didn't require only passing strings.[1] <a href="https://github.com/altaurog/pgcopy" rel="nofollow">https://github.com/altaurog/pgcopy</a>

derefralmost 6 years ago

You can go faster than this. (Mind you, the following suggestions will only matter if the DB is still your real bottleneck. Once you’re using COPY, you start to have to optimize the rest of your ETL pipeline as well.)I’ve done all of the following (except the last bit about pg_bulkload) in my own ETL stack. They work!——1. COPY supports a binary format in place of CSV. It’s proprietary (it’s not even exposed in libpq, only statically linked into the postmaster, pgdump and pgrestore binaries!) but also pretty easy to implement. Here’s a python package implementing it: <a href="https://pypi.org/project/pgcopy/" rel="nofollow">https://pypi.org/project/pgcopy/</a>. As long as you’ve got the rows in memory at some point, just generate that instead of CSV. The rest of the architecture is the same.2. Run concurrent SQL COPY commands on separate connections for each [staging] table you’re populating. This allows PG to populate the tables in parallel.3. Create (or TRUNCATE) the staging tables in the same transaction that contains the COPY. This way, the COPY won’t need to operate on a differential MVCC state, but rather can operate on the “raw” table. (And, if this is the “base” data load for a given table, rather than an incremental one against an already-populated table, then you don’t need a staging table; the target table itself is now your staging table. Just ensure that any CREATE INDEX or CREATE PRIMARY KEY or ADD CONSTRAINT happens after you populate the table with data. No need for a TEMPORARY/unlogged intermediate; the performance would actually be worse if you added one.)4. Pre-bake your .sql, embedding the `FORMAT binary` data, then stream it once you’re done. I can’t tell you how many times the problem with an ETL pipeline wasn’t PG’s ingestion speed, but rather the speed at which the ETL transformer process was writing to the socket. The rest of your pipeline can be asynchronous, but actually streaming into your PG instance takes up valuable shared resources on the instance. Assuming that the instance being ETLed to isn’t just a replication master, and actually receives read traffic, you’ll want writes to that instance to complete as quickly as possible—even at the expense of making the rest of your ETL pipeline more complicated. So pre-bake your .sql files, and then stream them. (One advantage, once you do this: you don’t need a runtime with an SQL client library any more. You can just spawn instances of `psql foo.sql`. These processes don’t even have to coincide with your transform stage any more! Push your `.sql` files to object storage, and then run `psql` on a dedicated loader VM—or from an object-storage-lifecycle-event-triggered Lambda function, if you like.)Fun fact: when you `pgdump --format=directory`, pgrestore(1) will do techniques #1 - #4 for you automatically.Key insight: the pgdump(1) and pgrestore(1) pair of binaries are optimized to heck and back. If pgdump(1) or pgrestore(1) has some special way of doing something, it’s probably for performance; look into doing that thing in your pipeline. (Even if there’s no library for your runtime to allow you to do that thing. Write one!)——You can go faster than even pgdump+pgrestore will “naively” do, while still being compatible with “standard Postgres”, if you’re willing to mangle your data architecture a bit. Specifically, you can take advantage of Postgres’s table partitioning support. (Yes, now you have to deal with partitions. But hey, you probably already had to at the data sizes we’re talking about.)5. Turn your incremental loads into table foo (via fresh staging tables) into base loads of fresh partitions of table foo. Each load creates its own partition. Techniques #1 and #2 will accelerate and parallelize the loading of sibling partitions of a parent table, just as well as they’ll accelerate the loading of separate tables.Once you do this, you don’t need any staging tables, because every load is a “base” load of a table. (Yes, you’ll have a lot of tables. You’ll need some agent sitting around to do async rollups using `CREATE TABLE AS ...` to consolidate these. If you have access to the TimescaleDB extension, you might be able to trick it into doing this for you automatically.)#6 (or #3b). Include your CREATE INDEX statements in the same transaction that CREATEs and COPYies into the partitions (after the data is populated, though.) One of the worst thing for the production performance of a data warehouse is contention between application reads and an index rebuild. If you always create new partitions, and only ever index them when you’re creating them, then your indices will be created against tables that aren’t visible to new clients yet, and so clients and the indexer will never get in one-another’s way.——If, after all this, you’re looking for the absolute fastest way to load data into Postgres, and you have control over your PG instance (i.e. it’s not a hosted DBaaS), you could try <a href="https://pgxn.org/dist/pg_bulkload/" rel="nofollow">https://pgxn.org/dist/pg_bulkload/</a>. It’s a PG extension that serves as a faster replacement for the COPY command, by bypassing most of the buffers and locks related to PG’s individual request-handling. It shouldn’t cause any problems with MVCC, but only because it assumes/requires that you’re also doing all of the above, so you’re never trying to load into a table users can “see.” (I haven’t tried it personally, but if PG’s ingest speed becomes my bottleneck again, I just might.)

Bishonen88almost 6 years ago

Somewhat related: how does one insert a Pandas DataFrame into a Postgres table, in which most fields are merely integers being foreign keys of other tables? The DataFrame itself contains the real values (e.g. Cities, Countries etc.).

评论 #20401883 未加载

2rsfalmost 6 years ago

Very nice and educational read ! one note though, it's usually a good practice in performance testing to repeat your measurements more than once, it will mitigate local problems or at least point to problems in the measurement.

zarathustraaalmost 6 years ago

Would also like this with asyncpg. psycopg2 is kind of obsolete today.

评论 #20400415 未加载

评论 #20404074 未加载

cgioalmost 6 years ago

There’s also copy_expert in psycopg2. I believe csv with copy_expert should be the fastest but have not tested in as much detail as op. Great work on the article.

评论 #20400321 未加载