How I slashed a SQL query's runtime with two Unix commands

249 点作者 DSpinellis将近 7 年前

21 条评论

The distinct on the date format is the problem, probably should use a separate query on each table to fill in the date output as a new column first, and then join afterwards.Also wrong tool for the job comes to mind. MariaDB/MySQL are just not great at joins and query planning, and at this scale, especially with all the work involved to export and use unix tools, why not use any of the columnar data warehouses that can handle this much faster?Export it to bigquery with CSVs and the query will probably finish in 30 seconds.

评论 #17696988 未加载

评论 #17697044 未加载

评论 #17697007 未加载

评论 #17699003 未加载

WorkLifeBalance将近 7 年前

I don't know mysql/mariaDB that well, but I know in ms sql server that doing a DISTINCT on a Function([Column]) would not use an index on [Column].In this case it would be much faster to have an intermediate table or column with that statistic that could then be properly indexed.

评论 #17697235 未加载

评论 #17702404 未加载

lunchladydoris将近 7 年前

I'd like to see the runtime for the same query but (1) without 'distinct', (2) without the date formatting and (3) with an inner join.

评论 #17696847 未加载

评论 #17699882 未加载

评论 #17698975 未加载

languagehacker将近 7 年前

Am I missing something or is this blog post just someone discovering that ETL exists

评论 #17698219 未加载

评论 #17697463 未加载

评论 #17698862 未加载

评论 #17700793 未加载

评论 #17698075 未加载

frou_dh将近 7 年前

If you initiate the shell commands from inside the database, does it count as less hacky? :) <a href="https://github.com/petere/plsh" rel="nofollow">https://github.com/petere/plsh</a>

评论 #17696989 未加载

linsomniac将近 7 年前

I've run into situations like this where I was able to use a window function or a materialized view, to get huge speed gains. It basically was restructuring the tables into a new view, much like the shell script did in this case, and that view then could be optimized by the query planner much more effectively.One was with an internal database we used for tracking employee time and generating invoices and reports. I forget the exact details but this technique took it down from an hour to a second, or similar.Another time was more recently in an interview, using the mysql employee sample database. I had a naive query that did most of what was needed, but after quite a bit more work I was able to make a materialized view that caused the query to go from using 64+GB of RAM and dying from OOM after 4 hours, to running in 45 minutes in 2GB of RAM.It usually takes me hours and hours to put together though. Because I do it so infrequently.

ww520将近 7 年前

Isn't the Unix join essentially doing an inner join? Why is the original SQL doing a left join while the text processing with Unix join is doing an inner join?Yes, the left join will scan the whole table. Left join means to return all the rows from the left hand side table. It won't use the index.Inner join with index is way faster than left join.

ben509将近 7 年前

If you find you're exporting and running via `uniq` or the like, `create temporary table as` is probably what you want. If you think about it, it's essentially what you're doing, without the additional filesystem involvement.The optimizer should be handling this for you, though for very large datasets, constructing a new table from a query essentially avoids the locking issues you might otherwise run into.

nextweek2将近 7 年前

I'd also like to see a breakdown of the time taken to complete the project.RAM is cheap and jumping from 16GB to 64GB (or even 128GB) might have cost as much as the analysis time.The only clue to that a memory upgrade might have been a quicker fix was that the merged file was 133G. Seems to me that an upgrade to 128GB of RAM might have led to a dramatically shorter query execution time.

评论 #17696984 未加载

评论 #17701409 未加载

pytyper2将近 7 年前

I think the date_format call forces a table scan, the query could be rewritten to be much faster. Optimizing the sql statement seems less error prone than extracting the data to text files.

评论 #17697648 未加载

yread将近 7 年前

There is some more information in this reddit thread<a href="https://www.reddit.com/r/programming/comments/94r6w6/how_i_slashed_a_sql_query_runtime_from_380_hours/" rel="nofollow">https://www.reddit.com/r/programming/comments/94r6w6/how_i_s...</a>

jerry40将近 7 年前

As far as I understand, 'project_commits' and 'commits' have one-by-one relationship (project_commits.commit_id = commits.id). From my point of view it is a strange design since they could just add 'project_id' column into 'commits' table. 'project_commits' table seems to be redundant here.I'd write this query:<pre><code> select pc.project_id, date_format(c.created_at, '%x%v1'), count(*) from commits c join project_commits pc on c.id = pc.commit_id group by pc.project_id, date_format(c.created_at, '%x%v1') </code></pre> And I'd use left join only at the stage when it is needed to join 'projects' dictionary with the result of the query.

protomyth将近 7 年前

Am I missing something fundamental like a WHERE clause? I see a join, but no limiting of rows beyond that.

评论 #17698795 未加载

评论 #17697593 未加载

Animats将近 7 年前

MariaDB can do sort/merge joins. It's rare that you want to force one, but you can. That's what IGNORE INDEX is for.If you're testing, take a sample of the database and test on that.First try the simple form:<pre><code> explain select project_commits.project_id, date_format(commits.created_at, '%x%v1') as week_commit from commits, project_commits where project_commits.commit_id = commits.id; </code></pre> That ought to call for a sort and merge.

oliverped将近 7 年前

The query is simply very badly written. Who makes a left join of a 5b rows table to a smaller table and at the same time expects a distinct from the “left” table??? If the distinct is important to solve the problem without expecting the join to return “true” then why have a join at first place???

dekhn将近 7 年前

I once had a huge performance problem for a simple self-join that I expected would run quickly.I talked to our resident SQL expert (retired from IBM, now doing a job as the lab administrative assistant). She suggested sorting the data before inserting it.That sped up the query.it still bother me that data order matters so much. But, I can't reproduce the problem in MySQL any more anyway.

评论 #17699397 未加载

mamcx将近 7 年前

After reading the post, and considering the time of export/import the data, I think will be better to denormalize the data (ie:precalculate the result).I have something like in the near past, but with much heavy calculations. Denormalizing using triggers turn query that take minutes in less 1 second.

slifin将近 7 年前

How long does it take to run the distinct on its own? I'd be interested to know what happens if the distinct is in a sub query then the result is joined with the commits tableEnsure project_commits.project_id is in an index before running the test

评论 #17699591 未加载

thecopy将近 7 年前

The question im interested in is why MySQL did not use the index for the lookup.

评论 #17697701 未加载

评论 #17696514 未加载

评论 #17696596 未加载

评论 #17696727 未加载

评论 #17696950 未加载

edgarvm将近 7 年前

Naive question: are your tables partitioned?

CyanLite2将近 7 年前

TLDR: author doesn’t know how to use database indexes properly.

评论 #17699629 未加载

评论 #17699050 未加载