Practical SQL for Data Analysis

602 pointsby steve-chavezabout 4 years ago

23 comments

michelppabout 4 years ago

This is an excellent example of what I call the Copy-Object-Copy effect. It's particularly apparent in frameworks with ORMs like Django. In Pandas case, devs will do 'SELECT *' and use pandas as a sort of pseudo ORM.You run a query to get your data as a bunch of objects, but you're copying the data over the db connection from the postgres wire protocol into Python objects in memory, which are typically then garbage collected at the end of the transaction, then copy the result to a JSON buffer that is also garbage collected, and then send the final result to send to the browser which has been waiting this whole time.I regularly see Django apps require several gigs of RAM per worker and when you look at the queries, it's just a bunch of poorly rendered SELECTs. It's grotesque.Contrast PostgREST: 100 megabytes of RAM per worker. I've seen Django DRF workers require 50 to 1 memory vs PostgREST for the same REST interface and result with PostgREST being much faster. Since the database is generating the json, it can do so immediately upon generating the first result row which is then streamed to the browser. No double buffering.Unlike Django, it's not that I don't like Pandas, it's that people way overuse it for SQL tasks as this article points out. I've seen pandas scripts that could be a simpler psql script. If psql can't do what you want, resist the temptation to 'SELECT *' into a data frame and break the problem up into stages where you get the database to do the maximum work before it gets to the data frame.

评论 #27027389 未加载

评论 #27027077 未加载

评论 #27027093 未加载

评论 #27027581 未加载

评论 #27029479 未加载

评论 #27026987 未加载

评论 #27027342 未加载

m16ghostabout 4 years ago

>Pandas is a very popular tool for data analysis. It comes built-in with many useful features, it's battle tested and widely accepted. However, pandas is not always the best tool for the job.SQL is very useful, but there are some data manipulations which are much easier to perform in pandas/dplyr/data.table than in SQL. For example, the article discusses how to perform a pivot table, which takes data in a "long" format, and makes it "wider".In the article, the pandas version is:>pd.pivot_table(df, values='name', index='role', columns='department', aggfunc='count')Compared to the SQL version:>SELECT role, SUM(CASE department WHEN 'R&D' THEN 1 ELSE 0 END) as "R&D", SUM(CASE department WHEN 'Sales' THEN 1 ELSE 0 END) as "Sales" FROM emp GROUP BY role;Not only does the SQL code require you to know up front how many distinct columns you are creating, it requires you to write a line out for each new column. This is okay in simple cases, but is untenable when you are pivoting on a column with hundreds or more distinct values, such as dates or zip codes.There are some SQL dialects which provide pivot functions like in pandas, but they are not universal.There are other examples in the article where the SQL code is much longer and less flexible, such as binning, where the bins are hardcoded into the query.

评论 #27028582 未加载

评论 #27027306 未加载

评论 #27032487 未加载

评论 #27029338 未加载

评论 #27027981 未加载

评论 #27030317 未加载

评论 #27027360 未加载

评论 #27027033 未加载

flakinessabout 4 years ago

The more I learn SQL, the less I write Python.Although the SQL syntax is weird and so dated, its portability across tools trumps everything else. You finished the EDA and decided to port the insights to a dashboard? With SQL it's trivial. With Python... well, probably you'll have to port it to SQL unless you have Netflix-like, Jupyter-backed dashboard infrastructure in place. For many of us who only have much-more-prevalent SQL-based dashboard platform, why not starting from SQL? Copy-n-paste is your friend!I still hate the SQL as a programmer, but as a non-expert data analyst I now have accepted it.

评论 #27028502 未加载

评论 #27029980 未加载

beforeolivesabout 4 years ago

Nice article. Pandas gets the job done but it's such a step backwards in terms of useability, API consistency and code feel. You can do anything that you can possibly need with it but you regularly have to look up things that you've looked up before because the different parts of the library are patched up together and don't work consistenly in an intuitive way. And then you end up with long lines of().chained().['expressions'].like_this(0).

评论 #27026389 未加载

评论 #27027264 未加载

评论 #27027780 未加载

评论 #27032918 未加载

评论 #27026945 未加载

nerdponxabout 4 years ago

SQL syntax sucks for doing non-trivial data analysis. I've tried it. Verbose, no composability or code reuse, not really portable across different databases, no easy interoperability with other tools, limited editor/IDE support, etc.I guess if you have huge amounts of data (10m+ rows) already loaded into a database then sure, do your basic summary stats in SQL.For everything else, I'll continue using SQL to get the data from the database and use Pandas or Data.Table to actually analyze it.That said, this is a very comprehensive review of SQL techniques which I think could be very useful for when you do have bigger datasets and/or just need to get data out of a database efficiently. Great writeup and IMO required reading for anyone looking to be a serious "independent" data scientist (i.e. not relying on data engineers to do basic ETL for you).I'd be a huge fan of something like the PySpark DataFrame API that "compiles" to SQL* (but that doesn't require you to actually be using PySpark which is its own can of worms). I think this would be a lot nicer for data analysis than any traditional ORM style of API, at least for data analysis, while providing better composability and IDE support than writing raw SQL.*I also want this for Numexpr: <a href="https://numexpr.readthedocs.io/en/latest/user_guide.html" rel="nofollow">https://numexpr.readthedocs.io/en/latest/user_guide.html</a>, but also want a lot of other things, like a Arrow-backed data frames and a Numexpr-like C library to interact with them.

评论 #27026512 未加载

评论 #27026651 未加载

评论 #27028447 未加载

评论 #27026585 未加载

评论 #27027637 未加载

spamalot159about 4 years ago

I think for a lot of people, SQL is a skill that doesn't stick. You learn enough to do the queries you need for your project, they work then you forget about them as you work on the rest of your project. These skills are perishable. Left outer join? Yeah, I knew what that was some time ago, but not anymoreThe days of dedicated SQL programers are mostly gone.

评论 #27026594 未加载

评论 #27026623 未加载

评论 #27026554 未加载

评论 #27029466 未加载

评论 #27028925 未加载

afiodorovabout 4 years ago

I have been working with data for more than 5 years now and my 2 cents is that Pandas shines with exploratory analysis. Because you have the data in memory it's easy to empirically arrive at the exact data transformation you need and figure out whether your analysis is worth pursuing further. At this stage of the development your ability to iterate and experiment quickly is most valuable. I see no value in worrying about the deployment of the work you are doing if you haven't got any results just yet.Now when the analysis is done and there's a conclusion about putting some of it in production, I am yet to see a single company that relies on pandas to do data transforms "online", i.e. on the live production data stream. Typically in my experience I would end up rewriting my data transforms in either a compiled language or SQL - i.e. something much more performant and maintainable.As a result as you progress in your career you become very proficient in both pandas and SQL - and you also start glancing at the source code of some of the transforms performed by pandas when you need to port them to the language of choice of the company you're working for.

simonwabout 4 years ago

This article is so useful. It starts out with SQL basics but then quickly leaps into all kinds of PostgreSQL tricks that I didn't know about - binning, efficient sampling, calculating, even linear regression.

codeulikeabout 4 years ago

I'm so glad to have been around long enough that SQL is now being seen as exotic again

评论 #27029475 未加载

ojosilvaabout 4 years ago

I'm not a Pandas dev or a Pythonista for that matter but I'm pretty sure Pandas Dataframe should be able to handle SQL cursors or materializing data JIT for the aggregation. Getting all rows in with a fetchall is an antipattern and just completely impractical for a lot of use cases! Didn't Pandas have a SQL Alchemy integration already? Although I'm not sure it would run the aggregation progressively or again with fetchall internally...

评论 #27032402 未加载

dmitrykovalabout 4 years ago

Following similar observations I was wondering if one can actually execute SQL queries inside of a Python process with the access to native Python functions and Numpy as UDFs. Thanks to Apache Arrow one can essentially combine DataFrame API with SQL within data analysis workflows, without the need to copy the data and write operators in a mix of C++ and Python, all within the confines of the same Python process.So I implemented Vinum, which allows to execute queries which may invoke Numpy or Python functions as UDFs available to the interpreter. For example: "SELECT value, np.log(value) FROM t WHERE ..".<a href="https://github.com/dmitrykoval/vinum" rel="nofollow">https://github.com/dmitrykoval/vinum</a>Finally, DuckDB makes a great progress integrating pandas dataframes into the API, with UDFs support coming soon. I would certainly recommend giving it a shot for OLAP workflows.

评论 #27029782 未加载

muxatorabout 4 years ago

> This benchmark does not mention the memory consumed by the database itself - this is intentional. [...] Whether you decide to use the database or not, the memory is already paid for, so you might as well use it!This sentence is a big red flag for me. An analysis of a stategy that pushes work towards a subsystem, and then purposedly ignores the perforance implications on that subsystem is methodologically unsound.Personally, I am all for using the DB and writing good SQL. But if I weren't, this argument would not convince me.

评论 #27032972 未加载

lysecretabout 4 years ago

Hmm one thing that people forget about when it comes to Pandas vs SQL is that for the kind of data you would use in pandas, 1 day and sometimes even 1 week or longer doesn't make a big difference. So say you are building some complicated data processing pipeline and you base it on a bunch of SQL queries. You are hitting the Database every single time.While if you just get the tables and to most (or at least some) of the merging and processing in pandas you can do the loading of the data in the middle of the night and that is the only time you are hitting the db.I have personally experienced big analytical SQL queries hitting the db and busy times...

bayan1234about 4 years ago

I think one of the less discussed advantages of SQL and reasons why it might be so prevalent in the analytics world is that users don't really need to spend much time to set up their environment. I can send a query to a less technical person with my analysis for them to run. They can put the processed data into a spreadsheet and do whatever they want.

twobitshifterabout 4 years ago

I now do most of my data analysis in Julia instead of pandas, but used pandas for a long time. SQL is a valuable skill and useful when you need to optimize a query for performance, which can sometimes happen with pandas and is much less likely with Julia.However, even if a pandas query takes 5X longer to run than a SQL query you must consider the efficiency to a developer. You can chain pandas commands together that will accomplish something that takes 10x the lines of SQL. With SQL you’re more likely to end up with many intermediate CTEs along the way. So while you can definitely save processor cycles by using SQL, I don’t think you’ll save clock-face time by using it in most one off tasks. Datascience is usually column oriented and Julia and pandas allow you to stay in that world.

评论 #27031175 未加载

jonas_babout 4 years ago

Question, if I have a CSV file that I'd like to do some quick SQL queries on before moving the results into Pandas. What would be good resource to do this? Preferably compatible with the rest of the Python-dataframe ecosystem and as simple as pd.read_csv()

评论 #27026587 未加载

评论 #27026532 未加载

评论 #27026509 未加载

评论 #27028293 未加载

评论 #27027272 未加载

评论 #27026580 未加载

评论 #27033282 未加载

评论 #27026578 未加载

评论 #27026589 未加载

评论 #27026460 未加载

sbuttgereitabout 4 years ago

The code:<pre><code> WITH dt AS ( SELECT unnest(array[1, 2]) AS n ) SELECT * FROM dt; </code></pre> Is more complex than necessary. This produces the same result:<pre><code> SELECT n FROM unnest(array[1, 2]) n; ┌───┐ │ n │ ├───┤ │ 1 │ │ 2 │ └───┘ (2 rows) </code></pre> I think I see some other opportunities as well.I know the code is from a section dealing with CTEs, but CTEs aren't needed for every situation including things like using VALUES lists. Most people in the target audience can probably ignore this next point (probably on a newer version of PostgreSQL), but older versions of PostgreSQL would materialize the CTE results prior to processing the main query, which is not immediately obvious.

评论 #27026537 未加载

closedabout 4 years ago

For what it's worth, I maintain a library called siuba that lets you generate SQL code from pandas methods.It's crazy to me how people use SELECT * -> pandas, but also how people in SQL type a ton of code over and over.<a href="https://github.com/machow/siuba" rel="nofollow">https://github.com/machow/siuba</a>

lipraisabout 4 years ago

it seems the only reason people on hn hate sql is that they don't understand sql and/or already buy in another toolset,if you only have a hammer at hand,everything is a nail

truth_seekerabout 4 years ago

Pure SQL porn. Very exciting!Lately I am seeing rather really creative blog posts on HN. Keep it up guys.

yewenjieabout 4 years ago

OT but what is the best resource for learning SQL?

racl101about 4 years ago

I think I do a healthy mixture of both.

jplr8922about 4 years ago

tldr ; if you hear ''but we can do this in SQL'', RUN!!!My eyes hurt as I read this article. There are reasons why analysts dont use SQL to do their job, and it has nothing to do with saving RAM and memory.1) Data analysis is not a linear process, it involve playing and manipulating the data in different way and letting your mind drift a bit. You want your project in an IDE made for that purpose, the ability to create charts, source control, export and share the information, ect. Pandas is just a piece of that puzzle which is not possible to replicate in pure SQL.2) In 2020, there are numerical methods you want to try beyond a traditional regression. Most real world data problems are not made for stats101 tools included in sql. Kurtosis? Autocorrelation?3) Politics. Most database administrator are control freaks who hate the idea of somebody else doing stuff in their DB. Right now were I work we still have to use SSIS-2013 instead of stored procedures in order to avoid the DBA refusal bureaucratic process.4) Eventual professional development. If your analysis is good and creates value, chances are it will become a 'real' program and you will have to explain what you are doing to turn in into an OOP tool. If you have CS101, good coding in python will make this process much easier than a 3000 lines spagetti-SQL SQL script.5) Data cleaning. Dealing with outliers, NAN and all that jazz really depends on the problem you try to solve. The absence of a one size fits all solutions is a good case for R/pandas/etc. These issue will break an SQL script in no time.6) Debugging in SQL. HahahahahahahahaIf you are still preocupied with the ram usage of your PC to do your project, here are two solutions which infuriate a lot of DBAs I've worked with.A) <a href="https://www.amazon.ca/s?k=ram&__mk_fr_CA=%C3%85M%C3%85%C5%BD%C3%95%C3%91&ref=nb_sb_noss_2" rel="nofollow">https://www.amazon.ca/s?k=ram&__mk_fr_CA=%C3%85M%C3%85%C5%BD...</a>B) <a href="https://aws.amazon.com/" rel="nofollow">https://aws.amazon.com/</a>

评论 #27028572 未加载

评论 #27030706 未加载