TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Loading Data into Pandas: Tips and Tricks You May or May Not Know

49 pointsby spacejunkjimalmost 3 years ago

9 comments

Helmut10001almost 3 years ago
Not to taunt about the article, but the most important pandas parameters to me are `iterator=True` and `chunksize=x`, for streamed processing. Here&#x27;s an example for processing a CSV file with 400 Million latitude and longitude coordinates.[1]<p>[1]: <a href="https:&#x2F;&#x2F;ad.vgiscience.org&#x2F;twitter-global-preview&#x2F;00_Twitter_datashader.html" rel="nofollow">https:&#x2F;&#x2F;ad.vgiscience.org&#x2F;twitter-global-preview&#x2F;00_Twitter_...</a>
评论 #32351026 未加载
nlalmost 3 years ago
I find pd.read_sql pretty useful for integration with SQLLite too.<p><a href="https:&#x2F;&#x2F;pandas.pydata.org&#x2F;docs&#x2F;reference&#x2F;api&#x2F;pandas.read_sql.html#" rel="nofollow">https:&#x2F;&#x2F;pandas.pydata.org&#x2F;docs&#x2F;reference&#x2F;api&#x2F;pandas.read_sql...</a>
wencalmost 3 years ago
I’ve mostly replaced pd.read_csv and pd.read_parquet with duckdb.query(“select * from ‘x.csv’) or duckdb.query(“select * from ‘y&#x2F;*.parquet’).<p>It’s much faster because DuckDB is vectorized. The result is a Pandas dataframe.<p>Querying the Pandas dataframe from DuckDB is faster than querying it with Pandas itself.
ashwalalmost 3 years ago
Some good tips in here, I&#x27;ve find myself reaching for JSON&#x2F;excel methods often.<p>Despite using it for years, I still haven&#x27;t decided if pandas is poorly architected or if the clunkiness (for lack of better of term) is a result of the inherent difficulty of the tasks.
评论 #32350567 未加载
char101almost 3 years ago
Another tip: use <a href="https:&#x2F;&#x2F;github.com&#x2F;sfu-db&#x2F;connector-x" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;sfu-db&#x2F;connector-x</a> to load database query result to pandas without memory copy resulting in faster operation.
bushbabaalmost 3 years ago
I wish I knew about json_normalize sooner. JSON is great, but needing to transform it to a CSV&#x2F;Excel sheet is a toil. Great to know about this one-liner!!
lettergramalmost 3 years ago
I spent some time working on something called DataProfiler python library<p><a href="https:&#x2F;&#x2F;github.com&#x2F;capitalone&#x2F;DataProfiler" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;capitalone&#x2F;DataProfiler</a><p>The gist is that you can point to any common dataset and load it directly into pandas.<p>from dataprofiler import Data<p>data = Data(&quot;your_file.csv&quot;) # Auto-Detect &amp; Load: CSV, AVRO, Parquet, JSON, Text, URL<p>I simply hate dealing with loading data, so it&#x27;s my go-to.
pineapplejuicealmost 3 years ago
My tip is to keep a dict of all the fields and the data types you expect them to be, particularly strings. In my company we have IDs that start with zeros, or are a mix of numbers and letters, and get interpreted as numeric types. I&#x27;m frequently pulling data out of the DW with the same fields, so I just have to use the dtype= arg point it to my dict and it takes care of that for me.
IntrepidWormalmost 3 years ago
In my experience, the best way to load data into pandas is <a href="https:&#x2F;&#x2F;www.atlassian.com&#x2F;software&#x2F;bamboo" rel="nofollow">https:&#x2F;&#x2F;www.atlassian.com&#x2F;software&#x2F;bamboo</a>