TechEcho

9 comments

Helmut10001almost 3 years ago

Not to taunt about the article, but the most important pandas parameters to me are `iterator=True` and `chunksize=x`, for streamed processing. Here's an example for processing a CSV file with 400 Million latitude and longitude coordinates.[1][1]: <a href="https://ad.vgiscience.org/twitter-global-preview/00_Twitter_datashader.html" rel="nofollow">https://ad.vgiscience.org/twitter-global-preview/00_Twitter_...</a>

评论 #32351026 未加载

nlalmost 3 years ago

I find pd.read_sql pretty useful for integration with SQLLite too.<a href="https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html#" rel="nofollow">https://pandas.pydata.org/docs/reference/api/pandas.read_sql...</a>

wencalmost 3 years ago

I’ve mostly replaced pd.read_csv and pd.read_parquet with duckdb.query(“select * from ‘x.csv’) or duckdb.query(“select * from ‘y/*.parquet’).It’s much faster because DuckDB is vectorized. The result is a Pandas dataframe.Querying the Pandas dataframe from DuckDB is faster than querying it with Pandas itself.

ashwalalmost 3 years ago

Some good tips in here, I've find myself reaching for JSON/excel methods often.Despite using it for years, I still haven't decided if pandas is poorly architected or if the clunkiness (for lack of better of term) is a result of the inherent difficulty of the tasks.

评论 #32350567 未加载

char101almost 3 years ago

Another tip: use <a href="https://github.com/sfu-db/connector-x" rel="nofollow">https://github.com/sfu-db/connector-x</a> to load database query result to pandas without memory copy resulting in faster operation.

bushbabaalmost 3 years ago

I wish I knew about json_normalize sooner. JSON is great, but needing to transform it to a CSV/Excel sheet is a toil. Great to know about this one-liner!!

lettergramalmost 3 years ago

I spent some time working on something called DataProfiler python library<a href="https://github.com/capitalone/DataProfiler" rel="nofollow">https://github.com/capitalone/DataProfiler</a>The gist is that you can point to any common dataset and load it directly into pandas.from dataprofiler import Datadata = Data("your_file.csv") # Auto-Detect & Load: CSV, AVRO, Parquet, JSON, Text, URLI simply hate dealing with loading data, so it's my go-to.

pineapplejuicealmost 3 years ago

My tip is to keep a dict of all the fields and the data types you expect them to be, particularly strings. In my company we have IDs that start with zeros, or are a mix of numbers and letters, and get interpreted as numeric types. I'm frequently pulling data out of the DW with the same fields, so I just have to use the dtype= arg point it to my dict and it takes care of that for me.

IntrepidWormalmost 3 years ago

In my experience, the best way to load data into pandas is <a href="https://www.atlassian.com/software/bamboo" rel="nofollow">https://www.atlassian.com/software/bamboo</a>

9 comments

Helmut10001almost 3 years ago

评论 #32351026 未加载

nlalmost 3 years ago

wencalmost 3 years ago

ashwalalmost 3 years ago

评论 #32350567 未加载

char101almost 3 years ago

bushbabaalmost 3 years ago

I wish I knew about json_normalize sooner. JSON is great, but needing to transform it to a CSV/Excel sheet is a toil. Great to know about this one-liner!!

lettergramalmost 3 years ago

pineapplejuicealmost 3 years ago

IntrepidWormalmost 3 years ago

In my experience, the best way to load data into pandas is <a href="https://www.atlassian.com/software/bamboo" rel="nofollow">https://www.atlassian.com/software/bamboo</a>

Loading Data into Pandas: Tips and Tricks You May or May Not Know

9 comments

Loading Data into Pandas: Tips and Tricks You May or May Not Know

9 comments