TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Delimited files are hell– a comparison of methods to read bad files

8 点作者 jasonpbecker超过 2 年前

4 条评论

jasonpbecker超过 2 年前
I tried `polars` via Explorer from Elixir, and it shit the bed in an entirely different way:<p>``` df = Explorer.DataFrame.from_csv(filename = &quot;my_file.txt&quot;, delimiter: &quot;|&quot;, infer_schema_length: nil) {:error, {:polars, &quot;Could not parse `OTHER` as dtype Int64 at column 3.\nThe current offset in the file is 4447442 bytes.\n\nConsider specifying the correct dtype, increasing\nthe number of records used to infer the schema,\nrunning the parser with `ignore_parser_errors=true`\nor adding `OTHER` to the `null_values` list.&quot;}} ```<p>Note, I added `infer_schema_length: nil` assuming that the data type discovery via sampling was just less good in `polars`, since this would have it read the whole file before determine types, but it still failed.
af3d超过 2 年前
Relying on undefined behaviour can&#x27;t really be considered much of a solution. Any changes to one of those third-party libraries could possibly break your application without warning. I would suggest inserting a sanitization routine right there into the stack to parse and transform the data file accordingly. For the sake of posterity, emitting logs of every &quot;questionable&quot; entry along the way wouldn&#x27;t be a bad idea either.
评论 #34478295 未加载
nuc1e0n超过 2 年前
The authors assessment that using binary files would prevent such issues is flat out wrong. If a file is corrupt, as in his data, then undefined behaviour is all you can expect. No matter if the file format is text or binary based.<p>The thing that would prevent such issues is validation of the data you accept.
评论 #34492468 未加载
jasonpbecker超过 2 年前
A comparison of duckdb, PostgreSQL, pandas, readr, and fread when reading a delimited file with strange data.