TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Delimited files are hell– a comparison of methods to read bad files

8 pointsby jasonpbeckerover 2 years ago

4 comments

jasonpbeckerover 2 years ago
I tried `polars` via Explorer from Elixir, and it shit the bed in an entirely different way:<p>``` df = Explorer.DataFrame.from_csv(filename = &quot;my_file.txt&quot;, delimiter: &quot;|&quot;, infer_schema_length: nil) {:error, {:polars, &quot;Could not parse `OTHER` as dtype Int64 at column 3.\nThe current offset in the file is 4447442 bytes.\n\nConsider specifying the correct dtype, increasing\nthe number of records used to infer the schema,\nrunning the parser with `ignore_parser_errors=true`\nor adding `OTHER` to the `null_values` list.&quot;}} ```<p>Note, I added `infer_schema_length: nil` assuming that the data type discovery via sampling was just less good in `polars`, since this would have it read the whole file before determine types, but it still failed.
af3dover 2 years ago
Relying on undefined behaviour can&#x27;t really be considered much of a solution. Any changes to one of those third-party libraries could possibly break your application without warning. I would suggest inserting a sanitization routine right there into the stack to parse and transform the data file accordingly. For the sake of posterity, emitting logs of every &quot;questionable&quot; entry along the way wouldn&#x27;t be a bad idea either.
评论 #34478295 未加载
nuc1e0nover 2 years ago
The authors assessment that using binary files would prevent such issues is flat out wrong. If a file is corrupt, as in his data, then undefined behaviour is all you can expect. No matter if the file format is text or binary based.<p>The thing that would prevent such issues is validation of the data you accept.
评论 #34492468 未加载
jasonpbeckerover 2 years ago
A comparison of duckdb, PostgreSQL, pandas, readr, and fread when reading a delimited file with strange data.