TechEcho

4 comments

I tried `polars` via Explorer from Elixir, and it shit the bed in an entirely different way:<p>``` df = Explorer.DataFrame.from_csv(filename = "my_file.txt", delimiter: "|", infer_schema_length: nil) {:error, {:polars, "Could not parse `OTHER` as dtype Int64 at column 3.\nThe current offset in the file is 4447442 bytes.\n\nConsider specifying the correct dtype, increasing\nthe number of records used to infer the schema,\nrunning the parser with `ignore_parser_errors=true`\nor adding `OTHER` to the `null_values` list."}} ```<p>Note, I added `infer_schema_length: nil` assuming that the data type discovery via sampling was just less good in `polars`, since this would have it read the whole file before determine types, but it still failed.

af3dover 2 years ago

Relying on undefined behaviour can't really be considered much of a solution. Any changes to one of those third-party libraries could possibly break your application without warning. I would suggest inserting a sanitization routine right there into the stack to parse and transform the data file accordingly. For the sake of posterity, emitting logs of every "questionable" entry along the way wouldn't be a bad idea either.

评论 #34478295 未加载

nuc1e0nover 2 years ago

The authors assessment that using binary files would prevent such issues is flat out wrong. If a file is corrupt, as in his data, then undefined behaviour is all you can expect. No matter if the file format is text or binary based.<p>The thing that would prevent such issues is validation of the data you accept.

评论 #34492468 未加载

jasonpbeckerover 2 years ago

A comparison of duckdb, PostgreSQL, pandas, readr, and fread when reading a delimited file with strange data.

Delimited files are hell– a comparison of methods to read bad files

4 comments

Delimited files are hell– a comparison of methods to read bad files

4 comments