科技回声

4 条评论

I tried `polars` via Explorer from Elixir, and it shit the bed in an entirely different way:<p>``` df = Explorer.DataFrame.from_csv(filename = "my_file.txt", delimiter: "|", infer_schema_length: nil) {:error, {:polars, "Could not parse `OTHER` as dtype Int64 at column 3.\nThe current offset in the file is 4447442 bytes.\n\nConsider specifying the correct dtype, increasing\nthe number of records used to infer the schema,\nrunning the parser with `ignore_parser_errors=true`\nor adding `OTHER` to the `null_values` list."}} ```<p>Note, I added `infer_schema_length: nil` assuming that the data type discovery via sampling was just less good in `polars`, since this would have it read the whole file before determine types, but it still failed.

af3d超过 2 年前

Relying on undefined behaviour can't really be considered much of a solution. Any changes to one of those third-party libraries could possibly break your application without warning. I would suggest inserting a sanitization routine right there into the stack to parse and transform the data file accordingly. For the sake of posterity, emitting logs of every "questionable" entry along the way wouldn't be a bad idea either.

评论 #34478295 未加载

nuc1e0n超过 2 年前

The authors assessment that using binary files would prevent such issues is flat out wrong. If a file is corrupt, as in his data, then undefined behaviour is all you can expect. No matter if the file format is text or binary based.<p>The thing that would prevent such issues is validation of the data you accept.

评论 #34492468 未加载

jasonpbecker超过 2 年前

A comparison of duckdb, PostgreSQL, pandas, readr, and fread when reading a delimited file with strange data.

Delimited files are hell– a comparison of methods to read bad files

4 条评论

Delimited files are hell– a comparison of methods to read bad files

4 条评论