TechEcho

Hi HN, I am looking for the most accurate tool which I can use to infer Data Types from Tabular data (csv,tsv,excel)I need to be able to perform some small customization, if possible, to the detection algorithm. For example if I have a 9 digit number, starting with 0, then treat it as a String.So far - I have found Frictionless Framework [0] which seems good, but I can't see any way of specifying customizations to the profiling algorithm, and Data Profiler [1] which uses ML for type detection, and it seems I should be able to train some new rules but I need a CUDA capable machine, which at the moment I do not have.Hoping the collective HN brain can point me to something better if it exists.[0] - https://framework.frictionlessdata.io/ [1] - https://github.com/capitalone/DataProfiler

2 comments

switch007almost 4 years ago

What kind of types?<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html" rel="nofollow">https://pandas.pydata.org/pandas-docs/stable/reference/api/p...</a> is pretty powerful (see also "parse_dates" and "converters" parameters). See also parse_excel()You can also use procedural code to look at the column data and change the type:<pre><code> # if all values in col c2 when converted to string begin with "0" and and values are of length 9, convert to int64 if df["col"].str.match("^0").all() and set(df["col"].str.len()) == {9}: df["col"] = df["col"].astype("int64")</code></pre>

评论 #27443814 未加载

评论 #27456624 未加载

quickthrower2almost 4 years ago

I’d probably knock one up in nodejs ad follows:Import a csv reader library that can stream.Read each line and apply a series of regex, each one classifying on match.Eg<pre><code> ^0\d{8}$ </code></pre> Means stringThen have a reduction rule e.g.If so far we think it’s a numeric column and we get a string then treat as string.If so far we think it’s a numeric column and we get a number it is still a numeric column.Then doing the regex and reduce in a loop will give you the final answers.Happy to knock up some example code if you wish.

Ask HN: What is the best tool to infer data type of tabular data?

2 comments

Ask HN: What is the best tool to infer data type of tabular data?

2 comments