Hi HN, I am looking for the most accurate tool which I can use to infer Data Types from Tabular data (csv,tsv,excel)<p>I need to be able to perform some small customization, if possible, to the detection algorithm. For example if I have a 9 digit number, starting with 0, then treat it as a String.<p>So far - I have found Frictionless Framework [0] which seems good, but I can't see any way of specifying customizations to the profiling algorithm, and Data Profiler [1] which uses ML for type detection, and it seems I should be able to train some new rules but I need a CUDA capable machine, which at the moment I do not have.<p>Hoping the collective HN brain can point me to something better if it exists.<p>[0] - https://framework.frictionlessdata.io/
[1] - https://github.com/capitalone/DataProfiler
What kind of types?<p><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html" rel="nofollow">https://pandas.pydata.org/pandas-docs/stable/reference/api/p...</a> is pretty powerful (see also "parse_dates" and "converters" parameters). See also parse_excel()<p>You can also use procedural code to look at the column data and change the type:<p><pre><code> # if all values in col c2 when converted to string begin with "0" and and values are of length 9, convert to int64
if df["col"].str.match("^0").all() and set(df["col"].str.len()) == {9}:
df["col"] = df["col"].astype("int64")</code></pre>
I’d probably knock one up in nodejs ad follows:<p>Import a csv reader library that can stream.<p>Read each line and apply a series of regex, each one classifying on match.<p>Eg<p><pre><code> ^0\d{8}$
</code></pre>
Means string<p>Then have a reduction rule e.g.<p>If so far we think it’s a numeric column and we get a string then treat as string.<p>If so far we think it’s a numeric column and we get a number it is still a numeric column.<p>Then doing the regex and reduce in a loop will give you the final answers.<p>Happy to knock up some example code if you wish.