TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: What is the best tool to infer data type of tabular data?

5 pointsby mahalelalmost 4 years ago
Hi HN, I am looking for the most accurate tool which I can use to infer Data Types from Tabular data (csv,tsv,excel)<p>I need to be able to perform some small customization, if possible, to the detection algorithm. For example if I have a 9 digit number, starting with 0, then treat it as a String.<p>So far - I have found Frictionless Framework [0] which seems good, but I can&#x27;t see any way of specifying customizations to the profiling algorithm, and Data Profiler [1] which uses ML for type detection, and it seems I should be able to train some new rules but I need a CUDA capable machine, which at the moment I do not have.<p>Hoping the collective HN brain can point me to something better if it exists.<p>[0] - https:&#x2F;&#x2F;framework.frictionlessdata.io&#x2F; [1] - https:&#x2F;&#x2F;github.com&#x2F;capitalone&#x2F;DataProfiler

2 comments

switch007almost 4 years ago
What kind of types?<p><a href="https:&#x2F;&#x2F;pandas.pydata.org&#x2F;pandas-docs&#x2F;stable&#x2F;reference&#x2F;api&#x2F;pandas.read_csv.html" rel="nofollow">https:&#x2F;&#x2F;pandas.pydata.org&#x2F;pandas-docs&#x2F;stable&#x2F;reference&#x2F;api&#x2F;p...</a> is pretty powerful (see also &quot;parse_dates&quot; and &quot;converters&quot; parameters). See also parse_excel()<p>You can also use procedural code to look at the column data and change the type:<p><pre><code> # if all values in col c2 when converted to string begin with &quot;0&quot; and and values are of length 9, convert to int64 if df[&quot;col&quot;].str.match(&quot;^0&quot;).all() and set(df[&quot;col&quot;].str.len()) == {9}: df[&quot;col&quot;] = df[&quot;col&quot;].astype(&quot;int64&quot;)</code></pre>
评论 #27443814 未加载
评论 #27456624 未加载
quickthrower2almost 4 years ago
I’d probably knock one up in nodejs ad follows:<p>Import a csv reader library that can stream.<p>Read each line and apply a series of regex, each one classifying on match.<p>Eg<p><pre><code> ^0\d{8}$ </code></pre> Means string<p>Then have a reduction rule e.g.<p>If so far we think it’s a numeric column and we get a string then treat as string.<p>If so far we think it’s a numeric column and we get a number it is still a numeric column.<p>Then doing the regex and reduce in a loop will give you the final answers.<p>Happy to knock up some example code if you wish.
评论 #27443800 未加载