TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Finding data items in one field that contradict data items in another field

32 pointsby eaguyhnover 6 years ago

2 comments

codeulikeover 6 years ago
Ok so the title of the blog is &#x27;Data ops in the Linux command line&#x27;. Sounds fun, but in the same way that &#x27;paintball with blindfolds&#x27; would be fun.<p>e.g. this is one of the examples - a kindof sanity check on two fields in a tab separated file to see if one is less than the other. The fields are identified by their position (31, 33 etc)<p><pre><code> awk -F&quot;\t&quot; &#x27;NR&gt;1 &amp;&amp; $31!=&quot;&quot; &amp;&amp; $33!=&quot;&quot; &amp;&amp; $33&gt;$31&#x27; fish | wc -l </code></pre> Surely much better to just import it into a database and do the analysis in SQL. The SQL equivalent of the above would be something like:<p><pre><code> SELECT * FROM FishSpecimenData WHERE MinDepth &gt; MaxDepth AND MinDepth is not null AND MaxDepth is not null </code></pre> If you&#x27;re worried about type conversions while importing into SQL, just import everything as a varchar. You&#x27;ve still got a fairly easy job to compare the numbers:<p><pre><code> SELECT * FROM FishSpecimenData WHERE Cast(MinDepth as int) &gt; Cast(MaxDepth as int) AND MinDepth is not null AND MaxDepth is not null AND IsNumeric(MinDepth) = 1 and IsNumeric(MaxDepth) = 1 </code></pre> edit: To be fair, on this page <a href="https:&#x2F;&#x2F;www.polydesmida.info&#x2F;cookbook&#x2F;index.html" rel="nofollow">https:&#x2F;&#x2F;www.polydesmida.info&#x2F;cookbook&#x2F;index.html</a> the author explains the rationale for using command line tools:<p><i>I&#x27;m a retired scientist and I&#x27;ve been mucking around with data tables for nearly 50 years. I started with printed columns on paper (and a calculator) before moving to spreadsheets and relational databases (Microsoft Access, Filemaker Pro, MySQL, SQLite). In 2012 I discovered the AWK language and realised that every processing job I&#x27;d ever done with data tables could be done faster and more simply on the command line. Since then my data tables have been stored as plain text and managed with GNU&#x2F;Linux command-line tools, especially AWK</i><p>So I guess the point of the blog is to promote that approach. Fair enough.
评论 #18130113 未加载
评论 #18133416 未加载
lolcover 6 years ago
Oh I thought the article would be about scientific fields. Not data fields. I became increasingly irritated about the pedantry before I realized that this was the topic.<p>Once the confusion lifted I could enjoy the read.
评论 #18129815 未加载