TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Finding CSV files that start with a BOM using ripgrep

119 pointsby pcr910303about 4 years ago

8 comments

asicspabout 4 years ago
&gt;<i>The --multiline option means the search spans multiple lines - I only want to match entire files that begin with my search term, so this means that ^ will match the start of the file, not the start of individual lines.</i><p>That&#x27;s not correct because the `m` flag gets enabled by the multiline option.<p><pre><code> $ printf &#x27;a\nbaz\nabc\n&#x27; | rg -U &#x27;^b&#x27; baz </code></pre> Need to use `\A` to match start of file or disable `m` flag using `(?-m)`, but seems like there&#x27;s some sort of bug though (will file an issue soon):<p><pre><code> $ printf &#x27;a\nbaz\nabc\n&#x27; | rg -U &#x27;\Ab&#x27; baz $ printf &#x27;a1\nbaz\nabc\n&#x27; | rg -U &#x27;\Ab&#x27; baz $ printf &#x27;a12\nbaz\nabc\n&#x27; | rg -U &#x27;\Ab&#x27; $</code></pre>
评论 #27324600 未加载
tyingqabout 4 years ago
&quot;BOM&quot; == UTF-8 Byte Order Mark I guess.<p>I initially thought it was searching for &quot;Bill of Materials&quot; for electronics projects or similar.
评论 #27325981 未加载
评论 #27324504 未加载
nemetroidabout 4 years ago
Here&#x27;s a coreutils (two-liner) version:<p><pre><code> printf &#x27;\xEF\xBB\xBF&#x27; &gt;bom.dat find . -name &#x27;*.csv&#x27; \ -exec sh -c &#x27;head --bytes 3 {} | cmp --quiet - bom.dat&#x27; \; \ -print </code></pre> The -exec option for find can be used as a filter (though -exec disables the default action, -print, so it must be reenabled after).<p>Could be made into a oneliner by replacing the &#x27;bom.dat&#x27; argument to cmp with &#x27;&lt;(printf ...)&#x27;.
评论 #27324605 未加载
superjanabout 4 years ago
One large source of byte order marks in utf8 is Windows. In MS DOS and later windows, 8 bit encoded files are assumed to be in the system code page, which to enable all the worlds writing systems varies from country to country. When utf8 came along, Microsoft tools disambiguated those from the local code page by prefixing them with a byte order mark. They also do this in (for instance) the .net framework Xml libraries(by default). I don’t know what .net core does. I suppose it made sense at the time but I’m sure they regret this by now.
评论 #27325559 未加载
RedShift1about 4 years ago
Off topic but related, why does UTF-16 and UTF-32 even exist? Doesn&#x27;t UTF-8 have the capability to go up to 32 bit wide characters already?
评论 #27324990 未加载
评论 #27324987 未加载
评论 #27325641 未加载
评论 #27325050 未加载
评论 #27324989 未加载
评论 #27325540 未加载
评论 #27327969 未加载
wodenokotoabout 4 years ago
I don’t know if I’ll ever gonna need this, but I loved learning it!
dazfullerabout 4 years ago
I probably won’t ever need this, but I love the write up for a tool which I use daily
nwellnhofabout 4 years ago
Is there anything like --multiline in GNU grep?
评论 #27328717 未加载