TechEcho

8 comments

asicspabout 4 years ago

>The --multiline option means the search spans multiple lines - I only want to match entire files that begin with my search term, so this means that ^ will match the start of the file, not the start of individual lines.That's not correct because the `m` flag gets enabled by the multiline option.<pre><code> $ printf 'a\nbaz\nabc\n' | rg -U '^b' baz </code></pre> Need to use `\A` to match start of file or disable `m` flag using `(?-m)`, but seems like there's some sort of bug though (will file an issue soon):<pre><code> $ printf 'a\nbaz\nabc\n' | rg -U '\Ab' baz $ printf 'a1\nbaz\nabc\n' | rg -U '\Ab' baz $ printf 'a12\nbaz\nabc\n' | rg -U '\Ab' $</code></pre>

评论 #27324600 未加载

tyingqabout 4 years ago

"BOM" == UTF-8 Byte Order Mark I guess.I initially thought it was searching for "Bill of Materials" for electronics projects or similar.

评论 #27325981 未加载

评论 #27324504 未加载

nemetroidabout 4 years ago

Here's a coreutils (two-liner) version:<pre><code> printf '\xEF\xBB\xBF' >bom.dat find . -name '*.csv' \ -exec sh -c 'head --bytes 3 {} | cmp --quiet - bom.dat' \; \ -print </code></pre> The -exec option for find can be used as a filter (though -exec disables the default action, -print, so it must be reenabled after).Could be made into a oneliner by replacing the 'bom.dat' argument to cmp with '<(printf ...)'.

评论 #27324605 未加载

superjanabout 4 years ago

One large source of byte order marks in utf8 is Windows. In MS DOS and later windows, 8 bit encoded files are assumed to be in the system code page, which to enable all the worlds writing systems varies from country to country. When utf8 came along, Microsoft tools disambiguated those from the local code page by prefixing them with a byte order mark. They also do this in (for instance) the .net framework Xml libraries(by default). I don’t know what .net core does. I suppose it made sense at the time but I’m sure they regret this by now.

评论 #27325559 未加载

RedShift1about 4 years ago

Off topic but related, why does UTF-16 and UTF-32 even exist? Doesn't UTF-8 have the capability to go up to 32 bit wide characters already?

评论 #27324990 未加载

评论 #27324987 未加载

评论 #27325641 未加载

评论 #27325050 未加载

评论 #27324989 未加载

评论 #27325540 未加载

评论 #27327969 未加载

wodenokotoabout 4 years ago

I don’t know if I’ll ever gonna need this, but I loved learning it!

dazfullerabout 4 years ago

I probably won’t ever need this, but I love the write up for a tool which I use daily

nwellnhofabout 4 years ago

Is there anything like --multiline in GNU grep?

评论 #27328717 未加载

8 comments

asicspabout 4 years ago

评论 #27324600 未加载

tyingqabout 4 years ago

"BOM" == UTF-8 Byte Order Mark I guess.I initially thought it was searching for "Bill of Materials" for electronics projects or similar.

评论 #27325981 未加载

评论 #27324504 未加载

nemetroidabout 4 years ago

评论 #27324605 未加载

superjanabout 4 years ago

评论 #27325559 未加载

RedShift1about 4 years ago

Off topic but related, why does UTF-16 and UTF-32 even exist? Doesn't UTF-8 have the capability to go up to 32 bit wide characters already?

评论 #27324990 未加载

评论 #27324987 未加载

评论 #27325641 未加载

评论 #27325050 未加载

评论 #27324989 未加载

评论 #27325540 未加载

评论 #27327969 未加载

wodenokotoabout 4 years ago

I don’t know if I’ll ever gonna need this, but I loved learning it!

dazfullerabout 4 years ago

I probably won’t ever need this, but I love the write up for a tool which I use daily

nwellnhofabout 4 years ago

Is there anything like --multiline in GNU grep?

评论 #27328717 未加载

Finding CSV files that start with a BOM using ripgrep

8 comments

Finding CSV files that start with a BOM using ripgrep

8 comments