How quickly can you remove spaces from a string?

201 pointsby deafcalculusover 8 years ago

25 comments

jakobeggerover 8 years ago

Fun story: When I implemented the syntax highlighting feature in Sequel Pro, I needed a way to determine the string length of UTF-8 encoded strings. I didn't find a function available on macOS that worked directly on a char*, so I googled and found a really simple UTF-8 strlen function. It was easy to understand and very fast. I think it was this: <a href="http://canonical.org/~kragen/strlen-utf8.html" rel="nofollow">http://canonical.org/~kragen/strlen-utf8.html</a>I committed my syntax highlighting code, and a few days later someone had replaced the simple UTF8 strlen function with the really long vectorised version from this page: <a href="http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html" rel="nofollow">http://www.daemonology.net/blog/2008-06-05-faster-utf8-strle...</a>But the funny thing is, that the supposedly fast vectorised strlen was optimised for very long strings. The benchmarks shows results for megabytes of texts. But we measured the length of tokens, usually only a few characters long, so in most cases the new function was actually slower!I was a very junior dev, didn't want to piss off anybody, and the strlen wasn't in the hot path anyway, so I didn't say anything. But I was a bit sad, that my easy-to-read code was replaced by such a monstrosity.What's my point? Before you go and use these functions in your code, profile your code to see if it would actually affect performance.

dzdtover 8 years ago

Somehow the article fails to mention that the speed will depend heavily on the input string: both its size and distribution of whitespace characters.Even assuming the question is for the limit of very long strings, the distribution makes a huge difference. Natural English has spaces on average every 5.1 characters [1], so using multi-character tests to speed up the case of runs of 8 or more characters without a space will probably slow it down, not speed it up![1]<a href="https://arxiv.org/pdf/1208.6109" rel="nofollow">https://arxiv.org/pdf/1208.6109</a>

评论 #13447269 未加载

评论 #13447625 未加载

评论 #13447295 未加载

评论 #13447942 未加载

computatorover 8 years ago

What strikes me is how many sites won't accept a login name, credit card number, phone number, or other field with leading or trailing spaces. This for something where the speed of the code doesn't matter at all. I can't think of any explanation other than incredible laziness or incompetence for not stripping off spaces.You might not notice this if you use a password manager or browser autofill, but it's a lot of sites, including companies like major airlines for example.Never mind that -- it's just a miracle when you can enter a credit card number formatted as 4123 4567 8901 2345 rather than squished together.

评论 #13451114 未加载

评论 #13449411 未加载

dsp1234over 8 years ago

This code will work on all UTF-8 encoded strings… which is the bulk of the strings found on the Internet if you consider that UTF-8 is a superset of ASCII.Note that the first set of code (and possibly the rest), only work as the space, newline, and carriage return are the 7 bit ASCII set is included in UTF-8. However, the extended 8-bit ASCII set is not, but is often included when people speak of ASCII. So for example, if the request was to remove all "Copyright Sign" symbols, which is U00A9, it would not work correctly. The UTF-8 encoding for this symbol is 0xC2 0xA9, but the code only works on individual bytes, so it would remove the A9 byte, leaving a C2 byte and then whatever byte came next. Additionally, it would hit other UTF-8 characters like the "Greek Capital Letter Omega" (Ω which is encoded in UTF-8 as 0xCE 0xA9)tl;dr Only works for the 7-bit ASCII set, but not the common extended 8-bit ASCII sets

评论 #13446764 未加载

评论 #13446905 未加载

评论 #13446683 未加载

评论 #13447556 未加载

nograciasover 8 years ago

6.5MB for the tables to drive this faster de-spacer? No thank you. It would blow out the L1 cache and make your program slower.<a href="https://raw.githubusercontent.com/lemire/despacer/master/include/despacer_tables.h" rel="nofollow">https://raw.githubusercontent.com/lemire/despacer/master/inc...</a>

评论 #13448432 未加载

et1337over 8 years ago

Cached text-only version: <a href="http://webcache.googleusercontent.com/search?q=cache%3Ahttp%3A%2F%2Flemire.me%2Fblog%2F2017%2F01%2F20%2Fhow-quickly-can-you-remove-spaces-from-a-string%2F&strip=1" rel="nofollow">http://webcache.googleusercontent.com/search?q=cache%3Ahttp%...</a>

brianpgordonover 8 years ago

This reminded me of spray-json's JsonParser. It has a bit of Scala code to seek past whitespace:<pre><code> @tailrec private def ws(): Unit = // fast test whether cursorChar is one of " \n\r\t" if (((1L << cursorChar) & ((cursorChar - 64) >> 31) & 0x100002600L) != 0L) { advance(); ws() } </code></pre> <a href="https://github.com/spray/spray-json/blob/765c83248e0bbe867ddc9d479b4fe79493569a54/src/main/scala/spray/json/JsonParser.scala#L186" rel="nofollow">https://github.com/spray/spray-json/blob/765c83248e0bbe867dd...</a>

评论 #13448946 未加载

octo_tover 8 years ago

So using 128bits instructions would imply you had 'words' which were over 16 characters long on average, right?The average (english) word is ~5 characters long, so most of the time, you'd be forced to check anyway.

评论 #13446578 未加载

评论 #13446566 未加载

mnarayan01over 8 years ago

I feel like you have to at least include non-breaking spaces if you're going to say you're removing spaces from UTF-8 strings.

criddellover 8 years ago

Is there a clever way to remove all the space characters?<a href="https://www.cs.tut.fi/~jkorpela/chars/spaces.html" rel="nofollow">https://www.cs.tut.fi/~jkorpela/chars/spaces.html</a>

评论 #13448595 未加载

评论 #13446697 未加载

stelundover 8 years ago

Maybe the asm to do scan for byte in memory is an alternative. Repne scab.It won't find all characters in a single scan. But maybe do 3 passes over a buffer which fits in cache.

aibover 8 years ago

I wonder how much of a difference using a separate destination buffer would make. Apart from some memory/cache/invalidation/magic thing that I'm not certain might occur, it would allow one to use the extended instructions used for scanning.

Annatarover 8 years ago

<pre><code> awk ' { gsub(/[[:blank:]\015]/, ""); printf("%s", $0); }' input | tee output </code></pre> stripping out LF isn't necessary because AWK does it on every record automatically. For even more speed, the code can be translated into ANSI C and compiled with awka[1] using an optimizing C compiler.[1] <a href="http://awka.sourceforge.net/index.html" rel="nofollow">http://awka.sourceforge.net/index.html</a>

exDM69over 8 years ago

I am surprised to see any speedup in this! I'd expect something trivial like this to be completely memory bound with the CPU sitting almost idle waiting for bytes coming in from memory.Looking at the benchmark code, this is using rdtsc to read the CPU time stamp counter. That does not take waiting for memory into account, does it?I wonder if there's a difference when measured in wall clock time. It's still somewhat beneficial to have the CPU work efficiently to give an opportunity for hyperthreading to take place when waiting for memory.If you really wanted to make something like this faster, you should focus on cache utilization and make use of prefetching instructions. x86 has pretty bad prefetching instructions and pretty good speculative fetching, so don't expect massive speedups but on ARM or Aarch64, you have a finer grained control over cache prefetching (L1 and L2 separately) and you could see much bigger differences.As for benchmarking this kind of problems: you obviously want to measure real world performance, so you need wall clock time as well as time stamp counter, but I'd look for optimization clues in "perf stat" and other CPU perf counters, with an emphasis on cache misses and branch mispredictions.The figure you should be staring at is the total throughput of the algorithm, measured in gigabytes per second. You should be getting figures close to the memory bandwidth available (25-50 GB/s depending on CPU and memory).edit: I measured the wall clock time with clock_gettime before/after all the repeats (using a megabyte sized buffer) and there is indeed no significant difference, here's my results:<pre><code> memcpy(tmpbuffer,buffer,N): 0.122945 cycles / ops 1495907352 nsec (1.495907 sec) countspaces(buffer, N): 3.657322 cycles / ops 1544915395 nsec (1.544915 sec) despace(buffer, N): 6.521193 cycles / ops 1621204460 nsec (1.621204 sec) faster_despace(buffer, N): 1.721657 cycles / ops 1500507217 nsec (1.500507 sec) despace64(buffer, N): 3.595031 cycles / ops 1544993649 nsec (1.544994 sec) despace_to(buffer, N, tmpbuffer): 6.307885 cycles / ops 1615101563 nsec (1.615102 sec) avx2_countspaces(buffer, N): 0.190992 cycles / ops 1460961459 nsec (1.460961 sec) avx2_despace(buffer, N): 5.750583 cycles / ops 1615971010 nsec (1.615971 sec) sse4_despace(buffer, N): 0.985002 cycles / ops 1482901389 nsec (1.482901 sec) sse4_despace_branchless(buffer, N): 0.338737 cycles / ops 1460874704 nsec (1.460875 sec) sse4_despace_trail(buffer, N): 1.950657 cycles / ops 1502268447 nsec (1.502268 sec) sse42_despace_branchless(buffer, N): 0.562246 cycles / ops 1468638389 nsec (1.468638 sec) sse42_despace_branchless_lookup(buffer, N): 0.624913 cycles / ops 1472445127 nsec (1.472445 sec) sse42_despace_to(buffer, N,tmpbuffer): 1.747046 cycles / ops 1507705780 nsec (1.507706 sec) </code></pre> Here's the diff to the original: <a href="http://pasteall.org/208511" rel="nofollow">http://pasteall.org/208511</a>edit2: surprisingly, Clang is about 10% slower than GCC in my experiments.

评论 #13450860 未加载

chrismorganover 8 years ago

A variant of this problem yields further interesting possibilities: if you’re trying to remove control characters like CR and LF, but replacing them with whitespace would be acceptable. That way you can work on it in-place, without needing to copy memory or allocate or anything like that.

saretiredover 8 years ago

The “optimized” functions have a bug when the number of remaining bytes is less than the block size.

评论 #13447284 未加载

Tooover 8 years ago

Someone file a compiler bug? 14x difference between the readable code and the optimized code is a lot. The first code is extremely straightforward, you shouldn't have deal with that SIMD mess manually.

评论 #13449607 未加载

评论 #13450200 未加载

fisherjeffover 8 years ago

Adding lookup tables to the naïve implementation gave me a ~3x speedup with virtually no extra effort. Branch penalties are a real killer.

评论 #13448410 未加载

ascotanover 8 years ago

I guess you could write CUDA code to do this on a GPU, but then the question becomes why? :/

评论 #13449110 未加载

评论 #13448996 未加载

评论 #13449271 未加载

Dargeover 8 years ago

Does anyone know how to do such a benchmark?

评论 #13447453 未加载

ebbvover 8 years ago

Optimizing by hand is rarely the fastest thing to do nowadays. I wonder how the original naive approach fairs with all optimizations turned on for gcc and with realistic input?

评论 #13449429 未加载

cowardlydragonover 8 years ago

ASCII/1 byte characters? I stopped reading then.UNICODE or GTFO.Should be embarassingly parallel for a contiguous array.

评论 #13448883 未加载

johnnyb9over 8 years ago

Going to start using this as an interview question!

评论 #13448527 未加载

austincheneyover 8 years ago

Isn't this the very kind of thing Regular Expressions were created to solve?* Remove spaces - myString.replace(/\u0020+/g, "");* Remove common line terminators - myString.replace(/(\r|\n)+/g, "");* Remove all white space - myString.replace(/\s+/g, "");

评论 #13446714 未加载

评论 #13446744 未加载

hcrispover 8 years ago

Not claiming to be the fastest, but here are two Python solutions suggested by Dave Beazley [0], plus one I extended using compiled regular expressions:<pre><code> # string replacement s = ' hello world \n' s.replace(' ', '') # 'helloworld' # regular expression import re re.sub('\s+', '', s) # 'helloworld' # compiled regular expression pat = re.compile('\s+') pat.sub('', s) # 'helloworld' </code></pre> [0] <a href="https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwiDwpPQytHRAhVJWCwKHSTmD5wQFggaMAA&url=https%3A%2F%2Fwww.safaribooksonline.com%2Flibrary%2Fview%2Fpython-cookbook-3rd%2F9781449357337%2Fch02s11.html&usg=AFQjCNEZqL6SV76UmE4ojyJ5poOgeCexEQ&sig2=I9kGVDVg5DyLxZ3pCJ1FIw&bvm=bv.144224172,d.bGg" rel="nofollow">https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&c...</a>

评论 #13446851 未加载