Re: the Rust optimized implementation, I was able to get ~20-25% better performance by rewriting the for loops as iterators, using Rust's byte range pattern matching, and a buffered writer, which seems crazy, but it's true. I chalked it up to some crazy ILP/SIMD tricks the compiler is doing.<p>I even submitted a PR[0], but Ben decided he was tired of maintaining and decided to archive the project (which fair enough!).<p>[0]: <a href="https://github.com/benhoyt/countwords/pull/115" rel="nofollow">https://github.com/benhoyt/countwords/pull/115</a>
For those curious about Rust optimization, the “optimized” version makes three changes compared to the “idiomatic” version:<p>* It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.<p>* It uses a faster hash algorithm. It’s not the first time this came up in a benchmark article. Rust’s decision to use a DOS-safe hash by default (and not provide a fast algorithm in the std, like other languages do) really seems to hurt it in that kind of microbenchmark.<p>* It uses get_mut+insert instead of the more convenient HashMap::entry method, because the latter would require redundantly allocating the key even in the repeat case. I’ve hit this problem in the past as well. Maybe the upcoming HashMap::raw_entry_mut will make this kind of optimization cleaner.
I love using the UNIX shell (pipes and programs that do "one thing well") to the point where it hurts my ability to improve at other other languages. But sometimes all you need is a throwaway command to do something once, not a beautifully optimized function that will be written, improved, compiled, profiled, reoptimized and then executed a billion times. The utter tersity of a bash one-liner just tickles me.
How about extending this to comparing programming language efficiency from a developer's standpoint, looking at bytes of code, versus the execution time?<p>I took the source code file size from the repository, and the runtime from the blog post.<p>Then I made an arbitrary overall "PAIN SCORE" (lower is better) by multiplying code size * runtime. I suggest this is a worthwhile metric simply because lower is better on both axes, but of course, in the "real world" there will be different economic costs to CPU time and developer time depending on the use case. Here's the sorted results, from least "pain" to most:<p><pre><code> LANGUAGE FILENAME CODE SIZE RUNTIME PAIN SCORE (LOWER IS BETTER)
Shell optimized.sh 75 bytes 1.83 s 137.25
Crystal simple.cr 240 bytes 1.29 s 309.6
Nim simple.nim 424 bytes 0.77 s 326.48
Python simple.py 208 bytes 2.21 s 459.68
Ruby simple.rb 175 bytes 3.17 s 554.75
Go optimized.go 1514 bytes 0.40 s 605.6
Python optimized.py 464 bytes 1.33 s 617.12
Zig optimized.zig 2688 bytes 0.24 s 645.12
Go simple.go 688 bytes 1.12 s 770.56
Zig simple.zig 1394 bytes 0.55 s 766.7
Nim optimized.nim 1683 bytes 0.49 s 824.67
Shell simple.sh 60 bytes 14.81 s 888.6
Ruby optimized.rb 401 bytes 2.47 s 990.47
JavaScript simple.js 532 bytes 1.88 s 1000.16
C optimized.c 4360 bytes 0.23 s 1002.80
Rust optimized.rs 3065 bytes 0.43 s 1317.95
Swift simple.swift 317 bytes 4.23 s 1340.91
JavaScript optimized.js 1501 bytes 1.10 s 1651.1
C simple.c 2735 bytes 0.96 s 2625.6
Rust simple.rs 2239 bytes 1.38 s 3089.82
</code></pre>
Sorting only by code size, the most concise implementations are: Shell, Ruby, Python, Crystal. Nobody was aiming to play code golf (i.e. minimize source code size), so these are fairly straightforward, idiomatic, readable implementations.<p>I am definitely a Crystal fan, in fact this afternoon I'm continuing to implement suggestions from my recent Show HN <a href="https://news.ycombinator.com/item?id=32081943" rel="nofollow">https://news.ycombinator.com/item?id=32081943</a> comments. :)
This is a rather meaningless comparison, since the differences are going to be dominated by:<p>1) The choice of libraries/datatypes used for strings and word->count map<p>2) How the source file is split into words - probably library function again, although in C/C++ one <i>could</i> choose to implement a super-optimized low level version that would blow the others away<p>IMO a performance comparison between languages is only meaningful if the runtime is not dominated by library functions, or if one admits it's really a standard library performance comparison, not a language comparison.
Recently there was a contest to optimize the performance of a similar program[0] and a Zoom discussion of the optimizations[1].<p>The programs are not comparable in the the following ways:<p>- Case: TFA requires (at least) ASCII lowercasing but the contest problem required no lowercasing.<p>- Ordering: TFA does not require sorting, but the contest problem required sorting.<p>- Memory: TFA imposes a requirement phrased as "don't read the whole file into memory" and this sounds like it's a resource-saving constraint, but it's usually a constraint that requires the program to spend additional resources. You could just mmap the file and store pointers into the mapped region. It costs extra to do copies instead of no copies.<p>- Text: TFA is unclear on what assumptions may be made about the lengths of words. For the contest problem, the Hungarian wikipedia input's longest word is around 80k.<p>- Safe, Hashing, Stdlib: TFA imposes some restrictions on what constructs may be used that are not imposed in the contest problem.<p>For the contest version of this problem, it seems like you can tokenize, hash, and count strings at around 1GB/s. Adapting a solution to solve TFA's problem (but not to conform to its Safe/Hashing/Stdlib requirements) would probably not carry too large of a penalty, since it's like 3 instructions to ASCII-lowercase 32 bytes and 1 string copy per unique string should take negligible time compared to the hash table lookups. So there is some room for the optimized solutions to go a little faster, if more optimizations are permitted.<p>[0]: <a href="https://easyperf.net/blog/2022/05/28/Performance-analysis-and-tuning-contest-6" rel="nofollow">https://easyperf.net/blog/2022/05/28/Performance-analysis-an...</a><p>[1]: <a href="https://www.youtube.com/watch?v=R_yX0XjdSBY" rel="nofollow">https://www.youtube.com/watch?v=R_yX0XjdSBY</a>
Funny how you can take the "optimized" C++ and make it significantly faster (vs. gnu/linux standard libraries) with little effort. A lookup table for to_lower, instead of bit math, helps slightly. Building with the LLVM-libc string functions helps a lot, because the GNU std::string_view::operator== is calling an AVX-optmized memcmp via the PLT, which is a terrible strategy for short strings. LLVM-libc has an inlined bcmp (note: bcmp can be significantly faster than memcmp, but on GNU they are aliases) that is resolved for the target platform at build time instead of run time.<p>Edit:<p>Even if you ignore LLVM-libc, just slapping this into the optimized.cpp and replacing the critical `==` with `our_bcmp` makes it 10% faster. IFUNC calls to micro-optimized SIMD functions are counterproductive. It is far, far better that the compiler can see all the code at build time.<p><pre><code> int our_bcmp (const char* a, const char* b, size_t sz) {
for (size_t i = 0; i < sz; i++) {
if (a[i] != b[i]) return 1;
}
return 0;
}</code></pre>
EDIT: I think the use of the term "allocation" might be overloaded here from the original article, after thinking about this a bit more I think the author is referring to a different way of accessing the map. If you look in godbolt <a href="https://godbolt.org/z/Yf64rodW6" rel="nofollow">https://godbolt.org/z/Yf64rodW6</a> you can see that the pointer version uses<p><pre><code> CALL runtime.mapaccess2_fast64(SB)
</code></pre>
whereas the 'raw' version uses<p><pre><code> CALL runtime.mapassign_fast64(SB)
</code></pre>
When reading the Go solution this bit stood out to me<p>> To reduce the allocations, we’ll use a map[string]*int instead of map[string]int so we only have to allocate once per unique word, instead of for every increment<p>I just tried benchmarking this with a simple setup and I get zero allocations for both approaches, although the "pointer" approach is <i>slightly</i> faster<p><pre><code> package main
import (
"testing"
)
func BenchmarkIncrementMapRawInt(b *testing.B) {
var data = make(map[int]int)
b.ResetTimer()
for i := 0; i < b.N; i++ {
key := i % 1000
data[key]++
}
}
func BenchmarkIncrementMapPointerInt(b *testing.B) {
var data = make(map[int]*int)
b.ResetTimer()
for i := 0; i < b.N; i++ {
key := i % 1000
increment(data, key)
}
}
func increment(counts map[int]*int, value int) {
if p, ok := counts[value]; ok {
*p++
return
}
n := 1
counts[value] = &n
}
$ go test -bench=. -benchmem .
goos: darwin
goarch: arm64
BenchmarkIncrementMapRawInt-8 112288002 10.57 ns/op 0 B/op 0 allocs/op
BenchmarkIncrementMapPointerInt-8 139728302 8.586 ns/op 0 B/op 0 allocs/op</code></pre>
I like the OCaml version. It's both very easy to read and comes out really well in the benchmark.<p>I also looked at the "simple" Zig version, which came out really well in the benchmark, and to me it didn't look simple at all. It seems you need to make a lot of low level details explicit.<p>But IMHO AWK takes the crown here. :)
Wow. Swift, touted as "safe by design and (...) runs lightning-fast"[1] is more of a screw-up than I thought. Almost twice as slow as Lua and behind even Pascal and Forth.<p>[1] <a href="https://developer.apple.com/swift/" rel="nofollow">https://developer.apple.com/swift/</a>
I've checked with ClickHouse and the result is better than I expect... it runs in 0.043 sec. on my machine, which is faster than any other result.<p>The code:<p>SELECT arrayJoin(splitByChar(' ', lower(line))) AS word, count() AS c FROM file('kjvbible.txt', LineAsString) WHERE notEmpty(word) GROUP BY word ORDER BY c DESC FORMAT Null<p>or:<p>clickhouse-local --query "SELECT arrayJoin(splitByChar(' ', lower(line))) AS word, count() AS c FROM file('kjvbible.txt', LineAsString) WHERE notEmpty(word) GROUP BY word ORDER BY c DESC" > /dev/null<p>It is using only a single thread.
In this type of blog post with comparisons including many languages, you usually see some unatural approach for your favorite language.<p>However, in this case, the Python example is idiomatic.<p>Since the articles calls for better approaches, I would suggest to take the opportunity to use the walrus operator and rsplit() for the optimized version. Something like this:<p><pre><code> reminding = ""
c=Counter( )
while (chunk := sys.stdin.read(64 * 1024)):
pre, post = chunk.lower().rsplit('\n', 1)
c.update((reminding + pre ).split())
reminding = post
</code></pre>
It should not affect performance too much, and the code gets more expressive.<p>However, the performances will be quite different depending of the python version you use. Interestingly, Python 3.11 beta, which comes with a lot performance tweaks, is slower for this exercice, while being reported to be faster on real life tasks.
> <i>Incidentally, this problem set the scene for a wizard duel between two computer scientists several decades ago. In 1986, Jon Bentley asked Donald Knuth to show off “literate programming” with a solution to this problem, and he came up with an exquisite, ten-page Knuthian masterpiece. Then Doug McIlroy (the inventor of Unix pipelines) replied with a one-liner Unix shell version using tr, sort, and uniq.</i><p>Since this writing and other linked resources present only one side of the affair, I will mention: the presentation in <i>Pearls</i> was quite unfair to Knuth, and the conclusions commonly made of it somewhere between moderately and entirely unsound. (That is, as conclusions <i>from the paper</i>. I speak of unsoundness of logic only: these conclusions may be supportable from other sources.)<p>Knuth was challenged to demonstrate literate programming, and so <i>that was what he demonstrated</i>. He showed building something from the ground up assuming a very minimal environment, with problem analysis and all: of course that ended up more verbose than a Unix pipeline that glosses over problem analysis and starts with most of the tools you need to make a probably-acceptable solution!<p>McIlroy himself admitted a few years later that he had been “a little unfair” to criticise on engineering grounds what had only been an illustration of technique. Even in the paper, Bentley did admit a degree of culpability for this mismatch in his “criticism of programs” paragraph. (“He admires the execution of the solution, but faults the problem on engineering grounds. (That is, of course, my responsibility as problem assigner; Knuth solved the problem he was given on grounds that are important to most engineers-the paychecks provided by their problem assigners.)”)<p>But I do wish that Knuth had been given the opportunity to write a review of McIlroy’s review, for simultaneous publication.
... (2021). Great read and fun project. See the discussion from when it was originally posted by the author.[0]<p>[0]: <a href="https://news.ycombinator.com/item?id=26463967" rel="nofollow">https://news.ycombinator.com/item?id=26463967</a>
See also a different comparison (on ~the same problem) at StackExchange: <a href="https://codegolf.stackexchange.com/questions/188133/bentleys-coding-challenge-k-most-frequent-words" rel="nofollow">https://codegolf.stackexchange.com/questions/188133/bentleys...</a><p>(Copying from my comment the last time this was posted: <a href="https://news.ycombinator.com/item?id=26467684" rel="nofollow">https://news.ycombinator.com/item?id=26467684</a>)<p>There's also a nice book "Exercises in Programming Style" about just this problem.
As they call out in the article, accurately counting words is a bit more complicated that splitting by space. You need to account for all sorts of punctuation etc.<p>When I've been asked this in programming interviews, I've almost always been expected to produce something like the code in the article (and do), but usually I'd point to something like the NLTK library as a better approach. It's polished and highly capable, and handles probably just about every edge case there is.
It's a bit of a shame that Rust's stdlib doesn't have a "fast but less safe" hasher. It's one of the easiest performance optimizations in my experience because it's very rare that an attacker controls input to your hashmap in a way that can actually lead to a DOS.<p>Not that it doesn't happen, it's just not something I've run into.<p>Just today I was testing out the performance difference hashing some Scylla queries and got nearly 2x faster hashing moving to fxhash.
These days I tend to use R for these type problems. It’s slow, but vectors as first-class data types simplify things a lot<p>readLines(file) |>
strsplit(“ +”) |>
unlist() |>
table() |>
sort(desc = TRUE)
Got a bit better C++ version here which uses a couple libraries instead of std:: stuff - <a href="https://gist.github.com/jcelerier/74dfd473bccec8f1bd5d78be5aada569" rel="nofollow">https://gist.github.com/jcelerier/74dfd473bccec8f1bd5d78be5a...</a> ; boost, fmt and <a href="https://github.com/martinus/robin-hood-hashing" rel="nofollow">https://github.com/martinus/robin-hood-hashing</a><p><pre><code> $ g++ -I robin-hood-hashing/src/include -O2 -flto -std=c++20 -fno-exceptions -fno-unwind-tables -fno-asynchronous-unwind-tables -lfmt
$ time ./a.out < kjvbible_x10.txt > /dev/null
0,19s user 0,01s system 99% cpu 0,197 total
</code></pre>
with the same build flags, optimize.cpp gives me<p><pre><code> 0,22s user 0,01s system 99% cpu 0,233 total</code></pre>
C# really needs a GetOrAdd method on the regular Dictionary class (as exists on ConcurrentDictionary). This common pattern (from the optimized version) requires the hash function on "word" to be evaluated twice (once for TryGetValue and once for Add). Then, if there is a collision on the 32 bit hash then the equality check needs to be run twice as well. Not good as all of these are O(N) operations (where N is the size of "word"). I suspect the optimized version would be materially faster if they implemented their own dictionary that includes this.<p><pre><code> if (!counts.TryGetValue(word, out var wordCountRef))
{
counts.Add(word, new Ref<int>(1));
}</code></pre>
Doesn't the Unix Shell implementation break the "threading" constraint, in that each program in the pipe will have a main thread each, and all the programs will run concurrently?<p>Or should the "threading" constraint really have been stated as "you can use multiple threads, but only if they all have their own address space"?<p>Alternatively, given the multi-core capabilities of even budget/mobile/small systems these days (even the Rasberry Pi 4 has a quad-core CPU), isn't restricting the implementation to a single thread a weird artificial limitation in 2022? Also, a GPU-based implementation would be interesting, and GPUs most of their power from their massively parallel core architecture.
> When optimizing this, the first thing to do is compile with optimizations enabled (g++ -O2). I kind of like the fact that with Go you don’t have to worry about this – optimizations are always on.<p>I think it is more fair to say that in Go optimizations are always OFF.
Taking it in a slightly different direction – because in my work chunking is nearly never necessary (too little text or too much memory) whereas Unicode characters and punctuation nearly always is – I would tend to something like:<p><pre><code> from collections import Counter
from re import finditer
word_counts = Counter()
with open(PATH, encoding = 'utf-8') as file:
doc = file.read()
word_counts.update(match.group().lower() for match in finditer(r'\w+', doc))
</code></pre>
In Python, this could be a fairly performant way of taking unicode into account, because it doesnʼt use Pythonʼs for-loop, and regular expressions are rather optimized compared with writing low-level-style Python. (Maybe it should use casefold instead of lower.)
My language of choice, LiveCode, is a superset of HyperTalk. The concept of "words" has been built-in (not a library) since the '80s. Hash tables since the '90s.<p>A solution to every aspect of this problem but the line-by-line bit took just over a minute to write. Adding that took another minute.<p>The trade-off is that LC is slower than almost any of the languages listed, and optimization is almost impossible -- the idiomatic way <i>is</i> the fastest (almost). And if you don't like the way LC parses words, there is almost no way to change it without going into the source code (a big pain).
Hm, I tried comparing the simple.hs vs simple.py (with ghc 8.10.7 which I happened to have here, their stack.yaml uses 8.10.4) and it's 5.7s vs 1.9s, not quite as stark a difference. Maybe because I'm on Linux?
q solution in kdb, runs in about 2.5s on my machine vs simple.py which takes 4.7s. Rough interpolation with the results table puts this on par with Rust/Go/OCaml or other compiled languages.<p><pre><code> \t desc count each group `$lower " " vs raze read0 `:kjvbible_x10.txt
</code></pre>
It's also 1 line!<p>I am sure q pros or k purists can optimize this even more...<p>EDIT: Moving the lower earlier brings this down to 1.9s<p><pre><code> desc count each group `$ " " vs lower raze read0 `:kjvbible_x10.txt</code></pre>
Fun stuff! Did run a similar thing with a simple bioinformatics problem before (calculating the ratio of G and Cs against A+G+C+T), also with a whole bunch of contributors:<p><a href="https://github.com/samuell/gccontent-benchmark#readme" rel="nofollow">https://github.com/samuell/gccontent-benchmark#readme</a><p>Really hard - or impossible - to arrive at a definitive single number for one language, but the whole exercise is a lot of fun and quite informative IMO :)
This was always my approach when I had do “leetcode whiteboard interviews”.<p>Start with something not too far north of FizzBuzz but with a ton of scope to progressively enrich the problem until the clock runs out.<p>At $BIG_CO du jour people tend to talk about how much “signal” an interview produces, and starting somewhere that a meaningful number of candidates can’t do anything with offers very little “signal”, likewise very little Shannon information.<p>Great article!
I'm curious why go is faster than rust in the benchmarks. My best theory is that the program doesn't run long enough to trigger a GC, so rust is paying the cost of freeing memory, but go isn't. (Using an arena allocator in rust would probably produce similar behavior).<p>Or maybe the difference is just noise? It's hard to tell without more details on how the benchmarks were run.
It seems like this is just measuring a single pass at reading a file with about 100k words in it. I ran the benchmark and it finished almost instantly. So languages with startup overhead and warmup time (eg: JIT etc) will totally underperform here. Makes sense then why Java is almost identical between optimised and non-optimised versions.
> We usually think of I/O as expensive, but I/O isn’t the bottleneck here...The tokenization and hash table operations are the bottleneck by a long shot<p>Interesting. This is something I'll have to keep in mind.<p>Best practice is always been to consider I/O the slowest.<p>But, it's true... times have changed.<p>Maybe we shouldn't make this assumption anymore.
Some of these efficiency gains are remarkable. Most of the code I write is just the quickest way to write a mundane function. Really curious if you went into a huge tech co and devoted ~10% of SWE time solely to optimization of code base what the energy savings would be.
For performance comparison between languages, this article is not very useful. Nothing against the author, this task is simply impossible :)<p>Om the other hand the article is gold mine for learning performance analysis in different languages.
Interesting, it confirms what I've suspected for a while: that by using Python and not C++ for "general text mashing" tasks I'm not losing that much performance. Like, 2x, 5x, perhaps 10x. But not 100x.
> a simple interview problem<p>I've been asking different versions of that question since I started interviewing and it is a surprisingly good filter.<p>But with articles like this, now I need new questions..
surprised to see swift 4 times as slow as non-optimized go.<p>Anyone has an explanation ?<p>I know the swift string type is really complex, but i always assumed it was at least performing well..
You need to measure memory usage with it, take the product of the two (memory and speed), and thus get a very good number on performance for comparison.
Another 1 minute lookin... multiplication on every byte.. really?
If this is supposed to be fast, get one big buffer for the words. Get another if you run out.
The hash function can be the word itself for words up to length 8.
Longer words would need a mult per 8 bytes.
You can do branchless strchr.
Roll your own memmove - just always copy 24 bytes and terminate at the length you already know.
The whole point of using C or C++ is that you can do non idiomatic things when you have to.
You can tell it will be BS just from the title treating C and C++ as if they were one language, not two very different languages.<p>Other comments here reveal how, exactly.
Our distributed NewSQL database project actually uses three of them, Go, Rust, and C++, for different layers. There are pros and cons for each of them, but it still challenge to maintain three languages in a one system development. Performance wise, C++ and Rust are good, coding wise Go is easiest but Rust is kind of having best practice.
I would not call it performance comparison at all. When Python call functions written in C it is not Python's performance. Write those functions using plain Python and then see the results. Sure for this basic example in the article it does not matter from a practical standpoint. But when you need to step away from canned cases suddenly Python's performance sucks big time.