TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Fast character case conversion (or how to compress sparse arrays)

70 pointsby donmccover 3 years ago

6 comments

jhallenworldover 3 years ago
I did a four-level radix tree for the Unicode character class and case conversion support in JOE (Joe&#x27;s Own Editor, so that you can edit Unicode on non-Unicode systems): index sizes of 7 bits, 5, 5 and 4 plus ASCII has a fast-path table.<p>The main difference is that it computes the tables at startup from the sorted Unicode intervals. So the construction code has to be fast. The same code is also used for user character classes in regular expressions.<p>Anyway, it builds them in two passes. First pass it de-duplicates nodes, but only the previously constructed node is a candidate for de-duplication. This keeps the memory usage low during construction. De-duplicated nodes can still be modified during this construction, so they may be re-duplicated (there is a reference counter to determine when this happens).<p>Second pass (after all data is loaded, no more changes allowed), it globally de-duplicates the leaf nodes using a hash table. Many of the leaf nodes are duplicates (and not just the all zero ones).
评论 #28856669 未加载
willvarfarover 3 years ago
My experience with this is to fast-path the common cases (ie ascii and vast swathes of emptiness) with hardcoded bounds comparisons, so the memory access is only done as a slow path when needed.
评论 #28856261 未加载
edflsafoiewqover 3 years ago
Previously: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=26227754" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=26227754</a>
kaetemiover 3 years ago
Wrote a case conversion that processes UTF-8 directly last year for Ryzom Core. The tables look like a mess, but it massively improved performance over the code that was replaced. Case changes seem to be called more often than I expected in the game. I do wonder if there&#x27;s any cleaner and faster way.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;ryzom&#x2F;ryzomcore&#x2F;blob&#x2F;core4&#x2F;nel&#x2F;src&#x2F;misc&#x2F;string_to_upper.cpp" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;ryzom&#x2F;ryzomcore&#x2F;blob&#x2F;core4&#x2F;nel&#x2F;src&#x2F;misc&#x2F;s...</a>
评论 #28862407 未加载
edflsafoiewqover 3 years ago
quickjs has pretty cool case conversion too. It special cases ASCII, then does the rest with a binary search through some kinda run-length encoded table. There&#x27;s more code but the table itself is only about 2K for both case directions.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;bellard&#x2F;quickjs&#x2F;blob&#x2F;b5e62895c619d4ffc75c9d822c8d85f1ece77e5b&#x2F;libunicode.c#L56" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;bellard&#x2F;quickjs&#x2F;blob&#x2F;b5e62895c619d4ffc75c...</a>
评论 #28858654 未加载
dhsysusbsjsiover 3 years ago
How do these benchmark? I’d like to know if the extra cost of the memory lookups is balanced by the smaller table and better cache locality.
评论 #28858442 未加载