When text rendering it goes through multiple stages. I remember going through the LibreOffice text rendering code. Text rendering basically goes through the following stages:<p>* segment up text into paragraphs. You’d think this would be easy, but Unicode has a lot of seperators. Heck in html you have break and paragraph tags, but Unicode has about half a dozen things that can count as paragraph seperators.<p>* parse text into style runs - each time text font, color slant, weight, or anything like this changes you add it to a seperate run<p>* parsing the text into bidirectional runs - the text must work out the points at which it shifts text direction and place them into a new run at each shift of direction<p>* you need to figure out how to reconcile the two types of runs into a single bidi and style run list.<p>Do t forget that you might need to handle vertical text! And Japanese writing has ruby characters that are characters between columns.<p>* fun bit of code - working out kashida length in Arabic. Took one of the real pros of the LO dev team to work out how to do this. Literally took them years!<p>* you then most work out what font you are actually going to use - commonly known as the itemisation stage.<p>This is a problem with Office suites when you don’t have the font installed. There is a complex font substitution and matching algorithm. Normally you get a stack of fonts to chose from and fallback to - everybody has their own font fallback algorithm. The PANOSE system is one such system when they literally take a bunch of text metrics and use distance algorithms to work out the best font to select. This is not universally adopted and most people have bolted on their own font selection stack, in general it’s some form of this.<p>LibreOffice has a buggy matching algorithm that frankly doesn’t actually work due to some problems with logical operators and a running font match calculation metric they have baked in. At one point I did extensive unit testing around this in an attempt to document and show existing behaviour, I submitted a bunch of patches and tests piecemeal but they only decided to accept half of them because they kept changing how they wanted the patches submitted and then eventually someone who didn’t understand the logic point blank refused to accept the code - I just gave up at this point and the code remains bug riddled and unclear.<p>On top of this, you need to take into account Unicode normalisation rules. Tricky.<p>* now you deal with script mapping. Here you take each character (usually a code point) and work out the glyph to use and where to place it. You get script runs - some languages have straight forward codepoint to glyph rules, others less so. By breaking it into script runs it makes it far easier to work out this conversion.<p>* now you get the shaping of the text into clusters. You’ll get situations where a glyph can be positioned and rendered in different ways. - in Latin-based languages and example is the “ff” - this can be two “f” characters but often it’s just a single character. It gets weirder with Indic characters which change based on the characters before and after… my mind was blown when I got to this point.<p>This gets complex, and fast - luckily there are plenty of great quality shaping engines that handle this for you. Most open source apps use HalfBuzz, which gets better and better with each iteration.<p>* now you take these text runs, and a lot of the job is done. However, paragraph separation is not line separation. You have a long enough paragraph of text and you must determine where to add breaks i lines of the text - basically word wrapping.<p>Whilst this seems very simple, it’s not because then you get text hyphenation. This can vary based on language and script.<p>A lot of this I worked out from reading LO code, but I did stumble onto an amazing primer here:<p><a href="https://raphlinus.github.io/text/2020/10/26/text-layout.html" rel="nofollow noreferrer">https://raphlinus.github.io/text/2020/10/26/text-layout.html</a><p>The LO guys are pretty amazing, for the record. They are dealing with multiple platforms and using each of the platforms text rendering subsystems where possible. Often they start to standardise - but certainly it’s not an easy feat. Hats off to them!