dict and the relevant dictionaries are things i pretty much always install on every new laptop. gcide in particular includes most of the famous 1913 webster dictionary with its sparkling prose:<p><pre><code> : ~; dict glisten
2 definitions found
From The Collaborative International Dictionary of English v.0.48 [gcide]:
Glisten \Glis"ten\ (gl[i^]s"'n), v. i. [imp. & p. p.
{Glistened}; p. pr. & vb. n. {Glistening}.] [OE. glistnian,
akin to glisnen, glisien, AS. glisian, glisnian, akin to E.
glitter. See {Glitter}, v. i., and cf. {Glister}, v. i.]
To sparkle or shine; especially, to shine with a mild,
subdued, and fitful luster; to emit a soft, scintillating
light; to gleam; as, the glistening stars.
Syn: See {Flash}.
[1913 Webster]
</code></pre>
it's interesting to think about how you would implement this service efficiently under the constraints of mid-01990s computers, where a gigabyte was still a lot of disk space and multiuser unix servers commonly had about 100 mips (<a href="https://netlib.org/performance/html/dhrystone.data.col0.html" rel="nofollow">https://netlib.org/performance/html/dhrystone.data.col0.html</a>)<p>totally by coincidence i was looking at the dictzip man page this morning; it produces gzip-compatible files that support random seeks so you can keep the database for your dictd server compressed. (as far as i know, rik faith's dictd is still the only server implementation of the dict protocol, which is incidentally not a very good protocol.) you can see that the penalty for seekability is about 6% in this case:<p><pre><code> : ~; ls -l /usr/share/dictd/jargon.dict.dz
-rw-r--r-- 1 root root 587377 Jan 1 2021 /usr/share/dictd/jargon.dict.dz
: ~; \time gzip -dc /usr/share/dictd/jargon.dict.dz|wc -c
0.01user 0.00system 0:00.01elapsed 100%CPU (0avgtext+0avgdata 1624maxresident)k
0inputs+0outputs (0major+160minor)pagefaults 0swaps
1418350
: ~; gzip -dc /usr/share/dictd/jargon.dict.dz|gzip -9c|wc -c
556102
: ~; units -t 587377/556102 %
105.62397
</code></pre>
nowadays computers are fast enough that it probably isn't a big win to gzip in such small chunks (dictzip has a chunk limit of 64k) and you might as well use a zipfile, all implementations of which support random access:<p><pre><code> : ~; mkdir jargsplit
: ~; cd jargsplit
: jargsplit; gzip -dc /usr/share/dictd/jargon.dict.dz|split -b256K
: jargsplit; zip jargon.zip xaa xab xac xad xae xaf
adding: xaa (deflated 60%)
adding: xab (deflated 59%)
adding: xac (deflated 59%)
adding: xad (deflated 61%)
adding: xae (deflated 62%)
adding: xaf (deflated 58%)
: jargsplit; ls -l jargon.zip
-rw-r--r-- 1 user user 565968 Sep 22 09:47 jargon.zip
: jargsplit; time unzip -o jargon.zip xad
Archive: jargon.zip
inflating: xad
real 0m0.011s
user 0m0.000s
sys 0m0.011s
</code></pre>
so you see 256-kibibyte chunks have submillisecond decompression time (more like 2 milliseconds on my cellphone) and only about a 1.8% size penalty for seekability:<p><pre><code> : jargsplit; units -t 565968/556102 %
101.77413
</code></pre>
and, unlike the dictzip format (which lists the chunks in an extra backward-combatible file header), zip also supports efficient appending<p>even in python (3.11.2) it's only about a millisecond:<p><pre><code> In [13]: z = zipfile.ZipFile('jargon.zip')
In [14]: [f.filename for f in z.infolist()]
Out[14]: ['xaa', 'xab', 'xac', 'xad', 'xae', 'xaf']
In [15]: %timeit z.open('xab').read()
1.13 ms ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
</code></pre>
this kind of performance means that any algorithm that would be efficient reading data stored on a conventional spinning-rust disk will be efficient reading compressed data if you put the data into a zipfile in "files" of around a meg each. (writing is another matter; zstd may help here, with its order-of-magnitude faster compression, but info-zip zip and unzip don't support zstd yet.)<p>dictd keeps an index file in tsv format which uses what looks like base64 to locate the desired chunk and offset in the chunk:<p><pre><code> : jargsplit; < /usr/share/dictd/jargon.index shuf -n 4 | LANG=C sort | cat -vte
fossil^IB9xE^IL8$
frednet^IB+q5^IDD$
upload^IE/t5^IJ1$
warez d00dz^IFLif^In0$
</code></pre>
this is very similar to the index format used by eric raymond's volks-hypertext <a href="https://www.ibiblio.org/pub/Linux/apps/doctools/vh-1.8.tar.gz" rel="nofollow">https://www.ibiblio.org/pub/Linux/apps/doctools/vh-1.8.tar.g...</a> or vi ctags or emacs etags, but it supports random access into the file<p>strfile from the fortune package works on a similar principle but uses a binary data file and no keys, just offsets:<p><pre><code> : ~; wget -nv canonical.org/~kragen/quotes.txt
2024-09-22 10:44:50 URL:http://canonical.org/~kragen/quotes.txt [49884/49884] -> "quotes.txt" [1]
: ~; strfile quotes.txt
"quotes.txt.dat" created
There were 87 strings
Longest string: 1625 bytes
Shortest string: 92 bytes
: ~; fortune quotes.txt
Get enough beyond FUM [Fuck You Money], and it's merely Nice To Have
Money.
-- Dave Long, <dl@silcom.com>, on FoRK, around 2000-08-16, in
Message-ID <200008162000.NAA10898@maltesecat>
: ~; od -i --endian=big quotes.txt.dat
0000000 2 87 1625 92
0000020 0 620756992 0 933
0000040 1460 2307 2546 3793
0000060 3887 4149 5160 5471
0000100 5661 6185 6616 7000
</code></pre>
of course if you were using a zipfile you could keep the index in the zipfile itself, and then there's no point in using base64 for the file offsets, or limiting them to 32 bits