Reverse-Engineering Apple Dictionary (2020)

278 pointsby goranmoominover 3 years ago

18 comments

btnover 3 years ago

Another approach for this is to explore the format through Apple's tools for building dictionaries – as they provide a "Dictionary Development Kit" in Xcode's downloadable "Additional Tools" package (which has documentation for the XML format and a bunch of scripts/binaries for building the bundle).I wound up doing this a while ago for a similar toy project. After some poking around, it turned out that dictionary bundles are entirely supported by system APIs in CoreServices! The APIs are private, but Apple accidentally shipped a header file with documentation for them in the 10.7 SDK [1]. You can load a dictionary with `IDXCreateIndexObject()`, read through its indices with the search methods (and the convenient `kIDXSearchAllMatch`), and get pointers to its entry data with `IDXGetFieldDataPtrs()`.It takes a bit of fiddling to figure out the structure (there are multiple indices for headwords, search keywords, cross-references, etc., and the API is a general-purpose trie library) and request the right fields, but those property lists in the bundle are there to help! (As the author of this article discovered, the entries are compressed and are proceeded with a 4-byte length marker.)[1] <a href="https://github.com/phracker/MacOSX-SDKs/blob/master/MacOSX10.7.sdk/usr/include/IndexedSearch.h" rel="nofollow">https://github.com/phracker/MacOSX-SDKs/blob/master/MacOSX10...</a>

评论 #28508006 未加载

enragedcactiover 3 years ago

For anyone else needing to tackle something like this, its definitely worth checking out Binwalk [1]. It is meant for extracting firmware but it works decently well on most files-in-files type data formats.[1] <a href="https://github.com/ReFirmLabs/binwalk" rel="nofollow">https://github.com/ReFirmLabs/binwalk</a>

评论 #28509468 未加载

gilgoomeshover 3 years ago

It seems highly likely (given that this is a dictionary and requires fast lookups) that you're reverse engineering something like CEVFS (i.e. a virtual file system for compressing a database). Which is why the dictionary is broken into chunks... these are the compressed pages of the database.

tim--over 3 years ago

Another way to extract all of those compressed zip files would have been to use binwalk.<pre><code> binwalk --dd='.*' *.asset </code></pre> Edit: should have read the comments before I posted this, because enragedcacti already mentioned this tool an hour ago.

评论 #28571933 未加载

peterburkimsherover 3 years ago

Thank you for posting this code on Github! There has been some reverse-engineering done on the language dictionaries bundled with Mac OS, and it's nice to know that the same model is being used on the Apple Watch! I look forward to seeing your dictionary app.<a href="https://josephg.com/blog/reverse-engineering-apple-dictionaries/" rel="nofollow">https://josephg.com/blog/reverse-engineering-apple-dictionar...</a>There's also a command-line tool that can query the dictionary:<a href="https://github.com/takumakei/osx-dictionary" rel="nofollow">https://github.com/takumakei/osx-dictionary</a>Something I haven't yet reverse-engineered is Apple's word segmentation. I can get the word breaks in Chinese by pressing option + right arrow + space, repeatedly. But I have no idea how the backend for that works.

评论 #28506371 未加载

octrefover 3 years ago

I have thought about building a vocabulary learning tool for learning Japanese on top of Apple Dictionary. My idea is simple: user collects dicitionary items and the tool offers lookup / spaced-repetition.However, I'm concerned that the dictionary is copyrighted. Is there any precedent that says whether such a tool would be legal/illegal?

评论 #28506247 未加载

ranvelover 3 years ago

This was an amazingly delightful read. I was hoping it would shed some light on a long-time project I had which was reversing the Oxford language dictionaries that were included on CDROM with the big printed texts. (I already did it, but with a debugger instead of by reversing the binary format). Alas, it did not, but it was super encouraging to see the enthusiasm and interest in language dictionaries.

cogburnd02over 3 years ago

would be really really cool if someone could make a small script to convert these into a format understandable by dict://<a href="https://en.wikipedia.org/wiki/DICT" rel="nofollow">https://en.wikipedia.org/wiki/DICT</a>

评论 #28571940 未加载

dunhamover 3 years ago

The "seemingly random bytes" look like a small 32-bit little endian number to me, probably the length of the subsequent payload.

atorodiusover 3 years ago

Author here, funny to see this popping up on HN :) Was definitly a fun ride making this.

评论 #28571959 未加载

评论 #28509090 未加载

ChrisMarshallNYover 3 years ago

That's awesome!But be aware (i.e. "beware") that Apple can pull the rug out of unpublished APIs, without warning.I have been caught out, by this, myself.

etaioinshrdluover 3 years ago

This is fun. I have another idea: I'd be interested on calling Siri from the command line. Even if using private APIs. (But without hacky fake drivers, or accessibility tools)

评论 #28506013 未加载

gfaureover 3 years ago

I assume one reason Apple has made it more challenging to extract the dictionary resources is in order to satisfy licensing constraints with the dictionary authors. I wonder if they'd block an app like this through the App Store submission process, if submitted.

评论 #28506079 未加载

评论 #28505908 未加载

评论 #28505750 未加载

hyperstarover 3 years ago

At the moment, I'm using a slight modification of the gist plus `sed 's/<[^>]*>//g'` to look up words from the shell on Linux. It would be nice to have some XML parsing into plain text, but it kind of works.

hellothereworldover 3 years ago

I also found this Dictionary API which imports the dictionaries into NodeJs by utilizing a utility called „dedict“.<a href="https://github.com/nikvdp/dictionary-api/blob/master/convertDicts.js" rel="nofollow">https://github.com/nikvdp/dictionary-api/blob/master/convert...</a>

diimdeepover 3 years ago

Unfortunately it is not possible to implement add-on(add source) to Dictionary.app that is dynamic, like built-in Wikipedia; for example to query urbandictionary.com; only static offline is possible. I tried to investigate this years ago, don't think something changed since then.

bhlover 3 years ago

Has anyone reversed-engineered the Apple emoji dictionary that maps some keywords to emojis? Last time I checked, they only shipped binaries on the newest MacOS. Would love to use that mapping to elevate search on my custom emoji picker.

评论 #28508659 未加载

评论 #28507025 未加载

评论 #28508563 未加载

cratermoonover 3 years ago

Is this a plist?