Here’s an interpretability idea you may find interesting:<p>Let's Turn AI Model Into a Place. The project to make AI interpretability research fun and widespread, by converting a multimodal language model into a place or a game like the Sims or GTA.<p>Imagine that you have a giant trash pile, how to make a language model out of it? First you remove duplicates of every item, you don't need a million banana peels, just one will suffice. Now you have a grid with each item of trash in each square, like a banana peel in one, a broken chair in another. Now you need to put related things close together and draw arrows between related items.<p>When a person "prompts" this place AI, the player themself runs from one item to another to compute the answer to the prompt.<p>For example, you stand near the monkey, it’s your short prompt, you see around you a lot of items and arrows towards those items, the closest item is chewing lips, so you step towards them, now your prompt is “monkey chews”, the next closest item is a banana, but there are a lot of other possibilities around, like an apple a bit farther away and an old tire far away on the horizon (monkeys rarely chew tires, so the tire is far away).<p>You are the time-like chooser and the language model is the space-like library, the game, the place. It’s static and safe, while you’re dynamic and dangerous.