My wife and I like murder mystery stories. After talking through the plot developments of several stories in tandem, I wondered if we could build a compelling murder mystery game where the suspects were AI characters with an LLM backend. We spent the last month building a game based on a story we wrote together.<p>If you're a fan of murder mysteries, or curious about using LLMs in a narrative context, check it out! I’d love to hear your feedback.<p>Some food for thought from the development:<p>1. It’s been a few years since I had last used streamlit, and I’m impressed how much the community has grown. I originally intended to use it as a prototype, but it can now support most basic web app features with some minor workarounds for things like accounts, etc. I ended up deploying with a separate REST application to do most of the account work and state mutations, but it made the UI development very fast.<p>2. The prompt finessing took much longer that I had thought it would. Essentially, each character conversation gets a prompt with the game state, built as a JSON of “facts” that the player has learned so far (along with an ID for each fact). The trick behind each conversation’s evolution was to prompt the model to self-label which facts were being used in its responses. These labels are then used to modify the game state, labeling referenced facts as having been “discovered” by the player. Further conversations with characters are then affected by the new state. The entire narrative was created by creating these JSON facts, and their dependencies (i.e. a list of facts that must be “discovered” before revealing a new one). Ultimately, this created a DAG of facts which represented the story as a whole. Each JSON took a lot of tweaking to ensure that it represented an orthogonal fact towards other facts, otherwise the model struggled at self-labeling.<p>3. In my initial attempts, I was getting very stilted responses from each character. They would reference facts, but they would generally be incomplete to the extent that the player didn’t have enough context to move forward. I tried lots of model setting variations and prompting changes, but the key ended up being modeling the conversation as a meta-conversation. In the input context each user chat input is injected with the character’s name, as if it were a script from a detective play or novel. For example: user: “Where you the night of the murder?”, Assistant: “at home.” became, user: “Detective: where were you the night of the murder?”, Assistant: “Julie: At home all night, I went to bed early even though Paul wasn’t home”. It was truly unbelievable how large of a difference this made. This is all hidden from the user in the final application.<p>4. During the tweaking, I needed some kind of objective to tell if my tweaks were helping or hurting, so for every JSON fact in my narrative DAG, I built a “leading question” that was designed to trigger the character to reveal the fact. I could then evaluate the % of correct facts that were revealed to the user across all of the leading questions in the narrative. It seems like there are more than a handful of frameworks now to help out with this kind of LLM scoring, but it was fun to build my own.<p>5. The whole thing was built with GPT-3.5-turbo, as GPT-4 was slower and pretty much equivalent in the performance testing.