Hello HN, Dawson and Ethan here from Martin (<a href="https://trymartin.com">https://trymartin.com</a>). We’ve been building an AI personal assistant (the elusive dream of a real life Jarvis) for about a year now, and we recently launched Martin as a web app. Watch our latest demo here: <a href="https://youtu.be/ZeafVF8U7Ts" rel="nofollow">https://youtu.be/ZeafVF8U7Ts</a>.<p>We’re starting with common agentic tasks for consumers/prosumers - Martin can read/draft emails, make your calendar, text and call others for you, and use Slack. Like any personal assistant, it can also set reminders, track your to-dos, and send you daily briefings. The idea is to eventually tackle everything that an on-call virtual assistant does.<p>4 months ago, we did a launch HN for Martin's voice-first iOS app. A big piece of product feedback we got was "I don't trust AI to take actions like sending texts/emails on my behalf if it's not 100% reliable."<p>We’re happy to report that Martin's failure rate is now a lot lower than before (though we have a lot more work to do for more complex actions). We have tackled some pretty interesting problems since our last launch, so thought we’d share a couple of them here:<p>First, building a testing suite to concretely measure and improve performance for agents is no trivial task. (We're optimistic that someone might build an awesome system for this one day, but we haven't found one so are doing it ourselves.) Specifically, what we’d like to do is run existing test cases on new implementations of our entire LLM processing flow - not just new prompts - and be able to rigorously say whether we’ve improved and/or where we’ve regressed. This means defining tests in such a way that they’re resilient to major overhauls of code structure, as well as building a testing execution context that mimics production behavior (i.e. a test user with calendar events, emails, contact info). On top of that, all test cases need to be manually and painstakingly written, with expected outputs sometimes being many tens of thousands of characters.<p>On the monitoring side, most of our reliability issues are soft errors which are very hard to programmatically catch. When malfunctions happen, most of the time we learn of it through customer feedback and not any conventional third-party monitoring system. The best we can do without manually sifting through tons of data is to implement rudimentary checks based on behavior patterns which we know historically indicate errors (e.g. making many similar API calls in quick succession, implying rapid failure and retry of function calls)<p>Another problem we keep coming back to is the stateless nature of LLM context (information is not stored latently and needs to be reintroduced at every invocation). Because of how much info Martin needs (product information, user memory, tool definitions, previous messages, platform-specific instructions, etc), we need to carefully manage what information we expose to Martin and how we balance broad context with specific information. Vanilla RAG can’t handle the complexity, so we built custom retrieval and context injection systems for each LLM call. We abstract away some information behind function calls and organize certain tools into modules which share context and instructions. This strategy has helped a lot with reliability.<p>Of course, we're still a long way from Jarvis. Whenever one of us struggles with a technical problem, the other will kindly remind him that "Tony Stark built this in a cave, with a box of scraps!"<p>We’re super pumped about where software is headed. It feels like we’re tinkering with ideas that are on the edge of what’s possible. You can try out Martin on desktop and iOS at <a href="https://trymartin.com">https://trymartin.com</a>. We have a 7-day free trial, and if you find it useful we charge $35/month afterwards for unlimited usage.<p>Very excited to hear your thoughts! If you have any ideas around reliability for agents or the future of consumer AI interfaces, we’d love to discuss and trade notes.