*TL;DR;*<p>Over one month ago I posted about a really hard problem that I "accidentally" solved (https://news.ycombinator.com/item?id=40460084).<p>The problem is to resolve cross-file references for multiple programming languages. I can generate a graph representation of the codebase.<p>*Why do you need to have a graph representation of the codebase?*<p>- To understand how code references other code<p>- Track how data is passed around<p>I generated references for repo https://github.com/dj-stripe/dj-stripe, here is a gist: https://gist.githubusercontent.com/kannthu/6e1bdd2781d2e0a6ded30844d61f089e/raw/f1fa4bc0f34891834ce13ac256eec12f6cc671e1/dj-stripe-references.json<p>The gist is a big JSON blob that contains definitions form the repository.<p>Definitions are:<p>- top-level functions<p>- classes<p>- methods and public properties<p>- top-level variables<p>- exports<p>Each definition contains:<p>- Snippet, path, and range within the file<p>- "references" - a list of places where the definition is used<p>- "expressions" - a list of resolved references (variables, functions, and classes) that are used within the body of the definition<p>*How this data can be useful?*<p>If you are building code generation, code intelligence, or code review products - your product needs to have an understanding of the codebase for many programming languages at once. The more accurate context you feed to LLM => the better output you will get, and doing it in-house is really expensive and resource-consuming.<p>Let me know if it is interesting for any of you.