AI is better at writing code than reading code. Here’s why.

Daksh Gupta
4 min readAug 7, 2023

--

Teaching AI to read code, a simple guide from Onboard AI.
www.getonboard.dev

The advice I give to everyone starting college is to take one programming class. Not because everyone should know how to code, though it doesn’t hurt, but because learning to program teaches you how to think. It forces you to examine the flow of logic inside your brain that leads you to each one of your judgements and conclusions, and by giving you the vocabulary to talk about how computers think, it inadvertently gives you the vocabulary and frameworks to express how you think. Recursion, iteration, conditional branching, how some thoughts in your head are in memory, and others are in storage and more.

Sometimes I find myself examining this advice. If the way computers think is so similar to how people think, what makes code so hard to read?

1. Why is code so hard to read?

I think there are 2 reasons.

1.1 The unintuitive nature of directory structure

A functionally identical web app written in MERN will have a completely different looking directory structure written in Next.JS. These are both Javascript codebases for an identical piece of software. However, where external API calls get made, where backend logic is executed, and where HTML components are generated is completely different. Given a codebase, I couldn’t tell you where certain functionality is without some fumble-y searching.

1.2 The interconnected web of large codebases

If you as a developer look at a self standing script, assuming it is in a language you are familiar with, you are pretty likely to be able to decipher what it does. It isn’t really the code itself that is hard to understand but rather the codebase. When there are hundreds of scripts that are interdependent, it is hard to decipher what each script, module or package does. This is unless, of course, it’s well documented.

2. Documentation and the knowledge gap

The biggest problem with docs, naturally, is that no wants to write them. However, let’s put that aside for a second and consider if docs are even the right way to approach learning a new codebase.

Docs, when done right, are an exhaustive explanation of the code’s functionality and it’s components. The biggest problem with docs then is actually that they are the same no matter who is looking at them. Consider a Next.JS app that lets users chat with one another. A developer familiar with Next.JS but not with chat programs will have a very different set of gaps than a chat developer who has only ever worked with PHP.

The solution to this is usually subject-matter experts. If you’re being onboarded onto a new codebase at your job, you should set up a meeting with a senior dev that knows there way around the code. Of course that means waiting till their next calendar availability and then taking up an hour of their day. It will likely be great though, you tell them what you know and they fill in the gaps.

At Onboard, our challenge is to teach an AI a codebase so they can be that senior dev, except available 24/7 and a dramatically lower cost/hour.

3. How to make an AI an expert on a codebase

Working with LLMs is an unrelenting struggle against the context window. OpenAIs GPT-3.5 caps out at 16,000 tokens, not nearly enough to have an LLM store the entirety of a codebase in its context and answer your questions.

Interestingly, human beings _also_ don’t have the required memory to store an entire codebase. So we start with a human senior dev would answer a question like “Where is the auth functionality and how does it work?”

1. The dev likely has a mental map of the codebase and a general idea of where things are.

2. They will load this map into their “memory”

3. They likely remember an overview of what certain key files do, so they load that into their “memory”

4. Now if they have the codebase open, they can reference specific files to synthesize their answer

Ultimately, the secret to giving LLMs the ability to synthesize relevant answers from corpi of data is to be extremely deliberate about what goes into their inference context. This is not new information. People have been using embeddings in vector databases and using nearest neighbor search to synthesize context for months now. What we have found is that this is far from a trivial problem when it comes to large codebases.

4. Why is it specifically hard to do this with codebases?

Codebases are non-linear, and unlike how we we approached them earlier, they are also not trees. What they are a directed acyclic graphs where each node is a class or function. Each edge is a connection where `A()` calls `B()` or `C` inherits `D` and so on. By extension, to vectorize a codebase, this graph needs to be implemented, and semantically searchable. Not just that, every update to a node needs to be propagated one layer out, so it’s immediate neighbor.

5. Conclusion

These are my thoughts on why the problem of code semantic search is so hard. It is the exact problem we are working to solve at Onboard AI (www.getonboard.dev)

Onboard is free to use for public repos <100MB.

You can use it here — getonboard.dev

Thank you for reading!

--

--