Early Adopter Release of Kodit: MCP server to index external repositories
by Dr. Phil Winder , CEO
A lot of my work today is assisted with AI. From editing blog posts to developing proposals. But the number one use case is AI-assisted coding. Tools like Cursor,l Cline, Roo, Aider, Claude Code, etc. have disrupted software engineering to levels I hadn’t anticipated. But still, it’s not perfect.
This post is about one particular set of problems and a new open source tool I developed to alleviate it.
AI Coding Assistants Are Incredible, Except
Winder.AI has been operating for more than 12 years and during that time I’ve seen what is now called AI grow from NLP classifiers into something indispensable. Today I’m using a coding assistant to help me do nearly everything on the engineering side. But the same problem kept popping up time and time again…
Before I dive into this, if you’re reading this post then you’ve probably been following me for a while. And like me, you have a background in data science, etc. You know why these issues occur, in fact, you expect them. But it’s important to remember that 99% of people now using AI coding assistants do not have the benefit of ML experience. They’re using it purely as a tool.
That’s more of a note to myself than anyone. I’ll try my best to explain what’s going on.
Two Key Issues: Cut-Offs and Private Data
Language models get their super-human abilities by training on vast amounts of text data. The sheer quantity and breadth allow language models to predict sequences of words and simulate emergent behaviour. In a coding capacity it’s the underlying mathematics knowledge paired with understanding of common codebases that make them so good as coding assistants. But there’s at least two major distinct problems.
The first problem is that the training data has a cut-off. Scraping the entire internet is hard. Curating that content so that only the best of it is retained is even harder. Yet this is what the best models need to do. It’s been shown time and time again that better data is more powerful than better models. But collecting and curating data takes time. Then it takes even more time to train the model. And more to deploy and gain adoption, and so on. In my experience, most state of the art models have a data cut-off approximately one year prior to its release. But what does this mean?
A cut-off of one year means that SQLAlchemy is still releasing 1.6 updates, PyTorch has just released 2.3, and Rust was esoteric. The best one is that ChatGPT still wants to expand MCP to “Multi-Context Provider”! The language model is being trained on all of this, yet you’re using it now. The predictions the models make are from the past.
The second key problem is that the training data doesn’t have your private data (we hope!). Then if your task is use or incorporate or improve upon this code, then obviously the model has very little chance of doing a good job. The only way to help it do a good job is include as much of that private data as possible and hope that zero-shot learning works.
Both of these issues led me to the realisation that we need a way to incorporate private or new codebases into the context of a coding assistant. So I built Kodit.
Kodit: An MCP Server to Index External Repositories
Kodit is a CLI and MCP (Model Context Protocol) server that allows you to index remote repositories. It works by scanning a local directory or a remote git repository and building exemplar code snippets from the repository code. These snippets are then enriched with extra information like a description of what the snippet is doing.
The snippets are then exposed as a search tool on an MCP server. I made this decision because all popular AI coding assistants allow you to add MCP servers.
Search is via keywords (BM25) and semantic search (embeddings for text and code). Results are fused with reciprocal rank fusion.
Local First - Remote for Scale
I’ve tried hard to maintain a local-first approach to create a better user experience out of the box. Code and text representations are created with a tiny CPU embedding model. Enrichment is performed by a tiny CPU LLM. Data is stored in a local SQLite database.
However for most users, I recommend leveraging remote AI APIs to provide better embeddings and more capable language models. Crucially, VectorChord is used for lightning fast BM25 and vector search.
This also allows for the use of Kodit enterprise-wide. Kodit is basically a webserver. So you can spin it up in a Docker container, connecting to your AI providers of choice (on-premise!) and your production VectorChord/PostgreSQL database. Then you can share the benefits of improved context with your colleagues.
Try it Now: Help Guide the Future
I think Kodit is a great idea. But it’s still early. There’s lots of missing functionality. And I want to gather feedback to help set the direction for the project.
Try it now (more installation methods are available):
pipx install kodit
Please do go ahead and try it now. Here’s everything you need to get started: