Skip to main content
AI coding agents work better when they can search a codebase semantically (“find the authentication logic”) rather than just by filename or grep. A code index backed by Turso can combine full-text search over identifiers with vector similarity search over embeddings, all in a single embedded database. This guide covers the schema and queries for building a code indexer with both FTS and vector search.

Schema

CREATE TABLE IF NOT EXISTS codebases (
    id          INTEGER PRIMARY KEY AUTOINCREMENT,
    root_path   TEXT NOT NULL UNIQUE,
    name        TEXT NOT NULL DEFAULT '',
    indexed_at  INTEGER NOT NULL DEFAULT 0
);

CREATE TABLE IF NOT EXISTS chunks (
    id              INTEGER PRIMARY KEY AUTOINCREMENT,
    codebase_id     INTEGER NOT NULL REFERENCES codebases(id),
    file_path       TEXT NOT NULL,
    chunk_key       TEXT NOT NULL UNIQUE,
    language        TEXT NOT NULL,
    kind            TEXT NOT NULL,
    name            TEXT NOT NULL DEFAULT '',
    signature       TEXT NOT NULL DEFAULT '',
    snippet         TEXT NOT NULL,
    start_line      INTEGER NOT NULL,
    end_line        INTEGER NOT NULL,
    file_hash       TEXT NOT NULL,
    indexed_at      INTEGER NOT NULL,
    embedding       F8_BLOB(384),
    embedding_model TEXT DEFAULT ''
);

CREATE TABLE IF NOT EXISTS indexed_files (
    id          INTEGER PRIMARY KEY AUTOINCREMENT,
    codebase_id INTEGER NOT NULL REFERENCES codebases(id),
    file_path   TEXT NOT NULL,
    file_hash   TEXT NOT NULL,
    chunk_count INTEGER NOT NULL DEFAULT 0,
    indexed_at  INTEGER NOT NULL,
    UNIQUE(codebase_id, file_path)
);

How it fits together

  • codebases registers each project root. A single database can index multiple codebases.
  • chunks stores individual semantic units extracted from source files — functions, structs, classes, impl blocks, etc. Each chunk has a name, signature, code snippet, line range, and optionally a vector embedding. The chunk_key is a unique identifier (e.g. file_path::kind::name) for upsert operations.
  • indexed_files tracks which files have been indexed and their content hashes, enabling incremental re-indexing — only changed files are re-processed.

Connecting

The FTS index features require an experimental flag at connection time:
import { connect } from "@tursodatabase/database";

const db = await connect(".index/code.db", {
  experimental: ["index_method"],
});
await db.exec(SCHEMA);
The experimental: ["index_method"] flag enables Turso’s USING fts index syntax and the fts_match() / fts_score() functions.

Indexing files

Upserting chunks

When a file is parsed, upsert its chunks. Setting embedding = NULL on conflict forces re-embedding when content changes:
INSERT INTO chunks (codebase_id, file_path, chunk_key, language, kind, name,
                    signature, snippet, start_line, end_line, file_hash, indexed_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(chunk_key) DO UPDATE SET
  language = excluded.language,
  kind = excluded.kind,
  name = excluded.name,
  signature = excluded.signature,
  snippet = excluded.snippet,
  start_line = excluded.start_line,
  end_line = excluded.end_line,
  file_hash = excluded.file_hash,
  indexed_at = excluded.indexed_at,
  embedding = NULL,
  embedding_model = '';

Tracking indexed files

INSERT INTO indexed_files (codebase_id, file_path, file_hash, chunk_count, indexed_at)
VALUES (?, ?, ?, ?, ?)
ON CONFLICT(codebase_id, file_path) DO UPDATE SET
  file_hash = excluded.file_hash,
  chunk_count = excluded.chunk_count,
  indexed_at = excluded.indexed_at;

Skipping unchanged files

Before parsing a file, check if it has changed:
SELECT file_hash FROM indexed_files WHERE codebase_id = ? AND file_path = ?;
If the hash matches, skip it entirely.

Storing embeddings

Embeddings can be stored as int8-quantized vectors using Turso’s vector8() function, which reduces storage from 1,536 bytes (float32, 384 dims) to 395 bytes:
UPDATE chunks SET embedding = vector8(?), embedding_model = ? WHERE chunk_key = ?;
The parameter for vector8() is a JSON-stringified float array. Turso handles the quantization internally. Find chunks that need embedding:
SELECT chunk_key, name, signature, file_path, kind, snippet
FROM chunks
WHERE codebase_id = ? AND (embedding IS NULL OR embedding_model != ?)
LIMIT ?;
Turso provides an FTS index with weighted BM25 scoring that is separate from SQLite’s fts5 virtual tables.

Creating the FTS index

Create an FTS table per codebase and populate it from the chunks:
-- Create the FTS table
CREATE TABLE IF NOT EXISTS fts_1 (
    chunk_id    INTEGER NOT NULL REFERENCES chunks(id) ON DELETE CASCADE,
    name        TEXT NOT NULL DEFAULT '',
    signature   TEXT NOT NULL DEFAULT ''
);

-- Populate from chunks
INSERT INTO fts_1 (chunk_id, name, signature)
SELECT id, name, signature FROM chunks WHERE codebase_id = 1;

-- Create the FTS index with weighted columns
CREATE INDEX IF NOT EXISTS idx_fts_1 ON fts_1
    USING fts (name, signature)
    WITH (
        tokenizer = 'default',
        weights = 'name=5.0,signature=3.0'
    );
The weights parameter controls BM25 scoring — here, matches in the function/type name are weighted 5x and signature matches 3x.

Searching with FTS

SELECT chunk_id, fts_score(name, signature, ?1) AS score
FROM fts_1
WHERE fts_match(name, signature, ?1)
ORDER BY score DESC
LIMIT ?;
Then fetch the full chunk data:
SELECT chunk_key, file_path, name, kind, signature, snippet, start_line, end_line
FROM chunks WHERE id = ?;
For semantic/natural language queries, use vector cosine distance:
SELECT chunk_key, file_path, name, kind, signature, snippet, start_line, end_line,
       vector_distance_cos(embedding, vector8(?)) AS distance
FROM chunks
WHERE embedding IS NOT NULL
ORDER BY distance ASC
LIMIT ?;
The score is 1 - distance (cosine similarity from cosine distance). For the best results, combine both approaches using Reciprocal Rank Fusion (RRF). Run the FTS and vector queries separately, then merge results in application code:
// Run both searches
const ftsResults = await ftsSearch(query, limit);
const vecResults = await vectorSearch(queryEmbedding, limit);

// Reciprocal Rank Fusion
const k = 60; // RRF constant
const scores = new Map();

ftsResults.forEach((r, i) => {
  const key = r.chunk_key;
  scores.set(key, (scores.get(key) || 0) + 1 / (k + i + 1));
});

vecResults.forEach((r, i) => {
  const key = r.chunk_key;
  scores.set(key, (scores.get(key) || 0) + 1 / (k + i + 1));
});

// Sort by combined score
const merged = [...scores.entries()]
  .sort((a, b) => b[1] - a[1])
  .slice(0, limit);

Removing stale files

When files are deleted from the codebase, clean up their chunks and records:
DELETE FROM chunks WHERE codebase_id = ? AND file_path = ?;
DELETE FROM indexed_files WHERE codebase_id = ? AND file_path = ?;
After removing stale data, rebuild the FTS table to keep the index consistent:
DROP TABLE IF EXISTS fts_1;
-- Then recreate and repopulate as shown above

Key design points

  • Incremental indexing via file hashes means only changed files are re-parsed and re-embedded, making updates fast even on large codebases.
  • Two search modes — FTS for identifier/keyword lookup and vector search for semantic/natural language queries — cover different use cases. Hybrid search combines both.
  • Weighted FTS (name=5.0, signature=3.0) biases results toward function and type names, which is what developers usually search for.
  • Int8 quantized vectors via vector8() reduce storage by 75% compared to float32 with minimal impact on search quality.
  • Everything runs in a single embedded database file with no external services.

Example

codemogger is an MCP server that implements this pattern as a code indexer for AI coding agents.