Compare · Vector databases

TroveFiles vs. vector databases for AI agent retrieval.

Vector databases are the default answer for "how does my agent retrieve information?" — but they're overkill for most agent workflows. Here's an honest comparison: when grep beats embeddings, when embeddings win, and when the right answer is both.

1.0 TL;DR

Two different
retrieval models.

Vector databases (Pinecone, pgvector, Weaviate, Qdrant) retrieve by semantic similarity: chunk the documents, embed the chunks, store the vectors, embed the question, return the nearest matches. Filesystem retrieval (TroveFiles) retrieves by structural pattern: the agent issues a grep, awk, or pdftotext command and gets back the literal match.

For most agent tasks — pulling a clause out of a contract, finding a specific number in a filing, listing files matching a pattern — exact pattern retrieval is faster, cheaper, and easier to reason about. Vector search wins where exact match genuinely fails.

2.0 THE PIPELINES

What it costs to retrieve.

The same retrieval task — "find EBITDA in a quarterly filing" — looks dramatically different across the two models. One is a single shell command. The other is a multi-step pipeline.

# Upload once, search forever
trove.upload("workspace/filings/q3-2024.pdf", open("q3.pdf", "rb"))

# Agent retrieves with one shell command per question
bash("grep -r 'EBITDA' workspace/filings/")
bash("pdftotext workspace/filings/q3-2024.pdf - | sed -n '/Risk/,/^$/p'")

# Cost: storage. Latency: ~10ms per grep.
# No embedding pipeline. Deterministic. Files live forever.
3.0 RETRIEVAL IS A PLANNING TASK

The LLM is a better
retrieval architect
than your pipeline.

The deeper case for filesystem retrieval isn't that grep is fast. It's that the LLM, given a filesystem and a shell, will architect its own retrieval better than any pre-built RAG pipeline does.

A 2026-class model handed a workspace doesn't one-shot a top-k query. It does what an analyst does: ls -la workspace/contracts/ to scan by date, find … | xargs grep for hierarchical search, head -200 to skim before committing to a full read, pdftotext file.pdf - | sed -n '/Risk/,/^$/p' to grab a section, then iterates — looks at intermediate results and decides what to do next.

Vector DBs are stuck in 2023 RAG ergonomics: one query, one ranked list, hope it's right. Filesystem retrieval lets the agent do what it's already good at — multi-step reasoning over partial information.

4.0 SIDE BY SIDE

The honest tradeoffs.

DimensionTroveFiles (filesystem)Vector database
Retrieval modelExact pattern (grep, awk, jq)Semantic similarity (cosine, dot product)
Best forKeyword search, structured data, exact-match recallFuzzy semantic search, ranking, similarity
Write costOne file writeChunk + embed + upsert (per doc, per change)
Read costOne shell commandEmbed query + index query + rerank
DeterminismSame command, same answerDepends on embedding model and chunking strategy
Inspectabilitycat the file, eyeball the resultVector inspection, no human-readable representation
Multi-tenant isolationPer-namespace directory rootsPer-namespace collections (tooling varies)
Multimodal preprocessingpdftotext, ffmpeg, convert in the workspaceSeparate ETL pipeline before embedding
Deletion / GDPRrm -rf workspace/users/alice/Find every chunk, drop from index, hope metadata is clean
PortabilityA directory of files — copy anywhereVendor-locked index format; migrating is a project
Cost at small scale (< 100k docs)Storage onlyEmbeddings + index hosting
Cost at large scale (10M+ docs)Grep latency grows linearlySublinear with proper index
5.0 PICK ONE (OR BOTH)

Pick TroveFiles when…

  • • The agent retrieves by keyword, name, or path.
  • • You want the LLM to architect retrieval (grep + awk + sed) rather than pre-index.
  • • Source documents change often — re-embedding is a tax you want to skip.
  • • Determinism matters more than fuzzy similarity.
  • • You want one tool that handles memory, files, and retrieval.

Pick a vector database when…

  • • The question is genuinely semantic ("like X but not exactly X").
  • • Corpus is huge and stable — pre-indexing pays for itself.
  • • You need ranking by similarity, not just match/no-match.
  • • The user query language is far from the document language (translation, paraphrase).

The strongest production setups use both: TroveFiles for the agent's own memory and known-keyword retrieval, a vector database for semantic search over a large stable corpus. See the memory use case for how teams split the work.

6.0 FAQ

Filesystem vs. vector DB,
answered.

When should I pick a filesystem over a vector database?

When the agent is retrieving things it knows the keywords for: contract clauses, code identifiers, named entities, exact phrases, structured data. Filesystem retrieval (grep, awk, jq, pdftotext) is faster, cheaper, and deterministic. The agent issues a shell command and gets the exact match.

When does a vector database actually pay off?

When the question doesn't map cleanly to keywords — "find conversations that felt similar to this one," "retrieve documents semantically related to a topic," "rank these passages by relevance." Embeddings shine where exact-match search fails, which is genuinely a real-but-narrow set of agent tasks.

Can I use both?

Yes — most production agents do. TroveFiles for the agent's own memory, scratchpad, and known-keyword corpus retrieval. A vector database for semantic similarity over a large external corpus. They are complements, not competitors.

Doesn't a vector database scale better than grep?

For very large corpora (tens of millions of documents), vector indices win on latency. For typical agent corpora — a customer's files, a knowledge base, last year's contracts — TroveFiles stays sub-second by keeping retrieval close to the data, so the agent isn't paying for round trips between an embed call, an index query, and a rerank.

Does grep hold up under concurrent agents?

Yes. TroveFiles is built so each command runs independently — concurrent agents fan out instead of queuing through a shared index. Throughput scales with parallel readers rather than the index tier you pay for. Vector DBs, by contrast, route every query through a single index, so concurrency is bounded by replicas and pricing tier.

What about chunking and re-embedding when documents change?

Vector pipelines have to re-chunk and re-embed every time a source document changes. With TroveFiles, you just write the new file. The next grep picks it up. No re-indexing job, no embedding cost on writes.

How do I migrate from a vector database to TroveFiles?

Most migrations are partial: keep the vector DB for true semantic queries, move the keyword and structured-data retrieval onto TroveFiles. Upload the source docs, point the agent's bash tool at TroveFiles, and start removing custom retrieval code. Most teams find 60-80% of their queries collapse into grep/awk/jq.

Who's running TroveFiles in production?

TroveFiles is the storage layer behind Silvia, our AI CFO with over $30 billion in connected assets. Every Silvia user has a TroveFiles namespace where the agent stores memories, skills, and preferences and retrieves them via shell commands across sessions.

Try retrieval that
doesn't need embeddings.

Upload a file, run grep, see the answer. API key in 30 seconds.