Semantic Indexing & Agentic RAG Infrastructure — 606M vectors, 447 datasets, 12 domains
The reasoning model is the first stage, not the last. An orchestrator (Qwen3-Coder-30B-A3B) decides what to retrieve from 447 datasets across 12 domains using tool-calling over FAISS indexes, rerankers, and classifiers. Deep reasoning (DeepSeek R1) runs once over curated evidence.
Output from the mini-index demo (20 documents, runs in under 60 seconds):
Query: "machine learning and neural networks" Top result: Neural Networks and Deep Learning (score: 0.878) Query: "semantic search and vector retrieval" Top result: Semantic Search and Dense Retrieval (score: 0.879) Query: "how to build a FAISS index" Top result: FAISS: Fast Similarity Search at Scale (score: 0.844)
Scores > 0.83 = strong semantic match (cosine similarity on L2-normalized embeddings).
git clone https://github.com/whmatrix/semantic-indexing-batch-02 cd semantic-indexing-batch-02/mini-index pip install sentence-transformers faiss-cpu python demo_query.py
| Deliverable | Format | Guarantee |
|---|---|---|
| Vector index | FAISS IndexFlatIP (exact cosine via L2-normalized inner product) | Deterministic, byte-reproducible |
| Chunk corpus | JSONL with metadata | len(vectors) == len(chunks) == len(metadata) |
| Audit summary | JSON manifest | Pass/fail quality gates per Universal Protocol v4.23 |
What this is not: No human-judged relevance labels. No MRR/MAP/NDCG claims. Scores are cosine similarity (vector alignment), not precision or recall. Domain suitability requires independent evaluation.
Reproduce it: git clone & cd mini-index & python demo_query.py — see mini-index
Need agentic RAG at scale?
Start with agentic-retrieval-system — tool-calling orchestration over 606M vectors, 447 datasets
Need production-scale indexing?
See semantic-indexing-batch-02 — 8.35M vectors, parallel split-merge, checkpointing
Have research documents to search?
Start with research-corpus-discovery — 4,600+ docs across 10 institutions, runnable demo