Section 3 — RAG Pipelines in .NET
Summary — what this page covers Retrieval-Augmented Generation in C#: ingest and chunk documents, generate embeddings, store and query vectors in Qdrant, retrieve by cosine similarity, and inject that context into the prompt so Claude answers from real data. The recurring theme: RAG quality is determined by chunking, not the LLM. Pair with Lab 3.
1:00 – 2:15 PM · 75 min — 35 min lecture/demo + 40 min lab
Learning objectives
- Explain RAG architecture and how it differs from fine-tuning
- Implement a document ingestion pipeline with chunking strategies in C#
- Generate text embeddings (OpenAI API — or a free/local model via Ollama)
- Store and query vectors using Qdrant in Docker
- Build retrieval with cosine similarity scoring
- Construct augmented prompts that inject retrieved context
- Evaluate and iterate on RAG response quality
Content
Block 3A — RAG architecture & ingestion (≈25 min)
RAG vs fine-tuning. Fine-tuning bakes knowledge into the model's weights — expensive, slow to update, and overkill for "answer from our data." RAG keeps your data outside the model and retrieves the relevant pieces at query time, injecting them into the prompt. New book added? It's searchable instantly — no retraining.
The pipeline: ingest → chunk → embed → store → retrieve → augment → generate. You build each stage in C#; the LLM only shows up at the end.
Chunking strategies — where quality is won or lost. Size, overlap, and respecting semantic boundaries (don't split mid-sentence) determine what the retriever can find. This is the lever: the same model gives great or useless answers depending on how you chunked. Too large and retrieval is imprecise; too small and you lose context. Tune it.
Block 3B — Embeddings & vector store (≈25 min)
Embeddings. Text goes in, a vector comes out; similar meanings land near each other. Hide the
provider behind an EmbeddingService interface so the rest of the pipeline doesn't care where
vectors come from. The default uses the OpenAI embedding API.
Free-first path: swap in Ollama +
nomic-embed-text(768-dim) behind the same interface — no OpenAI key required. Whatever model you pick, set the Qdrant collection's vector size to match the model's dimensions (768 fornomic-embed-text), or upserts fail.
Qdrant in Docker. Create a collection, upsert your book vectors, and query top-K by cosine similarity:
docker run -p 6333:6333 qdrant/qdrant
If Qdrant won't start, an in-memory vector store is a fine fallback for the lab — same interface, no Docker.
Block 3C — Retrieval & augmentation (≈25 min)
At query time: embed the user's query → retrieve the top-K most similar chunks → build an augmented prompt that injects those chunks as context → let Claude answer grounded in real data. Reuse prompt caching from Section 1 for the static portion of the prompt (instructions + retrieved context that repeats across a session) to cut cost.
Demos referenced here
- Ingestion + embedding run (watch vectors land in Qdrant) · A grounded
/api/recommendanswer vs an ungrounded one for the same query — the difference is the whole point. [Scripts in_instructor/.]
→ Continue to Lab 3.