# Section 3 — RAG Pipelines in .NET

> **Summary — what this page covers**
> Retrieval-Augmented Generation in C#: ingest and chunk documents, generate embeddings, store
> and query vectors in Qdrant, retrieve by cosine similarity, and inject that context into the
> prompt so Claude answers from real data. The recurring theme: **RAG quality is determined by
> chunking, not the LLM.** Pair with **Lab 3**.

**1:00 – 2:15 PM · 75 min** — 35 min lecture/demo + 40 min lab

## Learning objectives

- Explain RAG architecture and how it differs from fine-tuning
- Implement a document ingestion pipeline with **chunking strategies** in C#
- Generate text **embeddings** (OpenAI API — or a **free/local** model via Ollama)
- Store and query vectors using **Qdrant** in Docker
- Build retrieval with **cosine similarity** scoring
- Construct **augmented prompts** that inject retrieved context
- Evaluate and iterate on RAG response quality

## Content

### Block 3A — RAG architecture & ingestion (≈25 min)

**RAG vs fine-tuning.** Fine-tuning bakes knowledge into the model's weights — expensive, slow to
update, and overkill for "answer from *our* data." **RAG** keeps your data outside the model and
*retrieves* the relevant pieces at query time, injecting them into the prompt. New book added? It's
searchable instantly — no retraining.

**The pipeline:** ingest → chunk → embed → store → retrieve → augment → generate. You build each
stage in C#; the LLM only shows up at the end.

**Chunking strategies — where quality is won or lost.** Size, overlap, and respecting semantic
boundaries (don't split mid-sentence) determine what the retriever can find. **This is the lever**:
the same model gives great or useless answers depending on how you chunked. Too large and retrieval
is imprecise; too small and you lose context. Tune it.

### Block 3B — Embeddings & vector store (≈25 min)

**Embeddings.** Text goes in, a vector comes out; similar meanings land near each other. Hide the
provider behind an **`EmbeddingService` interface** so the rest of the pipeline doesn't care where
vectors come from. The default uses the OpenAI embedding API.

> **Free-first path:** swap in **Ollama + `nomic-embed-text` (768-dim)** behind the same interface —
> no OpenAI key required. Whatever model you pick, **set the Qdrant collection's vector size to
> match the model's dimensions** (768 for `nomic-embed-text`), or upserts fail.

**Qdrant in Docker.** Create a collection, upsert your book vectors, and query **top-K by cosine
similarity**:

```bash
docker run -p 6333:6333 qdrant/qdrant
```

If Qdrant won't start, an **in-memory vector store** is a fine fallback for the lab — same interface,
no Docker.

### Block 3C — Retrieval & augmentation (≈25 min)

At query time: embed the user's query → retrieve the **top-K** most similar chunks → build an
**augmented prompt** that injects those chunks as context → let Claude answer *grounded in real
data*. Reuse **prompt caching from [Section 1](03-section-1-sdk.md)** for the static portion of the
prompt (instructions + retrieved context that repeats across a session) to cut cost.

## Demos referenced here

- **Ingestion + embedding run** (watch vectors land in Qdrant) · **A grounded `/api/recommend`
  answer vs an ungrounded one** for the same query — the difference is the whole point. [Scripts in
  `_instructor/`.]

→ Continue to [**Lab 3**](08-lab-3-rag.md).