C7 — RAG Recommendations (`POST /api/recommend`)

Summary — what this page covers Give BookTracker a retrieval-augmented generation pipeline. You'll add a new BookTracker.VectorStore project (embeddings, chunking, a vector store), introduce a Book.Description corpus, ingest books + reviews into vectors, and build a grounded POST /api/recommend endpoint that answers from your data instead of the model's memory.

Time: ~60 minutes · Format: hands-on, solo · You start from: checkpoint/c6-streaming-agent · You end at: checkpoint/c7-rag

C7 is Day 2, Lab 3. The thesis of this lab: retrieval quality, not the model, is what makes answers good. You build the seam where embeddings, chunking, and vector search plug in — then watch a grounded answer beat an ungrounded one for the same question.

The as-built solution defaults to an in-memory vector store, so the app runs without Docker. The live Qdrant + Ollama path is the "real" stack you'll wire up and can switch to with one config value. You still need your Anthropic API key (from C5/C6) for the final generation step.

1. Prerequisites

You already have the .NET 10 SDK and your Anthropic API key wired in from C5/C6. For the live retrieval path (optional — the in-memory store is the Docker-free default), also install:

Tool	Why	Check
Docker	runs Qdrant (the vector DB)	`docker --version`
Ollama + `nomic-embed-text`	local, free embeddings (768-dim)	`ollama --version`

# pull the embedding model and start Ollama (free embeddings on :11434)
ollama pull nomic-embed-text
ollama serve

Everything below builds and runs with no Docker and no Ollama thanks to the in-memory store — but if Ollama isn't reachable, ingestion is skipped (the app logs a warning and still starts), so /api/recommend will have nothing to retrieve. Run Ollama to see real retrieval.

2. Start from the C6 checkpoint

Each lab starts from the previous checkpoint; the matching tag (checkpoint/c7-rag) is the answer key.

# from the repo root, branch from the C6 checkpoint
git switch -c my-c7 checkpoint/c6-streaming-agent

# everything below runs inside the solution folder
cd src/BookTracker

3. Create the `BookTracker.VectorStore` project

This new project holds the whole retrieval pipeline. It references Core only — it reads entities and seed text through Core's repository ports, so it never touches EF directly. The Anthropic SDK stays in Api; VectorStore is pure retrieval.

dotnet new classlib -n BookTracker.VectorStore -o BookTracker.VectorStore
dotnet sln BookTracker.sln add BookTracker.VectorStore/BookTracker.VectorStore.csproj
dotnet add BookTracker.VectorStore reference BookTracker.Core/BookTracker.Core.csproj

Add the packages (these exact versions are the as-built set):

dotnet add BookTracker.VectorStore package Qdrant.Client --version 1.18.1
dotnet add BookTracker.VectorStore package Microsoft.Extensions.AI --version 10.7.0
dotnet add BookTracker.VectorStore package OllamaSharp --version 5.4.25
dotnet add BookTracker.VectorStore package Microsoft.Extensions.Options --version 10.0.9

Then have Api reference VectorStore:

dotnet add BookTracker.Api reference BookTracker.VectorStore/BookTracker.VectorStore.csproj

4. Embeddings behind one seam

Wrap Microsoft.Extensions.AI's IEmbeddingGenerator<string, Embedding<float>> in a thin IEmbeddingService / EmbeddingService. The point of the seam: swapping OpenAI ↔ Ollama is just which generator you register — nothing downstream changes. The default is local Ollama.

// OllamaSharp implements IEmbeddingGenerator — this is the free local path
IEmbeddingGenerator<string, Embedding<float>> embedder =
    new OllamaApiClient(new Uri("http://localhost:11434"), "nomic-embed-text");

var result = await embedder.GenerateAsync([text], cancellationToken: ct);
float[] vector = result[0].Vector.ToArray();   // 768-dim for nomic-embed-text

Bind these settings via VectorStoreOptions (IOptions<>): OllamaUrl, EmbeddingModel, VectorSize (768), Provider, ChunkSize, ChunkOverlap, TopK. Defaults run Docker-free.

5. Chunk the text — the quality lever

Add ITextChunker / TextChunker: split text into size + overlap windows (the as-built defaults are ~500-char windows with ~80-char overlap), respecting sentence boundaries — don't split mid-sentence. This is where retrieval quality is won or lost; you'll tune it in Part C.

6. Add the vector store (two implementations)

Define IVectorStore and back it with two implementations selected by VectorStore:Provider:

public interface IVectorStore
{
    Task EnsureCollectionAsync(int vectorSize, CancellationToken ct);
    Task UpsertAsync(IEnumerable<VectorRecord> records, CancellationToken ct);
    Task<IReadOnlyList<VectorHit>> SearchAsync(float[] query, int topK, CancellationToken ct);
    Task<long> CountAsync(CancellationToken ct);   // skip ingestion when already populated
}

InMemoryVectorStore — cosine similarity in memory. This is the default (Provider = "InMemory"), so the app runs with no Docker.
QdrantVectorStore — the real DB over gRPC:

var qdrant = new QdrantClient("localhost", 6334);   // gRPC — NOT the REST port 6333
await qdrant.CreateCollectionAsync("books",
    new VectorParams { Size = 768, Distance = Distance.Cosine });  // Size = model dims

Vector size must equal the model's dimensions or upserts fail: nomic-embed-text = 768, OpenAI text-embedding-3-small = 1536.

7. Add the corpus: `Book.Description`

The corpus you retrieve over = book Title + Genre + Description + each review's Body. Book.Description is introduced here. Add the entity field, seed a blurb on each seeded book, and create the migration:

dotnet ef migrations add AddBookDescription --project BookTracker.Data --startup-project BookTracker.Api

Keep Description entity-only — do not expose it in BookDto. It's source text for retrieval, not part of the public API shape.

8. Ingest the corpus

Add CorpusIngestionService: read books (with Description) and reviews via the Core repository ports (IBookRepository, IReviewRepository) — not the DbContext — then chunk → embed → upsert each chunk with payload (bookId, title, the chunk text).

Run it best-effort at startup: it's idempotent (skip when CountAsync shows the store is already populated), and if Ollama/the embedding provider is unreachable it logs a warning and continues so the app still starts from a clone.

9. Build the RAG endpoint

In Api, add RagService (retrieve → augment → generate) and RecommendEndpoints:

1. embed the user query                         (IEmbeddingService)
2. top-K cosine search                          (IVectorStore.SearchAsync)
3. build an augmented prompt:
     system = instructions + retrieved chunks   ← cache this block (reuse C5 caching)
     user   = the query
4. client.Messages.Create(...)                  → grounded answer + source books

Put the retrieved context in the cached system block using CacheControlEphemeral, reusing the C5 prompt-caching pattern (same minimum-tokens caveat). The endpoint stays thin and delegates to the service:

app.MapPost("/api/recommend", async Task<Results<Ok<RecommendResult>, ValidationProblem>> (
    RecommendRequest request, IRagService rag, CancellationToken ct) =>
{
    if (string.IsNullOrWhiteSpace(request.Query))
        return TypedResults.ValidationProblem(new Dictionary<string, string[]>
        {
            ["query"] = ["Query is required."],
        });

    return TypedResults.Ok(await rag.RecommendAsync(request.Query, ct));
}).WithTags("Recommend");

In Program.cs: register the IEmbeddingGenerator (Ollama or OpenAI), the IVectorStore (by config), CorpusIngestionService, and RagService; ingest on startup; and map the endpoint.

10. Build, run, and verify

dotnet build BookTracker.sln
dotnet run --project BookTracker.Api          # in-memory store; needs Ollama for ingestion

Ask for a grounded recommendation:

curl -X POST http://localhost:5255/api/recommend \
  -H "Content-Type: application/json" \
  -d '{"query":"I liked Clean Code — what should I read next?"}'

You should get a recommendation grounded in retrieved books, with the source titles it drew from. Compare it against the same question asked without retrieval (e.g. your C5/C6 chat endpoint) — the answers should visibly differ.

Part C — the quality lever. Change ChunkSize / ChunkOverlap in config, restart, and re-ask: retrieval changes. That's the whole point — quality comes from chunking, not the model.

Optional — the real Qdrant stack. Bring up Qdrant mapping both ports (the .NET client is gRPC on 6334; 6333 is REST/UI), then flip the provider:

docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant   # or: docker compose up -d

Set VectorStore:Provider to "Qdrant" in appsettings, restart, and confirm the collection is created with size 768 and the corpus ingests (count > 0).

✅ Checkpoint — you're done when:

dotnet build is green; BookTracker.VectorStore exists and Api references it.
Book.Description exists with seed blurbs and the AddBookDescription migration.
The corpus ingests at startup (books + reviews chunked, embedded, upserted — count > 0).
POST /api/recommend returns recommendations grounded in retrieved context, with sources.
Grounded vs ungrounded answers for the same query visibly differ.
Changing chunk size/overlap changes retrieval.
The in-memory provider works as a Docker-free fallback; (optional) Qdrant works on both ports with vector size 768.

You're now at checkpoint/c7-rag.

What's next

Lab 4 (C7 → C8): you'll build your own MCP server in C# — exposing BookTracker's Core services as tools an MCP client (like Claude) can call.

C7 — RAG Recommendations (POST /api/recommend)