Test retrieval before generation
If the wrong context comes back, the answer is already on bad footing. Check whether the retrieved content is relevant, allowed for the user, recent enough, and complete enough for the task.
Test grounding, not just pleasant wording
A strong answer should stay anchored to the retrieved material. Test cases should include situations where the system must stick closely to source content, cite it clearly, or admit that the context is insufficient.
Include “no answer” and “bad context” cases
Some of the most important RAG tests are negative ones: nothing relevant retrieved, stale content retrieved, contradictory content retrieved, or content the user should not see. Good systems need safe fallback behavior here.
Watch for freshness and permissions
RAG failures are often operational as much as model-related. Out-of-date indexes, missing documents, or permission leaks can turn a technically fluent answer into a trust problem.
Measure the whole flow
Useful test signals include retrieval relevance, groundedness of the answer, source quality, latency, and fallback quality. The model response alone is not the whole system.