AI as a QA Engineer's Assistant: Local RAG for Test Documentation

2327 words 12 minutes RAG QA LLM local AI N8N Qdrant Python

The idea didn't come from a desire to "automate routine tasks" or follow trends. I wanted something different: a counterpart that knows the documentation as well as I do. One that holds the entire structure in mind — requirements, test methods, test cases, their relationships and contradictions. One that remembers what was in version 2.1 and how it relates to section 4.3.

A local AI agent that lives inside the corporate perimeter and works with the same knowledge base I do.

Not an executor that needs context explained from scratch. Not a keyword search engine. Оne that understands document structure, grasps the meaning of a query, and operates with the same concepts the team uses.


Documentation in QA: Volume as a Systemic Problem

In QA, documentation is not supporting material. It is the working environment: test programs and methodologies, technical requirements, specifications, test cases, protocols, reports.

On an active project, the volume of these materials grows constantly. New versions appear. Old ones are revised. Documents reference each other. At some point, the documentation ecosystem becomes genuinely complex — too complex for any single person to hold entirely in their head.

The problem is not laziness. The problem is that humans are physically poor at handling large volumes of loosely structured text — especially when wording shifts between versions, a project has many components, and requirements and test cases are scattered across different documents.

I didn't want an executor. I wanted a twin: an agent that already knows the project context, understands the document hierarchy, and can reason at the level of meaning — "what is being tested here", "what does this overlap with", "what is missing". And one that runs entirely locally, without a single byte leaving the company perimeter.


Why Not the Cloud

Technically, the simplest path is to connect a cloud LLM and build RAG on top of it. That can be done in a few days.

In my case, that option wasn't on the table — for several reasons at once.

Confidentiality. The product the team works on is a proprietary development. The documentation contains commercially sensitive information. Sending technical specifications to an external LLM service creates legal risks and directly violates obligations to the client.

Dependency on external infrastructure. Access to cloud APIs can be restricted — and not only for technical reasons. The geopolitical situation has a real impact on service availability: without a dedicated VPN, accessing large models simply isn't possible. Add the risk of cloud infrastructure outages and the inability to pay officially — and the picture is complete.

Cost and predictability. A local system operates independently of external infrastructure and generates no variable operating costs.

So I set a hard constraint:

Not a single byte of internal documentation should leave the working machine.

This made the task harder, but also more interesting.


What RAG Is and Why It's Necessary

An LLM on its own is a poor source of truth for internal documentation. It knows nothing about a specific product — especially if your documentation isn't available online. Ask it a direct question and it will either refuse to answer or start "making up" a plausible but incorrect response. This phenomenon is known in the professional community as hallucination.

In QA, hallucinations are unacceptable.

RAG (Retrieval-Augmented Generation) solves this problem: before generating a response, the system finds relevant fragments from the knowledge base and passes them into the model's context. The model answers not "from memory" but based on real data explicitly provided to it.

User question
Search for relevant fragments in the knowledge base
Pass fragments + question to LLM as context
Answer based on real data

The model becomes not a source of knowledge, but a tool for interpreting existing information. The key advantage: when documentation changes, there is no need to retrain the model — just re-index the updated documents.


RAG Evolution: How I Chose the Complexity Level

RAG is not a single fixed approach. Over the past few years, a clear evolution of complexity levels has emerged: from naively dumping an entire document into a prompt to full-fledged agentic systems with knowledge graphs.

LevelNameApproachSearch methodIn project
0Naive RAGEntire document → into promptNo search
1Basic RAGFixed-size chunks + cosine similarityDense only
2Hybrid RAGDense + Sparse + RerankerDense + BM25 + RerankCurrent
3Structured RAGMetadata, filters, document hierarchyFiltered searchCurrent
4Agentic RAGLLM plans query chains, multi-hopIterativeNext step
5Graph RAGKnowledge Graph over vectorsGraph + DenseFuture

Level 0 — Naive. The entire document is placed in the prompt. Works for small files, but unusable for a large knowledge base: every model's context window is limited, and token cost (even in local inference — measured in time) makes the approach unviable.

Level 1 — Basic. Documents are split into fixed-size chunks, each chunk is vectorized, and the query is matched against chunks by cosine distance. It works, but handles technical terms, abbreviations, and numeric values poorly — exactly the things that are critical in QA documentation. Furthermore, such a system doesn't scale: the data is just a set of isolated chunks with no structural connections, and adding filtering by section or content type is nearly impossible without rethinking the entire architecture.

Level 2 — Hybrid. Sparse search — BM25 or SPLADE — is added to dense search for exact keyword lookup. Results from both sources are merged via Reciprocal Rank Fusion (RRF), then reranked by a cross-encoder reranker. This substantially improves accuracy on technical terms.

Level 3 — Structured. Chunks are enriched with structural metadata: section number, block type (test case, section, table), name of the tested function. Search can be restricted to a specific section or content type before vector comparison — reducing noise and significantly improving result relevance.

Level 4 — Agentic. The LLM decides what to search for and when, making multiple sequential queries if needed and combining information from different sources. Required for complex questions like "compare the test methodology for module A with module B" — a multi-step task that a linear pipeline cannot handle.

Level 5 — Graph RAG. A graph of entity relationships on top of vector search. Enables answering questions about dependencies that are not explicitly stated in any individual document but emerge from the structure of the entire knowledge base.

I settled on the combination of levels 2 and 3. This proved to be the optimal compromise: hybrid search with a reranker ensures search quality, structured metadata ensures filtering precision. Feasible to implement in reasonable time, delivers measurable results, and leaves a clear path for further growth.

Levels 4 and 5 are an interesting prospect, but a premature investment. Agentic loops and knowledge graphs are justified when the basic pipeline is exhausted. Starting with Graph RAG means solving problems I haven't encountered yet.


Solution Architecture

Technology Stack

ComponentTechnologyRole
OrchestrationN8N (Docker)Workflow automation, triggers
LLM / EmbedLM StudioLocal model inference
Embeddingbge-m3Multilingual vectors, 1024 dim
Vector DBQdrantVector storage and search
RerankerFlashRankReranking, CPU-only, < 50 ms
API layerFastAPIHTTP interface between N8N and Python
N8N DBPostgreSQLWorkflow state storage
ParserDOCXConverter (custom).docx structure extraction

All components run in Docker on a single machine. LM Studio runs on the Windows host and is accessible to containers via host.docker.internal. This is a somewhat unconventional topology, but it allows LM Studio to use the host GPU without additional virtualization.

At first glance the architecture looks complex. In practice it is fairly transparent: indexing and querying are two independent pipelines.


System Overview

┌──────────────────────────── INDEXING PIPELINE ──────────────────────────────────┐
                                                                                  
  📄 DOCX  ──►  DOCXConverter  ──►  TableSafeChunker  ──►  Embedder (bge-m3)   
                  (hierarchy)        (smart chunking)       (LM Studio)          
                                                               Qdrant            
                                                         (vectors + payload)     
└─────────────────────────────────────────────────────────────────────────────────┘

┌──────────────────────────── QUERY PIPELINE ─────────────────────────────────────┐
                                                                                  
  💬 N8N  ──►  QueryRouter  ──►  HybridRetriever  ──►  AnswerGenerator          
               (filters)          (dense + rerank)     (LM Studio LLM)           
                                       │                      │                  
                                       ▼                      ▼                  
                                    Qdrant                  Answer               
└─────────────────────────────────────────────────────────────────────────────────┘

Indexing: The Most Important Part

The hardest part of RAG is not search. The hardest part is data preparation.

Restoring DOCX Hierarchy

Standard libraries — for example, python-docx — see a document as a flat list of paragraphs. For QA, this is unacceptable. I wrote a custom DOCXConverter that relies exclusively on Word heading styles (Heading 1, Heading 2, etc.), ignoring manual numbering and text numbers in the document body.

As a result, each block receives a section_hierarchy — a string like "2.1.1.2" that precisely indicates its position in the document tree. Every test case knows its place — for example, 2.1.1.3. This enables accurate filtering by section.

Table-Safe Chunking

Tables in test documentation contain critically important structural relationships: test steps in one column, expected results in another. Splitting a table into chunks destroys this relationship and renders the extracted fragments meaningless.

TableSafeChunker never splits tables. They are always indexed as a single block, converted to Markdown:

| Test method                  | Evaluation criteria          |
| ---------------------------- | ---------------------------- |
| 1. Restart the computer      | 1. Computer booted           |
| 2. Open the Client           | 2. Client opened             |

LLMs process this format significantly better than raw XML or scattered strings. Text blocks are split with a 100-character overlap, which preserves context at fragment boundaries.

Structured Payload in Qdrant

Each vector stores not only content but also rich structural metadata:

payload = {
    "node_type":          "test_case",
    "section_hierarchy":  "2.1.1.1",
    "tested_function":    "Client startup",
    "chunk_type":         "table",
    "parent_test":        "Verify Client startup...",
    "source_document":    "TestSpec_v2.3.docx",
}

This turns the knowledge base into a manageable structure rather than a bag of text fragments. A query like "show only test cases from section 2.1" is executed via metadata filtering before vector search — result quality is significantly higher than global search across the entire base.

Batch Vectorization

Embedder sends all chunks in a single batch request to LM Studio (bge-m3). Not N sequential requests — one. When indexing hundreds of chunks, this makes a meaningful difference in processing time.


Query Pipeline

When I submit a question, the following happens:

1. QueryRouter analyzes the question and determines the search strategy: filter only by test_case type? Restrict to section 2.1? Or search the entire base without filters?

2. HybridRetriever vectorizes the query, searches Qdrant with filters applied, and intentionally retrieves 3× more candidates than needed — over-retrieval required for quality reranking.

3. FlashRank reranks results using a cross-encoder model. Unlike cosine distance, a cross-encoder evaluates the (query, document) pair jointly — this produces significantly more accurate relevance scoring. Runs on CPU in under 50 ms.

4. AnswerGenerator constructs a prompt with context (tables represented as Markdown) and sends it to the LLM. Temperature 0.1 — a deliberate choice: in QA, reproducibility matters more than creativity.

Local Inference via LM Studio

LM Studio provides an OpenAI-compatible API on top of any GGUF model. The same Python code that works with the OpenAI API works locally — just change the base_url:

client = OpenAI(
    base_url="http://host.docker.internal:1234/v1",
    api_key="lm-studio",  # value doesn't matter, only the format is required
)

bge-m3 was chosen for its native Russian language support without additional configuration — critical for documentation written in Russian.


What This Delivers in Practice

The architecture enables several classes of tasks that previously consumed significant time when done manually.

Requirements coverage analysis. A query like "Which functional requirements from section 3.2 are not covered by test cases?" — the AI agent finds all requirements, finds all test cases, compares them, and produces a gap analysis. A task that takes several hours manually.

Test case generation. Input: a description of new functionality and existing test cases for a similar module. The agent generates a draft in the same format and style. The result requires review, but substantially reduces the time needed for initial development.

Contradiction detection. "Are there discrepancies between the requirements in section 4 and the evaluation criteria in section 6?" The agent analyzes both sections simultaneously and flags inconsistencies — a task humans easily miss at scale.

Semantic search, not keyword search. "Find all test cases related to authorization" — even though the documents use the terms "authentication", "login", and "sign-in" interchangeably. Vector search finds semantically similar fragments regardless of the specific words used.

This is not a replacement for a QA engineer. It is an accelerator for working with text.


Limitations — They Exist

Resource constraints. Local execution means quantized models only — an unavoidable tradeoff between confidentiality and quality. It requires experimenting with quantization levels (Q4, Q5, Q8) and comparing the quality/speed ratio for specific tasks.

Agent looping. Situations occasionally arise where the agent enters an infinite cycle of re-querying or rephrasing. This is a known problem with agentic systems running on smaller models. I'm considering a switch to llama.cpp as the inference backend — it produces more predictable behavior on GGUF models.

Response latency. Local inference is slower than cloud. Using the host GPU brought throughput from single-digit tokens per second to hundreds — for tasks like "prepare a summary before a review", a 15–20 second delay is acceptable. But this is not the speed cloud users are accustomed to.

No versioning. When a document is re-indexed, the history of previous versions is not preserved. Comparing the current methodology with an earlier one is not currently possible without manual intervention.


What I Plan to Improve

Sparse vectors in Qdrant. bge-m3 can generate sparse vectors — Qdrant supports them natively. Integration is expected to yield a +15–25% accuracy improvement on technical terms and abbreviations.

Parent Document Retrieval. Search using small, precise chunks, but pass the entire parent block to the LLM as context. The model receives more context — answers become more coherent.

Agentic loop. Moving to level 4: the LLM decides how many times to search and what to look for. Required for multi-hop queries: "find all modules whose test cases mention the process client.exe" — a task that requires several sequential searches.

Document versioning. Storing version history via metadata will enable answering questions like "what changed in the test methodology between versions 2.1 and 2.3?"

Evaluation pipeline. This is arguably the most important item. There is currently no automated quality assessment for answers — it is unclear whether changes actually improve things or just feel like they do. Implementing RAGAS or an equivalent will provide measurable metrics: contextual precision, recall, relevance. Without this, iterations happen in the dark.


Conclusion

Local AI augmented with RAG is no longer a toy. It is a working tool.

It is not perfect. It is not as fast as the cloud. It requires engineering effort. But it is fully autonomous, protects data, scales with documentation — and genuinely saves time.

LM Studio (local models)
  + bge-m3 (multilingual embedding)
  + Qdrant (vector database)
  + FlashRank (reranker)
  + FastAPI (API layer)
  + N8N (orchestration)
= Fully local RAG agent for QA documentation

No cloud dependencies.
No data leaks.
Runs on a developer workstation.

The main takeaway: start at a reasonable complexity level. Not with knowledge graphs, not with agentic architectures. Start with a well-implemented hybrid + structured RAG. Adding complexity is only justified when the simpler solution stops handling real tasks.

If you're building something similar or have already been down this road — I'd be glad to discuss it in the comments, especially around quality evaluation and the transition to agentic architectures.