Everyone knows how to write a prompt. Open a chat window, type instructions, iterate until the output looks right. But moving that prompt into a production system — where it runs thousands of times a day, handles edge cases, and needs to be maintained by a team — is a fundamentally different problem.
At Sanera, we have built AI-powered features across several client projects. Here is what we have learned about making prompts work reliably at scale.
// Version everything
Prompts are code. Treat them that way. Store them in version control, tag releases, and track which version is running in production. When output quality degrades — and it will — you need to know exactly which prompt version caused the regression.
A simple pattern that works well:
prompts/
classify-intent/
v1.0.txt # original
v1.1.txt # added edge-case handling
v2.0.txt # restructured for gpt-4o
eval.json # test cases + expected outputs
config.yaml # model, temperature, max_tokens// Build an eval suite
You cannot improve what you cannot measure. Before touching a prompt, build a set of test cases: inputs with known-good outputs. Run every prompt change against this suite before deploying.
Start simple. Even 20-30 representative examples catch most regressions. Metrics depend on the task:
- Classification: accuracy, precision, recall against labeled examples.
- Generation: human review scores, format compliance rate, hallucination checks.
- Extraction: exact match or fuzzy match against expected fields.
- Latency and cost: always track tokens consumed and response time per call.
// Guard-rails are not optional
In a chat playground, a bad output is an inconvenience. In production, it is a bug that hits real users. Structured output schemas, output validation, and fallback logic are non-negotiable.
// Validate LLM output before using it
const result = await llm.generate(prompt, input)
const parsed = outputSchema.safeParse(result)
if (!parsed.success) {
logger.warn("LLM output failed validation", {
prompt_version: "v2.1",
input_hash: hash(input),
errors: parsed.error.issues,
})
return fallbackResponse(input)
}
return parsed.dataLog every failure. These logs become your next round of eval cases and your best signal for where the prompt needs work.
// Template, do not concatenate
String concatenation is the prompt engineering equivalent of SQL injection. Use a template system that separates the prompt structure from the dynamic inputs. This makes prompts readable, testable, and safe from accidental injection.
Most frameworks now support this natively. Langchain, LlamaIndex, and even simple Mustache templates work. The key is that the prompt text and the user input are never mixed in raw string operations.
// The takeaway
Prompt engineering in production is not about writing clever instructions. It is about building a system around those instructions: version control, evaluation, validation, monitoring, and graceful degradation. The prompt itself is often the simplest part. The infrastructure around it is what makes it reliable.