From Prompt Engineering to AI Programming: Building Enterprise-Ready Generative AI Solutions
As organizations race to integrate generative AI into their business workflows, they are hitting a familiar challenge - the gap between a cool demo and a reliable enterprise solution (one that you'd be confident betting the business, or at the very least your reputation on). To bridge this, we must shift the mindset from prompt engineering to rigorous AI programming and systematic evaluation, just like with any other software engineering project.
The quality challenge in the age of AI
Over the last year or so we have seen the world move at 100mph, with AI integration into every product (whether it's a good fit or not!) and organisations eager to deploy LLMs and AI services into production in their own systems and workflows. With frontier models readily available through cloud APIs, and deep enterprise integration into big data platforms through wrapper/portal services like Microsoft Foundry, it's very easy to get started and get excited about what's possible.
This is exacerbated by the fact that AI is now pretty much ubiquitous to all consumers of digital products. If business stakeholders are used to being able to get results instantly from native LLM app interfaces on their devices, then expectations are high right from the offset.
But there's a big gap between a PoC and a working, reliable enterprise solution. On the face of it, adding an LLM service to your application is just another API integration, but their behaviour brings with it a new set of engineering concerns that can cause problems if not understood fully.
This post explains what those things are, and why they should be treated the same as any other engineering quality concern, so that you can build AI-integrated solutions with confidence.
A timeline of engineering quality
Before we dig into the specifics of LLM models and AI services, it's useful to look back at other technology and software architecture patterns, and understand how we thought about them in terms of ensuring quality. The easiest way for me to do that is to look back at my own career. Clearly this won't be a fully comprehensive guide, but there's enough experience in there to highlight commonalities across technology stacks and ecosystems.
Establishing the foundations
When I began working as a software developer in 2001, after a brief dabble with Borland Delphi and ASP pages, I quickly found myself immersed in the world of .NET. This was .NET 1.1 territory and ASP.NET Web Forms (around which there were a lot of strong opinions!). As a framework that was designed to magically generate HTML web pages using a set of server-side components, it was inherently difficult to pull apart business logic from the user interface layer, which made it hard to test.
But what we were building was complicated (a government funded, UK-wide, multi-model journey planner, pre-Google Maps) and we needed to prove that it was working correctly. The "auto-magic" that the framework provided allowed for rapid development of things that were simple, but started to get in the way and make things harder as the logic and interactions became more complex. I learned about unit testing as these tools started to become available for .NET (NUnit, MbUnit and then MSTest), and how to refactor code to be able to validate the things we needed to. Having confidence that things were working as they should shifted the dial from slow and brittle feedback loops to rapid, reliable validation cycles.
By the time the ALT.NET movement gained traction towards the end of the 2000's, I was a full-blown TDD aficionado. Shifting the focus to test-first made for better system design, and along with it came more advanced techniques and tooling like mocking, inversion-of-control containers and continuous integration processes to automate quality gates.
Around this time, I was also lucky enough to attend one of JP Boodhoo's .NET engineering bootcamps in Vancouver, which embedded the value of executable specifications with Behaviour Driven Development (BDD), highlighting the importance of natural language and encouraging closer collaboration between business and technical teams.
These core principles have been the foundation underpinning all software development I've been involved in since, enabling high-quality delivery of well documented code in iterative development cycles. They've been applied across web and native application stacks and across a variety of architecture patterns. But whilst there's been specific implementations and frameworks that became the flavour of the month according to the language or toolset of the moment, it's the concepts and approach that have always been the thing.
Contract-first, observability and resilience in the cloud
Fast forward to the 2010s and the world had moved on to API-first, REST-based architectures. At this point I was leading a small team responsible for building the payment processing engine for a large Middle-Eastern airline. With these APIs any mistakes could literally cost money, so as well as ensuring that we had comprehensive test coverage, we also focused on instrumentation and observability to help us diagnose things in our production environment. This was how I learned (the hard way!) that not all currencies have 2 decimal places! And more generally that if you're building an API, you also need a way to execute it. This was pre-Postman, and pre-Swagger, so the only thing to do was build our own version of an API client - a test harness that could be used to execute and validate the various endpoints.
In parallel to this, we were moving everything into the cloud, and started to encountered a new set of quality challenges. Distributed systems brought transient failures, eventual consistency, and the need for sophisticated retry logic.
This experience taught me that integration testing isn't just about verifying that your code works, it's about understanding the contract between systems, the innate behavioural patterns and capabilities of those systems, and building evaluation harnesses that can systematically validate behaviour across different scenarios.
Data, machine learning, and the challenge of uncertainty
By the time we get to 2015, the cloud had also allowed us to capture a lot more data. And so the landscape shifted again with an increasing focus on data engineering - machine learning, data science, cloud data platforms and advanced analytics. The data space was, and in many ways still is, less mature when it comes to engineering practices. But whilst the challenges were different, I found myself still applying the same "there's always a way to test something" mentality.
With machine learning and data science, the tendency to draw conclusions from patterns in data that are really just random noise could be balanced with structured experimentation, upfront definition of success metrics and rigorous validation.
Cloud data pipelines that were asynchronous and long-running meant you couldn't just run a quick unit test and get instant feedback. So we needed new approaches - schema validation to catch structural changes and data quality issues early, synthetic data generation to test edge cases that might not appear in production for months and data snapshot testing to validate consistency of pipeline outputs over time. We developed approaches for testing data quality at scale, validating not just that pipelines completed successfully, but that the data they produced was fit for purpose.
The key insight from this is that uncertainty doesn't mean untestable, it just means we need different validation strategies. This sometimes meant shifting from testing specific values to testing behaviours and patterns. My talk at SQL Bits in 2024 - "Do those numbers look right?" summarises how we were thinking about engineering quality in our data solutions, and deep dives into practical approaches for testing Power BI reports, data pipelines and Spark and Python interactive notebooks.
So are LLMs really any different?
Yes and no. On the one hand, they're just another integration - either via an API, or through local model deployment. But on the other hand, they exhibit characteristics that require us to think differently about validation.
They're non-deterministic: Unlike traditional APIs where the same input always produces the same output, LLMs can generate different responses each time due to their probabilistic nature (even when you set the temperature to 0). This makes traditional unit testing approaches which rely on exact output matching ineffective.
They can hallucinate: LLMs can confidently generate plausible-sounding but false information. Unlike a database query that either returns valid data or throws an error, an LLM might return a well-constructed response that is actually wrong - syntactically correct, but semantically and factually incorrect. This requires us to validate not just the structure of responses, but their factual accuracy and relevance.
They produce qualitative, unstructured responses: Traditional software returns structured data (JSON objects, numerical values, boolean flags etc.). LLMs return natural language, which is inherently ambiguous and context-dependent. Despite advancements in the area of structured outputs, this still isn't 100% reliable. And how do you write an assertion that validates "the response should be friendly and helpful"?
However, none of these challenges are entirely new. Non-determinism shows up in async operations, race conditions, rate limiting, or time-dependent behaviour. User experience validation can be qualitative in nature. And we've had to deal with integration points that might return unexpected results.
On that basis we can, and should, apply the same core engineering principles we've always used, albeit adapted to the unique characteristics of LLMs.
1. Break open the black box
Just as ASP.NET Web Forms made it hard to test by tightly coupling UI and logic, LLM integrations can become black boxes if we treat them as magic, closed systems. The solution is the same - refactor to separate concerns. This means treating your LLM interaction as a discrete component with clear inputs and outputs. Which are the configurable bits that you have control over (e.g. the model version, the prompt, the temperature etc.), and which bits are "inside the box" (e.g. the inner workings of the model, system prompts etc.)?
For example, don't embed prompts directly in your application code. Instead, create a prompt management layer that allows you to version, test, and iterate on prompts independently of your application logic.
At endjin, when we build LLM-powered solutions, we structure them so that the prompt construction, model invocation, and response parsing are separate, testable components. This helps to unlock the ability to swap models, adjust parameters, or refine prompts without touching core business logic.
2. Embrace natural language as a feature
One of the biggest insights from Behaviour Driven Development was that natural language specifications bridge the gap between business stakeholders and technical teams. LLMs flip this on its head - natural language isn't just the specification, it's also the programming interface.
This could be seen of as an advantage. You can write evaluation criteria in plain English: "The response should identify the customer's primary concern", "The summary should be under 100 words", "The sentiment should be appropriate to the context". Then you can use LLMs themselves to evaluate these criteria. Don't forget that the LLM can also be instructed to return additional numerical or categorical information that can augment the natural language response (for example a confidence level between 0 and 1), which can enable more traditional testing to still be performed.
Taking it a step further, part of the specification could be a feedback loop to improve the specification (akin to getting someone else to review your work) before you execute the steps. This might mean LLMs all the way down, but in a good way.
The key is to be systematic about it. There's possible weaknesses around this approach when you consider the vagaries of language, but creating a feedback loop to explore the context with an LLM can be very powerful. Define your success criteria upfront, create evaluation prompts that assess those criteria, and validate them against labelled examples before you rely on them in production.
3. Test-first, even for prompts
The discipline of Test-Driven Development teaches us to think about desired outcomes before implementation. This is even more important with LLMs, where it's easy to iterate endlessly on prompts without a clear definition of success.
Start by defining your test cases - specific inputs with expected behaviours. Not exact outputs (remember, we're dealing with non-deterministic responses) but behavioural expectations. For a customer service chatbot, you might want to identify what a complaint is about and make sure the right resolution is offered.
For example:
Given the customer comment is "Despite paying extra for speedy postage, the promised delivery date was missed by 3 days!"
When the chatbot generates a customer service response
Then the response should identify the core issue as 'delivery delay'
And the tone should be 'apologetic'
And the resolution offered should be 'refund of premium postage'
Then iterate on your prompt design until your system reliably meets these criteria. Track your success rate over time. If you're getting 85% success on your test suite, that's a quantifiable baseline you can work to improve. This is infinitely better than the "it seems to work pretty well" approach.
4. Build or use evaluation harnesses
Just as we built and used custom API clients to test our APIs, we need evaluation harnesses for LLM integrations. The good news here is that lots of AI developer services have eval frameworks built in. These aren't just test scripts, they're tools that allow you to systematically evaluate performance across multiple dimensions.
Your evaluation harness should:
- Run your prompts against a diverse set of test inputs
- Capture and version the outputs for comparison
- Apply multiple evaluation metrics (accuracy, relevance, tone, safety)
- Track performance over time as you refine prompts, change models or a model version is incremented
- Handle transient failures and rate limits gracefully
There's comparisons here with machine learning models that are trained / used in Data Science experiments. Whilst you can't do input / output testing, you can test acceptable tolerance ranges, error rates etc.
5. You need labelled datasets
Just as you can't train a machine learning model without labelled data, you can't validate an LLM integration without example inputs and expected behaviours. Even a small amout of well-chosen examples can provide meaningful validation. But also focus on edge cases and failure modes. What happens when the input is ambiguous? When it's in a different language? When it contains unusual formatting or special characters?
As your system matures, invest in building a larger, more diverse evaluation dataset. Involve domain experts that can set the acceptance criteria of the system. This becomes your regression test suite, allowing you to confidently refine prompts or change models while ensuring you haven't broken existing functionality. Given the volume of tests that will be required, don't underestimate the the level of effort required to curate them. This contributes to the TCO of the solution - production grade LLM solutions are expensive endeavours.
6. Embrace transient failures and build resilience
LLM APIs, like any external service, can experience transient failures, rate limits, performance degradation or varying response times. Your integration needs to handle these gracefully.
Implement retry logic with exponential backoff. Cache responses where appropriate. Monitor latency and error rates. Build fallback strategies for when the LLM service is unavailable. These are the same patterns we use for any cloud API integration , just applied to a different integration point.
7. Test behaviours, not values
Finally, apply the same patterns for validating data quality - focus on behaviours and patterns rather than specific values. Instead of asserting that a generated email contains the exact phrase "Thank you for your inquiry", check that it:
- Addresses the customer's stated concern
- Maintains an appropriate tone
- Includes relevant next steps
- Doesn't contain factual inaccuracies
This is more robust to the natural variation in LLM outputs, while still catching the failure modes that matter. It's often beneficial to use a separate LLM model or service to do this validation so that you avoid the innate bias used in generating the original output being used to validate it (i.e. don't ask an LLM to mark its own work). This technique is referred to as LLM-as-a-judge.
8. Instrument and observe
Across all of these approaches, there's a common thread - you need visibility into what's happening. As more work is handed to the LLM, this accentuates the need for human oversight, supervision, quality checks, curation of more examples, responding proactively to new situations (e.g. a new type of enquiry driven by external factors). Instrument your LLM integrations thoroughly. Log inputs, outputs, and evaluation metrics. Track latency, error rates, and cost. Monitor for drift in model behaviour over time.
At endjin, we treat observability as a first-class concern for LLM integrations, just as we do for any other production system.
Summary: From black box to engineering discipline
The journey from prompt engineering to AI programming is really a journey from treating LLMs as magic to treating them as engineered components. Yes, they have unique characteristics (non-determinism, hallucinations, qualitative outputs), but these aren't deal-breakers, they're just new constraints that require adapted approaches.
To succeed in deploying reliable, enterprise-grade AI solutions, treat LLM integrations with the same engineering standards that you apply to any other system component. That means systematic evaluation, defensive engineering, instrumentation, and continuous improvement.
At endjin, we've seen this transformation happen with our clients. The shift from "let's see if this prompt works" to "here's our evaluation framework and current performance metrics" represents a fundamental change in how teams think about AI integration. And it's that shift that enables moving from demos to production-ready solutions with confidence.
In subsequent posts, I'll dive deeper into the practical tools and frameworks that make this systematic evaluation possible.