Learn
Workshops
Learn systematic test design techniques
Learn to generate reliable tests using AI
Video Courses
Comprehensive video course on testing
Booklet
Long form essays about testing
Testimonials
Read what other people say about the workshop
For Businesses
Custom workshops for your team
Testing strategy and implementation consulting
Invite me to speak at your event or conference
Get help convincing your manager to pay for the workshop
Resources
Articles and insights about testing
Get in touch for questions or collaborations
Highlights from "Do LLMs Generate Test Oracles That Capture The Actual Or The Expected Program Behaviour?"
Written by Lucian Ghinda
There’s a lot of discussion about using LLMs to build applications, including generating test cases. It’s interesting to read people’s experiences and see how LLMs are affecting software development. Opinions on the usefulness of LLMs range from very positive to quite negative.
To make sense of these different experiences, I decided to look at research studies about using LLMs for testing. I’ll focus on development in another article.
I plan to share some highlights as I review different studies on how generative LLMs are used for testing.
I’ll start with a 2024 study titled “Do LLMs generate test oracles that capture the actual or the expected program behaviour?”
Konstantinou, Michael & Degiovanni, Renzo & Papadakis, Mike. (2024). Do LLMs generate test oracles that capture the actual or the expected program behaviour?. 10.48550/arXiv.2410.21136.
Before I continue, I want to clarify two important points:
- Test Oracle: “A test oracle is a mechanism that determines whether software executed correctly for a test case” (source) or “A source to determine an expected result” (source)
- Content of the study: The paper’s results are based on tests with OpenAI GPT 3.5-turbo in 2024, so newer models may have improved. Still, I think the insights from this study are valuable for understanding AI-driven test generation.
I’ve picked out some key points from the study and will share my thoughts on each of them.
LLMs Mirror Code, Not Intentions
Interestingly, our results show that LLMs are more likely to generate test oracles that capture the actual program behaviour (what is actually implemented) rather than the expected one, i.e., the intended behaviour. Additionally, we find that the overall performance of the LLMs is relatively low (less than 50% accuracy) meaning that LLMs do not provide a strong oracle correctness signal. Therefore, all LLMs suggestions will need human inspection.
This suggests that when LLMs are asked to generate test cases and assertions without clear requirements, they tend to follow the code itself. In my view, to use LLMs effectively for test generation, you need well-written requirements so the LLM can use them to create test oracles.
Another situation is when your codebase lacks test cases and you want to add tests before refactoring. If you focus on the functional behavior, given certain inputs, the code should produce specific results, an LLM can help generate these kinds of test cases.
LLMs can boost code coverage but require human help for correct behavior
While effective at covering code (or killing mutants), automatic test generation falls short in finding faults, particularly business-logic related faults. This is because of the inherent inability of these techniques to compose test oracles (test assertions) that capture the expected program behaviour. This means that the fault detection ability of these techniques is limited to zero-level oracles, such program crashes or memory violations (when applied at system level).
This supports the idea I mentioned earlier: you can use LLMs to increase test coverage and focus on documenting your code’s current functional behavior.
To reveal business-logic software faults with test generation techniques one need to manually validate and correct, when needed, the generated tests and their respective oracles
This is the approach I recommend (for now) in my workshops: when using LLMs to generate test cases, a human should always review them to ensure the tests check for the intended behavior, not just what the code currently does.
Our experiments showed that the LLM’s accuracy to correctly classify a correct assertion as positive, drops when the given code is buggy. This suggest that the LLM is prone to follow the actual implementation to classify the test oracle rather than the expected behaviour.
If your code has a bug that isn’t obvious business logic, the LLM will likely generate test cases that follow the bug, since it can’t guess your intentions without requirements. However, if you’re working with common patterns or business logic that LLMs have seen during training, they might spot issues. I’ve seen cases where LLMs, when asked to write tests, also reviewed the code and pointed out bugs that needed fixing.
But if you’re working on something specific, the LLM may not be able to point out bugs that need to be fixed.
If you’re writing code that’s specific to your domain and not just boilerplate, you can’t rely on LLMs to write good assertions. You’ll need to review what they generate.
Meaningful names help LLMs in generating tests
Taken together, our results corroborate the conclusion that unless having meaningful test or variable names LLMs can mainly be used to capture the actual program behaviour (thus to be used for regression testing).
Although the paper focused on Java, I think the idea can apply to other languages too. For example, in Ruby, where developers often use meaningful names for variables and methods and pay attention to domain context, this should help LLMs generate better test oracles, according to the study.
Two-Step Workflow
In TDD, you usually have requirements before writing code. You can use a two-step workflow: first, give the requirements and ask the LLM to identify what’s testable, then ask it to write tests. This way, the requirements act as the test oracle, and the LLM is more likely to write good test assertions based on them instead of just the implementation.
If you write the implementation first and then want to generate test cases, you can still use a two-step workflow. First, describe the purpose of your code and provide any other sources as a test oracle. Ask the LLM to review them and identify what should be tested, then give it the code and ask it to generate test cases.
#goodenoughtesting #subscribe #email
Get free samples and be notified when a new workshop is scheduled
You might get sometimes an weekly/monthly emails with updates, testing tips and
articles.
I will share early bird prices and discounts with you when the courses are ready.