sampling
test data

Good Tests Begin with Good Samples

Written by Lucian Ghinda

Make your future easier by spending time today to properly set up test data.

The time you spend today setting up test data that matches real and important user cases is an investment.

It will help your daily work but it will be a great help during a priority zero incident, and you will be glad you invested that time early.

To achieve this, understand your users, your product, and your business. Work closely with your product, support, and sales teams, then apply sampling.

Choosing good test data is about sampling well. Borrowing ideas from statistics helps us make smarter, more deliberate choices.

1. Population vs. Sample

Think of your code’s input space as a population: every possible combination of data and state. Your tests are a sample of that population.

The goal isn’t to test everything, but to test a representative subset that helps you discover likely problems early.

2. Random vs. Systematic Sampling

You can:

  • Pick random inputs (great for fuzz testing or probabilistic code), or
  • Pick systematic samples, one from each meaningful category or boundary.

For most business logic, structured sampling works best. Example:

If your API accepts ages 0 to 120, you don’t need 100 random ages. You need a few key samples:

  • 0 (lower boundary)
  • 1 (typical valid)
  • 119 and 120 (upper boundaries)
  • 121 (invalid)

This approach is actually the folundation for boundary value analysis testing technique.

3. Stratified Sampling: Covering All Groups

Not all inputs are equal. Divide your input space into logical groups and pick samples from each one.

Example:

  • For a login form, you might stratify by “valid credentials,” “invalid password,” “nonexistent user,” “empty input.” Then pick one or two cases from each.

This prevents over-testing one area and missing others. In testing terms: that’s a part of the equivalence partitioning testing technique.

4. Sampling Bias: The Trap of “Happy Paths”

Most developers unintentionally sample the same region, sometimes called the happy path. The result is biased tests that miss the tricky edge cases where bugs often hide.

Ask yourself:

Are my test cases representative of real-world usage and rare conditions?

If 90% of your users are on mobile Safari, don’t just test on desktop Chrome. If your code processes files, test both small and large ones, and consider different file extensions. See the stratified sampling.

5. How Many Samples Are Enough?

Adding more test data isn’t always better. Beyond a certain point, new samples might reveal new information, but adding more to an already covered space increases testing costs.

Think of sampling like this:

  • Start broad: cover each risk area once.
  • Go deeper only where you see instability or recent changes.

This is called risk-based coverage: you focus your effort where it matters most and avoid unnecessary work.

6. Always Include Extremes and Weirdness

Include a few “troublemakers”:

  • Empty strings
  • Nulls
  • Maximum lengths
  • Invalid characters
  • Very large numbers

These aren’t just edge cases. They’re bug magnets.

7. Realistic vs. Synthetic Data

  • Synthetic data helps you test logic cleanly and deterministically.
  • Realistic data reveals integration issues, performance surprises, and messy real-world inputs.

You need both.

For example, your happy path tests can use synthetic data, but your end-to-end suite should include a realistic snapshot of production-like data.

8. Adaptive Sampling

You don’t have to get it right from the start.

Start small.

When a bug appears, expand your sample around that area.

Think of it as zooming in with a magnifying glass to spot where problems start to appear.

9. Things to keep in mind

Good test data isn’t random.

Good test data is chosen carefully, based on structure, boundaries, and risk.

Next time you write a test, ask:

  • What’s the space input here?
  • Which categories or boundaries matter most?
  • Am I sampling fairly across them?
  • Where might the system behave differently?

That’s not over-engineering. That’s just good sampling, and it makes testing much easier.

Next Workshop

10 October 2025 - 15:00 UTC

3 hours
online
JS
CB
CB
CB
EA

6 Going

JAMIE SCHEMBRI, Christopher Barton and 4 others

9 spots remaining

#goodenoughtesting #subscribe #email

Get free samples and be notified when a new workshop is scheduled

You might get sometimes an weekly/monthly emails with updates, testing tips and articles.
I will share early bird prices and discounts with you when the courses are ready.