Test Suites

Test suites define the expected behaviors of your agent. They organize test cases with inputs, expected outputs, and scoring criteria.

Defining Suites

TypeScript:

import { defineSuite, defineTest, defineDataset } from '@neon/sdk'

const suite = defineSuite({
  name: 'core-tests',
  description: 'Core agent functionality tests',
  defaultScorers: [contains, toolSelection, llmJudge({ criteria: '...' })],
  defaultMinScore: 0.7,
  parallel: true,
})

Python:

from neon_sdk import define_suite

suite = define_suite(
    name="core-tests",
    description="Core agent functionality tests",
    default_scorers=[contains, tool_selection],
    default_min_score=0.7,
    parallel=True,
)

Test Cases

Basic Test

defineTest(suite, {
  name: 'simple-query',
  description: 'Test basic question answering',
  input: {
    query: 'What is 2 + 2?',
  },
  expected: {
    contains: ['4'],
  },
  minScore: 0.8,
})

Tool Expectations

// Expect specific tools (order-independent)
defineTest(suite, {
  name: 'search-task',
  input: { query: 'Find the latest news about AI' },
  expectedTools: ['web_search'],
})

// Expect tools in specific order
defineTest(suite, {
  name: 'multi-step-task',
  input: { query: 'Research and summarize topic X' },
  expectedToolSequence: ['web_search', 'web_search', 'summarize'],
})

// Expect NO tools
defineTest(suite, {
  name: 'simple-math',
  input: { query: 'What is 5 * 5?' },
  expectedTools: [],  // Empty = no tools should be called
})

Output Validation

// Check for specific strings
defineTest(suite, {
  name: 'factual-check',
  input: { query: 'What is the capital of Japan?' },
  expected: {
    contains: ['Tokyo', 'Japan'],
  },
})

// Use regex pattern
defineTest(suite, {
  name: 'format-check',
  input: { query: 'List three items' },
  expected: {
    pattern: /1\..+2\..+3\./,
  },
})

With Context

defineTest(suite, {
  name: 'with-context',
  input: {
    query: 'Summarize this document',
    context: {
      document: 'Long document text here...',
      format: 'bullet_points',
    },
  },
  config: {
    maxTokens: 500,
  },
})

Tags for Organization

defineTest(suite, {
  name: 'edge-case-1',
  tags: ['edge-case', 'search', 'regression-v1.2'],
  // ...
})

defineTest(suite, {
  name: 'critical-feature',
  tags: ['critical', 'p0'],
  // ...
})

// Run only specific tags
// neon eval run core-tests --tags critical

Datasets

Define reusable datasets for data-driven testing:

const dataset = defineDataset({
  name: 'weather-queries',
  cases: [
    { input: { query: 'Weather in Tokyo?' }, expected: { contains: ['Tokyo'] } },
    { input: { query: 'Weather in Paris?' }, expected: { contains: ['Paris'] } },
    { input: { query: 'Weather in NYC?' }, expected: { contains: ['New York'] } },
  ],
})

// Use dataset in suite
defineSuite({
  name: 'weather-tests',
  dataset: dataset,
  defaultScorers: [contains, toolSelection({ expected: ['weather_api'] })],
})

Best Practices

1. Cover Failure Modes

Test common agent failure modes:

// Tool selection errors
defineTest(suite, {
  name: 'wrong-tool',
  description: "Shouldn't use search for simple math",
  input: { query: 'What is 10 / 2?' },
  expectedTools: [],  // Should NOT call search
})

// Hallucination check
defineTest(suite, {
  name: 'grounded-response',
  input: {
    query: 'What did the document say about X?',
    context: { document: 'The document mentions Y and Z' },
  },
  expected: {
    contains: ['Y', 'Z'],
    // Should NOT contain X if not in document
  },
  scorers: [grounding()],
})

2. Use Meaningful Names

// Good
defineTest(suite, { name: 'search-factual-current-events' })
defineTest(suite, { name: 'no-tool-simple-arithmetic' })
defineTest(suite, { name: 'multi-step-research-comparison' })

// Bad
defineTest(suite, { name: 'test1' })
defineTest(suite, { name: 'case_a' })

3. Set Appropriate Thresholds

// Critical functionality - high threshold
defineTest(suite, {
  name: 'critical-feature',
  minScore: 0.9,
})

// Experimental feature - lower threshold
defineTest(suite, {
  name: 'experimental-feature',
  minScore: 0.6,
})

// Uses suite default (0.7)
defineTest(suite, {
  name: 'standard-feature',
})

4. Include Regression Tests

When you fix a bug, add a test case:

defineTest(suite, {
  name: 'regression-issue-123',
  description: 'Fixed in v1.2 - agent was calling wrong tool',
  input: { query: 'The specific query that caused the bug' },
  expectedTools: ['correct_tool'],
  tags: ['regression', 'issue-123'],
})

Suite Organization

Organize suites by concern:

evals/
├── core.eval.ts          # Core functionality
├── regression.eval.ts    # Regression tests
├── edge-cases.eval.ts    # Edge cases
├── performance.eval.ts   # Performance-sensitive tests
└── integration.eval.ts   # Integration tests

Each file exports a suite:

// evals/core.eval.ts
import { defineSuite, defineTest } from '@neon/sdk'

export const coreSuite = defineSuite({
  name: 'core-tests',
  // ...
})

defineTest(coreSuite, { /* ... */ })
defineTest(coreSuite, { /* ... */ })

Running Tests

# Run all tests in a suite
bun run eval --suite core-tests

# Run specific tags
bun run eval --suite core-tests --tags critical

# Run with verbose output
bun run eval --suite core-tests --verbose

# Compare against baseline
bun run eval --suite core-tests --compare baseline-run-id

Python Equivalent

from neon_sdk import define_suite, define_test, define_dataset
from neon_sdk.scorers import contains, tool_selection, grounding

suite = define_suite(
    name="core-tests",
    description="Core agent functionality tests",
    default_min_score=0.7,
)

define_test(
    suite,
    name="search-task",
    input={"query": "Find the latest news about AI"},
    expected_tools=["web_search"],
)

define_test(
    suite,
    name="grounded-response",
    input={
        "query": "What did the document say?",
        "context": {"document": "The document mentions Y and Z"},
    },
    expected={"contains": ["Y", "Z"]},
    scorers=[grounding()],
)