Scorers

Neon provides specialized scorers designed for evaluating AI agents. Unlike generic LLM evaluation, these scorers understand agent-specific patterns like tool selection, reasoning chains, and grounded responses.

Built-in Scorers

Rule-Based Scorers

Fast, deterministic scorers that don’t require LLM calls.

`exactMatch`

Checks for exact string or value match.

import { exactMatch } from '@neon/sdk'

const scorer = exactMatch('expected output')
// or match any of several values
const scorer = exactMatch(['option1', 'option2'])

`contains`

Checks if output contains expected strings.

import { contains } from '@neon/sdk'

const scorer = contains(['Paris', 'France'])
// Case-insensitive
const scorer = contains(['paris'], { caseSensitive: false })

`regex`

Pattern matching with regular expressions.

import { regex } from '@neon/sdk'

const scorer = regex(/\d{3}-\d{4}/)  // Phone number pattern

`toolSelection`

Evaluates whether the agent selected appropriate tools.

import { toolSelection } from '@neon/sdk'

const scorer = toolSelection({
  expected: ['web_search', 'calculator'],
  // Optional: require specific order
  strictOrder: false,
  // Optional: penalize extra tools
  penalizeExtra: true,
})

Score Calculation:

Jaccard similarity between expected and actual tools
Sequence similarity (LCS) if strictOrder: true
Penalty for unexpected tools if penalizeExtra: true

`latency`

Measures execution time against thresholds.

import { latency } from '@neon/sdk'

const scorer = latency({
  maxMs: 5000,      // Score 1.0 under this
  targetMs: 2000,   // Score 0.8 under this
})

`tokenEfficiency`

Evaluates token usage relative to output quality.

import { tokenEfficiency } from '@neon/sdk'

const scorer = tokenEfficiency({
  maxTokens: 1000,
  minTokens: 50,
})

LLM Judge Scorers

Use language models to evaluate subjective criteria.

`llmJudge`

General-purpose LLM evaluation with custom criteria.

import { llmJudge } from '@neon/sdk'

const scorer = llmJudge({
  criteria: 'Response should be helpful, accurate, and well-structured',
  model: 'claude-3-5-sonnet',  // or 'gpt-4o', 'gemini-1.5-pro'
  // Optional: scoring rubric
  rubric: `
    1 - Completely wrong or unhelpful
    2 - Partially correct but missing key information
    3 - Correct but could be clearer
    4 - Good response with minor issues
    5 - Excellent, complete response
  `,
})

`reasoning`

Evaluates the quality of agent reasoning.

import { reasoning } from '@neon/sdk'

const scorer = reasoning({
  model: 'claude-3-5-sonnet',
  // Evaluates:
  // - Logical coherence (0-3 points)
  // - Information usage (0-3 points)
  // - Problem decomposition (0-2 points)
  // - Completeness (0-2 points)
})

`grounding`

Evaluates whether responses are grounded in provided context.

import { grounding } from '@neon/sdk'

const scorer = grounding({
  model: 'claude-3-5-sonnet',
  // Evaluates:
  // - Factual accuracy (0-4 points)
  // - Evidence support (0-4 points)
  // - Expected content presence (0-2 points)
})

Domain-Specific Judges

Pre-configured judges for common domains.

import { codeReviewJudge, safetyJudge, helpfulnessJudge } from '@neon/sdk'

// Code quality evaluation
const codeScorer = codeReviewJudge({
  language: 'typescript',
  checkSecurity: true,
})

// Safety evaluation
const safetyScorer = safetyJudge({
  strictness: 'high',
})

// General helpfulness
const helpfulScorer = helpfulnessJudge()

Python SDK

All scorers are available in Python with identical functionality:

from neon_sdk.scorers import (
    exact_match,
    contains,
    regex,
    tool_selection,
    latency,
    llm_judge,
    reasoning,
    grounding,
)

# Rule-based
scorer = contains(["Paris", "France"], case_sensitive=False)

# LLM Judge
scorer = llm_judge(
    criteria="Response should be accurate and helpful",
    model="claude-3-5-sonnet",
)

# Tool selection
scorer = tool_selection(
    expected=["web_search"],
    strict_order=False,
)

Custom Scorers

Create custom scorers by extending the base class.

TypeScript:

import { BaseScorer, ScorerResult, ScorerContext } from '@neon/sdk'

class MyCustomScorer extends BaseScorer {
  name = 'my-custom-scorer'
  description = 'Evaluates custom criteria'

  async evaluate(context: ScorerContext): Promise<ScorerResult> {
    const { output, expected, trace } = context

    // Your scoring logic
    const score = calculateScore(output)

    return {
      score,
      reason: 'Custom evaluation passed',
      evidence: ['Detail 1', 'Detail 2'],
    }
  }
}

// Use in tests
defineTest(suite, {
  name: 'my-test',
  scorers: [new MyCustomScorer()],
})

Python:

from neon_sdk.scorers.base import BaseScorer, ScorerResult

class MyCustomScorer(BaseScorer):
    name = "my-custom-scorer"
    description = "Evaluates custom criteria"

    async def evaluate(self, context) -> ScorerResult:
        output = context.output

        # Your scoring logic
        score = calculate_score(output)

        return ScorerResult(
            score=score,
            reason="Custom evaluation passed",
            evidence=["Detail 1", "Detail 2"],
        )

Combining Scorers

Use multiple scorers for comprehensive evaluation:

defineTest(suite, {
  name: 'comprehensive-test',
  scorers: [
    // Fast rule-based checks
    contains(['expected', 'keywords']),
    toolSelection({ expected: ['search'] }),
    latency({ maxMs: 5000 }),

    // Deeper LLM evaluation
    llmJudge({ criteria: 'Response quality' }),
    reasoning(),
  ],
  // Minimum average score across all scorers
  minScore: 0.8,
})

Score Aggregation

When multiple scorers are used, scores are aggregated:

Strategy	Description
`mean` (default)	Average of all scores
`min`	Lowest score (strictest)
`max`	Highest score (most lenient)
`weighted`	Custom weights per scorer

defineTest(suite, {
  name: 'weighted-test',
  scorers: [
    { scorer: contains(['key']), weight: 0.3 },
    { scorer: llmJudge({ criteria: '...' }), weight: 0.7 },
  ],
  aggregation: 'weighted',
})

Score Interpretation

Score	Interpretation
0.9 - 1.0	Excellent — Agent performed optimally
0.7 - 0.9	Good — Minor issues, generally acceptable
0.5 - 0.7	Fair — Significant issues needing attention
0.0 - 0.5	Poor — Major failures requiring investigation

Best Practices

Start with rule-based scorers — They’re fast and deterministic
Use LLM judges for subjective criteria — Reasoning quality, helpfulness
Combine multiple scorers — Cover different failure modes
Set appropriate thresholds — Higher for critical paths, lower for experiments
Cache LLM judge results — Avoid redundant API calls in CI