Frequently Asked Questions

Common questions about the Neon agent evaluation platform.


Getting Started

1. How do I install Neon locally?

Clone the repository and start the infrastructure services with Docker:

git clone https://github.com/Sean-Koval/neon.git
cd neon
docker compose up -d          # Starts ClickHouse + PostgreSQL
bun install                   # Install all workspace dependencies
bun run dev                   # Start frontend + workers

To include Temporal for durable execution:

docker compose --profile temporal up -d

Requirements: Node.js >= 20, Bun 1.2.0, Python 3.11+ (for CLI/Python SDK), Docker.

2. How do I run my first evaluation?

  1. Define a test suite in a YAML file or using the SDK:
// evals/my-first-suite.eval.ts
import { defineSuite, defineTest, contains } from '@neon/sdk'

const suite = defineSuite({
  name: 'my-first-suite',
  defaultScorers: [contains],
})

defineTest(suite, {
  name: 'greeting-test',
  input: { query: 'Say hello' },
  expected: { outputContains: ['hello'] },
})
  1. Run the evaluation:
npx neon eval --suite my-first-suite
  1. View results in the dashboard at http://localhost:3000.

3. Which SDK should I use — TypeScript or Python?

Both SDKs have identical functionality. Choose based on your agent’s language:

FactorTypeScript (@neon/sdk)Python (neon-sdk)
Installbun add @neon/sdkpip install neon-sdk
Best forNode.js/TypeScript agentsPython agents (LangChain, CrewAI, etc.)
Async modelasync/awaitasyncio with context managers
Package managerBun or npmuv or pip

If your agent is in Python, use the Python SDK. If your agent is in TypeScript, use the TypeScript SDK. Both produce identical traces and scores.

4. What infrastructure does Neon require?

ServicePurposeRequired?
ClickHouseTrace storage and analytics queriesYes
PostgreSQLMetadata (projects, suites, API keys)Yes
TemporalDurable workflow executionOptional (needed for managed execution)
RedpandaHigh-throughput trace streamingOptional

All services run via Docker Compose. Use docker compose up -d for the core stack.


Test Suites & Scorers

5. What is a test suite and how do I write one?

A test suite is a collection of test cases that evaluate your agent. Each test case has an input, expected output, and scorers that grade the agent’s response.

import { defineSuite, defineTest, contains, llmJudge } from '@neon/sdk'

const suite = defineSuite({
  name: 'customer-support-agent',
  description: 'Tests for the customer support agent',
  defaultScorers: [contains, llmJudge({ criteria: 'Response is helpful and professional' })],
  defaultMinScore: 0.7,
})

defineTest(suite, {
  name: 'refund-request',
  input: { query: 'I want a refund for my order' },
  expected: {
    toolCalls: ['lookup_order', 'process_refund'],
    outputContains: ['refund', 'processed'],
  },
})

See Test Suites Guide for full documentation.

6. What scorers are available?

Neon provides three categories of scorers:

Rule-based (fast, deterministic):

  • contains — checks if output contains expected keywords
  • exactMatch — checks for exact string match
  • toolSelection — validates the agent called the right tools
  • regex — matches against regular expressions
  • latency — scores based on response time
  • tokenEfficiency — scores based on token usage
  • jsonMatchScorer — validates JSON structure

LLM judges (subjective evaluation):

  • llmJudge — custom criteria with a rubric
  • response_quality_judge — overall quality assessment
  • safety_judge — safety evaluation
  • helpfulness_judge — helpfulness assessment
  • code_review_judge — code quality review
  • reasoning — reasoning quality
  • grounding — factual grounding

Custom scorers:

const myScorer = defineScorer({
  name: 'word_count',
  evaluate: async (context) => ({
    value: Math.min(context.output.split(/\s+/).length / 100, 1.0),
    reason: 'Word count metric',
  }),
})

See Scorers Guide for detailed documentation on each scorer.

7. How does scoring work? What do the numbers mean?

Scores range from 0.0 to 1.0:

RangeRatingMeaning
0.9 - 1.0ExcellentAgent performs at or above expectations
0.7 - 0.9GoodAgent performs well with minor issues
0.5 - 0.7FairAgent works but needs improvement
0.0 - 0.5PoorAgent fails to meet expectations

Each test case can have a minScore threshold. A case passes if all scorer averages meet the minimum. You can configure aggregation strategies: mean (default), min, max, or weighted.

8. Can I use datasets to run many test cases?

Yes. You can provide a dataset of items to test against:

const suite = defineSuite({
  name: 'data-driven-tests',
  defaultScorers: [contains],
})

// From an array
const dataset = [
  { input: { query: 'Hello' }, expected: { outputContains: ['hi'] } },
  { input: { query: 'Weather?' }, expected: { outputContains: ['temperature'] } },
]

for (const item of dataset) {
  defineTest(suite, {
    name: `test-${item.input.query}`,
    ...item,
  })
}

The platform executes test cases in parallel for faster results.


Execution & Workflows

9. How does an evaluation run work end-to-end?

  1. You submit a suite via the SDK or API (POST /api/runs)
  2. A Temporal workflow (evalRunWorkflow) starts
  3. For each test case, a child workflow (evalCaseWorkflow) runs the agent
  4. Each agent call generates trace spans (stored in ClickHouse)
  5. Scorers evaluate the trace and produce scores
  6. Results are aggregated and the run completes
  7. The dashboard displays real-time progress and final results

10. What happens if a test case fails during a run?

Individual case failures do not stop the entire run. The failed case is recorded with a failed status and error message, and the remaining cases continue executing. Similarly, if a scorer throws an exception, it records a score of 0 with the error reason and other scorers continue.

This graceful degradation ensures you get results for all cases, even when some fail.

11. Can I pause or cancel a running evaluation?

Yes. Use the run control API:

# Pause a run
curl -X POST /api/runs/{id}/control \
  -H "Content-Type: application/json" \
  -d '{"action": "pause"}'

# Resume a paused run
curl -X POST /api/runs/{id}/control \
  -d '{"action": "resume"}'

# Cancel a run
curl -X POST /api/runs/{id}/control \
  -d '{"action": "cancel"}'

Paused runs automatically resume after 24 hours. Cancelled runs cannot be resumed.

12. What is “durable execution” and why does it matter?

Durable execution (powered by Temporal) means your evaluation workflows survive crashes, restarts, and network failures. If the worker process crashes mid-evaluation, Temporal automatically resumes from the last completed step when the worker restarts — no lost work, no duplicate execution.

This is critical for long-running evaluations with LLM calls that may take minutes or hours.


Dashboard

13. What can I see in the dashboard?

The Neon dashboard provides:

  • Home: Recent traces, active eval runs, score summaries
  • Trace Viewer: Hierarchical span tree with timing, inputs/outputs, and associated scores
  • Trace Comparison: Side-by-side diff of two traces highlighting improvements and regressions
  • Evaluation Runs: List of all runs with status, progress, and pass rates
  • Run Detail: Per-case breakdown with score distributions
  • Analytics: Score trends over time, component health, correlation analysis
  • Human Feedback: Preference collection for RLHF training

See Dashboard Guide for a walkthrough.

14. How do I compare two evaluation runs?

Use the comparison API or dashboard:

curl -X POST /api/compare \
  -H "Content-Type: application/json" \
  -d '{
    "baseline_run_id": "run-abc",
    "candidate_run_id": "run-xyz"
  }'

The comparison shows:

  • Score differences per test case
  • Regressions (scores that got worse)
  • Improvements (scores that got better)
  • Statistical significance of changes

In the CLI, use agent-eval compare to compare runs.

15. How do I view traces for a specific run?

Navigate to the run detail page in the dashboard (/eval-runs/{id}). Each test case links to its trace, where you can see the full span tree: LLM calls, tool executions, retrieval operations, and their timing.

You can also query traces directly via the API:

GET /api/traces?project_id={workspace_id}&limit=50
GET /api/traces/{trace_id}

SDKs

16. How do I trace my agent’s operations?

TypeScript:

import { trace, generation } from '@neon/sdk'

const result = await trace('agent-run', async () => {
  return await generation('llm-call', { model: 'claude-3-5-sonnet' }, async () => {
    return await llm.chat(prompt)
  })
})

Python:

from neon_sdk import trace, generation

with trace("agent-run"):
    with generation("llm-call", model="claude-3-5-sonnet"):
        result = await llm.chat(prompt)

Supported span types: generation, tool, retrieval, reasoning, planning, routing, memory, prompt, and generic span.

17. Can I use Neon without running my agents inside it?

Yes. Neon supports an observe-only mode where you run agents anywhere (Cloud Run, Lambda, Kubernetes) and send traces to the Neon API:

POST /api/traces/ingest
  Headers: X-API-Key: <key>, X-Workspace-Id: <id>
  Body: { trace_id, name, status, duration_ms, spans: [...] }

You can also use OpenTelemetry-compatible instrumentation to send traces.


CI/CD Integration

18. How do I add evaluations to my CI pipeline?

Add an evaluation step to your GitHub Actions workflow:

- name: Run evaluations
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
  run: bun run eval --suite core-tests --output json > results.json

- name: Check for regressions
  run: |
    if jq -e '.failed > 0' results.json > /dev/null; then
      echo "Evaluation failed"
      exit 1
    fi

You can block PRs on evaluation regressions using branch protection rules that require the eval check to pass.

See CI/CD Guide for detailed setup with GitHub Actions, GitLab CI, and other platforms.

19. How do I detect regressions automatically?

Compare against a baseline run (e.g., from the main branch):

bun run eval:compare \
  --baseline main \
  --candidate ${{ steps.eval.outputs.run_id }} \
  --threshold 0.05 \
  --fail-on-regression

This fails the CI step if any scorer’s average drops by more than 5% compared to the baseline. You can configure the threshold and choose strict mode (fail on any test failure) or lenient mode.


Troubleshooting

20. Something isn’t working. Where do I start?

  1. Check services are running: docker compose ps — all containers should be Up
  2. Check API health: GET /api/health returns status of all dependencies
  3. Check logs:
    • Frontend: Terminal running bun run dev
    • Workers: Terminal running bun run workers
    • Infrastructure: docker compose logs <service>
  4. Common fixes:
    • Restart services: docker compose restart
    • Reinstall dependencies: bun install
    • Clear build cache: bun run build --force

For specific error messages, see the Error Reference Guide.

Common issues:

  • 503 errors: An infrastructure service (ClickHouse, PostgreSQL, or Temporal) is down. Start it with docker compose up -d.
  • 401 errors: Authentication is missing or expired. Check your API key or JWT token.
  • Eval run stuck: Check that the Temporal worker is running (bun run workers). You can cancel stuck runs via the API.
  • Scores are all zero: Verify the scorer configuration and that the agent produces output. For LLM judges, ensure ANTHROPIC_API_KEY is set.

Advanced Usage

21. How do I use the TypeScript SDK CLI?

The @neon/sdk package includes a CLI for running evaluations:

# Run all eval files
npx neon eval

# Run specific patterns
npx neon eval "tests/**/*.eval.js"

# With options
npx neon eval --filter "weather" --parallel 5 --timeout 120000

# JSON output for CI/CD
npx neon eval --format json

# CI mode: JSON output + non-zero exit on failure
npx neon eval --ci --threshold 0.8

See the CLI Reference for all options.

22. How do I create custom scorers?

TypeScript:

import { defineScorer, ScorerConfig } from '@neon/sdk'

const wordCount = defineScorer({
  name: 'word_count',
  evaluate: async (context) => ({
    value: Math.min(context.output.split(/\s+/).length / 100, 1.0),
    reason: `Word count: ${context.output.split(/\s+/).length}`,
  }),
})

Python:

from neon_sdk.scorers import define_scorer, ScorerConfig, ScoreResult

custom = define_scorer(ScorerConfig(
    name='word_count',
    evaluate=lambda ctx: ScoreResult(
        value=min(len(ctx.output.split()) / 100, 1.0),
        reason=f"Word count: {len(ctx.output.split())}",
    ),
))

23. What LLM providers are supported for LLM judge scorers?

Neon supports multiple LLM providers for evaluation scoring:

ProviderEnvironment VariablePackage Required
AnthropicANTHROPIC_API_KEY@anthropic-ai/sdk
OpenAIOPENAI_API_KEYopenai
Google Vertex AIGOOGLE_CLOUD_PROJECT@google-cloud/vertexai
Vertex ClaudeGOOGLE_CLOUD_PROJECT@anthropic-ai/vertex-sdk

24. Can I export evaluation data for fine-tuning?

Yes. The SDK supports exporting results in formats compatible with popular training frameworks:

  • OpenAI fine-tuning format
  • HuggingFace TRL (SFT, DPO, KTO)
  • DSPy optimization format
  • Agent Lightning format

Use the export utilities in @neon/sdk to convert evaluation traces into training data.

25. How do I set up alerts for score regressions?

Create alert rules via the API or dashboard:

curl -X POST /api/alerts/rules \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Score regression",
    "metric": "avg_score",
    "operator": "lt",
    "threshold": 0.7,
    "severity": "critical"
  }'

Supported operators: gt, gte, lt, lte, eq. Supported severities: critical, warning, info.