Architecture

Neon is an agent operations platform built for observability, durable execution, and systematic evaluation of AI agents. This document explains how the system works under the hood.

System Overview

┌─────────────────────────────────────────────────────────────────┐
│                     YOUR AGENTS                                  │
│         (Any runtime: Cloud Run, Lambda, K8s, local)            │
└────────────────────────────┬────────────────────────────────────┘

              ┌──────────────┴──────────────┐
              │      SDK / OpenTelemetry     │
              │   @neon/sdk  |  neon-sdk    │
              └──────────────┬──────────────┘

┌────────────────────────────┼────────────────────────────────────┐
│                      NEON PLATFORM                               │
│  ┌─────────────────────────┴─────────────────────────┐          │
│  │              Next.js Frontend & API                │          │
│  │         Dashboard, tRPC routes, REST API          │          │
│  └────────────┬─────────────────────┬────────────────┘          │
│               │                     │                            │
│     ┌─────────▼─────────┐  ┌───────▼────────┐                   │
│     │    ClickHouse     │  │    Temporal    │                   │
│     │  (Trace Storage)  │  │  (Workflows)   │                   │
│     │                   │  │                │                   │
│     │ • traces          │  │ • evalRun      │                   │
│     │ • spans           │  │ • agentRun     │                   │
│     │ • scores          │  │ • abTest       │                   │
│     └───────────────────┘  └───────┬────────┘                   │
│                                    │                             │
│                          ┌─────────▼─────────┐                   │
│                          │  Temporal Workers │                   │
│                          │                   │                   │
│                          │ • emitSpan()      │                   │
│                          │ • scoreTrace()    │                   │
│                          │ • llmCall()       │                   │
│                          └───────────────────┘                   │
│                                                                  │
│     ┌───────────────────┐                                        │
│     │    PostgreSQL     │  (Metadata: projects, configs, users)  │
│     └───────────────────┘                                        │
└──────────────────────────────────────────────────────────────────┘

Core Components

1. Trace Ingestion

Traces flow into Neon via two paths:

SDK Tracing (Recommended)

import { trace, generation, tool } from '@neon/sdk'

const result = await trace('agent-run', async () => {
  const response = await generation('llm-call', { model: 'claude-3-5-sonnet' }, async () => {
    return await llm.chat(prompt)
  })

  await tool('search', async () => {
    return await searchAPI.query(response.query)
  })

  return response
})

OpenTelemetry (Any Language)

from opentelemetry import trace
tracer = trace.get_tracer("my-agent")

@tracer.start_as_current_span("agent-run")
async def run_agent(query: str):
    # Your agent code
    return await llm.generate(query)

Both paths produce spans that are sent to the /api/traces/ingest endpoint and stored in ClickHouse.

2. ClickHouse Storage

ClickHouse is optimized for analytical queries over time-series data. Neon uses three main tables:

Trace Table

CREATE TABLE trace (
  trace_id String,
  project_id UUID,
  name String,
  status Enum('ok', 'error'),
  start_time DateTime64(3),
  end_time DateTime64(3),
  duration_ms UInt64,
  total_input_tokens UInt32,
  total_output_tokens UInt32,
  tool_call_count UInt16,
  llm_call_count UInt16,
  attributes Map(String, String)
) ENGINE = MergeTree()
ORDER BY (project_id, start_time, trace_id)

Span Table

CREATE TABLE span (
  span_id String,
  trace_id String,
  parent_span_id Nullable(String),
  name String,
  span_type Enum('span', 'generation', 'tool', 'retrieval'),
  component_type Nullable(String),
  start_time DateTime64(3),
  end_time DateTime64(3),
  duration_ms UInt64,
  model Nullable(String),
  input String,
  output String,
  input_tokens UInt32,
  output_tokens UInt32,
  attributes Map(String, String)
) ENGINE = MergeTree()
ORDER BY (trace_id, start_time, span_id)

Score Table

CREATE TABLE score (
  score_id UUID,
  trace_id String,
  span_id Nullable(String),
  name String,
  value Float64,
  score_type Enum('numeric', 'categorical', 'boolean'),
  source Enum('api', 'sdk', 'annotation', 'eval', 'temporal'),
  scorer_name Nullable(String),
  reason Nullable(String),
  evidence Array(String),
  created_at DateTime64(3)
) ENGINE = MergeTree()
ORDER BY (trace_id, created_at, score_id)

3. Temporal Workflows

Temporal provides durable execution for long-running evaluations. Workflows survive crashes, timeouts, and can pause for human approval.

Eval Run Workflow

export async function evalRunWorkflow(input: EvalRunInput): Promise<EvalRunResult> {
  const { projectId, dataset, scorers, config } = input
  const results: EvalCaseResult[] = []

  // Process each test case
  for (const item of dataset.items) {
    const caseResult = await workflow.executeChild(evalCaseWorkflow, {
      args: [{ projectId, item, scorers }],
      workflowId: `eval-case-${item.id}`,
    })
    results.push(caseResult)

    // Update progress (queryable)
    progress = { completed: results.length, total: dataset.items.length }
  }

  return aggregateResults(results)
}

Key Workflow Features:

  • Progress Queries: Poll progressQuery to get real-time status
  • Signals: Send cancelRunSignal or pauseSignal to control execution
  • Child Workflows: Each test case runs in isolation
  • Retries: Automatic retry on transient failures

4. Temporal Activities

Activities are the building blocks that do actual work:

// Emit span to ClickHouse
export async function emitSpan(span: SpanInput): Promise<void> {
  await fetch(`${API_URL}/api/traces/ingest`, {
    method: 'POST',
    body: JSON.stringify(span),
  })
}

// Score a trace using configured scorers
export async function scoreTrace(input: ScoreInput): Promise<ScoreResult[]> {
  const { trace, scorers } = input
  const results: ScoreResult[] = []

  for (const scorer of scorers) {
    const result = await scorer.evaluate({ trace })
    results.push(result)
  }

  return results
}

// Call LLM for generation or judging
export async function llmCall(input: LLMInput): Promise<LLMOutput> {
  const response = await anthropic.messages.create({
    model: input.model,
    messages: input.messages,
  })
  return { content: response.content, usage: response.usage }
}

Data Flow

Trace Collection

1. Agent executes with SDK tracing

   ├─ trace("agent-run") creates root span
   │   ├─ generation("llm-call") creates child span
   │   ├─ tool("search") creates child span
   │   └─ retrieval("rag") creates child span

2. On trace completion, SDK batches spans

3. POST /api/traces/ingest

4. API validates and writes to ClickHouse

5. Spans available for querying immediately

Evaluation Execution

1. SDK calls neon.eval.runSuite(suite)

2. POST /api/runs starts Temporal workflow

   ├─ evalRunWorkflow created
   │   │
   │   ├─ For each test case:
   │   │   ├─ evalCaseWorkflow (child)
   │   │   │   ├─ Execute agent
   │   │   │   ├─ emitSpan() activity
   │   │   │   ├─ scoreTrace() activity
   │   │   │   └─ Return EvalCaseResult
   │   │   │
   │   │   └─ Aggregate results
   │   │
   │   └─ Return EvalRunResult

3. Frontend polls /api/runs/[id]/status

4. Workflow queries return progress

5. On completion, results in ClickHouse + Temporal

Score Computation

1. Trace stored in ClickHouse

2. Scorer requested (via eval or manual)

   ├─ Rule-based scorer (fast, local)
   │   ├─ contains() - string matching
   │   ├─ regex() - pattern matching
   │   └─ toolSelection() - tool comparison

   └─ LLM Judge scorer (slower, accurate)
       ├─ llmJudge() - custom criteria
       ├─ reasoning() - reasoning quality
       └─ grounding() - factual accuracy

3. Score written to ClickHouse

4. Score visible in dashboard + API

Component Types

Neon tracks different types of agent operations:

Component TypeDescriptionExample
generationLLM callsClaude completion
toolExternal tool callsAPI request, search
retrievalRAG/vector searchDocument lookup
reasoningChain-of-thoughtInternal reasoning
planningAction planningTask decomposition
routingDecision routingModel selection
memoryMemory operationsContext retrieval
promptPrompt constructionTemplate rendering

This taxonomy enables:

  • Filtering spans by type in the dashboard
  • Type-specific scorers (e.g., tool selection)
  • Component-level analytics

Span Attributes

Standard Attributes

Every span includes:

{
  span_id: string
  trace_id: string
  parent_span_id: string | null
  name: string
  span_type: 'span' | 'generation' | 'tool' | 'retrieval'
  start_time: Date
  end_time: Date
  duration_ms: number
}

Generation Attributes

LLM calls include:

{
  model: string              // 'claude-3-5-sonnet'
  input: string              // Prompt text
  output: string             // Response text
  input_tokens: number
  output_tokens: number
  temperature: number
  stop_reason: string
}

Tool Attributes

Tool calls include:

{
  tool_name: string          // 'web_search'
  tool_input: object         // { query: '...' }
  tool_output: object        // { results: [...] }
  tool_status: 'success' | 'error'
  error_message?: string
}

Skill Selection Context

When agents select tools/skills:

{
  skill_category: string           // 'search', 'calculation'
  selection_confidence: number     // 0.0 - 1.0
  selection_reason: string         // 'User asked for weather'
  alternatives_considered: string[] // ['calculator', 'search']
}

Scalability

ClickHouse Partitioning

Tables are partitioned by month for efficient queries:

PARTITION BY toYYYYMM(start_time)

Data Retention

Configure TTL for automatic cleanup:

TTL start_time + INTERVAL 90 DAY

Horizontal Scaling

  • ClickHouse: Add shards for write throughput
  • Temporal: Add workers for workflow throughput
  • Frontend: Deploy multiple instances behind load balancer

Security

Data Isolation

  • Traces are scoped to project_id
  • API routes validate project membership
  • ClickHouse queries always filter by project

Secrets Management

SecretStorageUsage
LLM API keysEnvironmentScorer LLM calls
Database URLsEnvironmentClickHouse, PostgreSQL
Session secretEnvironmentAuth tokens
API keysPostgreSQLExternal client auth

Network Security

  • ClickHouse: Internal network only (no public access)
  • PostgreSQL: Internal network only
  • Temporal: Internal network only
  • Frontend: Public (with auth)

Deployment Profiles

Development (Minimal)

docker compose up -d
# Starts: ClickHouse, PostgreSQL

With Durable Execution

docker compose --profile temporal up -d
# Adds: Temporal Server, Temporal UI

Production (Full)

docker compose --profile full up -d
# Adds: Workers, Redis, all services

High Throughput

docker compose --profile streaming up -d
# Adds: Redpanda (Kafka-compatible)

Extension Points

Custom Scorers

import { defineScorer } from '@neon/sdk'

const myScorer = defineScorer({
  name: 'my-scorer',
  dataType: 'numeric',
  evaluate: async (context) => {
    // Custom logic
    return { score: 0.9, reason: 'Passed' }
  },
})

Custom Activities

Add new Temporal activities in temporal-workers/src/activities/:

export async function myActivity(input: MyInput): Promise<MyOutput> {
  // Custom logic
}

API Extensions

Add new routes in frontend/app/api/:

// frontend/app/api/my-endpoint/route.ts
export async function GET(request: Request) {
  // Custom endpoint
}

Monitoring

Health Endpoints

EndpointService
GET /api/healthFrontend + deps
GET :8123/pingClickHouse
pg_isreadyPostgreSQL
tctl cluster healthTemporal

Key Metrics

  • Trace ingestion rate (traces/second)
  • Span storage size (GB)
  • Eval workflow duration (seconds)
  • Scorer latency (ms)
  • LLM API cost ($)