Architecture
Neon is an agent operations platform built for observability, durable execution, and systematic evaluation of AI agents. This document explains how the system works under the hood.
System Overview
┌─────────────────────────────────────────────────────────────────┐
│ YOUR AGENTS │
│ (Any runtime: Cloud Run, Lambda, K8s, local) │
└────────────────────────────┬────────────────────────────────────┘
│
┌──────────────┴──────────────┐
│ SDK / OpenTelemetry │
│ @neon/sdk | neon-sdk │
└──────────────┬──────────────┘
│
┌────────────────────────────┼────────────────────────────────────┐
│ NEON PLATFORM │
│ ┌─────────────────────────┴─────────────────────────┐ │
│ │ Next.js Frontend & API │ │
│ │ Dashboard, tRPC routes, REST API │ │
│ └────────────┬─────────────────────┬────────────────┘ │
│ │ │ │
│ ┌─────────▼─────────┐ ┌───────▼────────┐ │
│ │ ClickHouse │ │ Temporal │ │
│ │ (Trace Storage) │ │ (Workflows) │ │
│ │ │ │ │ │
│ │ • traces │ │ • evalRun │ │
│ │ • spans │ │ • agentRun │ │
│ │ • scores │ │ • abTest │ │
│ └───────────────────┘ └───────┬────────┘ │
│ │ │
│ ┌─────────▼─────────┐ │
│ │ Temporal Workers │ │
│ │ │ │
│ │ • emitSpan() │ │
│ │ • scoreTrace() │ │
│ │ • llmCall() │ │
│ └───────────────────┘ │
│ │
│ ┌───────────────────┐ │
│ │ PostgreSQL │ (Metadata: projects, configs, users) │
│ └───────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
Core Components
1. Trace Ingestion
Traces flow into Neon via two paths:
SDK Tracing (Recommended)
import { trace, generation, tool } from '@neon/sdk'
const result = await trace('agent-run', async () => {
const response = await generation('llm-call', { model: 'claude-3-5-sonnet' }, async () => {
return await llm.chat(prompt)
})
await tool('search', async () => {
return await searchAPI.query(response.query)
})
return response
})
OpenTelemetry (Any Language)
from opentelemetry import trace
tracer = trace.get_tracer("my-agent")
@tracer.start_as_current_span("agent-run")
async def run_agent(query: str):
# Your agent code
return await llm.generate(query)
Both paths produce spans that are sent to the /api/traces/ingest endpoint and stored in ClickHouse.
2. ClickHouse Storage
ClickHouse is optimized for analytical queries over time-series data. Neon uses three main tables:
Trace Table
CREATE TABLE trace (
trace_id String,
project_id UUID,
name String,
status Enum('ok', 'error'),
start_time DateTime64(3),
end_time DateTime64(3),
duration_ms UInt64,
total_input_tokens UInt32,
total_output_tokens UInt32,
tool_call_count UInt16,
llm_call_count UInt16,
attributes Map(String, String)
) ENGINE = MergeTree()
ORDER BY (project_id, start_time, trace_id)
Span Table
CREATE TABLE span (
span_id String,
trace_id String,
parent_span_id Nullable(String),
name String,
span_type Enum('span', 'generation', 'tool', 'retrieval'),
component_type Nullable(String),
start_time DateTime64(3),
end_time DateTime64(3),
duration_ms UInt64,
model Nullable(String),
input String,
output String,
input_tokens UInt32,
output_tokens UInt32,
attributes Map(String, String)
) ENGINE = MergeTree()
ORDER BY (trace_id, start_time, span_id)
Score Table
CREATE TABLE score (
score_id UUID,
trace_id String,
span_id Nullable(String),
name String,
value Float64,
score_type Enum('numeric', 'categorical', 'boolean'),
source Enum('api', 'sdk', 'annotation', 'eval', 'temporal'),
scorer_name Nullable(String),
reason Nullable(String),
evidence Array(String),
created_at DateTime64(3)
) ENGINE = MergeTree()
ORDER BY (trace_id, created_at, score_id)
3. Temporal Workflows
Temporal provides durable execution for long-running evaluations. Workflows survive crashes, timeouts, and can pause for human approval.
Eval Run Workflow
export async function evalRunWorkflow(input: EvalRunInput): Promise<EvalRunResult> {
const { projectId, dataset, scorers, config } = input
const results: EvalCaseResult[] = []
// Process each test case
for (const item of dataset.items) {
const caseResult = await workflow.executeChild(evalCaseWorkflow, {
args: [{ projectId, item, scorers }],
workflowId: `eval-case-${item.id}`,
})
results.push(caseResult)
// Update progress (queryable)
progress = { completed: results.length, total: dataset.items.length }
}
return aggregateResults(results)
}
Key Workflow Features:
- Progress Queries: Poll
progressQueryto get real-time status - Signals: Send
cancelRunSignalorpauseSignalto control execution - Child Workflows: Each test case runs in isolation
- Retries: Automatic retry on transient failures
4. Temporal Activities
Activities are the building blocks that do actual work:
// Emit span to ClickHouse
export async function emitSpan(span: SpanInput): Promise<void> {
await fetch(`${API_URL}/api/traces/ingest`, {
method: 'POST',
body: JSON.stringify(span),
})
}
// Score a trace using configured scorers
export async function scoreTrace(input: ScoreInput): Promise<ScoreResult[]> {
const { trace, scorers } = input
const results: ScoreResult[] = []
for (const scorer of scorers) {
const result = await scorer.evaluate({ trace })
results.push(result)
}
return results
}
// Call LLM for generation or judging
export async function llmCall(input: LLMInput): Promise<LLMOutput> {
const response = await anthropic.messages.create({
model: input.model,
messages: input.messages,
})
return { content: response.content, usage: response.usage }
}
Data Flow
Trace Collection
1. Agent executes with SDK tracing
│
├─ trace("agent-run") creates root span
│ ├─ generation("llm-call") creates child span
│ ├─ tool("search") creates child span
│ └─ retrieval("rag") creates child span
│
2. On trace completion, SDK batches spans
│
3. POST /api/traces/ingest
│
4. API validates and writes to ClickHouse
│
5. Spans available for querying immediately
Evaluation Execution
1. SDK calls neon.eval.runSuite(suite)
│
2. POST /api/runs starts Temporal workflow
│
├─ evalRunWorkflow created
│ │
│ ├─ For each test case:
│ │ ├─ evalCaseWorkflow (child)
│ │ │ ├─ Execute agent
│ │ │ ├─ emitSpan() activity
│ │ │ ├─ scoreTrace() activity
│ │ │ └─ Return EvalCaseResult
│ │ │
│ │ └─ Aggregate results
│ │
│ └─ Return EvalRunResult
│
3. Frontend polls /api/runs/[id]/status
│
4. Workflow queries return progress
│
5. On completion, results in ClickHouse + Temporal
Score Computation
1. Trace stored in ClickHouse
│
2. Scorer requested (via eval or manual)
│
├─ Rule-based scorer (fast, local)
│ ├─ contains() - string matching
│ ├─ regex() - pattern matching
│ └─ toolSelection() - tool comparison
│
└─ LLM Judge scorer (slower, accurate)
├─ llmJudge() - custom criteria
├─ reasoning() - reasoning quality
└─ grounding() - factual accuracy
│
3. Score written to ClickHouse
│
4. Score visible in dashboard + API
Component Types
Neon tracks different types of agent operations:
| Component Type | Description | Example |
|---|---|---|
generation | LLM calls | Claude completion |
tool | External tool calls | API request, search |
retrieval | RAG/vector search | Document lookup |
reasoning | Chain-of-thought | Internal reasoning |
planning | Action planning | Task decomposition |
routing | Decision routing | Model selection |
memory | Memory operations | Context retrieval |
prompt | Prompt construction | Template rendering |
This taxonomy enables:
- Filtering spans by type in the dashboard
- Type-specific scorers (e.g., tool selection)
- Component-level analytics
Span Attributes
Standard Attributes
Every span includes:
{
span_id: string
trace_id: string
parent_span_id: string | null
name: string
span_type: 'span' | 'generation' | 'tool' | 'retrieval'
start_time: Date
end_time: Date
duration_ms: number
}
Generation Attributes
LLM calls include:
{
model: string // 'claude-3-5-sonnet'
input: string // Prompt text
output: string // Response text
input_tokens: number
output_tokens: number
temperature: number
stop_reason: string
}
Tool Attributes
Tool calls include:
{
tool_name: string // 'web_search'
tool_input: object // { query: '...' }
tool_output: object // { results: [...] }
tool_status: 'success' | 'error'
error_message?: string
}
Skill Selection Context
When agents select tools/skills:
{
skill_category: string // 'search', 'calculation'
selection_confidence: number // 0.0 - 1.0
selection_reason: string // 'User asked for weather'
alternatives_considered: string[] // ['calculator', 'search']
}
Scalability
ClickHouse Partitioning
Tables are partitioned by month for efficient queries:
PARTITION BY toYYYYMM(start_time)
Data Retention
Configure TTL for automatic cleanup:
TTL start_time + INTERVAL 90 DAY
Horizontal Scaling
- ClickHouse: Add shards for write throughput
- Temporal: Add workers for workflow throughput
- Frontend: Deploy multiple instances behind load balancer
Security
Data Isolation
- Traces are scoped to
project_id - API routes validate project membership
- ClickHouse queries always filter by project
Secrets Management
| Secret | Storage | Usage |
|---|---|---|
| LLM API keys | Environment | Scorer LLM calls |
| Database URLs | Environment | ClickHouse, PostgreSQL |
| Session secret | Environment | Auth tokens |
| API keys | PostgreSQL | External client auth |
Network Security
- ClickHouse: Internal network only (no public access)
- PostgreSQL: Internal network only
- Temporal: Internal network only
- Frontend: Public (with auth)
Deployment Profiles
Development (Minimal)
docker compose up -d
# Starts: ClickHouse, PostgreSQL
With Durable Execution
docker compose --profile temporal up -d
# Adds: Temporal Server, Temporal UI
Production (Full)
docker compose --profile full up -d
# Adds: Workers, Redis, all services
High Throughput
docker compose --profile streaming up -d
# Adds: Redpanda (Kafka-compatible)
Extension Points
Custom Scorers
import { defineScorer } from '@neon/sdk'
const myScorer = defineScorer({
name: 'my-scorer',
dataType: 'numeric',
evaluate: async (context) => {
// Custom logic
return { score: 0.9, reason: 'Passed' }
},
})
Custom Activities
Add new Temporal activities in temporal-workers/src/activities/:
export async function myActivity(input: MyInput): Promise<MyOutput> {
// Custom logic
}
API Extensions
Add new routes in frontend/app/api/:
// frontend/app/api/my-endpoint/route.ts
export async function GET(request: Request) {
// Custom endpoint
}
Monitoring
Health Endpoints
| Endpoint | Service |
|---|---|
GET /api/health | Frontend + deps |
GET :8123/ping | ClickHouse |
pg_isready | PostgreSQL |
tctl cluster health | Temporal |
Key Metrics
- Trace ingestion rate (traces/second)
- Span storage size (GB)
- Eval workflow duration (seconds)
- Scorer latency (ms)
- LLM API cost ($)