CI/CD Integration

Integrate Neon into your CI/CD pipeline to automatically test agent quality and gate deployments on evaluation results.

GitHub Actions

Basic Setup

# .github/workflows/eval.yml
name: Agent Evaluation

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Bun
        uses: oven-sh/setup-bun@v1

      - name: Install dependencies
        run: bun install

      - name: Run evaluations
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          NEON_API_URL: ${{ secrets.NEON_API_URL }}
        run: |
          bun run eval --suite core-tests --output json > results.json

      - name: Check for regressions
        run: |
          # Fail if any test failed
          if jq -e '.failed > 0' results.json > /dev/null; then
            echo "❌ Evaluation failed"
            jq '.results[] | select(.passed == false)' results.json
            exit 1
          fi
          echo "✅ All evaluations passed"

With Baseline Comparison

Compare against a baseline to detect regressions:

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Bun
        uses: oven-sh/setup-bun@v1

      - name: Install dependencies
        run: bun install

      - name: Run evaluation
        id: eval
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          # Run current version
          bun run eval --suite core-tests --output json > candidate.json
          echo "run_id=$(jq -r '.run_id' candidate.json)" >> $GITHUB_OUTPUT

      - name: Compare with baseline
        if: github.event_name == 'pull_request'
        env:
          NEON_API_URL: ${{ secrets.NEON_API_URL }}
        run: |
          # Compare against main branch baseline
          bun run eval:compare \
            --baseline main \
            --candidate ${{ steps.eval.outputs.run_id }} \
            --threshold 0.05 \
            --fail-on-regression

Comment Results on PR

      - name: Comment on PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('results.json', 'utf8'));

            const passed = results.passed;
            const failed = results.failed;
            const avgScore = (results.avg_score * 100).toFixed(1);

            const status = failed === 0 ? '✅' : '❌';
            const body = `## ${status} Agent Evaluation Results

            | Metric | Value |
            |--------|-------|
            | Passed | ${passed} |
            | Failed | ${failed} |
            | Avg Score | ${avgScore}% |

            ${failed > 0 ? '### Failed Cases\n' + results.results
              .filter(r => !r.passed)
              .map(r => `- **${r.case_name}**: ${(r.avg_score * 100).toFixed(1)}%`)
              .join('\n') : ''}
            `;

            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });

Quality Gates

Block PRs on Regression

Configure branch protection to require passing evaluations:

  1. Go to Settings → Branches → Branch protection rules
  2. Add rule for main
  3. Enable Require status checks to pass
  4. Add Agent Evaluation / evaluate as required

Configurable Thresholds

env:
  # Minimum passing score (0.0 - 1.0)
  MIN_SCORE: 0.8
  # Maximum allowed regression from baseline
  REGRESSION_THRESHOLD: 0.05
  # Fail on any test failure
  STRICT_MODE: true

steps:
  - name: Run evaluation
    run: |
      bun run eval --suite core-tests \
        --min-score $MIN_SCORE \
        --output json > results.json

  - name: Enforce quality gate
    run: |
      AVG_SCORE=$(jq '.avg_score' results.json)
      if (( $(echo "$AVG_SCORE < $MIN_SCORE" | bc -l) )); then
        echo "❌ Average score $AVG_SCORE below threshold $MIN_SCORE"
        exit 1
      fi

Multiple Suites

Run different suites for different scenarios:

jobs:
  core-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: oven-sh/setup-bun@v1
      - run: bun install
      - run: bun run eval --suite core-tests --fail-on-failure

  regression-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: oven-sh/setup-bun@v1
      - run: bun install
      - run: bun run eval --suite regression-tests --fail-on-failure

  performance-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: oven-sh/setup-bun@v1
      - run: bun install
      # Performance tests don't block, just report
      - run: bun run eval --suite performance-tests || true

GitLab CI

# .gitlab-ci.yml
stages:
  - test

agent-eval:
  stage: test
  image: oven/bun:latest
  script:
    - bun install
    - bun run eval --suite core-tests --output json > results.json
    - |
      if [ $(jq '.failed' results.json) -gt 0 ]; then
        echo "Evaluation failed"
        exit 1
      fi
  variables:
    ANTHROPIC_API_KEY: $ANTHROPIC_API_KEY
  artifacts:
    reports:
      dotenv: results.json

CircleCI

# .circleci/config.yml
version: 2.1

jobs:
  evaluate:
    docker:
      - image: oven/bun:latest
    steps:
      - checkout
      - run:
          name: Install dependencies
          command: bun install
      - run:
          name: Run evaluations
          command: |
            bun run eval --suite core-tests --output json > results.json
      - run:
          name: Check results
          command: |
            if [ $(jq '.failed' results.json) -gt 0 ]; then
              exit 1
            fi
      - store_artifacts:
          path: results.json

workflows:
  main:
    jobs:
      - evaluate

Environment Variables

VariableDescriptionRequired
ANTHROPIC_API_KEYAnthropic API key for LLM judgesFor LLM scorers
OPENAI_API_KEYOpenAI API keyAlternative
NEON_API_URLNeon API endpointFor remote storage
NEON_PROJECT_IDProject identifierFor multi-project

Best Practices

1. Cache Dependencies

- name: Cache dependencies
  uses: actions/cache@v3
  with:
    path: ~/.bun/install/cache
    key: ${{ runner.os }}-bun-${{ hashFiles('**/bun.lockb') }}

2. Run Critical Tests First

jobs:
  critical:
    runs-on: ubuntu-latest
    steps:
      - run: bun run eval --suite core-tests --tags critical

  full:
    needs: critical  # Only run if critical passes
    runs-on: ubuntu-latest
    steps:
      - run: bun run eval --suite core-tests

3. Store Results as Artifacts

- name: Upload results
  uses: actions/upload-artifact@v3
  with:
    name: eval-results
    path: results.json
    retention-days: 30

4. Notify on Failures

- name: Notify Slack on failure
  if: failure()
  uses: slackapi/slack-github-action@v1
  with:
    payload: |
      {
        "text": "❌ Agent evaluation failed on ${{ github.ref }}"
      }
  env:
    SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}

Troubleshooting

”API key not found”

Ensure secrets are configured in your repository:

  • GitHub: Settings → Secrets → Actions
  • GitLab: Settings → CI/CD → Variables
  • CircleCI: Project Settings → Environment Variables

Timeouts

Increase timeout for slow evaluations:

- name: Run evaluation
  timeout-minutes: 30
  run: bun run eval --suite core-tests

Rate Limits

If hitting LLM rate limits, add delays or reduce parallelism:

- run: bun run eval --suite core-tests --parallel 2 --delay 1000