Multi-Agent Orchestration

DevTeam Orchestrator enables distributed multi-agent workflows where different agents run on different machines, use different models, and collaborate through shared task queues and Temporal workflows.

Architecture

                 +-------------------+
                 |   Orchestrator    |
                 |   API + Temporal  |
                 +--------+----------+
                          |
            +-------------+-------------+
            |             |             |
    +-------+------+ +---+--------+ +--+----------+
    | GPU Worker   | | CPU Worker | | CPU Worker  |
    | RTX 5080     | | Surface    | | GR-14       |
    | ollama local | | 8GB RAM    | | 16GB RAM    |
    | Queue: gpu   | | Queue: cpu | | Queue: gr14 |
    +-------+------+ +---+--------+ +--+----------+
            |             |             |
    +-------+------+ +---+--------+ +--+----------+
    | Models:      | | Models:    | | Models:     |
    | - opus       | | - sonnet   | | - haiku     |
    | - local LLM  | | - haiku    | | - sonnet    |
    | - 70B params | | - fast     | | - batch     |
    +--------------+ +------------+ +-------------+

Queue-Based Routing

Route tasks to specific workers based on model requirements and hardware capabilities:

// GPU-intensive task → GPU worker
await client.createTask({
  prompt: 'Complex reasoning task requiring large model...',
  model: 'opus',
  queue: 'gpu-queue',
});
 
// Fast batch processing → CPU workers
await client.createTask({
  prompt: 'Quick classification task...',
  model: 'haiku',
  queue: 'cpu-queue',
});
 
// Local model (no API cost) → GPU worker with Ollama
await client.createTask({
  prompt: 'Process this with local model...',
  model: 'ollama:llama3.3',
  queue: 'gpu-queue',
});

Queue Configuration

worker-config.yaml

worker:
  id: worker-asus-gpu
  queues:
    - name: gpu-queue
      concurrency: 4
      models:
        - opus
        - sonnet
        - ollama:*
    - name: default
      concurrency: 2
      models:
        - sonnet
        - haiku
 
  resources:
    gpu: true
    gpuModel: RTX 5080
    ramGB: 24
    cpuCores: 8
 
  heartbeat:
    intervalMs: 30000
    endpoint: https://devteam.marsala.dev/api/workers/heartbeat

Worker Registration

Workers auto-register when they connect:

import { DevTeamWorker } from 'devteam-worker';
 
const worker = new DevTeamWorker({
  apiUrl: 'https://devteam.marsala.dev',
  apiKey: process.env.DEVTEAM_API_KEY,
  temporalAddress: 'temporal.marsala.dev:7233',
  workerId: 'worker-asus-gpu',
  queues: ['gpu-queue', 'default'],
  concurrency: 4,
  capabilities: {
    gpu: true,
    gpuModel: 'RTX 5080',
    localModels: ['llama3.3:70b', 'dolphin3:8b'],
  },
});
 
worker.on('task', async (task) => {
  console.log(`Processing task ${task.id} with model ${task.model}`);
});
 
worker.on('error', (error) => {
  console.error('Worker error:', error);
});
 
await worker.start();
// Worker worker-asus-gpu registered
// Listening on queues: gpu-queue, default (concurrency: 4)

Agent Specialization

Assign specialized system prompts and tools to different agents:

const plan = await client.createPlan({
  name: 'research-pipeline',
  steps: [
    {
      id: 'researcher',
      prompt: 'Research the topic: {{input.topic}}',
      model: 'sonnet',
      queue: 'research-queue',
      systemPrompt: `You are a research analyst with access to web search
                      and document retrieval tools. Gather comprehensive
                      information from multiple sources.`,
    },
    {
      id: 'analyst',
      prompt: 'Analyze research findings: {{researcher.output}}',
      model: 'opus',
      queue: 'gpu-queue',
      systemPrompt: `You are a senior analyst specializing in quantitative
                      analysis and pattern recognition. Identify key insights,
                      trends, and anomalies in the research data.`,
      dependsOn: ['researcher'],
    },
    {
      id: 'writer',
      prompt: 'Write a report based on analysis: {{analyst.output}}',
      model: 'sonnet',
      queue: 'default',
      systemPrompt: `You are a professional report writer. Create clear,
                      well-structured reports with executive summaries,
                      data tables, and actionable recommendations.`,
      dependsOn: ['analyst'],
    },
  ],
});

Load Balancing

The orchestrator distributes tasks across workers using configurable strategies:

Strategy	Description	Best For
`round-robin`	Distribute evenly across workers	Homogeneous workers
`least-loaded`	Route to worker with fewest active tasks	Heterogeneous workers
`capability-match`	Route based on model/GPU requirements	Mixed GPU/CPU clusters
`locality`	Prefer workers close to data source	RAG-heavy workloads

const client = new DevTeamClient({
  loadBalancing: {
    strategy: 'capability-match',
    preferences: {
      'opus': ['gpu-queue'],
      'ollama:*': ['gpu-queue'],
      'haiku': ['cpu-queue', 'default'],
    },
  },
});

Health Monitoring

Workers report health via heartbeat:

// Check worker status
const workers = await client.getWorkers();
workers.forEach((w) => {
  console.log(`${w.id}: ${w.status} (${w.activeTasks}/${w.concurrency} tasks)`);
  console.log(`  Queues: ${w.queues.join(', ')}`);
  console.log(`  Last heartbeat: ${w.lastHeartbeat}`);
  console.log(`  Uptime: ${w.uptimeMs / 3600000}h`);
});

Auto-Recovery

If a worker stops sending heartbeats:

After 60 seconds: Worker marked as unhealthy
After 120 seconds: Active tasks are re-queued to other workers
After 300 seconds: Worker deregistered

worker:
  heartbeat:
    intervalMs: 30000
    unhealthyAfterMs: 60000
    rebalanceAfterMs: 120000
    deregisterAfterMs: 300000

⚠️

When tasks are re-queued after worker failure, they restart from the beginning. Design tasks to be idempotent to prevent duplicate side effects.

Scaling Patterns

Horizontal Scaling

Add more workers to handle increased load:

# Start additional workers on new machines
ssh worker-node-5 'devteam-worker start --queue cpu-queue --concurrency 8'
ssh worker-node-6 'devteam-worker start --queue cpu-queue --concurrency 8'

Vertical Scaling

Increase concurrency on existing workers:

devteam-worker config set concurrency 8
devteam-worker restart

Auto-Scaling (Kubernetes)

worker-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: devteam-worker
  namespace: workers
spec:
  replicas: 3
  selector:
    matchLabels:
      app: devteam-worker
  template:
    spec:
      containers:
        - name: worker
          image: matwal/devteam-worker:v1
          env:
            - name: DEVTEAM_API_URL
              value: "https://devteam.marsala.dev"
            - name: DEVTEAM_QUEUE
              value: "cpu-queue"
            - name: DEVTEAM_CONCURRENCY
              value: "4"
          resources:
            requests:
              cpu: "2"
              memory: "4Gi"
            limits:
              cpu: "4"
              memory: "8Gi"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: devteam-worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: devteam-worker
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: External
      external:
        metric:
          name: devteam_queue_depth
        target:
          type: AverageValue
          averageValue: "5"

Next Steps

RAG Integration -- Connect agents to vector search
Deployment Guide -- Production deployment patterns

DAG Workflows RAG Integration