Swarms
Multi-agent swarm coordination — spawning child executions, inter-agent messaging, resource locking, and cascade cancellation.
Swarms
A Swarm is a group of agent executions that share a common parent root execution. One agent (the parent) spawns one or more child executions, coordinates with them via messaging, and synchronizes access to shared resources via TTL-backed locks. The orchestrator tracks the entire parent-child hierarchy and enforces security boundaries at spawn time.
Swarm Topology
Root Execution (parent)
├── Child Execution A (depth 1)
│ ├── Child Execution A1 (depth 2)
│ └── Child Execution A2 (depth 2)
└── Child Execution B (depth 1)
└── Child Execution B1 (depth 2)Maximum recursive depth: 3. An execution at depth 3 cannot spawn further children. Attempts to do so are rejected with SpawnError::MaxDepthExceeded.
Spawning Child Agents
Agents spawn children by calling the aegis.spawn_child MCP tool from within bootstrap.py. The child executes asynchronously — spawn_child returns immediately with identifiers, and the parent uses aegis.await_child to block until completion.
from aegis import AegisClient
client = AegisClient()
# Spawn a child agent
result = client.call_tool("aegis.spawn_child", {
"manifest_yaml": open("/agent/worker-manifest.yaml").read(),
# swarm_id is optional; omit to have the orchestrator create a new swarm
})
child_execution_id = result["execution_id"]
swarm_id = result["swarm_id"]
# Do other work while the child runs...
# Block until child completes (or timeout)
outcome = client.call_tool("aegis.await_child", {
"execution_id": child_execution_id,
"timeout_secs": 300
})
if outcome["status"] == "completed":
print(f"Child succeeded: {outcome['output']}")
else:
print(f"Child did not succeed: {outcome['status']}")Security Context Ceiling
A child agent's security_context must be a subset of its parent's security_context. The orchestrator enforces this at spawn time and rejects the call with SpawnError::ContextExceedsParentCeiling if the child requests broader permissions than the parent holds.
This prevents privilege escalation via spawned children. A parent holding a restricted SecurityContext cannot grant a child execution broader permissions than it holds itself.
Inter-Agent Messaging
Agents within a swarm can send messages to each other using unicast (to a specific agent) or broadcast (to all agents in the swarm).
# Unicast to a specific agent
client.call_tool("aegis.send_message", {
"to_agent_id": "agent-uuid-here",
"payload": b"<serialized task data>"
})
# Broadcast to all agents in the swarm
client.call_tool("aegis.broadcast_message", {
"swarm_id": swarm_id,
"payload": b"<serialized task data>"
})Messages are raw bytes. Agents are responsible for serialization (e.g., JSON, msgpack). Message payloads are not logged — only the payload size is recorded in MessageSent domain events for audit purposes.
There is no message ordering guarantee between different sender-receiver pairs. Within a single sender-receiver pair, messages are delivered in send order.
Resource Locking
When multiple child agents need exclusive access to a shared resource (for example, writing to the same file or updating a shared database row), they use the ResourceLock mechanism.
# Acquire a lock
lock = client.call_tool("aegis.acquire_lock", {
"resource": "workspace/shared-config.json",
"ttl_secs": 60 # lock auto-expires after 60 seconds even if not released
})
lock_token = lock["lock_token"]
try:
# ... exclusive work ...
pass
finally:
# Release the lock
client.call_tool("aegis.release_lock", {
"lock_token": lock_token
})Lock Behavior
| Property | Value |
|---|---|
| Default TTL | 300 seconds (5 minutes) |
| TTL on execution end | Lock is automatically released when the holding execution completes or is cancelled. |
| Contention behavior | acquire_lock blocks until the lock is available or the call times out. |
| Expiry | A background GC task sweeps expired locks. LockExpired domain event is emitted. |
To avoid deadlocks:
- Always use
try/finallyto release locks. - Set TTLs conservatively — if your critical section takes 10 seconds, use a 30-second TTL.
- Avoid circular lock acquisition (A waits for B's lock while B waits for A's lock).
Cascade Cancellation
Cancelling a swarm propagates cancellation to all live child executions.
# Cancel by swarm ID (cancels all children)
aegis swarm cancel <swarm-id>
# Cancel the root execution (also cancels all children)
aegis execution cancel <root-execution-id>The cancellation reason is recorded in ChildCancelled domain events for the audit trail. Possible reasons:
| Reason | Description |
|---|---|
ParentCancelled | Parent execution was explicitly cancelled. |
Manual | Operator called aegis swarm cancel directly. |
AllChildrenComplete | Swarm dissolved naturally after all children finished. |
SecurityViolation | A security policy violation triggered swarm termination. |
Swarm Lifecycle
Created ──▶ Active ──▶ Dissolving ──▶ Dissolved
▲
│
(all children complete, or cancel called)A swarm enters Dissolving when:
- All child executions have completed/failed, or
aegis swarm cancelis called.
It transitions to Dissolved once all in-flight child state is cleaned up (locks released, messages drained).
Monitoring Swarms
# List all swarms for a root execution
aegis swarm list --execution <root-execution-id>
# Get swarm details (children, status, locks)
aegis swarm get <swarm-id>
# List child executions in a swarm
aegis swarm children <swarm-id>Swarm lifecycle events (SwarmCreated, ChildAgentSpawned, ChildAgentCompleted, SwarmDissolved, etc.) are published to the event bus and can be consumed by external systems via the gRPC streaming API.