-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Replace the current controller-driven deterministic container startup flow (testcontainers + direct configuration injection) with a distributed agent discovery and configuration protocol based on announce queues in Cassini. This requires refactoring all agents to support a standard presence/announce handshake, as well as re-architecting the test harness to rely on presence detection instead of container lifecycle control.
This change aligns the harness with real distributed system behavior, removes tight coupling to testcontainers, and enables multi-node, multi-host, and composition-based deployments.
Motivation
The current architecture assumes the controller is responsible for:
- Spawning each agent via testcontainers
- Knowing which container maps to which agent
- Sending configuration directly over the wire
- Tracking container references for teardown
This model is fragile, unrealistic, and blocks scaling tests beyond a single host or deterministic startup scenarios. It also makes it impossible to:
- Run agents externally (K8s, Podman Compose, remote systems)
- Restart agents independently and still maintain test flow
- Observe real-world distributed dynamics like late joiners, churn, or recovery
- Treat the harness as an actual control plane instead of a babysitter for containers
By introducing well-known announce queues on Cassini (e.g. agent.announce.<AgentType>), the harness shifts into a distributed coordination model:
- Agents announce themselves when online
- The controller waits for the expected set of agents
- Once presence is satisfied, the controller issues configuration
- Agents execute autonomously and report test events back
The controller stops micromanaging infra and becomes an orchestration and validation layer.
This is the correct architectural direction.
High-Level Changes
1. Introduce announce queues
-
One per agent type (e.g.
agent.announce.ProducerAgent) -
Agents publish presence on startup
-
Schema includes:
agent_typeagent_idtopics(if producer)metadata- optional heartbeat
2. Update all agents to implement an announce handshake
Every agent must:
-
Connect to Cassini
-
Publish its announce payload
-
(Optionally) heartbeat
-
Listen for config on:
agent.config.<agent_id>- OR
agent.config.<agent_type>
3. Update HarnessController to wait for expected agents
-
TestPlan.environment.expectedAgentsbecomes mandatory -
During pre-test:
- Subscribe to all
agent.announce.* - Collect announces until all required agents appear
- Timeout → test failure
- Subscribe to all
4. Controller stops starting testcontainers for agents
Container lifecycle management moves out-of-band:
- Use Podman Compose
- Or Docker Compose
- Or manual container orchestration
- Or K8s manifests
The harness only coordinates the logical agents.
5. Controller becomes a config distributor
Once presence is satisfied:
- Build combined config (producer topics, sink topics, patterns, etc.)
- Publish over the appropriate config channel
- Agents begin executing
6. Update sink + producer config logic
- Producers: same model (pattern, topics, durations)
- Sinks: now derive their topic list from producer announce metadata
→ They no longer need explicit config in most tests.
7. Update TestPlan schema
- Remove
StartAgentactions in most scenarios - Replace with:
environment.expectedAgents = [ ProducerAgent, SinkAgent, ... ]-
Phases become:
- setup (wait for presence)
- execution (sleep, custom events)
- validation (assertions)
8. Update teardown
- Controller publishes
STOPtoagent.control.<agent_id> - Agents exit cleanly
- Underlying infra cleaned by compose/K8s
Detailed Tasks
A. Broker Protocol / Topics
- Define announce topic format
- Define config topic format
- Define control/stop topic format
- Add versioning and agent_type verification
B. Agent Code Refactor
For every agent:
- Implement announce on startup
- Implement announce schema serialization
- Implement config listener
- Refactor worker startup to depend on config message
- Add STOP listener to support controlled shutdown
C. Harness Refactor
- Add presence-tracking subsystem
- Add announce subscription logic
- Add timeout + failure logic
- Replace
StartAgentcontainer logic with presence-based gating - Implement config broadcasting
- Wire presence + config into phase execution
D. TestPlan Updates
- Update Dhall schemas
- Remove/Deprecate
StartAgentin test phases - Add
expectedAgents - Update ProducerConfig/SinkConfig semantics
- Update docs + examples
E. Operational Updates
- New podman-compose.yml for local multi-agent testing
- Update developer scripts to run agents independently
- Provide examples of standalone agent startup outside harness
Migration Strategy
- Implement announce protocol in one agent first (e.g., ProducerAgent)
- Update harness to support presence detection for that one type
- Gradually migrate remaining agents
- Remove old configuration handshake
- Remove testcontainers dependency for agents entirely
- Update all tests to use the presence-based workflow
Migration can be staged without breaking the entire system.
Risks
- Requires refactoring every agent
- Requires coordinated rollout of new protocol
- Some tests will break until both sides speak the new announce protocol
- A few testcontainers tests may need temporary shims
But the distributed model is unequivocally stronger and future-proof.
Acceptance Criteria
- Controller does not start agent containers
- Agents self-announce via Cassini
- Controller waits for expected agents before test execution
- Controller sends config over broker
- Agents receive config and start work
- Sinks automatically subscribe to topics defined by producers
- End-to-end test runs without deterministic container startup
- Example test using podman-compose demonstrates full lifecycle
If you want, I can also write an architectural diagram, message schemas, and the Dhall changes in separate issues or as follow-up comments.