Skip to content

Replace deterministic container startup with agent discovery + announce queues #104

@vonjackets

Description

@vonjackets

Replace the current controller-driven deterministic container startup flow (testcontainers + direct configuration injection) with a distributed agent discovery and configuration protocol based on announce queues in Cassini. This requires refactoring all agents to support a standard presence/announce handshake, as well as re-architecting the test harness to rely on presence detection instead of container lifecycle control.

This change aligns the harness with real distributed system behavior, removes tight coupling to testcontainers, and enables multi-node, multi-host, and composition-based deployments.


Motivation

The current architecture assumes the controller is responsible for:

  • Spawning each agent via testcontainers
  • Knowing which container maps to which agent
  • Sending configuration directly over the wire
  • Tracking container references for teardown

This model is fragile, unrealistic, and blocks scaling tests beyond a single host or deterministic startup scenarios. It also makes it impossible to:

  • Run agents externally (K8s, Podman Compose, remote systems)
  • Restart agents independently and still maintain test flow
  • Observe real-world distributed dynamics like late joiners, churn, or recovery
  • Treat the harness as an actual control plane instead of a babysitter for containers

By introducing well-known announce queues on Cassini (e.g. agent.announce.<AgentType>), the harness shifts into a distributed coordination model:

  • Agents announce themselves when online
  • The controller waits for the expected set of agents
  • Once presence is satisfied, the controller issues configuration
  • Agents execute autonomously and report test events back

The controller stops micromanaging infra and becomes an orchestration and validation layer.

This is the correct architectural direction.


High-Level Changes

1. Introduce announce queues

  • One per agent type (e.g. agent.announce.ProducerAgent)

  • Agents publish presence on startup

  • Schema includes:

    • agent_type
    • agent_id
    • topics (if producer)
    • metadata
    • optional heartbeat

2. Update all agents to implement an announce handshake

Every agent must:

  1. Connect to Cassini

  2. Publish its announce payload

  3. (Optionally) heartbeat

  4. Listen for config on:

    • agent.config.<agent_id>
    • OR agent.config.<agent_type>

3. Update HarnessController to wait for expected agents

  • TestPlan.environment.expectedAgents becomes mandatory

  • During pre-test:

    • Subscribe to all agent.announce.*
    • Collect announces until all required agents appear
    • Timeout → test failure

4. Controller stops starting testcontainers for agents

Container lifecycle management moves out-of-band:

  • Use Podman Compose
  • Or Docker Compose
  • Or manual container orchestration
  • Or K8s manifests

The harness only coordinates the logical agents.

5. Controller becomes a config distributor

Once presence is satisfied:

  • Build combined config (producer topics, sink topics, patterns, etc.)
  • Publish over the appropriate config channel
  • Agents begin executing

6. Update sink + producer config logic

  • Producers: same model (pattern, topics, durations)
  • Sinks: now derive their topic list from producer announce metadata
    → They no longer need explicit config in most tests.

7. Update TestPlan schema

  • Remove StartAgent actions in most scenarios
  • Replace with:
environment.expectedAgents = [ ProducerAgent, SinkAgent, ... ]
  • Phases become:

    • setup (wait for presence)
    • execution (sleep, custom events)
    • validation (assertions)

8. Update teardown

  • Controller publishes STOP to agent.control.<agent_id>
  • Agents exit cleanly
  • Underlying infra cleaned by compose/K8s

Detailed Tasks

A. Broker Protocol / Topics

  • Define announce topic format
  • Define config topic format
  • Define control/stop topic format
  • Add versioning and agent_type verification

B. Agent Code Refactor

For every agent:

  • Implement announce on startup
  • Implement announce schema serialization
  • Implement config listener
  • Refactor worker startup to depend on config message
  • Add STOP listener to support controlled shutdown

C. Harness Refactor

  • Add presence-tracking subsystem
  • Add announce subscription logic
  • Add timeout + failure logic
  • Replace StartAgent container logic with presence-based gating
  • Implement config broadcasting
  • Wire presence + config into phase execution

D. TestPlan Updates

  • Update Dhall schemas
  • Remove/Deprecate StartAgent in test phases
  • Add expectedAgents
  • Update ProducerConfig/SinkConfig semantics
  • Update docs + examples

E. Operational Updates

  • New podman-compose.yml for local multi-agent testing
  • Update developer scripts to run agents independently
  • Provide examples of standalone agent startup outside harness

Migration Strategy

  1. Implement announce protocol in one agent first (e.g., ProducerAgent)
  2. Update harness to support presence detection for that one type
  3. Gradually migrate remaining agents
  4. Remove old configuration handshake
  5. Remove testcontainers dependency for agents entirely
  6. Update all tests to use the presence-based workflow

Migration can be staged without breaking the entire system.


Risks

  • Requires refactoring every agent
  • Requires coordinated rollout of new protocol
  • Some tests will break until both sides speak the new announce protocol
  • A few testcontainers tests may need temporary shims

But the distributed model is unequivocally stronger and future-proof.


Acceptance Criteria

  • Controller does not start agent containers
  • Agents self-announce via Cassini
  • Controller waits for expected agents before test execution
  • Controller sends config over broker
  • Agents receive config and start work
  • Sinks automatically subscribe to topics defined by producers
  • End-to-end test runs without deterministic container startup
  • Example test using podman-compose demonstrates full lifecycle

If you want, I can also write an architectural diagram, message schemas, and the Dhall changes in separate issues or as follow-up comments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions