Replace deterministic container startup with agent discovery + announce queues

Replace the current controller-driven deterministic container startup flow (testcontainers + direct configuration injection) with a distributed **agent discovery and configuration** protocol based on **announce queues in Cassini**. This requires refactoring all agents to support a standard presence/announce handshake, as well as re-architecting the test harness to rely on presence detection instead of container lifecycle control.

This change aligns the harness with real distributed system behavior, removes tight coupling to testcontainers, and enables multi-node, multi-host, and composition-based deployments.

---

## **Motivation**

The current architecture assumes the controller is responsible for:

* Spawning each agent via testcontainers
* Knowing which container maps to which agent
* Sending configuration directly over the wire
* Tracking container references for teardown

This model is fragile, unrealistic, and blocks scaling tests beyond a single host or deterministic startup scenarios. It also makes it impossible to:

* Run agents externally (K8s, Podman Compose, remote systems)
* Restart agents independently and still maintain test flow
* Observe real-world distributed dynamics like late joiners, churn, or recovery
* Treat the harness as an actual control plane instead of a babysitter for containers

By introducing well-known **announce queues** on Cassini (e.g. `agent.announce.<AgentType>`), the harness shifts into a distributed coordination model:

* Agents announce themselves when online
* The controller waits for the expected set of agents
* Once presence is satisfied, the controller issues configuration
* Agents execute autonomously and report test events back

The controller stops micromanaging infra and becomes an orchestration and validation layer.

This is the correct architectural direction.

---

## **High-Level Changes**

### **1. Introduce announce queues**

* One per agent type (e.g. `agent.announce.ProducerAgent`)
* Agents publish presence on startup
* Schema includes:

  * `agent_type`
  * `agent_id`
  * `topics` (if producer)
  * `metadata`
  * optional heartbeat

### **2. Update all agents to implement an announce handshake**

Every agent must:

1. Connect to Cassini
2. Publish its announce payload
3. (Optionally) heartbeat
4. Listen for config on:

   * `agent.config.<agent_id>`
   * OR `agent.config.<agent_type>`

### **3. Update HarnessController to wait for expected agents**

* `TestPlan.environment.expectedAgents` becomes mandatory
* During pre-test:

  * Subscribe to all `agent.announce.*`
  * Collect announces until all required agents appear
  * Timeout → test failure

### **4. Controller stops starting testcontainers for agents**

Container lifecycle management moves out-of-band:

* Use Podman Compose
* Or Docker Compose
* Or manual container orchestration
* Or K8s manifests

The harness only coordinates the *logical* agents.

### **5. Controller becomes a config distributor**

Once presence is satisfied:

* Build combined config (producer topics, sink topics, patterns, etc.)
* Publish over the appropriate config channel
* Agents begin executing

### **6. Update sink + producer config logic**

* Producers: same model (pattern, topics, durations)
* Sinks: now derive their topic list from producer announce metadata
  → They no longer need explicit config in most tests.

### **7. Update TestPlan schema**

* Remove `StartAgent` actions in most scenarios
* Replace with:

```dhall
environment.expectedAgents = [ ProducerAgent, SinkAgent, ... ]
```

* Phases become:

  * setup (wait for presence)
  * execution (sleep, custom events)
  * validation (assertions)

### **8. Update teardown**

* Controller publishes `STOP` to `agent.control.<agent_id>`
* Agents exit cleanly
* Underlying infra cleaned by compose/K8s

---

## **Detailed Tasks**

### **A. Broker Protocol / Topics**

* [ ] Define announce topic format
* [ ] Define config topic format
* [ ] Define control/stop topic format
* [ ] Add versioning and agent_type verification

### **B. Agent Code Refactor**

For *every agent*:

* [ ] Implement announce on startup
* [ ] Implement announce schema serialization
* [ ] Implement config listener
* [ ] Refactor worker startup to depend on config message
* [ ] Add STOP listener to support controlled shutdown

### **C. Harness Refactor**

* [ ] Add presence-tracking subsystem
* [ ] Add announce subscription logic
* [ ] Add timeout + failure logic
* [ ] Replace `StartAgent` container logic with presence-based gating
* [ ] Implement config broadcasting
* [ ] Wire presence + config into phase execution

### **D. TestPlan Updates**

* [ ] Update Dhall schemas
* [ ] Remove/Deprecate `StartAgent` in test phases
* [ ] Add `expectedAgents`
* [ ] Update ProducerConfig/SinkConfig semantics
* [ ] Update docs + examples

### **E. Operational Updates**

* [ ] New podman-compose.yml for local multi-agent testing
* [ ] Update developer scripts to run agents independently
* [ ] Provide examples of standalone agent startup outside harness

---

## **Migration Strategy**

1. Implement announce protocol in *one* agent first (e.g., ProducerAgent)
2. Update harness to support presence detection for that one type
3. Gradually migrate remaining agents
4. Remove old configuration handshake
5. Remove testcontainers dependency for agents entirely
6. Update all tests to use the presence-based workflow

Migration can be staged without breaking the entire system.

---

## **Risks**

* Requires refactoring every agent
* Requires coordinated rollout of new protocol
* Some tests will break until both sides speak the new announce protocol
* A few testcontainers tests may need temporary shims

But the distributed model is unequivocally stronger and future-proof.

---

## **Acceptance Criteria**

* Controller does not start agent containers
* Agents self-announce via Cassini
* Controller waits for expected agents before test execution
* Controller sends config over broker
* Agents receive config and start work
* Sinks automatically subscribe to topics defined by producers
* End-to-end test runs without deterministic container startup
* Example test using podman-compose demonstrates full lifecycle

---

If you want, I can also write an architectural diagram, message schemas, and the Dhall changes in separate issues or as follow-up comments.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace deterministic container startup with agent discovery + announce queues #104

Motivation

High-Level Changes

1. Introduce announce queues

2. Update all agents to implement an announce handshake

3. Update HarnessController to wait for expected agents

4. Controller stops starting testcontainers for agents

5. Controller becomes a config distributor

6. Update sink + producer config logic

7. Update TestPlan schema

8. Update teardown

Detailed Tasks

A. Broker Protocol / Topics

B. Agent Code Refactor

C. Harness Refactor

D. TestPlan Updates

E. Operational Updates

Migration Strategy

Risks

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Replace deterministic container startup with agent discovery + announce queues #104

Description

Motivation

High-Level Changes

1. Introduce announce queues

2. Update all agents to implement an announce handshake

3. Update HarnessController to wait for expected agents

4. Controller stops starting testcontainers for agents

5. Controller becomes a config distributor

6. Update sink + producer config logic

7. Update TestPlan schema

8. Update teardown

Detailed Tasks

A. Broker Protocol / Topics

B. Agent Code Refactor

C. Harness Refactor

D. TestPlan Updates

E. Operational Updates

Migration Strategy

Risks

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions