Operations — infrabroker¶
Operational runbook: building, running, adding hosts, hot-reload, PKI, and reference configs. For the design rationale see ARCHITECTURE.md; for the security posture see THREAT_MODEL.md.
Table of contents¶
- Starting the system
- Adding a host
- Hot reload
- broker-ctl
- Local PKI
- Reference config files
- Monitoring
- Production deployment
1. Starting the system¶
cd /path/to/infrabroker
# 1. Start the signer (must be running before the broker starts)
./signer.sh start # background, PID in signer.pid, log in signer.log
./signer.sh status
./signer.sh log # tail -f signer.log
./signer.sh stop
./signer.sh restart
# 2. The MCP (mcp-broker) is started by the MCP client (e.g. OpenCode / Claude
# Code) on connect. It requires the signer to be running: if it cannot
# GET /v1/hosts, the broker fails to start.
# 3. Rebuild after changes (make embeds the git-tag version into the binaries)
make install # all binaries → ~/bin
make signer # or just one
Compiled binaries: ~/bin/mcp-broker · ~/bin/mcp-broker-http · ~/bin/signer
· ~/bin/broker-ctl · ~/bin/broker · ~/bin/control-plane. make install injects the version from git describe
--tags; a plain go build ./cmd/... still works but reports a dev-<commit>
version. Run make version to see what would be embedded.
Order matters: always start the signer before opening the MCP client. With multiple broker replicas, note that session/approval/behavior state is in-memory per process (single-instance only — see THREAT_MODEL.md).
What survives a restart: with state_db set (signer and control plane),
runtime grants/waivers and pending or approved-but-uncollected approvals are
persisted (SQLite, write-through) and restored at startup. Live SSH sessions
and the behaviour baseline are intentionally not persisted: a TCP connection
cannot be resurrected, and the baseline re-learns. Without state_db,
restarts drop grants/waivers/approvals as before (fail-safe). Back up the
.db together with its -wal/-shm sidecar files.
2. Adding a host¶
signer.json is the single source of truth. Edit it (or use broker-ctl
host add) and reload the signer; the broker picks up the change in ≤
hosts_refresh_seconds without a restart.
"hosts": {
"web01": {
"addr": "10.0.0.21:22",
"user": "deploy",
"host_key": "ssh-ed25519 AAAA...",
"principal": "host:web01",
"source_address": "",
"max_ttl_seconds": 120,
"allow_as_bastion": false,
"groups": ["prod-web"], // RBAC: groups this host belongs to
"allow_sudo": true,
"allowed_sudo_users": ["root", "deploy"],
"allow_pty": true
}
},
"callers": {
"broker-1": { "allowed_groups": ["prod-web"] } // CN → allowed groups
}
Bastions: if the host uses
"jump": "bastion", the bastion must share the host's groups, or the broker cannot resolve the jump chain.Backward compatible: a CN absent from
callershas no group restriction and sees every host — unless the table has a reserved"_default"entry, which absent CNs then inherit. Recommended for production:"_default": { "allowed_groups": [] }makes the table default-deny, so forgetting to list a new broker CN fails closed instead of open.
Obtain the host_key:
ssh-keyscan -t ed25519 <ip-or-hostname>
# copy only the "ssh-ed25519 AAAA..." part (without the hostname prefix)
Remote host configuration¶
In the target's /etc/ssh/sshd_config:
TrustedUserCAKeys /etc/ssh/infrabroker_ca.pub # copy pki/ssh_ca.pub
AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u
LogLevel VERBOSE
AllowTcpForwarding no # yes on bastions
X11Forwarding no
PermitTunnel no
# PermitTTY yes # default; uncomment only if it was disabled
Create /etc/ssh/auth_principals/<user> with the host's principal (e.g.
host:web01). For elevation, add the sudoers entry described in
ARCHITECTURE.md § Privilege elevation.
See also deploy/sshd_config.snippet.
2.1 Adding a Kubernetes cluster (optional)¶
The signer can also broker Kubernetes access (credential-broker; see
ARCHITECTURE.md § Kubernetes target).
Clusters live under kubernetes.clusters in signer.json, are default-deny,
and are hot-reloadable like hosts. Setup has two sides — in the cluster and in
signer.json.
In the cluster: create a least-privilege minter ServiceAccount whose only RBAC is minting bound tokens for the agent SAs, and one or more agent SAs with the Roles the agent actually needs (layer B):
# The minter: its ENTIRE RBAC is `create` on serviceaccounts/token for the
# agent SAs. A signer compromise yields token-minting for those SAs, nothing more.
apiVersion: v1
kind: ServiceAccount
metadata: { name: infrabroker-minter, namespace: agents }
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata: { name: infrabroker-minter, namespace: agents }
rules:
- apiGroups: [""]
resources: ["serviceaccounts/token"]
resourceNames: ["broker-platform", "broker-readonly"] # the agent SAs
verbs: ["create"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata: { name: infrabroker-minter, namespace: agents }
roleRef: { apiGroup: rbac.authorization.k8s.io, kind: Role, name: infrabroker-minter }
subjects: [{ kind: ServiceAccount, name: infrabroker-minter, namespace: agents }]
---
# An agent SA (layer B). Give it exactly the cluster RBAC the agent may use;
# the broker's action policy (layer A) can only narrow this, never widen it.
apiVersion: v1
kind: ServiceAccount
metadata: { name: broker-platform, namespace: agents }
# ...bind it to Roles/ClusterRoles for the resources your rules allow.
Mint the minter's own token and store it where token_file points; the signer
re-reads it per mint, so you can rotate it out-of-band. It is a standing
cluster credential — write it 0600 owned by the signer's service user
(infrabroker-signer in the production layout, §8) and never under the
group-readable /etc/infrabroker/pki root:
umask 077
kubectl -n agents create token infrabroker-minter --duration=8760h \
> /var/lib/infrabroker/signer/pki/prod-k8s-minter.token
kubectl config view --raw -o jsonpath='{.clusters[0].cluster.certificate-authority-data}' \
| base64 -d > /var/lib/infrabroker/signer/pki/prod-k8s-ca.crt
chown infrabroker-signer:infrabroker-signer /var/lib/infrabroker/signer/pki/prod-k8s-minter.token
chmod 0600 /var/lib/infrabroker/signer/pki/prod-k8s-minter.token
In signer.json: add the cluster under kubernetes.clusters (see
signer.example.json for a fully-commented block). Cluster names must be
disjoint from host names. Verify end-to-end after a reload:
3. Hot reload¶
The signer re-reads signer.json without restarting, atomically replacing the
hosts policy, the Kubernetes clusters, max_ttl_seconds,
reload_callers, and the CA key(s). If the new config is invalid (including a
bad cluster rule or an unreadable ca_cert/token_file), the previous state is
preserved. listen, TLS, and audit_log require a full restart.
broker-ctl reload # SIGHUP if local, else POST /v1/reload (mTLS)
# alternatives:
kill -HUP "$(cat signer.pid)"
./signer.sh restart
POST /v1/reload(mTLS): only CNs inreload_callersmay invoke it (others → 403). Emptyreload_callersdisables the HTTP endpoint.SIGHUP: local reload, bypasses the allowlist.
The broker does not need a reload: it refreshes /v1/hosts every
hosts_refresh_seconds for its cached server list. New ssh_execute and
ssh_session_open calls refresh /v1/hosts immediately before building SSH
hops and fail closed if the signer/control-plane cannot provide the current host
view.
Command-policy and target/bastion authorization changes are evaluated by the
signer on every new certificate and on every ssh_session_exec preflight.
Existing mode=exec sessions therefore start enforcing a new policy on their
next command. Existing mode=shell / mode=pty sessions are rejected on their
next command once a policy becomes active, because their stateful command stream
cannot be verified per command.
If a host's physical SSH route changes (addr, user, host_key, or jump),
already-open sessions are rejected on their next command and must be reopened so
they authenticate to the new route.
4. broker-ctl¶
Global options (before the subcommand):
broker-ctl [--config <signer.json>] [--client-config <broker-ctl.json>] <command> [args]
broker-ctl --version [--verbose] # print the build version
--config is a global option and must precede the subcommand
(broker-ctl --config /etc/signer.json host list), consistent with the other
binaries. It defaults to ./signer.json.
Breaking change (v1.15.0):
--configno longer works after the subcommand. Replacebroker-ctl host list --config fwithbroker-ctl --config f host list.
Client configuration (remote commands)¶
The remote commands (reload, policy add/remove/grant/grants/revoke,
approval list/allow/deny, host list --remote) need a URL and an mTLS
identity. Instead of repeating --url/--cert/--key/--ca on every call, put
them in a client parameters file (this is client-side config — the service
policy stays in signer.json):
{
"signer": { "url": "127.0.0.1:9443", "cert": "pki/broker.crt", "key": "pki/broker.key", "ca": "pki/mtls_ca.crt" },
"control_plane": { "url": "127.0.0.1:7443", "cert": "pki/broker-admin.crt", "key": "pki/broker-admin.key", "ca": "pki/mtls_ca.crt" }
}
The relative pki/* paths above are the lab layout. In the production
per-service install (§8) the admin CLI material is root-only under
/etc/infrabroker/pki/admin/, and the seeded /etc/infrabroker/broker-ctl.json
points both sections at pki/admin/admin.{crt,key} with the shared
pki/mtls_ca.crt — no service user can read it and impersonate the admin.
Search order: --client-config → $BROKER_CTL_CONFIG →
~/.config/broker-ctl/config.json → /etc/infrabroker/broker-ctl.json
(the production installer seeds the last one). The current working directory is
not searched — an implicit ./broker-ctl.json could let a planted file
redirect the CLI's mTLS endpoint and CA trust anchor, so a project-local file
must be named explicitly with --client-config. Per-parameter precedence:
explicit flag > env var > file > built-in default. Environment variables:
BROKER_CTL_SIGNER_{URL,CERT,KEY,CA} for the signer section,
BROKER_CTL_CP_{URL,CERT,KEY,CA} for the control plane. See
broker-ctl.example.json. When a config file omits cert/key/ca, the
built-in ./pki/* default is resolved relative to that file's directory (not
the current working directory), so a partial file cannot pull the mTLS trust
material from wherever broker-ctl happens to run.
Hosts¶
# Add host (with automatic ssh-keyscan)
broker-ctl host add --name web01 --addr 10.0.0.1:22 --user deploy --scan \
--sudo --pty --groups prod-web --callers broker-1
# Add host with a manual key
broker-ctl host add --name web01 --addr 10.0.0.1:22 --user deploy \
--host-key "ssh-ed25519 AAAA..." --ttl 120
# Add host with a command policy (allowlist)
broker-ctl host add --name web01 --addr 10.0.0.1:22 --user deploy --scan \
--policy-mode allowlist --allow "^uptime$,^df -h" --shell-parse
# Add host with command-policy audit mode to collect a baseline before enforcing
broker-ctl host add --name web02 --addr 10.0.0.2:22 --user deploy --scan \
--policy-mode allowlist --policy-enforcement audit \
--allow "^uptime$,^df -h,^journalctl "
# Update an existing host preserving its command_policy
broker-ctl host add --name web01 --addr 10.0.0.1:22 --user deploy --scan --force
# (no --policy-* / --allow / --deny flags → CommandPolicy copied from existing entry)
# Update an existing host replacing its command_policy
broker-ctl host add --name web01 --addr 10.0.0.1:22 --user deploy --scan --force \
--policy-mode denylist --deny "rm -rf"
# List hosts (columns: JUMP, SRC_ADDR, SUDO_USERS, CALLERS, POLICY)
broker-ctl host list
# List the LIVE policy from a running signer over mTLS (GET /v1/policy/hosts;
# the client cert CN must be in reload_callers). Same columns as the local
# view, but reflecting the in-memory state after hot-reloads and grants —
# also the recommended post-deploy end-to-end check.
broker-ctl host list --remote
# Remove host
broker-ctl host remove web01
host add flags:
| Flag | Required | Default | Description |
|---|---|---|---|
--name |
✓ | — | Logical host name |
--addr |
✓ | — | host:port of the SSH server |
--user |
✓ | — | Remote SSH account |
--host-key |
✓* | — | Host key (authorized_keys). - = read stdin |
--scan |
✓* | — | Fetch the key with ssh-keyscan (alternative to --host-key) |
--principal |
host:<name> |
SSH principal in the cert | |
--ttl |
120 |
max_ttl_seconds |
|
--jump |
— | Name of the preceding bastion | |
--source-address |
— | Bastion egress IP/CIDR | |
--sudo |
false | allow_sudo=true |
|
--sudo-users |
— | allowed_sudo_users (comma-separated) |
|
--pty |
false | allow_pty=true |
|
--groups |
— | RBAC groups (comma-separated) | |
--file-transfer |
false | allow_file_transfer=true (ssh_put_file / ssh_get_file) |
|
--callers |
— | CNs allowed on this host (comma-separated) | |
--bastion |
false | allow_as_bastion=true |
|
--force |
false | Update if it exists, preserving every field whose flag you don't pass (see note) | |
--policy-mode |
— | allowlist | denylist | off |
|
--policy-enforcement |
— (empty = enforce) |
enforce | audit; audit allows commands but emits would-deny / would-require-approval warnings |
|
--allow |
— | Allowlist patterns (RE2 regex, comma-separated) | |
--deny |
— | Denylist patterns (RE2 regex, comma-separated) | |
--require-approval |
— | Require-approval patterns (RE2 regex, comma-separated) | |
--shell-parse |
false | Parse commands as POSIX sh before evaluating the policy |
* Either --host-key or --scan is required, but not both. --scan honours
the port in --addr (and IPv6 literals).
Partial update with
--force(v1.12.6): a--forceupdate starts from the existing entry and overrides only the fields whose flags you pass; any field you omit (sudo, groups, callers, TTL,command_policy, …) keeps its current value. Sohost add --name web01 --addr newip:22 --user deploy --scan --forcechanges just the address and leaves sudoers/groups/policy intact. A flag set explicitly to empty (--groups "",--sudo=false) still clears its field. (--addr,--user, and--host-key/--scanare always required and thus always written.)Command-policy sub-flags are also merged field-granularly (v1.13.0): passing e.g. only
--require-approvalupdates that list and keeps the existing--policy-mode/--policy-enforcement/--allow/--deny/--shell-parse. Previously any single policy sub-flag rebuilt the wholecommand_policyfrom flag defaults, silently downgrading the host tomode:off(firewall disabled, sessions re-enabled).Baseline workflow: start a candidate firewall with
--policy-enforcement audit, let realssh_executeandssh_session_exectraffic run, then inspect warnings inbroker-ctl audit showand mine suggestions withbroker-ctl policy recommend. Switch to--policy-enforcement enforceonly after reviewing the proposed allow/deny rules.
CA keys¶
broker-ctl ca-keys add --name _default --type pem --path pki/ssh_ca
broker-ctl ca-keys add --name prod-web --type akv \
--vault-url https://myvault.vault.azure.net/ --key-name ssh-ca-web
broker-ctl ca-keys list
broker-ctl ca-keys remove prod-web
Callers (group RBAC table)¶
broker-ctl callers add --name broker-1 --groups prod-web,staging
broker-ctl callers add --name broker-1 --groups prod-web --force # update
broker-ctl callers add --name _default --groups "" # default-deny unlisted CNs
broker-ctl callers list
broker-ctl callers remove broker-1
An explicitly-empty --groups "" writes allowed_groups: [] (deny every
host); combined with the reserved name _default it applies to every CN not
explicitly listed, turning the table default-deny.
Reload¶
broker-ctl reload
broker-ctl --config /path/to/signer.json reload # alternative config (global flag)
Command policy: explain, recommend, mutate (v1.17.0)¶
# Explain a host's composed (group + inline) command policy, evaluate a command offline
broker-ctl policy explain --host web01 --command 'systemctl restart nginx'
# Mine an audit log for advisory suggestions (read-only — changes nothing)
broker-ctl policy recommend --audit signer_audit.log --min-count 5
# [PROMOTE] web01 ^systemctl restart nginx$ 47x, 47 human-approved
# [DEAD] web01 ^journalctl 0 matches in window -> review/remove
# Durable change via the validated mutation API (mTLS; CN must be in reload_callers).
# Validated before persist, written atomically, applied in-memory, audited:
broker-ctl policy add --host web01 --allow '^systemctl status [a-z0-9_.-]+$'
broker-ctl policy remove --host web01 --allow '^journalctl '
Runtime grants: temporary, expiring widening (v1.18.0)¶
A grant widens an allowlist host for a while without editing signer.json —
it lives in memory and expires on its own. Operator-only (mTLS, CN in
reload_callers), audited, and widen-only: a grant only adds allow patterns,
applies only on a host that is already allowlist-active, and can never override a
baseline deny. Cap the maximum TTL with max_grant_ttl_seconds in signer.json.
# Incident: web01 (allowlist) denies 'systemctl restart nginx'. Grant it for 2 hours.
broker-ctl policy grant --host web01 --allow '^systemctl restart nginx$' --ttl 2h
# → granted on web01: allow "^systemctl restart nginx$" for 2h0m0s (id 42d1..., expires ...Z)
# Verify without running anything (dry-run flips denied -> allowed):
broker-ctl policy explain --host web01 --command 'systemctl restart nginx' # static view
# …and from the agent side, ssh_execute --dry_run now reports ALLOWED.
# Scope a grant to one broker CN or one end user (default = host-wide):
broker-ctl policy grant --host web01 --allow '^systemctl restart nginx$' --ttl 2h --caller broker-1
broker-ctl policy grant --host web01 --allow '^systemctl restart nginx$' --ttl 2h --end-user alice
# List active grants; revoke early (otherwise it just expires):
broker-ctl policy grants
# ID HOST EXPIRES (UTC) SCOPE RULES
# 42d1eabd7c73b474c85e75a7 web01 2026-06-19T14:00:00Z any allow[^systemctl restart nginx$]
broker-ctl policy revoke 42d1eabd7c73b474c85e75a7
Notes: a grant on a non-allowlist host is refused (409 — it would be a no-op
and would invert the host to default-deny); grants survive a config reload, and
with state_db set they also survive a signer restart (without it they are
dropped — fail-safe, the baseline is more restrictive); every create/revoke is in
the signed audit log (grant-created / grant-revoked).
Approvals (mTLS to the control plane, approver cert)¶
broker-ctl approval list
broker-ctl approval allow <id>
broker-ctl approval deny <id>
# Approve-and-learn (v1.18.0): also waive RE-approval for this exact command for a
# while, so it runs without prompting again until the waiver expires for the same
# broker/end-user subject. The signer mints an approval waiver scoped to the
# original broker CN and end user (honoured only because the control plane is a
# trusted_forwarder); it shows up in 'policy grants' and is revocable like any grant.
broker-ctl approval allow <id> --learn --ttl 2h
broker-ctl policy grants # the waiver appears as waive-approval[^cmd$]
broker-ctl policy revoke <grant-id> # end it early (otherwise it just expires)
A waiver only un-gates an already-allowed command (it never widens allow/deny),
so it is safe even on a default-allow host that carries a require_approval rule. The
waiver is scoped to the approved caller/end-user and elevation, and the TTL is clamped
to max_grant_ttl_seconds if that cap is set. Every mint is audited
(approval-waiver-created, linked to the originating approval id).
Browser UI: the control plane also serves an approval UI at
https://<control-plane>/ui/approvals(list) and/ui/approvals/{id}(detail with Approve / Deny and the approve-and-learn TTL). Auth is the browser's mTLS client certificate — import an approver cert (CN inapproval.callers) into the browser. Pointapproval_url_templateathttps://<control-plane>/ui/approvals/{id}so Teams/webhook notification links land on the request page.
approval.timeout_seconds in control-plane.example.json controls both halves of
the approval lifecycle: a pending request must be decided before that TTL elapses
from creation, and an approved request must be collected by the broker before the
same TTL elapses from the decision. Approved requests are consumed once.
Audit¶
# Follow the broker log live (shows the last 20 lines first)
broker-ctl audit tail --log audit.log
broker-ctl audit tail --log audit.log -n 50
# Follow the signer log (certificate issuances)
broker-ctl audit tail --log signer_audit.log
# Filter (host, caller, outcome, date; combinable)
broker-ctl audit show --log audit.log --host web01
broker-ctl audit show --log audit.log --outcome denied
broker-ctl audit show --log signer_audit.log --outcome issued --since 2026-06-05
broker-ctl audit show --log audit.log --host db01 --outcome denied --limit 20
# JSON for jq pipelines
broker-ctl audit show --log audit.log --outcome denied --json | jq .
broker-ctl audit show --log audit.log --json | jq 'select(.serial==1042)'
# Verify the hash chain
broker-ctl audit verify --log audit.log
broker-ctl audit verify --log signer_audit.log
# Verify chain + Ed25519 signatures
broker-ctl audit verify --log audit.log --key pki/audit.seed
broker-ctl audit verify --log signer_audit.log --key pki/signer_audit.seed
# Verify the WHOLE chain across rotated segments (<log> plus <log>.<timestamp>),
# checking cross-file linkage so a dropped or truncated segment is detected.
# Single-file verify accepts the first prev_hash as an unchecked seed; --all does not.
broker-ctl audit verify --log audit.log --all --key pki/audit.seed
Recovering a torn audit log (signer won't boot)¶
The signer is fail-closed on audit-log corruption. If a crash or power loss
tears the final record mid-write (a truncated, unparseable trailing line), the
signer refuses to start — you will see a fatal
audit: restoring audit chain: parsing last log entry: … — rather than silently
continuing over a gap. On a tamper-evident, hash-chained log a truncated tail is
indistinguishable from a truncation attack, so recovery is a deliberate operator
action, never automatic:
# 1. Inspect what would be dropped (dry-run — makes NO changes):
broker-ctl audit repair --log /var/lib/infrabroker/signer/signer_audit.log
# 2. (optional) confirm the kept prefix's signatures are intact first:
broker-ctl audit repair --log signer_audit.log --key pki/signer_audit.seed
# 3. Apply: quarantine the torn bytes to <log>.corrupt-<timestamp> and truncate
# the log to the last well-formed record so the signer can boot. The hash
# chain continues from there; keep the quarantine file for forensics.
broker-ctl audit repair --log signer_audit.log --apply
repair only ever removes a contiguous corrupt suffix. If it finds a malformed
record before a well-formed one (mid-file corruption, which does not block
startup because the signer reads the last line), it refuses and points you to
audit verify to investigate.
See USAGE.md § 7 for the full audit-review guide (jq recipes, field reference, chain-integrity details).
Version¶
Every binary reports its build version. Short by default (script-friendly),
detailed with --verbose:
broker-ctl --version # e.g. v1.15.0
broker-ctl --version --verbose # version + Go toolchain + os/arch + VCS revision
broker-ctl version # equivalent subcommand form
broker-ctl version --verbose
signer --version # same flags on every binary
broker --version --verbose
The version is injected from the git tag at build time (make build); a plain
go build falls back to the module version or the VCS revision recorded by the
Go toolchain, so it is never a stale hard-coded string.
5. Local PKI¶
Generated locally — never commit pki/ to git (it holds private keys).
| File | Description | Rotate when |
|---|---|---|
pki/ssh_ca |
SSH CA private key (Ed25519) | CA rotation |
pki/ssh_ca.pub |
SSH CA public key | — (copy to hosts as TrustedUserCAKeys) |
pki/mtls_ca.{key,crt} |
TLS CA (self-signed, 10y) for broker↔signer mTLS | 2036 |
pki/signer.{key,crt} |
Signer server cert (SAN: 127.0.0.1, localhost) | 2036 |
pki/broker.{key,crt} |
Broker client cert (CN=broker-1) | 2036 |
pki/audit.seed |
Ed25519 seed for the broker log | do not rotate (breaks the chain) |
pki/signer_audit.seed |
Ed25519 seed for the signer log | do not rotate (breaks the chain) |
Production CA custody belongs in an HSM/KMS/Secure Enclave. The seam is ready:
ca.LoadCAFromPEMreturns anssh.Signer; replace it withssh.NewSignerFromSigner(kmsClient)(AKV already supported — see ARCHITECTURE.md § Multi-CA).
Rotating keys and certificates¶
The system issues ephemeral SSH credentials, but its own control-plane PKI is long-lived and must be rotated deliberately. There is no automation for this yet — follow these procedures.
SSH CA key (pki/ssh_ca). Hosts pin it via TrustedUserCAKeys, so rotation
needs a transition window where both the old and new CA are trusted:
- Generate the new CA key and add it to
signer.jsonas a per-group CA (ca_keys, see ARCHITECTURE.md § Multi-CA) or stage it alongside the currentca_key. - Distribute the new public key to every managed host, appending it to the
TrustedUserCAKeysfile (a host may trust multiple CA keys — keep both lines during the transition). Reloadsshd(systemctl reload sshd). - Switch issuance to the new CA (point the host group at the new
ca_keysentry, or replaceca_key) andbroker-ctl reloadthe signer. - Once all live certificates signed by the old CA have expired (≤
max_ttl, i.e. minutes), remove the old public key from every host'sTrustedUserCAKeysand reloadsshd.
Multi-CA (v1.11.0) makes step 1–3 per host group, so you can rotate one group at a time instead of the whole fleet.
mTLS PKI (pki/mtls_ca, signer.crt, broker.crt, control-plane cert).
These are self-signed with a 10-year validity, which is itself a long-lived
credential. To rotate the issuing mtls_ca (the higher-impact case):
- Generate a new
mtls_caand issue new server/client certs from it. - During transition, configure each service's
client_cato trust both the old and new CA (concatenate the two CA PEMs into the file referenced byclient_ca). Restart the services (TLS config is not hot-reloaded). - Roll out the new client certs (
broker.crt, control-plane cert) and server certs (signer.crt). - Remove the old CA from the
client_cabundles and restart.
To rotate only a leaf cert (e.g. a compromised broker.crt) without changing the
CA: issue a new cert from the existing mtls_ca, deploy it, and — because there
is no CRL on the mTLS path — rely on the broker CN allowlists (callers,
allowed_callers, reload_callers, trusted_forwarders) to deny the old CN if
it must be revoked before expiry.
Audit seeds are not certificates and must not be rotated — replacing
pki/*.seedbreaks the hash/signature chain of existing logs (see the table above). Archive the seed with the log if you ever retire a log file.
6. Reference config files¶
| File | Purpose |
|---|---|
config.json |
Active broker config (remote mode) |
config.example.json |
Reference with local + remote modes; allow_sudo/allow_pty/command_policy/approval_wait_seconds |
signer.json |
Active signer config (single source of truth for hosts) |
signer.example.json |
Reference with per-host allow_sudo/allowed_sudo_users/allow_pty/groups/command_policy + callers + trusted_forwarders |
control-plane.example.json |
Control plane reference: signer block, sign_callers (broker/approver role separation), approval (notifier/callers/timeout), behavior, trusted_forwarders, mTLS |
broker-ctl.example.json |
Client parameters for the remote broker-ctl commands (signer / control_plane URL + mTLS cert/key/ca); see §4 |
deploy/sshd_config.snippet |
sshd_config fragment + NOPASSWD sudoers for managed hosts |
Common operational notes¶
- The signer must be running before the broker / MCP client starts.
hosts_refresh_secondsis optional and defaults to 300 (5 min) when absent or0— already production-appropriate. It is not set in the shipped example configs. Lower it (e.g.30) only in development to pick up host-list changes from the signer faster.- To use elevation on a real host: set
allow_sudo: trueinsigner.json, reload the signer, and configure NOPASSWD sudoers on the host. Verify withssh_execute(server, "id", sudo=true). - To use PTY: set
allow_pty: trueand reload. Usessh_execute(..., pty=true)(one-shot) orssh_session_open(server, mode="pty")(interactive). - To use group RBAC (broker mTLS): add
"groups"per host and acallerssection. Issue a new CN signed bypki/mtls_ca.crtfor each restricted broker and add it tocallers. Include any bastion in the same groups. - To use the HTTP+OAuth frontend (
cmd/mcp-broker-http): configure theoauthblock andresource_urlinconfig.json. Provideserver_cert/server_key(noclient_ca— auth is the bearer token). For per-user RBAC add"groups_claim": "groups"and thegroupsfield on the relevant hosts. - Physical broker/signer separation (different machines) requires a new SAN
on the signer cert with the real IP/hostname, and updating
config.jsonwith that URL. - Broker/approver role separation (control plane): the signing path
(
/v1/sign,/v1/hosts,/v1/sign/result) is restricted to brokers. List the broker CNs insign_callers; with no list, a CN inapproval.callersis denied the sign path (an approver is not a broker). This stops an approver certificate, signed by the sameclient_ca, from originating signing requests. - Config is strictly validated at load: an unknown or misspelled key
(e.g.
sign_callerinstead ofsign_callers) is rejected at startup/reload rather than silently ignored, so a typo cannot quietly leave a setting open._*comment keys and the reserved_defaultgroup are still accepted.
7. Monitoring¶
Every service accepts an optional monitor_listen config key (empty or absent
= disabled) that starts a separate plain-HTTP listener with two endpoints:
| Endpoint | Purpose |
|---|---|
/healthz |
Liveness: 200 ok while the process is serving. Use it for load-balancer/systemd/container health checks. |
/metrics |
Metrics in the Prometheus text exposition format. |
The broker config key covers all three broker frontends (broker,
mcp-broker, mcp-broker-http); the signer and control plane have their own
key in signer.json / control-plane.json.
Security: the listener has no authentication and no TLS. Bind it to
127.0.0.1or a private scrape interface, never a public address. It is deliberately a separate listener so the mTLS/OAuth service ports stay clean.
Metrics¶
| Metric | Service | Meaning |
|---|---|---|
signer_sign_requests_total{outcome} |
signer | POST /v1/sign requests by audit outcome (issued, denied, approval-required, dry_run_*, …) plus rate-limited, which is counted here but deliberately not audited. |
controlplane_events_total{outcome} |
control plane | Audit events by outcome (forwarded, denied, anomaly, rate-limited, approval-*, error). |
controlplane_approvals_pending |
control plane | Approval requests currently awaiting a human decision (gauge, read at scrape time). |
broker_events_total{outcome} |
broker frontends | Audit events by outcome (executed, denied, session_open, session_exec, session_close, error, …). |
broker_sessions_active |
broker frontends | Persistent SSH sessions currently open (gauge). |
audit_append_failures_total |
all | Audit-log Append errors. Alert on any increase: the operation continues by design (threat-model gap #9), so this counter is the only machine-readable signal that the audit trail has a gap. |
Example scrape check:
curl -s http://127.0.0.1:9160/healthz
curl -s http://127.0.0.1:9160/metrics | grep signer_sign_requests_total
8. Production deployment¶
The manual flow above (signer.sh + make install to ~/bin) is the lab
setup. For production, deploy/ in the repository ships hardened systemd
units for the three daemons (signer, control-plane, mcp-broker-http),
an idempotent installer and a release target:
make dist # dist/infrabroker-<version>.tar.gz
# on the target host, as root:
./deploy/install.sh # per-service users, dirs, binaries, units, seed configs
systemctl enable --now infrabroker-signer # always the signer first
Reference layout: binaries in /usr/local/bin, the control-plane / mcp-http
configs and the mTLS PKI in /etc/infrabroker/ (root-owned, never overwritten
on upgrade), audit logs in /var/lib/infrabroker/<svc>/. The signer config
lives in /var/lib/infrabroker/signer/signer.json (service-owned): the durable
policy-mutation API rewrites it in place, so it cannot sit in the read-only
/etc tree. Policy hot-reload maps to systemctl reload infrabroker-signer
(SIGHUP).
Privilege separation (v1.35.0+): each daemon runs as its own system user
(infrabroker-signer, infrabroker-control-plane, infrabroker-mcp-http); the
shared infrabroker group only grants traversal of /etc/infrabroker and read
of the shared mTLS CA cert. Each service's mTLS key lives in its own
/etc/infrabroker/pki/<svc>/ (0750 root:infrabroker-<svc>), the admin CLI
material in pki/admin/ (root-only), and only the CA cert sits at the
pki/ root. A compromised broker frontend therefore cannot read the signer's
CA key, policy, state, or another service's key. See deploy/README.md for the
full layout and the upgrade steps from a single-user install.
The installer also seeds /etc/infrabroker/broker-ctl.json (client parameters,
see §4) so broker-ctl host list --remote works flag-less as the post-deploy
end-to-end check.
CA custody is the operator's choice, made in signer.json → ca_keys:
"akv" (Azure Key Vault — the private key never leaves the vault;
recommended for production; RSA/EC only) or "pem" (local file — lab/dev,
the signer logs a warning). Credentials for AKV come from
DefaultAzureCredential (managed identity, or a service principal via the
unit's optional EnvironmentFile=/etc/infrabroker/signer.env).
The full checklist — custody trade-offs, default-deny callers, rate
limits, upgrade caveats (in-memory approvals/sessions) — lives in
deploy/README.md in the repository.