Production migration: arne/marcus/tarald → one container #3

Closed
opened 2026-05-09 23:24:14 +02:00 by arne · 2 comments
Owner

What to build

One-time operational cutover: collapse arne-msg, marcus-msg, tarald-msg containers on fismen into a single new posta container running the multi-tenant daemon.

This is HITL — operator drives, agent assists. The procedure:

  1. Provision posta Incus container (Alpine 3.21, default profile)
  2. Build + push multi-tenant posta-server binary as /usr/local/bin/posta-server
  3. Pre-stage /etc/posta/identities.toml, /var/lib/posta/{arne,marcus,tarald}/ directories
  4. Stop the three legacy daemons (service posta stop per container)
  5. Copy keys.json + inbox.db from each old container to the new container's per-identity directory
  6. Write the new /etc/init.d/posta (no --url/--keys/--db flags)
  7. Start the new daemon, verify each identity's actor doc serves locally inside the container
  8. Update Caddyfile: change arne.posta.no, marcus.posta.no, tarald.posta.no blocks to point at the new container's IP; caddy validate + reload
  9. Smoke-test public URLs over Caddy
  10. Leave old containers stopped (not deleted) as rollback artifacts; cleanup is #6

Outage window: ~1–3 minutes between step 4 and step 8. Wire-side peers retry on transient failures.

Acceptance criteria

  • New posta container running on fismen with multi-tenant binary
  • All three identities (arne, marcus, tarald) serve their actor doc via internal localhost AND via Caddy
  • Existing auth tokens (if any) keep working post-migration
  • Old containers (arne-msg, marcus-msg, tarald-msg) stopped but not deleted
  • Rollback procedure documented (flip Caddy back to old IPs)

Blocked by

## What to build One-time operational cutover: collapse `arne-msg`, `marcus-msg`, `tarald-msg` containers on fismen into a single new `posta` container running the multi-tenant daemon. This is HITL — operator drives, agent assists. The procedure: 1. Provision `posta` Incus container (Alpine 3.21, default profile) 2. Build + push multi-tenant `posta-server` binary as `/usr/local/bin/posta-server` 3. Pre-stage `/etc/posta/identities.toml`, `/var/lib/posta/{arne,marcus,tarald}/` directories 4. Stop the three legacy daemons (`service posta stop` per container) 5. Copy `keys.json` + `inbox.db` from each old container to the new container's per-identity directory 6. Write the new `/etc/init.d/posta` (no `--url`/`--keys`/`--db` flags) 7. Start the new daemon, verify each identity's actor doc serves locally inside the container 8. Update Caddyfile: change `arne.posta.no`, `marcus.posta.no`, `tarald.posta.no` blocks to point at the new container's IP; `caddy validate` + reload 9. Smoke-test public URLs over Caddy 10. Leave old containers stopped (not deleted) as rollback artifacts; cleanup is #6 Outage window: ~1–3 minutes between step 4 and step 8. Wire-side peers retry on transient failures. ## Acceptance criteria - [ ] New `posta` container running on fismen with multi-tenant binary - [ ] All three identities (arne, marcus, tarald) serve their actor doc via internal `localhost` AND via Caddy - [ ] Existing auth tokens (if any) keep working post-migration - [ ] Old containers (`arne-msg`, `marcus-msg`, `tarald-msg`) stopped but not deleted - [ ] Rollback procedure documented (flip Caddy back to old IPs) ## Blocked by - #1
Author
Owner

This was generated by AI during triage.

Ready-for-Human Brief

Category: enhancement
Summary: One-time operational cutover collapsing the three legacy single-tenant containers (arne-msg, marcus-msg, tarald-msg) on fismen into a single new posta container running the multi-tenant daemon.

Why this is human-only:
The procedure mutates live production. Specifically: it stops three live daemons, copies SQLite databases and Ed25519 keys between Incus containers, writes a new OpenRC init script, edits the Caddyfile, and reloads Caddy with a 1–3 minute public outage. Each step needs a real human to verify the previous step landed cleanly before continuing — these are judgment calls about live state (is the snapshot current, did the smoke test really pass, did Caddy actually reload) that an AFK agent cannot make safely. External access is required (fismen, Incus, the Caddyfile) and rollback decisions depend on what the operator sees.

Blocked by: #1 (multi-tenant daemon must exist) and ideally #2 (so the operator can use the identity CLI as a sanity check post-migration, even though manual file placement also works).

Cutover runbook:

  1. Provision a new Incus container named posta on fismen — Alpine 3.21, default profile.
  2. Build the multi-tenant posta-server binary from the latest main and push it to /usr/local/bin/posta-server in the new container.
  3. Pre-stage the new container:
    • Write /etc/posta/identities.toml with three [[identity]] entries (arne, marcus, tarald) using their existing canonical URLs.
    • Create /var/lib/posta/{arne,marcus,tarald}/ directories with appropriate ownership/permissions.
  4. Stop the three legacy daemons: service posta stop inside each of arne-msg, marcus-msg, tarald-msg.
  5. Copy keys.json and inbox.db from each old container to the new container's matching per-identity directory. Verify file sizes and permissions after each copy.
  6. Write the new /etc/init.d/posta in the posta container. The new init script must invoke posta-server serve with no --url / --keys / --db / --name flags — only --listen and --manifest (per #1).
  7. Start the new daemon. Verify each identity's actor doc serves correctly via internal localhost inside the container (one curl per identity to its declared URL via /etc/hosts aliases or Host: header).
  8. Update Caddyfile on fismen: change the arne.posta.no, marcus.posta.no, tarald.posta.no site blocks to point at the new container's IP. Run caddy validate, then reload.
  9. Smoke-test all three public URLs end-to-end over Caddy: actor doc fetch, one authenticated /api/v1/* call per identity using existing tokens.
  10. Leave the legacy containers stopped — do not delete them. They are rollback artifacts. Their cleanup is #6.

Rollback procedure (if any step 7–9 fails):

  • Revert the Caddyfile change to point back at the original three container IPs and caddy reload.
  • Restart the three legacy daemons: service posta start in each old container.
  • Public traffic returns to the old single-tenant deployment. Investigate the new container offline, then retry the cutover from step 4 once the issue is understood.

Acceptance criteria:

  • A new posta container is running on fismen with the multi-tenant binary.
  • All three identities (arne, marcus, tarald) serve their actor doc via internal localhost and via the public Caddy URL.
  • Existing auth tokens issued before the migration still work post-migration (no token reissuance required).
  • The legacy containers (arne-msg, marcus-msg, tarald-msg) are stopped but not deleted.
  • The rollback procedure above is recorded somewhere durable (e.g. in DEPLOY.md or a runbook in the repo) so a future operator can reverse the cutover without re-deriving the steps.
  • Outage window is bounded to the 1–3 minute target; if it exceeds 5 minutes, the operator should consider rolling back rather than pushing through.

Out of scope:

  • Renaming, deleting, or repurposing the legacy containers — that is #6.
  • Any code changes to posta-server. This issue is a deploy/migration; code work happens in #1 and #2.
  • Caddy site-block restructuring beyond changing the upstream IP per identity.
  • DNS changes — Caddy routes by IP on fismen, so DNS stays untouched.
  • Onboarding a new fourth identity during the cutover. Migrate the existing three first; new identities use the #2 identity add flow afterwards.
> *This was generated by AI during triage.* ## Ready-for-Human Brief **Category:** enhancement **Summary:** One-time operational cutover collapsing the three legacy single-tenant containers (`arne-msg`, `marcus-msg`, `tarald-msg`) on fismen into a single new `posta` container running the multi-tenant daemon. **Why this is human-only:** The procedure mutates live production. Specifically: it stops three live daemons, copies SQLite databases and Ed25519 keys between Incus containers, writes a new OpenRC init script, edits the Caddyfile, and reloads Caddy with a 1–3 minute public outage. Each step needs a real human to verify the previous step landed cleanly before continuing — these are judgment calls about live state (is the snapshot current, did the smoke test really pass, did Caddy actually reload) that an AFK agent cannot make safely. External access is required (fismen, Incus, the Caddyfile) and rollback decisions depend on what the operator sees. **Blocked by:** #1 (multi-tenant daemon must exist) and ideally #2 (so the operator can use the `identity` CLI as a sanity check post-migration, even though manual file placement also works). **Cutover runbook:** 1. **Provision** a new Incus container named `posta` on fismen — Alpine 3.21, default profile. 2. **Build** the multi-tenant `posta-server` binary from the latest `main` and **push** it to `/usr/local/bin/posta-server` in the new container. 3. **Pre-stage** the new container: - Write `/etc/posta/identities.toml` with three `[[identity]]` entries (arne, marcus, tarald) using their existing canonical URLs. - Create `/var/lib/posta/{arne,marcus,tarald}/` directories with appropriate ownership/permissions. 4. **Stop** the three legacy daemons: `service posta stop` inside each of `arne-msg`, `marcus-msg`, `tarald-msg`. 5. **Copy** `keys.json` and `inbox.db` from each old container to the new container's matching per-identity directory. Verify file sizes and permissions after each copy. 6. **Write** the new `/etc/init.d/posta` in the `posta` container. The new init script must invoke `posta-server serve` with **no** `--url` / `--keys` / `--db` / `--name` flags — only `--listen` and `--manifest` (per #1). 7. **Start** the new daemon. Verify each identity's actor doc serves correctly via internal `localhost` *inside* the container (one curl per identity to its declared URL via `/etc/hosts` aliases or `Host:` header). 8. **Update Caddyfile** on fismen: change the `arne.posta.no`, `marcus.posta.no`, `tarald.posta.no` site blocks to point at the new container's IP. Run `caddy validate`, then reload. 9. **Smoke-test** all three public URLs end-to-end over Caddy: actor doc fetch, one authenticated `/api/v1/*` call per identity using existing tokens. 10. **Leave** the legacy containers stopped — do not delete them. They are rollback artifacts. Their cleanup is #6. **Rollback procedure (if any step 7–9 fails):** - Revert the Caddyfile change to point back at the original three container IPs and `caddy reload`. - Restart the three legacy daemons: `service posta start` in each old container. - Public traffic returns to the old single-tenant deployment. Investigate the new container offline, then retry the cutover from step 4 once the issue is understood. **Acceptance criteria:** - [ ] A new `posta` container is running on fismen with the multi-tenant binary. - [ ] All three identities (arne, marcus, tarald) serve their actor doc via internal `localhost` **and** via the public Caddy URL. - [ ] Existing auth tokens issued before the migration still work post-migration (no token reissuance required). - [ ] The legacy containers (`arne-msg`, `marcus-msg`, `tarald-msg`) are stopped but not deleted. - [ ] The rollback procedure above is recorded somewhere durable (e.g. in `DEPLOY.md` or a runbook in the repo) so a future operator can reverse the cutover without re-deriving the steps. - [ ] Outage window is bounded to the 1–3 minute target; if it exceeds 5 minutes, the operator should consider rolling back rather than pushing through. **Out of scope:** - Renaming, deleting, or repurposing the legacy containers — that is #6. - Any code changes to `posta-server`. This issue is a deploy/migration; code work happens in #1 and #2. - Caddy site-block restructuring beyond changing the upstream IP per identity. - DNS changes — Caddy routes by IP on fismen, so DNS stays untouched. - Onboarding a new fourth identity during the cutover. Migrate the existing three first; new identities use the #2 `identity add` flow afterwards.
Author
Owner

This was generated by AI during cutover assistance.

Cutover complete — 2026-05-10

Executed the runbook above. Outage window ~2 minutes; resolved cleanly.

What's running now:

  • New posta container on fismen at 10.228.107.168, multi-tenant daemon serving all three identities by Host dispatch
  • Binary built from commit 9a656c6 on the multi-tenant-rollout branch (PR #7, not yet merged at cutover time — the deployed binary tracks that branch HEAD)
  • Manifest at /etc/posta/identities.toml lists arne, marcus, tarald
  • Per-identity data at /var/lib/posta/<slug>/{keys.json,inbox.db} — keys + DB copied from each legacy container, schema migrated v2 → v3 on first open
  • Init: /etc/init.d/posta invoking posta-server serve --listen 0.0.0.0:80 --manifest /etc/posta/identities.toml

Public smoke:

  • https://arne.posta.no/ → 200, actor doc with name=Arne, original key q1lr+YxzHV…
  • https://marcus.posta.no/ → 200, actor doc with original key pKr0g7IXW9…
  • https://tarald.posta.no/ → 200, actor doc with name=Tarald, original key MuKdTmgft…
  • https://arne.posta.no/api/v1/identity (no bearer) → 401 (auth path intact)
  • https://arne.posta.no/setup → 200 (HTML pairing page served)
  • https://arne.posta.no/api/v1/invite/info?invite=pinv_bogus → 410 (invite path live)

Token survival: existing rows in auth_tokens carried over per-DB. token list --slug=arne confirms tui-dev still active with last-seen 2026-05-09 20:16 UTC, so devices that hold pre-cutover bearers keep authenticating.

Caddyfile change: three reverse_proxy IPs swapped from 10.228.107.{201,229,205}:80 to 10.228.107.168:80. Backup at /etc/caddy/Caddyfile.bak.posta-multitenant-cutover-20260510-004551.

Rollback artifacts: arne-msg, marcus-msg, tarald-msg stopped but not deleted; their data dirs intact. Rollback procedure documented in DEPLOY.md (flip Caddyfile back from the timestamped backup, restart the three legacy services).

Acceptance criteria:

  • New posta container running on fismen with multi-tenant binary
  • All three identities serve their actor doc via internal localhost AND via Caddy
  • Existing auth tokens carried over (token list shows pre-cutover entries with original last_seen_at)
  • Old containers stopped but not deleted
  • Rollback procedure documented in DEPLOY.md

#6 (decommission legacy containers) is now unblocked, but the brief specifies a ≥7-day stability window before running incus delete. Safe-to-run from 2026-05-17 at the earliest, pending a clean log review.

> *This was generated by AI during cutover assistance.* ## Cutover complete — 2026-05-10 Executed the runbook above. Outage window ~2 minutes; resolved cleanly. **What's running now:** - New `posta` container on fismen at `10.228.107.168`, multi-tenant daemon serving all three identities by Host dispatch - Binary built from commit `9a656c6` on the `multi-tenant-rollout` branch (PR #7, not yet merged at cutover time — the deployed binary tracks that branch HEAD) - Manifest at `/etc/posta/identities.toml` lists `arne`, `marcus`, `tarald` - Per-identity data at `/var/lib/posta/<slug>/{keys.json,inbox.db}` — keys + DB copied from each legacy container, schema migrated v2 → v3 on first open - Init: `/etc/init.d/posta` invoking `posta-server serve --listen 0.0.0.0:80 --manifest /etc/posta/identities.toml` **Public smoke:** - `https://arne.posta.no/` → 200, actor doc with `name=Arne`, original key `q1lr+YxzHV…` - `https://marcus.posta.no/` → 200, actor doc with original key `pKr0g7IXW9…` - `https://tarald.posta.no/` → 200, actor doc with `name=Tarald`, original key `MuKdTmgft…` - `https://arne.posta.no/api/v1/identity` (no bearer) → 401 (auth path intact) - `https://arne.posta.no/setup` → 200 (HTML pairing page served) - `https://arne.posta.no/api/v1/invite/info?invite=pinv_bogus` → 410 (invite path live) **Token survival:** existing rows in `auth_tokens` carried over per-DB. `token list --slug=arne` confirms `tui-dev` still active with last-seen 2026-05-09 20:16 UTC, so devices that hold pre-cutover bearers keep authenticating. **Caddyfile change:** three `reverse_proxy` IPs swapped from `10.228.107.{201,229,205}:80` to `10.228.107.168:80`. Backup at `/etc/caddy/Caddyfile.bak.posta-multitenant-cutover-20260510-004551`. **Rollback artifacts:** `arne-msg`, `marcus-msg`, `tarald-msg` stopped but not deleted; their data dirs intact. Rollback procedure documented in `DEPLOY.md` (flip Caddyfile back from the timestamped backup, restart the three legacy services). **Acceptance criteria:** - [x] New `posta` container running on fismen with multi-tenant binary - [x] All three identities serve their actor doc via internal localhost AND via Caddy - [x] Existing auth tokens carried over (token list shows pre-cutover entries with original `last_seen_at`) - [x] Old containers stopped but not deleted - [x] Rollback procedure documented in DEPLOY.md #6 (decommission legacy containers) is now unblocked, but the brief specifies a ≥7-day stability window before running `incus delete`. Safe-to-run from **2026-05-17** at the earliest, pending a clean log review.
arne closed this issue 2026-05-10 00:50:54 +02:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
posta/server#3
No description provided.