S4: Rewrite existing peer_url rows to canonical form (with collision merge) #14

Open
opened 2026-05-12 20:27:09 +02:00 by arne · 1 comment
Owner

Parent

posta/server#10 — Absorb spec §4.1/§4.2 canonicalization and §9 key-management simplification

What to build

A new versioned schema migration in internal/store that walks every table
carrying a peer URL — messages, contacts, and any other peer_url-bearing
table — and re-canonicalizes each row to §4.1 canonical form. Where the strict
form collapses two pre-existing rows into one, merge them according to the
policy below. Ship with a --dry-run mode that prints the merge plan without
writing.

End-to-end behaviour after this slice:

  • Upgrading an existing inbox.db containing both https://x.example/inbox
    and https://x.example/inbox/ produces one canonical row whose history is
    the older row's id; counts, pinned status, last-message timestamps, and
    per-peer read watermarks are folded into the surviving row according to a
    deterministic policy.
  • Every merge is logged at slog.Info with both original URLs, the resulting
    canonical URL, and the kept/dropped row ids — so an operator can audit
    exactly what the upgrade did.
  • Running with --dry-run (or equivalent CLI surface on the migrate command)
    prints the merge plan and exits without writing.
  • The migration is idempotent: running it on an already-canonical database is
    a no-op.

This slice is operationally irreversible. The dry-run flag is the safety net;
operators should run it on a copy of the production DB first.

Acceptance criteria

  • A new entry is appended to the migrations list in internal/store
    (alongside the existing transactional migration framework). Existing
    applied versions are not edited.
  • The migration rewrites peer_url in messages and contacts to
    canonical form using the adapter / posta.Canonicalize.
  • Collision merge policy:
    • When two rows now share the same canonical peer_url, keep the row with
      the older id (or the older created_at when there is no surrogate id).
    • For messages: do not merge — each row is an independent envelope and
      just gets its peer_url rewritten. (Two duplicate envelopes only collide
      on (peer_url, msg_id) if msg_id matches, which is a pre-existing
      invariant.)
    • For contacts: fold into the surviving row by — display_name: keep the
      older row's value; pinned: logical OR; last_message_at: max;
      per-peer read watermark fields: max. Drop the loser row.
  • Every merge emits a slog.Info line tagged with the migration version,
    both input URLs, the canonical URL, and the kept/dropped ids.
  • A --dry-run mode is reachable via the operator surface (CLI flag on
    whichever subcommand triggers migrations on demand, or an env var
    consulted by OpenSQLite). In dry-run, the migration computes the plan,
    emits the would-merge log lines, and rolls back the transaction.
  • A test in internal/store/sqlite_test.go (or new file) builds a
    :memory: SQLite, seeds rows that deliberately collide under §4.1
    (trailing slash, percent-encoded unreserved like %7E vs ~, IDN
    U-label vs A-label, case-only-distinct hosts that survive lowering),
    runs the migration, and asserts: rows merged onto the older id;
    display_name/pinned/last_message_at/watermarks folded per the
    policy; loser rows dropped.
  • A second test asserts the migration is idempotent — running it again
    after success leaves the table identical.
  • A third test asserts --dry-run emits the same plan as a real run
    but leaves the database unchanged.
  • go build ./... and go test ./... pass.

Blocked by

posta/spec shipping Canonicalize() (the precondition documented in
posta/spec/TODO.md). The migration uses the spec library directly; it does
not depend on S3 (the call-site migration). Soft preference to ship after
S3 so newly-written rows are already canonical before the historical sweep
runs — but the migration is correct in either order.

## Parent posta/server#10 — Absorb spec §4.1/§4.2 canonicalization and §9 key-management simplification ## What to build A new versioned schema migration in `internal/store` that walks every table carrying a peer URL — `messages`, `contacts`, and any other `peer_url`-bearing table — and re-canonicalizes each row to §4.1 canonical form. Where the strict form collapses two pre-existing rows into one, merge them according to the policy below. Ship with a `--dry-run` mode that prints the merge plan without writing. End-to-end behaviour after this slice: - Upgrading an existing `inbox.db` containing both `https://x.example/inbox` and `https://x.example/inbox/` produces one canonical row whose history is the older row's id; counts, pinned status, last-message timestamps, and per-peer read watermarks are folded into the surviving row according to a deterministic policy. - Every merge is logged at `slog.Info` with both original URLs, the resulting canonical URL, and the kept/dropped row ids — so an operator can audit exactly what the upgrade did. - Running with `--dry-run` (or equivalent CLI surface on the migrate command) prints the merge plan and exits without writing. - The migration is idempotent: running it on an already-canonical database is a no-op. This slice is operationally irreversible. The dry-run flag is the safety net; operators should run it on a copy of the production DB first. ## Acceptance criteria - [ ] A new entry is appended to the `migrations` list in `internal/store` (alongside the existing transactional migration framework). Existing applied versions are not edited. - [ ] The migration rewrites `peer_url` in `messages` and `contacts` to canonical form using the adapter / `posta.Canonicalize`. - [ ] Collision merge policy: - When two rows now share the same canonical `peer_url`, keep the row with the older `id` (or the older `created_at` when there is no surrogate id). - For `messages`: do not merge — each row is an independent envelope and just gets its `peer_url` rewritten. (Two duplicate envelopes only collide on `(peer_url, msg_id)` if `msg_id` matches, which is a pre-existing invariant.) - For `contacts`: fold into the surviving row by — `display_name`: keep the older row's value; `pinned`: logical OR; `last_message_at`: max; per-peer read watermark fields: max. Drop the loser row. - [ ] Every merge emits a `slog.Info` line tagged with the migration version, both input URLs, the canonical URL, and the kept/dropped ids. - [ ] A `--dry-run` mode is reachable via the operator surface (CLI flag on whichever subcommand triggers migrations on demand, or an env var consulted by `OpenSQLite`). In dry-run, the migration computes the plan, emits the would-merge log lines, and rolls back the transaction. - [ ] A test in `internal/store/sqlite_test.go` (or new file) builds a `:memory:` SQLite, seeds rows that deliberately collide under §4.1 (trailing slash, percent-encoded unreserved like `%7E` vs `~`, IDN U-label vs A-label, case-only-distinct hosts that survive lowering), runs the migration, and asserts: rows merged onto the older id; `display_name`/`pinned`/`last_message_at`/watermarks folded per the policy; loser rows dropped. - [ ] A second test asserts the migration is idempotent — running it again after success leaves the table identical. - [ ] A third test asserts `--dry-run` emits the same plan as a real run but leaves the database unchanged. - [ ] `go build ./...` and `go test ./...` pass. ## Blocked by `posta/spec` shipping `Canonicalize()` (the precondition documented in `posta/spec/TODO.md`). The migration uses the spec library directly; it does **not** depend on S3 (the call-site migration). Soft preference to ship after S3 so newly-written rows are already canonical before the historical sweep runs — but the migration is correct in either order.
Author
Owner

This was generated by AI during triage.

Precondition resolved. posta/spec commit 5aa3aa3 lands Canonicalize(s) (string, error) in pkg/posta. The migration can call it directly on each row's peer_url.

The server's go.mod already uses replace … => ../spec, so no version bump is needed.

Implementation notes for the agent:

  • Append a new entry to the migrations list in internal/store/sqlite.go (versioned, transactional — same framework as existing migrations).
  • Rows whose pre-migration peer_url is rejected by Canonicalize (e.g. legacy entries with userinfo or non-HTTPS scheme) need a documented policy — recommend: log at slog.Warn with the row id and reject category, leave the row untouched, and surface the count in the migration summary. Worth flagging in the slice comment / PR description.
  • The --dry-run surface is operator-facing; a CLI flag on whichever subcommand currently triggers OpenSQLite is the natural seam, or an env var if no migration-specific subcommand exists. The dry-run should compute the merge plan inside a transaction that's rolled back at the end.
  • The fixture test should include the four collision cases listed in the acceptance criteria (trailing slash, %7E vs ~, IDN U-label vs A-label, case-only-distinct hosts) using the canonical outputs from posta/spec/testdata/vectors/url-canonical/.

This slice is operationally irreversible on production data. The --dry-run output is the artifact a human reviews before applying.

Category: enhancement
State: ready-for-agent

> *This was generated by AI during triage.* Precondition resolved. `posta/spec` commit `5aa3aa3` lands `Canonicalize(s) (string, error)` in `pkg/posta`. The migration can call it directly on each row's `peer_url`. The server's `go.mod` already uses `replace … => ../spec`, so no version bump is needed. Implementation notes for the agent: - Append a new entry to the `migrations` list in `internal/store/sqlite.go` (versioned, transactional — same framework as existing migrations). - Rows whose pre-migration `peer_url` is rejected by `Canonicalize` (e.g. legacy entries with userinfo or non-HTTPS scheme) need a documented policy — recommend: log at `slog.Warn` with the row id and reject category, leave the row untouched, and surface the count in the migration summary. Worth flagging in the slice comment / PR description. - The `--dry-run` surface is operator-facing; a CLI flag on whichever subcommand currently triggers `OpenSQLite` is the natural seam, or an env var if no migration-specific subcommand exists. The dry-run should compute the merge plan inside a transaction that's rolled back at the end. - The fixture test should include the four collision cases listed in the acceptance criteria (trailing slash, `%7E` vs `~`, IDN U-label vs A-label, case-only-distinct hosts) using the canonical outputs from `posta/spec/testdata/vectors/url-canonical/`. This slice is operationally irreversible on production data. The `--dry-run` output is the artifact a human reviews before applying. **Category:** enhancement **State:** ready-for-agent
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
posta/server#14
No description provided.