design: reshape cloud-svc as control plane (two-port split)
CI / build (push) Failing after 3s
CI / release (push) Has been skipped

Earlier draft archived cloud-svc entirely. Better shape: keep it as a
control plane for the restic backend. Two listeners in one process:

  - provisioning :9091 on automc-net (called by discord-bot)
  - operator     :9092 on 127.0.0.1 (called by automc-setup wizard)

Players still hit restic-rest-server (data plane) directly with their
per-user password. cloud-svc never sits in the player data path —
limits its public exposure to zero.
This commit is contained in:
2026-06-02 21:19:45 +02:00
parent 698a7a037c
commit d9a6057c75
+90 -25
View File
@@ -2,7 +2,11 @@
Per-Discord-user state sync for Minecraft. Pulls on launch, pushes on exit. Single JAR drops into Prism / MMC / ATLauncher / frazclient as a pre-launch + post-exit hook.
**Backend:** `restic-rest-server` with `--private-repos --append-only`. No custom server code. cloud-sync.jar is a restic CLI wrapper.
**Data plane:** `restic-rest-server` with `--private-repos --append-only`. Clients hit this directly with their per-user password.
**Control plane:** `cloud-svc` Go service with two listeners — a provisioning port reachable from automc-net (called by discord-bot) and a loopback admin port (called by automc-setup wizard). Players never touch cloud-svc.
**Client:** `cloud-sync.jar` subprocesses restic. ~200 LOC.
## Why this shape
@@ -16,13 +20,15 @@ Per-Discord-user state sync for Minecraft. Pulls on launch, pushes on exit. Sing
| Encryption at rest | Per-repo password, built in |
| Multi-machine support | Restic tags + hostname; if we ever want it, free |
cloud-svc as I'd been building it was a worse re-implementation of all the above. Pivoting before it ships.
cloud-svc as originally designed was a worse re-implementation of all the above. Pivoting before it ships; cloud-svc gets reshaped into the control plane described below.
## Topology
```mermaid
flowchart LR
pl["player PC"]:::external
op["operator
(via SSH)"]:::external
jar["cloud-sync.jar
(in launcher's
pre/post hooks)"]:::deploy
@@ -33,13 +39,24 @@ on first run)"]:::deploy
subgraph john["john (192.168.65.33)"]
rp{{"reverse proxy
:443"}}:::deploy
ao{{"restic-rest-server
subgraph net["automc-net"]
ao{{"restic-rest-server
--private-repos
--append-only
:8002"}}:::deploy
bot{{"discord-bot"}}:::deploy
cs_int{{"cloud-svc
provisioning :9091
(automc-net only)"}}:::deploy
end
cs_admin{{"cloud-svc
admin :9092
(127.0.0.1 only)"}}:::deploy
store[/"/srv/cloud-data
/<discord_id>/..."/]:::pvc
bot{{"discord-bot"}}:::deploy
htp[/"/etc/restic-users
htpasswd"/]:::pvc
end
@@ -50,15 +67,39 @@ htpasswd"/]:::pvc
rp -->|"loopback"| ao
ao -->|"reads"| htp
ao -->|"writes"| store
bot -.->|"on /register:
htpasswd -B add"| htp
POST /admin/users"| cs_int
cs_int -.->|"htpasswd add
restic init
key add"| htp
cs_int -.->|"mints repo"| store
bot -.->|"DM password"| pl
op -.->|"SSH then
automc-setup cloud ..."| cs_admin
cs_admin -.->|"list / prune / revoke"| htp
cs_admin -.->|"prune via
operator master key"| store
classDef deploy fill:#d5e8d4,stroke:#82b366,color:#000
classDef pvc fill:#f5f5f5,stroke:#666,color:#000
classDef external fill:#f5f5f5,stroke:#666,color:#000,stroke-dasharray:5 5
```
`cloud-svc` runs as **one process with two listeners**:
| Listener | Bind | Reachable from | Endpoints |
|---|---|---|---|
| Provisioning | `automc-net:9091` (no PublishPort) | discord-bot via service-net DNS | `POST /admin/users` only |
| Operator | `127.0.0.1:9092` | john's loopback (SSH session) | `GET/DELETE /admin/users`, `POST /admin/users/{id}/prune`, `GET /admin/users/{id}/quota`, etc. |
The split means a compromised discord-bot can mint new accounts but cannot enumerate, prune, or revoke existing ones. Operator-only ops require shell access on john.
Auth model:
- Provisioning listener: shared service token (env `CLOUD_PROVISIONING_KEY`), discord-bot uses same value from its own env
- Operator listener: no auth — loopback bind is the boundary, same pattern as `server-manager:127.0.0.1:8080`
## Auth & identity
| Element | Value |
@@ -68,9 +109,9 @@ htpasswd -B add"| htp
| URL pattern | `rest:https://cloud.tm.center/<discord_id>/` |
| Server isolation | `--private-repos` enforces URL path matches authenticated user |
discord-bot's `/register` flow extends to mint a random password, `htpasswd -B`-add it to the file, DM the password to the player. Existing flow stays untouched for non-cloud cases.
discord-bot's `/register` flow extends to call `POST cloud-svc:9091/admin/users` with the player's Discord ID. cloud-svc mints a random password, `htpasswd -B`-adds it to the file, runs `restic init` + `restic key add operator-master`, and returns the password. discord-bot DMs it to the player. discord-bot itself never touches restic or htpasswd directly.
Revocation = `htpasswd -D` removes the user. No token store, no scope checks, no auth-service involvement.
Revocation = operator runs `automc-setup cloud revoke <discord_id>` which hits the loopback admin port. No token store, no scope checks, no auth-service involvement.
## Client flow
@@ -138,40 +179,64 @@ Recommendation: **option 1** (multi-key per repo). On `/register`, the bot calls
- Operator UI for "this player has 25 GB of cloud data, what's in it?"
- Cross-machine sync UX (you can play on PC A then PC B; latest snapshot wins. No conflict UI because restic doesn't merge — restore-latest is destructive by design.)
## Migration from cloud-svc
## cloud-svc — reshape, not delete
cloud-svc was never deployed. No user data to migrate. Action:
- Archive `Timemachine/cloud-svc` repo (mark archived, leave commits + DESIGN.md as a record)
- Delete `cloud_pull` / `cloud_push` from `frazclient/client.py`
- Remove `automc_cloud_svc.md` memory entry, replace with `automc_cloud_sync.md` pointing here
cloud-svc gets a new purpose: control plane for the restic backend. Throw away:
- Manifest types + validation (`manifest.go`)
- Blob storage + tarball extraction (`storage.go` body)
- Player-facing `/v1/*` endpoints (`server.go` body)
- Snapshot ID generation, content hash cross-check
Keep:
- Project skeleton (go.mod, Dockerfile, Makefile, CI)
- Auth-cache pattern from `auth.go` (reused for provisioning token verification)
- Per-user mutex pattern from `storage.go` (still needed to serialize concurrent provisioning calls)
- Config loader from `config.go` (adds new vars)
New code:
- Two `http.Server` instances, one per listener
- htpasswd writer that respects bcrypt + file locking
- restic CLI subprocesser (init repo, add key, prune)
- `time.Ticker` for nightly prune job
Estimate: ~300 LOC kept, ~600 LOC new. Net smaller than current cloud-svc.
Also delete `cloud_pull` / `cloud_push` from `frazclient/client.py` (these get obsoleted by `cloud-sync.jar` calls).
## Topology consequences for `automc/docs/network-exposure.md`
Same one public endpoint (`cloud.tm.center :443`), same reverse-proxy hardening checklist, same threat surface. Differences:
| Layer | Bind | Public? |
|---|---|---|
| `restic-ao` (data plane) | `127.0.0.1:8002` | Via reverse proxy at `cloud.tm.center:443` |
| `cloud-svc` provisioning listener | `automc-net:9091` (no PublishPort) | No |
| `cloud-svc` admin listener | `127.0.0.1:9092` | No |
| Old (cloud-svc) | New (restic-ao) |
Only one public HTTPS endpoint changes from the original plan: it now fronts `restic-ao` instead of `cloud-svc`. Same reverse-proxy hardening checklist applies. Threat surface differences:
| Old (cloud-svc as data path) | New (restic-ao as data path) |
|---|---|
| Bearer token via auth-service `/auth/verify-key` | Basic auth via htpasswd in restic-rest-server |
| Token leak = one user's data | Password leak = one user's data |
| Bearer token via auth-service `/auth/verify-key` | HTTP Basic via htpasswd in restic-rest-server |
| Custom Go service, 33 tests | Upstream restic-rest-server, well-audited |
| `127.0.0.1:9091` loopback bind | `127.0.0.1:8002` (existing restic-ao quadlet) |
| 60s in-memory cache of verified tokens | rest-server reads htpasswd per request |
| Player-facing endpoints | None — cloud-svc not public |
Net: fewer moving parts, smaller attack surface.
Operator endpoints are loopback-only and require SSH access to john to reach. No new public surface from the control plane.
## Repo layout post-pivot
| Repo | Purpose |
|---|---|
| `Timemachine/cloud-sync` (this) | Kotlin/Gradle JAR that subprocesses restic |
| `Timemachine/cloud-svc` | **Archived.** Snapshot of the abandoned path; commits + DESIGN.md kept as decision record |
| `Timemachine/discord-bot` | Extended `/register` flow to mint htpasswd creds + init restic repo |
| `Timemachine/automc` | `setup` wizard renders the restic-ao quadlet with the new flags; `database/schema.sql` unchanged |
| `Timemachine/cloud-svc` | **Reshaped** — control plane only. Two-port Go service for provisioning + operator ops. NOT archived. |
| `Timemachine/discord-bot` | Extended `/register` flow calls cloud-svc to provision; DMs returned password |
| `Timemachine/automc` | `setup` wizard adds `automc-setup cloud {list,prune,revoke,quota}` subcommands hitting cloud-svc's loopback admin port. Quadlet templates for both restic-ao (new flags) and cloud-svc (two listeners). `database/schema.sql` unchanged. |
## Pre-implementation checklist
- [ ] User reviews this design doc
- [ ] Confirm: server-side prune via operator master password (option 1 above)
- [ ] Confirm: archive cloud-svc rather than delete
- [x] **Confirmed (2026-06-02): cloud-svc reshapes to control plane, not archived**
- [x] **Confirmed (2026-06-02): two-port split — automc-net for provisioning, loopback for operator**
- [ ] Confirm: server-side prune via operator master password key on each repo
- [ ] Confirm: cloud-sync.jar auto-downloads restic binary vs requires it pre-installed
- [ ] Confirm: nightly prune at 04:00 UTC vs after-each-push
- [ ] Confirm: nightly prune cadence (default proposal: daily 04:00 UTC)
- [ ] Confirm: shared service token between discord-bot and cloud-svc provisioning port (env var on both)