Files
cloud-sync/DESIGN.md
T
claude-timemachine ffdfb1f9b6
CI / test (3.10) (push) Successful in 40s
CI / test (3.11) (push) Successful in 19s
CI / test (3.12) (push) Successful in 23s
CI / build-pyz (push) Successful in 4s
CI / release (push) Has been skipped
pivot to Python: replace Kotlin/JVM with stdlib zipapp
Reasons stacked up:
  - AV: unsigned JARs that auto-download binaries + upload files trigger
    Windows Defender false-positives more often than Python scripts
    invoked by code-signed python.exe.
  - Qt UI option: PySide6 opens a path to a real Qt UI (matching Prism's
    look) if needed later. JVM Qt bindings are abandoned.
  - frazclient already needs Python; inlining as 'import cloud_sync' is
    zero overhead vs the launcher always shelling out to java.

Implementation:
  - cloud_sync package: cli.py (argparse), creds.py, scope.py,
    restic.py (binary discovery + auto-download + sha256 verify),
    sync.py (pull/push subprocess restic).
  - pyproject.toml with hatchling backend; pip-installable.
  - Makefile builds cloud-sync.pyz via python -m zipapp (~53 KB).
  - 33 pytest tests, stdlib only on runtime.
  - CI workflow runs pytest matrix (3.10/3.11/3.12) + builds pyz.
  - DESIGN.md + README.md updated to reflect Python.

E2E verified against local restic-rest-server:
  pull empty → push initial → rm -rf local → pull restores → modify+push
  creates second snapshot → client forget --prune blocked by --append-only.

Throws away ~565 LOC of Kotlin (and 18 jar tests) committed earlier in
this same session. Net result is ~250 LOC Python + 33 tests = smaller
and more aligned with the rest of the stack.
2026-06-03 01:11:47 +02:00

278 lines
14 KiB
Markdown

# cloud-sync — design
Per-Discord-user state sync for Minecraft. Pulls on launch, pushes on exit. Single Python zipapp drops into Prism / MMC / ATLauncher / frazclient as a pre-launch + post-exit hook.
**Data plane:** `restic-rest-server` with `--private-repos --append-only`. Clients hit this directly with their per-user password.
**Control plane:** `cloud-svc` Go service with two listeners — a provisioning port reachable from automc-net (called by discord-bot) and a loopback admin port (called by automc-setup wizard). Players never touch cloud-svc.
**Client:** `cloud-sync.pyz` (Python 3.10+, stdlib only) subprocesses restic. ~300 LOC. Distributed as a zipapp (single-file). Python over Java for two reasons: (a) launcher's PostExit hook can call any subprocess so language doesn't matter, (b) custom unsigned JARs that download binaries + upload files are textbook Windows Defender false-positive triggers, while Python invoked by signed `python.exe` mostly sidesteps that.
## Why this shape
| Concern | How restic solves it |
|---|---|
| Snapshot semantics | Native — every `restic backup` is a snapshot |
| Deduplication | Chunk-level (not just file-level), built in |
| Retention policy | `restic forget --keep-last/daily/weekly/monthly` |
| Append-only enforcement | `restic-rest-server --append-only`: even with a valid password, clients can't delete |
| Per-user isolation | `--private-repos`: URL path must contain the authenticated username |
| Encryption at rest | Per-repo password, built in |
| Multi-machine support | Restic tags + hostname; if we ever want it, free |
cloud-svc as originally designed was a worse re-implementation of all the above. Pivoting before it ships; cloud-svc gets reshaped into the control plane described below.
## Topology
```mermaid
flowchart LR
pl["player PC"]:::external
op["operator
(via SSH)"]:::external
jar["cloud-sync.pyz
(Python; in launcher's
pre/post hooks)"]:::deploy
restic["restic binary
(auto-downloaded
on first run)"]:::deploy
subgraph john["john (192.168.65.33)"]
rp{{"reverse proxy
:443"}}:::deploy
subgraph net["automc-net"]
ao{{"restic-rest-server
--private-repos
--append-only
:8002"}}:::deploy
bot{{"discord-bot"}}:::deploy
cs_int{{"cloud-svc
provisioning :9091
(automc-net only)"}}:::deploy
end
cs_admin{{"cloud-svc
admin :9092
(127.0.0.1 only)"}}:::deploy
store[/"/srv/cloud-data
/<discord_id>/..."/]:::pvc
htp[/"/etc/restic-users
htpasswd"/]:::pvc
end
pl --> jar --> restic
restic ==>|"rest:https
<discord_id>:<password>"| rp
rp -->|"loopback"| ao
ao -->|"reads"| htp
ao -->|"writes"| store
bot -.->|"on /register:
POST /admin/users"| cs_int
cs_int -.->|"htpasswd add
restic init
key add"| htp
cs_int -.->|"mints repo"| store
bot -.->|"DM password"| pl
op -.->|"SSH then
automc-setup cloud ..."| cs_admin
cs_admin -.->|"list / prune / revoke"| htp
cs_admin -.->|"prune via
operator master key"| store
classDef deploy fill:#d5e8d4,stroke:#82b366,color:#000
classDef pvc fill:#f5f5f5,stroke:#666,color:#000
classDef external fill:#f5f5f5,stroke:#666,color:#000,stroke-dasharray:5 5
```
`cloud-svc` runs as **one process with two listeners**:
| Listener | Bind | Reachable from | Endpoints |
|---|---|---|---|
| Provisioning | `automc-net:9091` (no PublishPort) | discord-bot via service-net DNS | `POST /admin/users` only |
| Operator | `127.0.0.1:9092` | john's loopback (SSH session) | `GET/DELETE /admin/users`, `POST /admin/users/{id}/prune`, `GET /admin/users/{id}/quota`, etc. |
The split means a compromised discord-bot can mint new accounts but cannot enumerate, prune, or revoke existing ones. Operator-only ops require shell access on john.
Auth model:
- Provisioning listener: per-caller tokens. cloud-svc reads `CLOUD_PROVISIONING_TOKENS_BOT`, `CLOUD_PROVISIONING_TOKENS_<NAME>` env vars. Header `Authorization: Bearer <token>`. Logs include matched caller name for audit attribution.
- Operator listener: no auth — loopback bind is the boundary, same pattern as `server-manager:127.0.0.1:8080`
## Auth & identity
| Element | Value |
|---|---|
| User identity | Discord ID (immutable, from discord-bot's existing account-card flow) |
| User credential | restic repo password = bcrypt'd in `/etc/restic-users` htpasswd file |
| URL pattern | `rest:https://cloud.tm.center/<discord_id>/` |
| Server isolation | `--private-repos` enforces URL path matches authenticated user |
discord-bot's `/register` flow extends to call `POST cloud-svc:9091/admin/users` with the player's Discord ID. cloud-svc mints a random password, `htpasswd -B`-adds it to the file, runs `restic init` + `restic key add operator-master`, and returns the password. discord-bot DMs it to the player. discord-bot itself never touches restic or htpasswd directly.
Revocation = operator runs `automc-setup cloud revoke <discord_id>` which hits the loopback admin port. No token store, no scope checks, no auth-service involvement.
## On-disk layout (client)
cloud-sync.pyz stores its state under `<pack-folder>/.cloud-sync/` — per-instance, hidden by leading dot. Auto-excluded from cloud sync so a player can't accidentally upload their own credentials.
```
<pack-folder>/
mods/ # managed by packwiz
config/ # mixed: pack-shipped + player-modified
options.txt
journeymap/data/
.cloud-sync/ # cloud-sync owns this
token # "discord_id:password" (mode 0600)
scope.json # per-distribution include/exclude rules
restic-<version> # auto-downloaded binary
state.json # last-pull snapshot ID, last-push time
logs/ # rolling, capped ~5 MB
```
Per-instance isolation matters: a player running a cracked instance + a premium instance gets two separate `.cloud-sync/` dirs with different Discord credentials. `rm -rf .cloud-sync/` resets one instance entirely.
### Restic binary discovery
Probed in order:
1. `<pack-folder>/.cloud-sync/restic-<version>` — pinned copy from first run
2. `$PATH` (`which restic` + version match) — honor existing system install
3. Download from `github.com/restic/restic/releases/download/v<version>/restic_<version>_<os>_<arch>.bz2`, cache to `<pack-folder>/.cloud-sync/`
`--restic-binary <path>` flag overrides discovery for air-gapped operators.
### Jar placement
Stateless. Lives wherever the operator put it. Prism / MMC config references absolute path. One pyz can serve N instances; each gets its own `.cloud-sync/` underneath its own `--pack-folder`.
## Client flow
### `cloud-sync.pyz pull`
```
1. Load creds from <pack-folder>/.cloud-token (format: discord_id:password on one line)
2. Locate or auto-download restic binary into <pack-folder>/.cloud-sync/restic-<version>
3. restic -r rest:https://<url>/<discord_id>/ snapshots --latest 1 --json
4. If no snapshots → exit 0 (first run on this machine, nothing to restore)
5. restic restore latest --target <pack-folder> --include-from cloud-scope.txt
```
### `cloud-sync.pyz push`
```
1. Same creds + restic locator as pull
2. restic backup <pack-folder> --files-from cloud-scope.txt --exclude-from cloud-exclude.txt
3. restic forget --keep-last 20 --keep-daily 7 --keep-weekly 4 --keep-monthly 6 --prune
```
The `forget --prune` step is allowed by `restic-rest-server --append-only` only if the client supplies `Force-Allow-Forget: true`. We DON'T enable this in `--append-only` mode — the server refuses forget. **Pruning happens server-side via a nightly cron** running `restic forget` with the operator's full-access password against the repo. Clients can only add, never remove.
## `cloud-scope.json` → restic args
| Input | Becomes |
|---|---|
| `include: ["options.txt", "config/", "journeymap/data/"]` | Listed in `cloud-scope.txt`, passed as `--files-from cloud-scope.txt` |
| `exclude: ["config/simple-mod-sync*", "**/*.log"]` | Listed in `cloud-exclude.txt`, passed as `--exclude-from cloud-exclude.txt` |
| `max_size_mb_per_file: 50` | restic doesn't have a per-file size cap; we filter during scope generation |
## Retention policy
Server-side cron (e.g., daily at 04:00 UTC) walks all per-user repos:
```bash
for repo in /srv/cloud-data/*/; do
user=$(basename "$repo")
restic -r "$repo" --password-file /etc/restic-master-pass \
forget --keep-last=20 --keep-daily=7 --keep-weekly=4 --keep-monthly=6 --prune
done
```
This requires the operator to have a "master password" that opens any user's repo — restic doesn't have that natively. **Options:**
1. **Init each user's repo with TWO keys** — one for the user, one for the operator-side pruner. restic supports multi-key per repo.
2. **Run the cron with each user's own password** — requires storing all user passwords server-side; defeats the encryption.
3. **Don't auto-prune** — let users push forever, trust quota at the rest-server level.
Recommendation: **option 1** (multi-key per repo). On `/register`, the bot calls `restic -r <repo> --password-file <operator> key add` to add the player's password as a SECOND key. The pruner cron uses the operator master password.
## What's in v1
- restic-rest-server with `--private-repos --append-only --htpasswd-file`
- discord-bot `/register` extension: mint password, htpasswd add, `restic init` repo, `restic key add` player key
- cloud-sync.pyz that subprocesses restic for pull/push
- Auto-download restic binary on first run from upstream GitHub release
- Server-side nightly prune cron with operator-side master password key
## What's deferred
- restic version pinning / auto-update of the binary (treat like packwiz-installer self-update)
- Server-side `restic check` cron for repo integrity
- Per-user quota at the rest-server level (rest-server supports `--max-size` per-user via `.maxsize` file in each repo)
- Operator UI for "this player has 25 GB of cloud data, what's in it?"
- Cross-machine sync UX (you can play on PC A then PC B; latest snapshot wins. No conflict UI because restic doesn't merge — restore-latest is destructive by design.)
## cloud-svc — reshape, not delete
cloud-svc gets a new purpose: control plane for the restic backend. Throw away:
- Manifest types + validation (`manifest.go`)
- Blob storage + tarball extraction (`storage.go` body)
- Player-facing `/v1/*` endpoints (`server.go` body)
- Snapshot ID generation, content hash cross-check
Keep:
- Project skeleton (go.mod, Dockerfile, Makefile, CI)
- Auth-cache pattern from `auth.go` (reused for provisioning token verification)
- Per-user mutex pattern from `storage.go` (still needed to serialize concurrent provisioning calls)
- Config loader from `config.go` (adds new vars)
New code:
- Two `http.Server` instances, one per listener
- htpasswd writer that respects bcrypt + file locking
- restic CLI subprocesser (init repo, add key, prune)
- `time.Ticker` for nightly prune job
Estimate: ~300 LOC kept, ~600 LOC new. Net smaller than current cloud-svc.
Also delete `cloud_pull` / `cloud_push` from `frazclient/client.py` (these get obsoleted by `import cloud_sync` calls; frazclient depends on the same package).
## Topology consequences for `automc/docs/network-exposure.md`
| Layer | Bind | Public? |
|---|---|---|
| `restic-ao` (data plane) | `127.0.0.1:8002` | Via reverse proxy at `cloud.tm.center:443` |
| `cloud-svc` provisioning listener | `automc-net:9091` (no PublishPort) | No |
| `cloud-svc` admin listener | `127.0.0.1:9092` | No |
Only one public HTTPS endpoint changes from the original plan: it now fronts `restic-ao` instead of `cloud-svc`. Same reverse-proxy hardening checklist applies. Threat surface differences:
| Old (cloud-svc as data path) | New (restic-ao as data path) |
|---|---|
| Bearer token via auth-service `/auth/verify-key` | HTTP Basic via htpasswd in restic-rest-server |
| Custom Go service, 33 tests | Upstream restic-rest-server, well-audited |
| Player-facing endpoints | None — cloud-svc not public |
Operator endpoints are loopback-only and require SSH access to john to reach. No new public surface from the control plane.
## Repo layout post-pivot
| Repo | Purpose |
|---|---|
| `Timemachine/cloud-sync` (this) | Python 3.10+ package + zipapp that subprocesses restic |
| `Timemachine/cloud-svc` | **Reshaped** — control plane only. Two-port Go service for provisioning + operator ops. NOT archived. |
| `Timemachine/discord-bot` | Extended `/register` flow calls cloud-svc to provision; DMs returned password |
| `Timemachine/automc` | `setup` wizard adds `automc-setup cloud {list,prune,revoke,quota}` subcommands hitting cloud-svc's loopback admin port. Quadlet templates for both restic-ao (new flags) and cloud-svc (two listeners). `database/schema.sql` unchanged. |
## Pre-implementation checklist
All locked 2026-06-02:
- [x] cloud-svc reshapes to control plane, not archived
- [x] Two-port split — automc-net for provisioning, loopback for operator
- [x] Server-side prune via operator master password key on each repo. On `provision`, cloud-svc runs `restic init` then `restic key add` with the operator-master password as a SECOND key. The nightly pruner uses the operator key to open any repo.
- [x] cloud-sync.pyz auto-downloads restic binary. Matches `packwiz-installer-bootstrap` pattern. First run hits `https://github.com/restic/restic/releases` for the matching platform binary, caches under `<pack-folder>/.cloud-sync/restic-<version>`. SHA256 verified against the release's `SHA256SUMS` file. `--no-download` flag for air-gapped operators.
- [x] Nightly prune at 04:00 UTC. `time.Ticker` inside cloud-svc; no external cron. `--prune-time HH:MM` flag in case operators want a different window.
- [x] Per-caller tokens, NOT shared. cloud-svc reads `CLOUD_PROVISIONING_TOKENS_BOT`, `CLOUD_PROVISIONING_TOKENS_<OTHER>` env vars — one per known caller. Logs include the matched caller name so audit trails show which service made each call. Adding a future caller (e.g., a portal) means a new env var, not a token rotation.