Files
cloud-svc/DESIGN.md
T
claude-timemachine 1752ef05a6
CI / validate (push) Successful in 26s
CI / docker (push) Failing after 8s
initial: Steam-Cloud-style per-user state sync skeleton
HTTP API + on-disk storage + auth-service token verification + dev mode.
31 tests pass, vet clean. See DESIGN.md for the architecture and
README.md for the operator surface.

Pending: pg-backed per-user quota override, snapshot retention / blob GC,
tarball-vs-manifest content cross-check, end-to-end deploy on john.
2026-06-02 18:52:25 +02:00

267 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# cloud-svc design
Steam-Cloud-style per-user file sync for Minecraft clients. Player launches their client → state pulled. Player exits → state pushed. Across machines, conflicts resolved via per-file mtime + a pre-launch dialog when ambiguity remains.
## Identity
User identity = Discord ID (already issued by automc's account-card flow). Cloud token is a long-lived API key with scope `cloud:rw`, issued via `auth-service` and tied to the Discord ID.
```
client.py ──── Authorization: Bearer <cloud-token> ────► cloud-svc
▼ POST /auth/verify-key
auth-service ──► returns { discord_id, scopes }
```
cloud-svc never sees Discord IDs directly from the client — it always asks auth-service. Token revocation is a single DB UPDATE on `api_keys.revoked`.
## On-disk layout
`~/automc/cloud-data/` on `john`, owned by `automc` user. Per-Discord-ID prefix:
```
cloud-data/
<discord_id>/
manifest.json ← latest snapshot's per-file mtime + sha256 map
snapshots/
<snapshot_id>.tar.zst ← full content tarball (zstd-compressed)
blobs/ ← content-addressed dedupe (sha256-prefixed dirs)
ab/
abcdef0123... ← raw file content, referenced by manifest
cd/
cdef0123...
```
**Why both tarball AND blob store?**
- Tarballs are the snapshot's immutable historical record (good for restore-to-version).
- Blob store is the on-line read path: per-file fetch on conflict resolution. Tarballs would force decompress-everything for one file.
- Same content in two places = waste. Resolved by **hard-linking** the blob from inside the tarball at write time (cloud-svc writes one canonical copy, tarball entries are hardlinks). Linux supports this in tar.
If hardlinks turn out to be a pain across the rootless-podman boundary, fall back to blob-only and synthesize tarballs on demand for download. Defer the call.
## Snapshot retention
| Trigger | Action |
|---|---|
| `cloud_push` from client | New snapshot row. Tar written. Manifest updated. |
| Snapshot count > `RETAIN_LATEST` (default 30) | Oldest deleted. Hardlinked blobs that lose all refs are GC'd in a periodic job. |
| Per-user quota exceeded | Reject push with HTTP 413 + JSON `{error: "quota", used, limit}`. Client surfaces in UI. |
Snapshot IDs are ULIDs (timestamp-sortable, unique without coordination). Tarball name = `01J<26 chars>.tar.zst`.
## Per-file metadata
`manifest.json` (per user, latest snapshot only):
```json
{
"snapshot_id": "01J9XQK4Z3...",
"created_at": "2026-06-02T18:30:00Z",
"files": {
"options.txt": { "sha256": "ab...", "size": 5234, "mtime": "2026-06-02T18:25:11Z" },
"config/voicechat-client.json": { "sha256": "cd...", "size": 432, "mtime": "2026-06-01T22:14:03Z" },
"journeymap/data/sp/world1/...": { "sha256": "ef...", "size": 88123,"mtime": "2026-06-02T18:29:42Z" }
}
}
```
Stored only for the **latest** snapshot. Older snapshots' manifests live alongside `.tar.zst`. Reading old manifest = open tar header. Manifest mtime is **the file's mtime at push time on the client** — not the upload time. This is the source of truth for conflict resolution.
## HTTP API
All endpoints under `/v1/`. JSON unless noted. Bearer auth required.
| Method | Path | Purpose |
|---|---|---|
| `GET` | `/v1/manifest` | Return latest manifest for caller. 200 + JSON, or 204 if no snapshots yet. |
| `GET` | `/v1/blob/{sha256}` | Stream raw file content. 200 + bytes, or 404. Used during conflict resolution to fetch a specific file the client wants from the remote. |
| `POST` | `/v1/snapshot` | Upload new snapshot. Body = multipart `{manifest.json, snapshot.tar.zst}`. Server validates manifest matches tar contents (hash check), assigns snapshot_id, stores. Returns `{snapshot_id, snapshot_url}`. |
| `GET` | `/v1/snapshots` | List caller's snapshot IDs + timestamps (newest first). Used for restore UI / debugging. |
| `GET` | `/v1/snapshot/{id}` | Download a specific historical tarball. |
| `DELETE` | `/v1/snapshot/{id}` | Delete a specific snapshot (e.g., compromised data). Cannot delete the latest. |
| `GET` | `/v1/quota` | `{used_bytes, limit_bytes, snapshots, snapshot_limit}` |
All authentication errors return `401 {error: "auth"}`. Quota errors return `413 {error: "quota", ...}`. Schema validation `400`. Unknown user (token revoked / Discord ID stripped) `403`.
## Pull semantics (client side)
```
1. GET /v1/manifest → remote manifest
2. Walk local include-paths, compute (path, mtime, sha256) for each file
3. For each path in (remote local):
remote_only → DOWNLOAD via GET /v1/blob/{sha256}, write file, set mtime to remote.mtime
local_only → no-op (will push on exit)
both, sha matches → no-op
both, sha differs:
remote.mtime > local.mtime → AUTO_REMOTE: download, overwrite, set mtime
local.mtime > remote.mtime → AUTO_LOCAL: keep local (will push on exit)
|diff| ≤ 2s OR same mtime → CONFLICT: surface in dialog
```
The 2s threshold absorbs FS-level mtime rounding.
## Push semantics (client side)
```
1. Walk local include-paths, build per-file (path, mtime, sha256)
2. GET /v1/manifest, build delta:
in local, not in remote → new
in local, sha differs from remote → changed
in remote, not in local → DELETED (manifest entry, no blob)
3. Build manifest.json for new snapshot:
{snapshot_id, created_at, files: {<full current set>}}
4. Build tarball: only new + changed files (deleted entries omitted)
5. POST /v1/snapshot with manifest + tarball
6. On 200, save snapshot_id to local state file (used for next pull's known-base)
7. On 413 (quota), surface to user; offer pruning or scope reduction
```
## Conflict UI
Pre-launch (after packwiz, before MC starts). When `pull` finds files in CONFLICT state, render a dialog. Cross-platform via stdlib `tkinter`:
```
┌──────────────────────────────────────────────────────────┐
│ Cloud sync — manual resolve needed │
├──────────────────────────────────────────────────────────┤
│ Some files differ between this machine and your cloud: │
│ │
│ File Local Remote │
│ options.txt 18:25 +0200 18:24 +0200 │
│ ( ) keep local ( ) use remote │
│ │
│ config/voicechat-client.json 22:14 +0200 22:13 +0200 │
│ ( ) keep local ( ) use remote │
│ │
│ [Use local for all] [Use remote for all] [Cancel launch] │
│ │
│ [Continue launch] │
└──────────────────────────────────────────────────────────┘
```
Defaults per-row to "use remote" (matches Steam's default — pull is destructive but consistent). User can override per file.
**Cancel launch** = abort `client.py`, return non-zero. Player can fix manually then re-run.
## What syncs (configurable per distribution)
`cloud-scope.json` next to `client.py`:
```json
{
"include": [
"options.txt",
"optionsof.txt",
"optionsshaders.txt",
"config/",
"journeymap/data/",
"screenshots/"
],
"exclude": [
"config/simple-mod-sync*",
"config/packwiz*",
"**/.tmp",
"**/cache/"
],
"max_size_mb_per_file": 50,
"max_total_mb": 200
}
```
Defaults are baked into client.py if the file is absent. JourneyMap (`journeymap/data/`) tracks per-server worlds, waypoints, settings — explicitly included.
## Auth-service integration
cloud-svc → auth-service contract (already exists in `auth-service/server.go`):
```
POST http://auth-service:9090/auth/verify-key
Authorization: <SM_API_KEY> # cloud-svc's own service token
Body: { "key": "<player-token>" }
200 { "user_id": "<discord_id>", "scopes": ["cloud:rw"] }
401 { "error": "invalid" }
403 { "error": "revoked" }
```
cloud-svc caches verified tokens in-memory for 60s to avoid hammering auth-service. Cache invalidated on 401 from client (forced refresh).
## Encryption at rest
**Optional, deferred.** Today: blobs are raw on `john`'s disk, owned by `automc` user, mode 0600. Filesystem permissions are the only barrier. Acceptable for pre-prod.
For production: per-user symmetric key (derived from Discord ID + master secret) encrypts blobs with AES-GCM. Manifest stored with the encrypted blob mappings; client provides key per request. Adds significant complexity — defer until production scale.
## Quadlet template
```ini
[Unit]
Description=automc cloud-svc (player state sync)
After=automc-pg.service auth-service.service
[Container]
ContainerName=cloud-svc
Image=git.timemachine.center/timemachine/cloud-svc:latest
Network=automc-net
NetworkAlias=cloud-svc
Environment=TZ={{ tz }}
EnvironmentFile=%h/automc/secrets/cloud-svc.env
PublishPort=127.0.0.1:9091:9091
Volume=%h/automc/cloud-data:/data:Z
[Service]
Restart=always
[Install]
WantedBy=default.target
```
Bound to `127.0.0.1:9091` — players reach it only via SSH tunnel during dev. In real deployment, exposed via a reverse-proxy on the same hostname they use for the packwiz pack URL (e.g., `packs.timemachine.center/cloud/...`).
## Out of scope (v1)
- **Selective restore from old snapshot** — UI for "go back to last Tuesday's state". The API supports it (`GET /v1/snapshot/{id}` + manual extraction); the UI is deferred.
- **Multi-device live conflict** (player on PC + laptop simultaneously) — single-machine assumption, race documented.
- **Compression tuning** — zstd level 3 default. May tune up to 6 if disk pressure observed.
- **Anti-replay on tokens** — straight bearer auth. If a token leaks, revoke it. Not a primary attack surface.
- **Cross-server modpack-aware filtering** — cloud-scope.json is per-distribution, not per-server. Different servers might want different scopes; defer.
## Stack
- **Go 1.24+** (matches other automc services)
- `jackc/pgx/v5` for pg (if metadata stored there; alternative: sqlite per-deploy)
- `klauspost/compress/zstd` for tarball compression
- Standard `archive/tar` for tarball assembly
- No external HTTP framework — `net/http` + `gorilla/mux` style routing (or stdlib `http.ServeMux` Go 1.22+ pattern syntax)
## File / module layout
```
cloud-svc/
cmd/cloud-svc/main.go
internal/
api/ ← HTTP handlers per endpoint
storage/ ← on-disk blob + tarball R/W, GC
manifest/ ← manifest.json types + (de)serialize + validate
auth/ ← auth-service client (token verify + cache)
quota/ ← per-user quota tracking
database/migrations/ ← if pg-backed metadata
Dockerfile
Makefile
.gitea/workflows/ci.yaml
docs/ARCHITECTURE.md
```
~1500 LOC Go total estimate.
## Open questions
1. **Metadata in automc-pg or sqlite-per-instance?**
- pg: shared with other services, easier ops, schema migrations same pipeline.
- sqlite: zero coupling, faster local I/O, harder to query externally.
- Recommendation: **pg** for consistency with the rest of automc.
2. **Quota source**: hardcoded per-user, or per-user-row in DB?
- DB row allows admin to bump limits per player. Use `users.cloud_quota_bytes` column (nullable, default to global).
3. **Reverse proxy in front of cloud-svc**: needed for player-facing URL (`packs.timemachine.center/cloud/...`)?
- Either nginx fronting cloud-svc, or expose cloud-svc directly with its own TLS via Caddy/something. Defer.