initial: Steam-Cloud-style per-user state sync skeleton
CI / validate (push) Successful in 26s
CI / docker (push) Failing after 8s

HTTP API + on-disk storage + auth-service token verification + dev mode.
31 tests pass, vet clean. See DESIGN.md for the architecture and
README.md for the operator surface.

Pending: pg-backed per-user quota override, snapshot retention / blob GC,
tarball-vs-manifest content cross-check, end-to-end deploy on john.
This commit is contained in:
2026-06-02 18:52:25 +02:00
commit 1752ef05a6
16 changed files with 2039 additions and 0 deletions
+266
View File
@@ -0,0 +1,266 @@
# cloud-svc design
Steam-Cloud-style per-user file sync for Minecraft clients. Player launches their client → state pulled. Player exits → state pushed. Across machines, conflicts resolved via per-file mtime + a pre-launch dialog when ambiguity remains.
## Identity
User identity = Discord ID (already issued by automc's account-card flow). Cloud token is a long-lived API key with scope `cloud:rw`, issued via `auth-service` and tied to the Discord ID.
```
client.py ──── Authorization: Bearer <cloud-token> ────► cloud-svc
▼ POST /auth/verify-key
auth-service ──► returns { discord_id, scopes }
```
cloud-svc never sees Discord IDs directly from the client — it always asks auth-service. Token revocation is a single DB UPDATE on `api_keys.revoked`.
## On-disk layout
`~/automc/cloud-data/` on `john`, owned by `automc` user. Per-Discord-ID prefix:
```
cloud-data/
<discord_id>/
manifest.json ← latest snapshot's per-file mtime + sha256 map
snapshots/
<snapshot_id>.tar.zst ← full content tarball (zstd-compressed)
blobs/ ← content-addressed dedupe (sha256-prefixed dirs)
ab/
abcdef0123... ← raw file content, referenced by manifest
cd/
cdef0123...
```
**Why both tarball AND blob store?**
- Tarballs are the snapshot's immutable historical record (good for restore-to-version).
- Blob store is the on-line read path: per-file fetch on conflict resolution. Tarballs would force decompress-everything for one file.
- Same content in two places = waste. Resolved by **hard-linking** the blob from inside the tarball at write time (cloud-svc writes one canonical copy, tarball entries are hardlinks). Linux supports this in tar.
If hardlinks turn out to be a pain across the rootless-podman boundary, fall back to blob-only and synthesize tarballs on demand for download. Defer the call.
## Snapshot retention
| Trigger | Action |
|---|---|
| `cloud_push` from client | New snapshot row. Tar written. Manifest updated. |
| Snapshot count > `RETAIN_LATEST` (default 30) | Oldest deleted. Hardlinked blobs that lose all refs are GC'd in a periodic job. |
| Per-user quota exceeded | Reject push with HTTP 413 + JSON `{error: "quota", used, limit}`. Client surfaces in UI. |
Snapshot IDs are ULIDs (timestamp-sortable, unique without coordination). Tarball name = `01J<26 chars>.tar.zst`.
## Per-file metadata
`manifest.json` (per user, latest snapshot only):
```json
{
"snapshot_id": "01J9XQK4Z3...",
"created_at": "2026-06-02T18:30:00Z",
"files": {
"options.txt": { "sha256": "ab...", "size": 5234, "mtime": "2026-06-02T18:25:11Z" },
"config/voicechat-client.json": { "sha256": "cd...", "size": 432, "mtime": "2026-06-01T22:14:03Z" },
"journeymap/data/sp/world1/...": { "sha256": "ef...", "size": 88123,"mtime": "2026-06-02T18:29:42Z" }
}
}
```
Stored only for the **latest** snapshot. Older snapshots' manifests live alongside `.tar.zst`. Reading old manifest = open tar header. Manifest mtime is **the file's mtime at push time on the client** — not the upload time. This is the source of truth for conflict resolution.
## HTTP API
All endpoints under `/v1/`. JSON unless noted. Bearer auth required.
| Method | Path | Purpose |
|---|---|---|
| `GET` | `/v1/manifest` | Return latest manifest for caller. 200 + JSON, or 204 if no snapshots yet. |
| `GET` | `/v1/blob/{sha256}` | Stream raw file content. 200 + bytes, or 404. Used during conflict resolution to fetch a specific file the client wants from the remote. |
| `POST` | `/v1/snapshot` | Upload new snapshot. Body = multipart `{manifest.json, snapshot.tar.zst}`. Server validates manifest matches tar contents (hash check), assigns snapshot_id, stores. Returns `{snapshot_id, snapshot_url}`. |
| `GET` | `/v1/snapshots` | List caller's snapshot IDs + timestamps (newest first). Used for restore UI / debugging. |
| `GET` | `/v1/snapshot/{id}` | Download a specific historical tarball. |
| `DELETE` | `/v1/snapshot/{id}` | Delete a specific snapshot (e.g., compromised data). Cannot delete the latest. |
| `GET` | `/v1/quota` | `{used_bytes, limit_bytes, snapshots, snapshot_limit}` |
All authentication errors return `401 {error: "auth"}`. Quota errors return `413 {error: "quota", ...}`. Schema validation `400`. Unknown user (token revoked / Discord ID stripped) `403`.
## Pull semantics (client side)
```
1. GET /v1/manifest → remote manifest
2. Walk local include-paths, compute (path, mtime, sha256) for each file
3. For each path in (remote local):
remote_only → DOWNLOAD via GET /v1/blob/{sha256}, write file, set mtime to remote.mtime
local_only → no-op (will push on exit)
both, sha matches → no-op
both, sha differs:
remote.mtime > local.mtime → AUTO_REMOTE: download, overwrite, set mtime
local.mtime > remote.mtime → AUTO_LOCAL: keep local (will push on exit)
|diff| ≤ 2s OR same mtime → CONFLICT: surface in dialog
```
The 2s threshold absorbs FS-level mtime rounding.
## Push semantics (client side)
```
1. Walk local include-paths, build per-file (path, mtime, sha256)
2. GET /v1/manifest, build delta:
in local, not in remote → new
in local, sha differs from remote → changed
in remote, not in local → DELETED (manifest entry, no blob)
3. Build manifest.json for new snapshot:
{snapshot_id, created_at, files: {<full current set>}}
4. Build tarball: only new + changed files (deleted entries omitted)
5. POST /v1/snapshot with manifest + tarball
6. On 200, save snapshot_id to local state file (used for next pull's known-base)
7. On 413 (quota), surface to user; offer pruning or scope reduction
```
## Conflict UI
Pre-launch (after packwiz, before MC starts). When `pull` finds files in CONFLICT state, render a dialog. Cross-platform via stdlib `tkinter`:
```
┌──────────────────────────────────────────────────────────┐
│ Cloud sync — manual resolve needed │
├──────────────────────────────────────────────────────────┤
│ Some files differ between this machine and your cloud: │
│ │
│ File Local Remote │
│ options.txt 18:25 +0200 18:24 +0200 │
│ ( ) keep local ( ) use remote │
│ │
│ config/voicechat-client.json 22:14 +0200 22:13 +0200 │
│ ( ) keep local ( ) use remote │
│ │
│ [Use local for all] [Use remote for all] [Cancel launch] │
│ │
│ [Continue launch] │
└──────────────────────────────────────────────────────────┘
```
Defaults per-row to "use remote" (matches Steam's default — pull is destructive but consistent). User can override per file.
**Cancel launch** = abort `client.py`, return non-zero. Player can fix manually then re-run.
## What syncs (configurable per distribution)
`cloud-scope.json` next to `client.py`:
```json
{
"include": [
"options.txt",
"optionsof.txt",
"optionsshaders.txt",
"config/",
"journeymap/data/",
"screenshots/"
],
"exclude": [
"config/simple-mod-sync*",
"config/packwiz*",
"**/.tmp",
"**/cache/"
],
"max_size_mb_per_file": 50,
"max_total_mb": 200
}
```
Defaults are baked into client.py if the file is absent. JourneyMap (`journeymap/data/`) tracks per-server worlds, waypoints, settings — explicitly included.
## Auth-service integration
cloud-svc → auth-service contract (already exists in `auth-service/server.go`):
```
POST http://auth-service:9090/auth/verify-key
Authorization: <SM_API_KEY> # cloud-svc's own service token
Body: { "key": "<player-token>" }
200 { "user_id": "<discord_id>", "scopes": ["cloud:rw"] }
401 { "error": "invalid" }
403 { "error": "revoked" }
```
cloud-svc caches verified tokens in-memory for 60s to avoid hammering auth-service. Cache invalidated on 401 from client (forced refresh).
## Encryption at rest
**Optional, deferred.** Today: blobs are raw on `john`'s disk, owned by `automc` user, mode 0600. Filesystem permissions are the only barrier. Acceptable for pre-prod.
For production: per-user symmetric key (derived from Discord ID + master secret) encrypts blobs with AES-GCM. Manifest stored with the encrypted blob mappings; client provides key per request. Adds significant complexity — defer until production scale.
## Quadlet template
```ini
[Unit]
Description=automc cloud-svc (player state sync)
After=automc-pg.service auth-service.service
[Container]
ContainerName=cloud-svc
Image=git.timemachine.center/timemachine/cloud-svc:latest
Network=automc-net
NetworkAlias=cloud-svc
Environment=TZ={{ tz }}
EnvironmentFile=%h/automc/secrets/cloud-svc.env
PublishPort=127.0.0.1:9091:9091
Volume=%h/automc/cloud-data:/data:Z
[Service]
Restart=always
[Install]
WantedBy=default.target
```
Bound to `127.0.0.1:9091` — players reach it only via SSH tunnel during dev. In real deployment, exposed via a reverse-proxy on the same hostname they use for the packwiz pack URL (e.g., `packs.timemachine.center/cloud/...`).
## Out of scope (v1)
- **Selective restore from old snapshot** — UI for "go back to last Tuesday's state". The API supports it (`GET /v1/snapshot/{id}` + manual extraction); the UI is deferred.
- **Multi-device live conflict** (player on PC + laptop simultaneously) — single-machine assumption, race documented.
- **Compression tuning** — zstd level 3 default. May tune up to 6 if disk pressure observed.
- **Anti-replay on tokens** — straight bearer auth. If a token leaks, revoke it. Not a primary attack surface.
- **Cross-server modpack-aware filtering** — cloud-scope.json is per-distribution, not per-server. Different servers might want different scopes; defer.
## Stack
- **Go 1.24+** (matches other automc services)
- `jackc/pgx/v5` for pg (if metadata stored there; alternative: sqlite per-deploy)
- `klauspost/compress/zstd` for tarball compression
- Standard `archive/tar` for tarball assembly
- No external HTTP framework — `net/http` + `gorilla/mux` style routing (or stdlib `http.ServeMux` Go 1.22+ pattern syntax)
## File / module layout
```
cloud-svc/
cmd/cloud-svc/main.go
internal/
api/ ← HTTP handlers per endpoint
storage/ ← on-disk blob + tarball R/W, GC
manifest/ ← manifest.json types + (de)serialize + validate
auth/ ← auth-service client (token verify + cache)
quota/ ← per-user quota tracking
database/migrations/ ← if pg-backed metadata
Dockerfile
Makefile
.gitea/workflows/ci.yaml
docs/ARCHITECTURE.md
```
~1500 LOC Go total estimate.
## Open questions
1. **Metadata in automc-pg or sqlite-per-instance?**
- pg: shared with other services, easier ops, schema migrations same pipeline.
- sqlite: zero coupling, faster local I/O, harder to query externally.
- Recommendation: **pg** for consistency with the rest of automc.
2. **Quota source**: hardcoded per-user, or per-user-row in DB?
- DB row allows admin to bump limits per player. Use `users.cloud_quota_bytes` column (nullable, default to global).
3. **Reverse proxy in front of cloud-svc**: needed for player-facing URL (`packs.timemachine.center/cloud/...`)?
- Either nginx fronting cloud-svc, or expose cloud-svc directly with its own TLS via Caddy/something. Defer.