Files
cloud-svc/DESIGN.md
T
claude-timemachine 1752ef05a6
CI / validate (push) Successful in 26s
CI / docker (push) Failing after 8s
initial: Steam-Cloud-style per-user state sync skeleton
HTTP API + on-disk storage + auth-service token verification + dev mode.
31 tests pass, vet clean. See DESIGN.md for the architecture and
README.md for the operator surface.

Pending: pg-backed per-user quota override, snapshot retention / blob GC,
tarball-vs-manifest content cross-check, end-to-end deploy on john.
2026-06-02 18:52:25 +02:00

12 KiB
Raw Blame History

cloud-svc design

Steam-Cloud-style per-user file sync for Minecraft clients. Player launches their client → state pulled. Player exits → state pushed. Across machines, conflicts resolved via per-file mtime + a pre-launch dialog when ambiguity remains.

Identity

User identity = Discord ID (already issued by automc's account-card flow). Cloud token is a long-lived API key with scope cloud:rw, issued via auth-service and tied to the Discord ID.

client.py  ──── Authorization: Bearer <cloud-token> ────►  cloud-svc
                                                              │
                                                              ▼ POST /auth/verify-key
                                                          auth-service ──► returns { discord_id, scopes }

cloud-svc never sees Discord IDs directly from the client — it always asks auth-service. Token revocation is a single DB UPDATE on api_keys.revoked.

On-disk layout

~/automc/cloud-data/ on john, owned by automc user. Per-Discord-ID prefix:

cloud-data/
  <discord_id>/
    manifest.json           ← latest snapshot's per-file mtime + sha256 map
    snapshots/
      <snapshot_id>.tar.zst ← full content tarball (zstd-compressed)
    blobs/                  ← content-addressed dedupe (sha256-prefixed dirs)
      ab/
        abcdef0123...       ← raw file content, referenced by manifest
      cd/
        cdef0123...

Why both tarball AND blob store?

  • Tarballs are the snapshot's immutable historical record (good for restore-to-version).
  • Blob store is the on-line read path: per-file fetch on conflict resolution. Tarballs would force decompress-everything for one file.
  • Same content in two places = waste. Resolved by hard-linking the blob from inside the tarball at write time (cloud-svc writes one canonical copy, tarball entries are hardlinks). Linux supports this in tar.

If hardlinks turn out to be a pain across the rootless-podman boundary, fall back to blob-only and synthesize tarballs on demand for download. Defer the call.

Snapshot retention

Trigger Action
cloud_push from client New snapshot row. Tar written. Manifest updated.
Snapshot count > RETAIN_LATEST (default 30) Oldest deleted. Hardlinked blobs that lose all refs are GC'd in a periodic job.
Per-user quota exceeded Reject push with HTTP 413 + JSON {error: "quota", used, limit}. Client surfaces in UI.

Snapshot IDs are ULIDs (timestamp-sortable, unique without coordination). Tarball name = 01J<26 chars>.tar.zst.

Per-file metadata

manifest.json (per user, latest snapshot only):

{
  "snapshot_id": "01J9XQK4Z3...",
  "created_at": "2026-06-02T18:30:00Z",
  "files": {
    "options.txt":                   { "sha256": "ab...", "size": 5234, "mtime": "2026-06-02T18:25:11Z" },
    "config/voicechat-client.json":  { "sha256": "cd...", "size": 432,  "mtime": "2026-06-01T22:14:03Z" },
    "journeymap/data/sp/world1/...": { "sha256": "ef...", "size": 88123,"mtime": "2026-06-02T18:29:42Z" }
  }
}

Stored only for the latest snapshot. Older snapshots' manifests live alongside .tar.zst. Reading old manifest = open tar header. Manifest mtime is the file's mtime at push time on the client — not the upload time. This is the source of truth for conflict resolution.

HTTP API

All endpoints under /v1/. JSON unless noted. Bearer auth required.

Method Path Purpose
GET /v1/manifest Return latest manifest for caller. 200 + JSON, or 204 if no snapshots yet.
GET /v1/blob/{sha256} Stream raw file content. 200 + bytes, or 404. Used during conflict resolution to fetch a specific file the client wants from the remote.
POST /v1/snapshot Upload new snapshot. Body = multipart {manifest.json, snapshot.tar.zst}. Server validates manifest matches tar contents (hash check), assigns snapshot_id, stores. Returns {snapshot_id, snapshot_url}.
GET /v1/snapshots List caller's snapshot IDs + timestamps (newest first). Used for restore UI / debugging.
GET /v1/snapshot/{id} Download a specific historical tarball.
DELETE /v1/snapshot/{id} Delete a specific snapshot (e.g., compromised data). Cannot delete the latest.
GET /v1/quota {used_bytes, limit_bytes, snapshots, snapshot_limit}

All authentication errors return 401 {error: "auth"}. Quota errors return 413 {error: "quota", ...}. Schema validation 400. Unknown user (token revoked / Discord ID stripped) 403.

Pull semantics (client side)

1. GET /v1/manifest → remote manifest
2. Walk local include-paths, compute (path, mtime, sha256) for each file
3. For each path in (remote  local):
     remote_only  → DOWNLOAD via GET /v1/blob/{sha256}, write file, set mtime to remote.mtime
     local_only   → no-op (will push on exit)
     both, sha matches → no-op
     both, sha differs:
       remote.mtime > local.mtime → AUTO_REMOTE: download, overwrite, set mtime
       local.mtime > remote.mtime → AUTO_LOCAL: keep local (will push on exit)
       |diff| ≤ 2s OR same mtime  → CONFLICT: surface in dialog

The 2s threshold absorbs FS-level mtime rounding.

Push semantics (client side)

1. Walk local include-paths, build per-file (path, mtime, sha256)
2. GET /v1/manifest, build delta:
     in local, not in remote          → new
     in local, sha differs from remote → changed
     in remote, not in local           → DELETED (manifest entry, no blob)
3. Build manifest.json for new snapshot:
     {snapshot_id, created_at, files: {<full current set>}}
4. Build tarball: only new + changed files (deleted entries omitted)
5. POST /v1/snapshot with manifest + tarball
6. On 200, save snapshot_id to local state file (used for next pull's known-base)
7. On 413 (quota), surface to user; offer pruning or scope reduction

Conflict UI

Pre-launch (after packwiz, before MC starts). When pull finds files in CONFLICT state, render a dialog. Cross-platform via stdlib tkinter:

┌──────────────────────────────────────────────────────────┐
│ Cloud sync — manual resolve needed                       │
├──────────────────────────────────────────────────────────┤
│ Some files differ between this machine and your cloud:   │
│                                                          │
│  File                          Local         Remote      │
│  options.txt                   18:25 +0200   18:24 +0200 │
│      ( ) keep local            ( ) use remote            │
│                                                          │
│  config/voicechat-client.json  22:14 +0200   22:13 +0200 │
│      ( ) keep local            ( ) use remote            │
│                                                          │
│ [Use local for all] [Use remote for all] [Cancel launch] │
│                                                          │
│ [Continue launch]                                        │
└──────────────────────────────────────────────────────────┘

Defaults per-row to "use remote" (matches Steam's default — pull is destructive but consistent). User can override per file.

Cancel launch = abort client.py, return non-zero. Player can fix manually then re-run.

What syncs (configurable per distribution)

cloud-scope.json next to client.py:

{
  "include": [
    "options.txt",
    "optionsof.txt",
    "optionsshaders.txt",
    "config/",
    "journeymap/data/",
    "screenshots/"
  ],
  "exclude": [
    "config/simple-mod-sync*",
    "config/packwiz*",
    "**/.tmp",
    "**/cache/"
  ],
  "max_size_mb_per_file": 50,
  "max_total_mb": 200
}

Defaults are baked into client.py if the file is absent. JourneyMap (journeymap/data/) tracks per-server worlds, waypoints, settings — explicitly included.

Auth-service integration

cloud-svc → auth-service contract (already exists in auth-service/server.go):

POST http://auth-service:9090/auth/verify-key
  Authorization: <SM_API_KEY>     # cloud-svc's own service token
  Body: { "key": "<player-token>" }

200 { "user_id": "<discord_id>", "scopes": ["cloud:rw"] }
401 { "error": "invalid" }
403 { "error": "revoked" }

cloud-svc caches verified tokens in-memory for 60s to avoid hammering auth-service. Cache invalidated on 401 from client (forced refresh).

Encryption at rest

Optional, deferred. Today: blobs are raw on john's disk, owned by automc user, mode 0600. Filesystem permissions are the only barrier. Acceptable for pre-prod.

For production: per-user symmetric key (derived from Discord ID + master secret) encrypts blobs with AES-GCM. Manifest stored with the encrypted blob mappings; client provides key per request. Adds significant complexity — defer until production scale.

Quadlet template

[Unit]
Description=automc cloud-svc (player state sync)
After=automc-pg.service auth-service.service

[Container]
ContainerName=cloud-svc
Image=git.timemachine.center/timemachine/cloud-svc:latest
Network=automc-net
NetworkAlias=cloud-svc
Environment=TZ={{ tz }}
EnvironmentFile=%h/automc/secrets/cloud-svc.env
PublishPort=127.0.0.1:9091:9091
Volume=%h/automc/cloud-data:/data:Z

[Service]
Restart=always

[Install]
WantedBy=default.target

Bound to 127.0.0.1:9091 — players reach it only via SSH tunnel during dev. In real deployment, exposed via a reverse-proxy on the same hostname they use for the packwiz pack URL (e.g., packs.timemachine.center/cloud/...).

Out of scope (v1)

  • Selective restore from old snapshot — UI for "go back to last Tuesday's state". The API supports it (GET /v1/snapshot/{id} + manual extraction); the UI is deferred.
  • Multi-device live conflict (player on PC + laptop simultaneously) — single-machine assumption, race documented.
  • Compression tuning — zstd level 3 default. May tune up to 6 if disk pressure observed.
  • Anti-replay on tokens — straight bearer auth. If a token leaks, revoke it. Not a primary attack surface.
  • Cross-server modpack-aware filtering — cloud-scope.json is per-distribution, not per-server. Different servers might want different scopes; defer.

Stack

  • Go 1.24+ (matches other automc services)
  • jackc/pgx/v5 for pg (if metadata stored there; alternative: sqlite per-deploy)
  • klauspost/compress/zstd for tarball compression
  • Standard archive/tar for tarball assembly
  • No external HTTP framework — net/http + gorilla/mux style routing (or stdlib http.ServeMux Go 1.22+ pattern syntax)

File / module layout

cloud-svc/
  cmd/cloud-svc/main.go
  internal/
    api/      ← HTTP handlers per endpoint
    storage/  ← on-disk blob + tarball R/W, GC
    manifest/ ← manifest.json types + (de)serialize + validate
    auth/     ← auth-service client (token verify + cache)
    quota/    ← per-user quota tracking
  database/migrations/  ← if pg-backed metadata
  Dockerfile
  Makefile
  .gitea/workflows/ci.yaml
  docs/ARCHITECTURE.md

~1500 LOC Go total estimate.

Open questions

  1. Metadata in automc-pg or sqlite-per-instance?
    • pg: shared with other services, easier ops, schema migrations same pipeline.
    • sqlite: zero coupling, faster local I/O, harder to query externally.
    • Recommendation: pg for consistency with the rest of automc.
  2. Quota source: hardcoded per-user, or per-user-row in DB?
    • DB row allows admin to bump limits per player. Use users.cloud_quota_bytes column (nullable, default to global).
  3. Reverse proxy in front of cloud-svc: needed for player-facing URL (packs.timemachine.center/cloud/...)?
    • Either nginx fronting cloud-svc, or expose cloud-svc directly with its own TLS via Caddy/something. Defer.