remote-run-ssh
Run CVlization examples on the `ssh l1` GPU host by copying only the needed example directory plus the shared `cvlization/` package into `/tmp`, then launching the example’s Docker scripts.
$ 설치
git clone https://github.com/kungfuai/CVlization /tmp/CVlization && cp -r /tmp/CVlization/.claude/skills/remote-run-ssh ~/.claude/skills/CVlization// tip: Run this command in your terminal to install the skill
name: remote-run-ssh
description: Run CVlization examples on the ssh l1 GPU host by copying only the needed example directory plus the shared cvlization/ package into /tmp, then launching the example’s Docker scripts.
Remote Run over SSH
Operate CVlization examples on the remote GPU reachable as ssh l1. This playbook keeps the remote copy minimal—just the target example folder (e.g., examples/perception/multimodal_multitask/recipe_analysis_torch) and the cvlization/ library—then relies on the example’s own build.sh / train.sh Docker helpers. The user’s long-lived checkout on the remote stays untouched.
When to Use
- Heavy trainings or evaluations that require CUDA (CIFAR10 speed runs, multimodal pipelines, etc.).
- Performance or regression measurements on the remote GPU after local code changes.
- Producing reproducible logs / artifacts for discussions or CI baselines without pushing a branch first.
Prerequisites
- Local repo state ready to sync (uncommitted changes acceptable).
- SSH config already maps the GPU machine to
l1. - Remote host provides NVIDIA GPU (currently A10) and Docker with GPU runtime enabled.
- At least ~15 GB free under
/tmpfor the slim workspace, Docker context, and caches. - Hugging Face tokens or other creds available locally if the example pulls hub assets.
Quick Reference
- Choose the example to run.
rsynconly the example folder,cvlization/, and any required helper dirs to/tmp/cvlization_remoteonl1.- On
l1, run./build.shinside the example folder to build the Docker image. - Run
./train.sh(or the example’s equivalent) to launch the job with GPU access. - Collect logs / metrics and record the run in
var/skills/remote-run-ssh/runs/<timestamp>/log.md.
Detailed Procedure
1. Identify the example and supporting files
- Note the example path relative to repo root (e.g.,
examples/perception/multimodal_multitask/recipe_analysis_torch). - List any extra assets the run needs (custom configs under
examples, top-level scripts, environment files, etc.). - Confirm
cvlization/includes all library modules the example imports.
2. Sync minimal workspace to /tmp
REMOTE_ROOT=/tmp/cvlization_remote
rsync -az --delete \
--include='cvlization/***' \
--include='examples/***' \
--include='scripts/***' \
--include='pyproject.toml' \
--include='setup.cfg' \
--include='README.md' \
--exclude='*' \
./ l1:${REMOTE_ROOT}/
Tips:
- Adjust the include list if the example needs additional files (e.g.,
docker-compose.yml,requirements.txt). The blanket--exclude='*'prevents unrelated directories from syncing. - Keep the remote path structure (
${REMOTE_ROOT}/examples/...) aligned with the local repo so relative paths like../../../..used inside scripts still resolve to the repo root. - Avoid syncing
.git, local datasets,.venv, or heavy cache directories.
3. Build the example image
ssh l1 'cd /tmp/cvlization_remote/examples/perception/multimodal_multitask/recipe_analysis_torch && ./build.sh'
- The Docker build context is limited to the example folder, keeping builds quick.
- Edit
build.shlocally if you need custom base images or dependency tweaks before re-syncing.
4. Confirm GPU availability
ssh l1 'nvidia-smi'
Ensure no conflicting jobs are consuming the GPU before starting a long run.
5. Run the training script (Docker)
ssh l1 'cd /tmp/cvlization_remote/examples/perception/multimodal_multitask/recipe_analysis_torch && ./train.sh > run.log 2>&1'
- Tail logs while the job runs:
ssh l1 'tail -f /tmp/cvlization_remote/examples/perception/multimodal_multitask/recipe_analysis_torch/run.log'
train.shalready mounts the example directory at/workspace, mounts the synced repo root read-only at/cvlization_repo, and setsPYTHONPATH=/cvlization_repo.- Customize environment variables or extra mounts by editing
train.shlocally (e.g., injecting dataset paths, WANDB keys) then re-syncing. - If an example lacks Docker scripts, fall back to running its entrypoint directly (
python train.py) inside the container or a temporary venv—but document the deviation in the run log.
6. Capture metrics
Log files should include per-epoch summaries and final metrics. Record:
- Wall-clock time / throughput.
- Accuracy, loss, or other task metrics.
- Warnings or notable log lines (e.g., retry downloads, CUDA warnings).
7. Retrieve artifacts (optional)
rsync -az l1:/tmp/cvlization_remote/examples/perception/multimodal_multitask/recipe_analysis_torch/run.log \
./remote_runs/$(date +%Y%m%dT%H%M%S)_multimodal.log
Copy checkpoints, TensorBoard logs, or generated samples in the same manner if needed.
8. Document the run
- Create
var/skills/remote-run-ssh/runs/<timestamp>/log.mdsummarizing:- Example path, git commit or diff basis.
- Commands executed (
build.sh,train.shargs). - Key metrics / observations.
- Location of logs or artifacts (local paths or remote references).
9. Cleanup (optional)
- Delete
/tmp/cvlization_remotewhen finished if the disk budget is tight (ssh l1 'rm -rf /tmp/cvlization_remote'). - Clear cached datasets (
rm -rf /root/.cache/...inside Docker) only if future runs should start fresh.
Troubleshooting
- Missing module inside container: Ensure
train.shmounts the repo root and setsPYTHONPATH. Re-runrsyncif files were added after the initial sync. - Docker build fails: Inspect the output of
./build.sh. Some examples assume base images with CUDA toolkits—update the Dockerfile accordingly. - CUDA OOM: Reduce batch sizes or precision, or run one job at a time on
l1. - No GPU detected: Confirm
--gpus=allis present intrain.shand checknvidia-smion the host. - Long sync times: Tighten the rsync include rules to the specific example and library folders required.
Outputs
Every run should leave:
- Remote workspace at
/tmp/cvlization_remotecontaining the synced example + library. - Local or remote logs with the captured metrics.
- A run log under
var/skills/remote-run-ssh/runs/<timestamp>/log.mddocumenting the session.
Repository
