temporal

Manage Temporal workflows: server lifecycle, worker processes, workflow execution, monitoring, and troubleshooting for Python SDK with temporal server start-dev.

allowed_tools: Bash(.claude/skills/temporal/scripts/*:*), Read

$ 安裝

git clone https://github.com/steveandroulakis/temporal-conductor-migration-agent /tmp/temporal-conductor-migration-agent && cp -r /tmp/temporal-conductor-migration-agent/.claude/skills/temporal ~/.claude/skills/temporal-conductor-migration-agent

// tip: Run this command in your terminal to install the skill


name: temporal description: "Manage Temporal workflows: server lifecycle, worker processes, workflow execution, monitoring, and troubleshooting for Python SDK with temporal server start-dev." version: 1.0.1 allowed-tools: "Bash(.claude/skills/temporal/scripts/:), Read"

Temporal Skill

Manage Temporal workflows using local development server. This skill focuses on the execution, validation, and troubleshooting lifecycle of workflows.

PropertyValue
Target SDKPython only
Server Typetemporal server start-dev (local development)
gRPC Port7233

Critical Concepts

Understanding how Temporal components interact is essential for troubleshooting:

How Workers, Workflows, and Tasks Relate

┌─────────────────────────────────────────────────────────────────┐
│                     TEMPORAL SERVER                              │
│  Stores workflow history, manages task queues, coordinates work │
└─────────────────────────────────────────────────────────────────┘
                              │
                    Task Queue (named queue)
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                         WORKER                                   │
│  Long-running process that polls task queue for work            │
│  Contains: Workflow definitions + Activity implementations       │
│                                                                  │
│  When work arrives:                                              │
│    - Workflow Task → Execute workflow code decisions            │
│    - Activity Task → Execute activity code (business logic)     │
└─────────────────────────────────────────────────────────────────┘

Key Insight: The workflow code runs inside the worker. If worker code is outdated or buggy, workflow execution fails.

Workflow Task vs Activity Task

Task TypeWhat It DoesWhere It RunsOn Failure
Workflow TaskMakes workflow decisions (what to do next)WorkerStalls the workflow until fixed
Activity TaskExecutes business logicWorkerRetries per retry policy

CRITICAL: Workflow Task errors are fundamentally different from Activity Task errors:

  • Workflow Task Failure → Workflow stops making progress entirely
  • Activity Task Failure → Workflow retries the activity (workflow still progressing)

Environment Variables

VariableDefaultDescription
CLAUDE_TEMPORAL_LOG_DIR/tmp/claude-temporal-logsDirectory for worker log files
CLAUDE_TEMPORAL_PID_DIR/tmp/claude-temporal-pidsDirectory for worker PID files
CLAUDE_TEMPORAL_PROJECT_DIR$(pwd)Project root directory
CLAUDE_TEMPORAL_PROJECT_NAME$(basename "$PWD")Project name (used for log/PID naming)
CLAUDE_TEMPORAL_NAMESPACEdefaultTemporal namespace
TEMPORAL_ADDRESSlocalhost:7233Temporal server gRPC address
TEMPORAL_CLItemporalPath to Temporal CLI binary
TEMPORAL_WORKER_CMDuv run workerCommand to start worker

Quick Start

# 1. Start server
./scripts/ensure-server.sh

# 2. Start worker (ensures no old workers, starts fresh one)
./scripts/ensure-worker.sh

# 3. Execute workflow
uv run starter  # Capture workflow_id from output

# 4. Wait for completion
./scripts/wait-for-workflow-status.sh --workflow-id <id> --status COMPLETED

# 5. Get result (IMPORTANT: verify result is correct, not an error message)
./scripts/get-workflow-result.sh --workflow-id <id>

# 6. CLEANUP: Kill workers when done
./scripts/kill-worker.sh

Worker Management

The Golden Rule

Ensure no old workers are running. Stale workers with outdated code cause:

  • Non-determinism errors (history mismatch)
  • Executing old buggy code
  • Confusing behavior

Best practice: Run only ONE worker instance with the latest code.

Starting Workers

# PREFERRED: Smart restart (kills old, starts fresh)
./scripts/ensure-worker.sh

This command:

  1. Finds ALL existing workers for the project
  2. Kills them
  3. Starts a new worker with fresh code
  4. Waits for worker to be ready

Verifying Workers

# List all running workers
./scripts/list-workers.sh

# Check specific worker health
./scripts/monitor-worker-health.sh

# View worker logs
tail -f $CLAUDE_TEMPORAL_LOG_DIR/worker-$(basename "$(pwd)").log

What to look for in logs:

  • Worker started, listening on task queue: ... → Worker is ready
  • Worker process died during startup → Startup failure, check logs for error

Cleanup (REQUIRED)

Always kill workers when done. Don't leave workers running.

# Kill current project's worker
./scripts/kill-worker.sh

# Kill ALL workers (full cleanup)
./scripts/kill-all-workers.sh

# Kill all workers AND server
./scripts/kill-all-workers.sh --include-server

Workflow Execution

Starting Workflows

# Execute workflow via starter script
uv run starter

CRITICAL: Capture the Workflow ID from output. You need it for all monitoring/troubleshooting.

Checking Status

# Get workflow status
temporal workflow describe --workflow-id <id>

# Wait for specific status
./scripts/wait-for-workflow-status.sh \
  --workflow-id <id> \
  --status COMPLETED \
  --timeout 60

Workflow Status Reference

StatusMeaningAction
RUNNINGWorkflow in progressWait, or check if stalled
COMPLETEDSuccessfully finishedGet result, verify correctness
FAILEDError during executionAnalyze error
CANCELEDExplicitly canceledReview reason
TERMINATEDForce-stoppedReview reason
TIMED_OUTExceeded timeoutIncrease timeout

Getting Results

./scripts/get-workflow-result.sh --workflow-id <id>

IMPORTANT - False Positive Detection:

Workflows may COMPLETE but return undesired results (e.g., error messages in the result payload).

// This workflow COMPLETED but the result is an ERROR!
{"status": "error", "message": "Failed to process request"}

Always verify the result content is correct, not just that the status is COMPLETED.


Troubleshooting

Step 1: Identify the Problem

# Check workflow status
temporal workflow describe --workflow-id <id>

# Check for stalled workflows (workflows stuck in RUNNING)
./scripts/find-stalled-workflows.sh

# Analyze specific workflow errors
./scripts/analyze-workflow-error.sh --workflow-id <id>

Step 2: Diagnose Using This Decision Tree

Workflow not behaving as expected?
│
├── Status: RUNNING but no progress (STALLED)
│   │
│   ├── Is it an interactive workflow waiting for signal/update?
│   │   └── YES → Send the required interaction
│   │
│   └── NO → Run: ./scripts/find-stalled-workflows.sh
│       │
│       ├── WorkflowTaskFailed detected
│       │   │
│       │   ├── Non-determinism error (history mismatch)?
│       │   │   └── See: "Fixing Non-Determinism Errors" below
│       │   │
│       │   └── Other workflow task error (code bug, missing registration)?
│       │       └── See: "Fixing Other Workflow Task Errors" below
│       │
│       └── ActivityTaskFailed (excessive retries)
│           └── Activity is retrying. Fix activity code, restart worker.
│               Workflow will auto-retry with new code.
│
├── Status: COMPLETED but wrong result
│   └── Check result: ./scripts/get-workflow-result.sh --workflow-id <id>
│       Is result an error message? → Fix workflow/activity logic
│
├── Status: FAILED
│   └── Run: ./scripts/analyze-workflow-error.sh --workflow-id <id>
│       Fix code → ./scripts/ensure-worker.sh → Start NEW workflow
│
├── Status: TIMED_OUT
│   └── Increase timeouts → ./scripts/ensure-worker.sh → Start NEW workflow
│
└── Workflow never starts
    └── Check: Worker running? Task queue matches? Workflow registered?

Fixing Workflow Task Errors

Workflow task errors STALL the workflow - it stops making progress entirely until the issue is fixed.

Fixing Non-Determinism Errors

Non-determinism occurs when workflow code changes while a workflow is running, causing history mismatch.

Symptoms:

  • WorkflowTaskFailed events in history
  • "Non-deterministic error" or "history mismatch" in logs

Fix procedure:

# 1. TERMINATE affected workflows (they cannot recover)
temporal workflow terminate --workflow-id <id>

# 2. Kill existing workers
./scripts/kill-worker.sh

# 3. Fix the workflow code if needed

# 4. Restart worker with corrected code
./scripts/ensure-worker.sh

# 5. Verify workflow logic is correct

# 6. Start NEW workflow execution
uv run starter

Key point: Non-determinism corrupts the workflow. You MUST terminate and start fresh.

Fixing Other Workflow Task Errors

For workflow task errors that are NOT non-determinism (code bugs, missing registration, etc.):

Symptoms:

  • WorkflowTaskFailed events
  • Error is NOT "history mismatch" or "non-deterministic"

Fix procedure:

# 1. Identify the error
./scripts/analyze-workflow-error.sh --workflow-id <id>

# 2. Fix the root cause (code bug, worker config, etc.)

# 3. Kill and restart worker with fixed code
./scripts/ensure-worker.sh

# 4. NO NEED TO TERMINATE - the workflow will automatically resume
#    The new worker picks up where it left off and continues execution

Key point: Unlike non-determinism, the workflow can recover once you fix the code.


Fixing Activity Task Errors

Activity task errors cause retries, not immediate workflow failure.

Workflow Stalling Due to Retries

Workflows can appear stalled because an activity keeps failing and retrying.

Diagnosis:

# Check for excessive activity retries
./scripts/find-stalled-workflows.sh

# Look for ActivityTaskFailed count
# Check worker logs for retry messages
tail -100 $CLAUDE_TEMPORAL_LOG_DIR/worker-$(basename "$(pwd)").log

Fix procedure:

# 1. Fix the activity code

# 2. Restart worker with fixed code
./scripts/ensure-worker.sh

# 3. Worker auto-retries with new code
#    No need to terminate or restart workflow

Activity Failure (Retries Exhausted)

When all retries are exhausted, the activity fails permanently.

Fix procedure:

# 1. Analyze the error
./scripts/analyze-workflow-error.sh --workflow-id <id>

# 2. Fix activity code

# 3. Restart worker
./scripts/ensure-worker.sh

# 4. Start NEW workflow (old one has failed)
uv run starter

Common Error Types Reference

Error TypeWhere to FindWhat HappenedRecovery
Non-determinismWorkflowTaskFailed in historyCode changed during executionTerminate workflow → Fix → Restart worker → NEW workflow
Workflow code bugWorkflowTaskFailed in historyBug in workflow logicFix code → Restart worker → Workflow auto-resumes
Missing workflowWorker logsWorkflow not registeredAdd to worker.py → Restart worker
Missing activityWorker logsActivity not registeredAdd to worker.py → Restart worker
Activity bugActivityTaskFailed in historyBug in activity codeFix code → Restart worker → Auto-retries
Activity retriesActivityTaskFailed (count >2)Repeated failuresFix code → Restart worker → Auto-retries
Sandbox violationWorker logsBad imports in workflowFix workflow.py imports → Restart worker
Task queue mismatchWorkflow never startsDifferent queues in starter/workerAlign task queue names
TimeoutStatus = TIMED_OUTOperation too slowIncrease timeout config

Interactive Workflows

Interactive workflows pause and wait for external input (signals or updates).

Signals

# Send signal to workflow
temporal workflow signal \
  --workflow-id <id> \
  --name "signal_name" \
  --input '{"key": "value"}'

# Or via interact script (if available)
uv run interact --workflow-id <id> --signal-name "signal_name" --data '{"key": "value"}'

Updates

# Send update to workflow
temporal workflow update \
  --workflow-id <id> \
  --name "update_name" \
  --input '{"approved": true}'

Queries

# Query workflow state (read-only)
temporal workflow query \
  --workflow-id <id> \
  --name "get_status"

Common Recipes

Recipe 1: Clean Start (Fresh Environment)

./scripts/kill-all-workers.sh
./scripts/ensure-server.sh
./scripts/ensure-worker.sh
uv run starter

Recipe 2: Debug Stalled Workflow

# 1. Find what's wrong
./scripts/find-stalled-workflows.sh
./scripts/analyze-workflow-error.sh --workflow-id <id>

# 2. Check worker logs
tail -100 $CLAUDE_TEMPORAL_LOG_DIR/worker-$(basename "$(pwd)").log

# 3. Fix based on error type (see decision tree above)

Recipe 3: Clear Stalled Environment

./scripts/find-stalled-workflows.sh
./scripts/bulk-cancel-workflows.sh
./scripts/kill-worker.sh
./scripts/ensure-worker.sh

Recipe 4: Test Interactive Workflow

./scripts/ensure-worker.sh
uv run starter  # Get workflow_id
./scripts/wait-for-workflow-status.sh --workflow-id $workflow_id --status RUNNING
uv run interact --workflow-id $workflow_id --signal-name "approval" --data '{"approved": true}'
./scripts/wait-for-workflow-status.sh --workflow-id $workflow_id --status COMPLETED
./scripts/get-workflow-result.sh --workflow-id $workflow_id
./scripts/kill-worker.sh  # CLEANUP

Recipe 5: Check Recent Workflow Results

# List recent workflows
./scripts/list-recent-workflows.sh --minutes 30

# Check results (verify they're correct, not error messages!)
./scripts/get-workflow-result.sh --workflow-id <id1>
./scripts/get-workflow-result.sh --workflow-id <id2>

Tool Reference

Lifecycle Scripts

ToolDescriptionKey Options
ensure-server.shStart dev server if not running-
ensure-worker.shKill old workers, start fresh oneUses $TEMPORAL_WORKER_CMD
kill-worker.shKill current project's worker-
kill-all-workers.shKill all workers--include-server
list-workers.shList running workers-

Monitoring Scripts

ToolDescriptionKey Options
list-recent-workflows.shShow recent executions--minutes N (default: 5)
find-stalled-workflows.shDetect stalled workflows--query "..."
monitor-worker-health.shCheck worker status-
wait-for-workflow-status.shBlock until status--workflow-id, --status, --timeout

Debugging Scripts

ToolDescriptionKey Options
analyze-workflow-error.shExtract errors from history--workflow-id, --run-id
get-workflow-result.shGet workflow output--workflow-id, --raw
bulk-cancel-workflows.shMass cancellation--pattern "..."

Log Files

LogLocationContent
Worker logs$CLAUDE_TEMPORAL_LOG_DIR/worker-{project}.logWorker output, activity logs, errors

Useful searches:

# Find errors
grep -i "error" $CLAUDE_TEMPORAL_LOG_DIR/worker-*.log

# Check worker startup
grep -i "started" $CLAUDE_TEMPORAL_LOG_DIR/worker-*.log

# Find activity issues
grep -i "activity" $CLAUDE_TEMPORAL_LOG_DIR/worker-*.log