Files
nomad_mcp/NOMAD_JOB_MANAGEMENT_GUIDE.md

20 KiB

Nomad Job Management Guide

This guide explains the complete process of creating, deploying, monitoring, and troubleshooting Nomad jobs using the Nomad MCP service. It's designed to be used by both humans and AI assistants to effectively manage containerized applications in a Nomad cluster.

Prerequisites

  • Access to a Nomad cluster
  • Nomad MCP service installed and running
  • Proper environment configuration (NOMAD_ADDR, NOMAD_NAMESPACE, etc.)
  • Python with required packages installed

1. Creating a Nomad Job Specification

A Nomad job specification defines how your application should run. This can be created in two formats:

Option A: Using a .nomad HCL File

job "your-job-name" {
  datacenters = ["dc1"]
  type        = "service"
  namespace   = "development"

  group "app" {
    count = 1

    network {
      port "http" {
        to = 8000
      }
    }

    task "app-task" {
      driver = "docker"

      config {
        image = "your-registry/your-image:tag"
        ports = ["http"]
        command = "python"
        args = ["-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
        
        # Mount volumes if needed
        mount {
          type = "bind"
          source = "local/app-code"
          target = "/app"
          readonly = false
        }
      }

      # Pull code from Git repository if needed
      artifact {
        source      = "git::ssh://git@your-git-server:port/org/repo.git"
        destination = "local/app-code"
        options {
          ref    = "main"
          sshkey = "your-base64-encoded-ssh-key"
        }
      }

      env {
        # Environment variables
        PORT = "8000"
        HOST = "0.0.0.0"
        LOG_LEVEL = "INFO"
        PYTHONPATH = "/app"
        
        # Add any application-specific environment variables
        STATIC_DIR = "/local/app-code/static"
      }

      resources {
        cpu    = 200
        memory = 256
      }

      service {
        name = "your-service-name"
        port = "http"
        tags = [
          "traefik.enable=true",
          "traefik.http.routers.your-service.entryPoints=https",
          "traefik.http.routers.your-service.rule=Host(`your-service.domain.com`)"
        ]

        check {
          type     = "http"
          path     = "/api/health"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

Option B: Using a Python Deployment Script

#!/usr/bin/env python
import os
import json
from app.services.nomad_client import NomadService

def main():
    # Initialize the Nomad service
    nomad_service = NomadService()
    
    # Create job specification
    job_spec = {
        "Job": {
            "ID": "your-job-name",
            "Name": "your-job-name",
            "Type": "service",
            "Datacenters": ["dc1"],
            "Namespace": "development",
            "TaskGroups": [
                {
                    "Name": "app",
                    "Count": 1,
                    "Networks": [
                        {
                            "DynamicPorts": [
                                {
                                    "Label": "http",
                                    "To": 8000
                                }
                            ]
                        }
                    ],
                    "Tasks": [
                        {
                            "Name": "app-task",
                            "Driver": "docker",
                            "Config": {
                                "image": "your-registry/your-image:tag",
                                "ports": ["http"],
                                "command": "python",
                                "args": ["-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"],
                                "mount": [
                                    {
                                        "type": "bind",
                                        "source": "local/app-code",
                                        "target": "/app",
                                        "readonly": False
                                    }
                                ]
                            },
                            "Artifacts": [
                                {
                                    "GetterSource": "git::ssh://git@your-git-server:port/org/repo.git",
                                    "RelativeDest": "local/app-code",
                                    "GetterOptions": {
                                        "ref": "main",
                                        "sshkey": "your-base64-encoded-ssh-key"
                                    }
                                }
                            ],
                            "Env": {
                                "PORT": "8000",
                                "HOST": "0.0.0.0",
                                "LOG_LEVEL": "INFO",
                                "PYTHONPATH": "/app",
                                "STATIC_DIR": "/local/app-code/static"
                            },
                            "Resources": {
                                "CPU": 200,
                                "MemoryMB": 256
                            },
                            "Services": [
                                {
                                    "Name": "your-service-name",
                                    "PortLabel": "http",
                                    "Tags": [
                                        "traefik.enable=true",
                                        "traefik.http.routers.your-service.entryPoints=https",
                                        "traefik.http.routers.your-service.rule=Host(`your-service.domain.com`)"
                                    ],
                                    "Checks": [
                                        {
                                            "Type": "http",
                                            "Path": "/api/health",
                                            "Interval": 10000000000,  # 10 seconds in nanoseconds
                                            "Timeout": 2000000000     # 2 seconds in nanoseconds
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }
            ]
        }
    }
    
    # Start the job
    response = nomad_service.start_job(job_spec)
    
    print(f"Job deployment response: {response}")
    
    if response.get("status") == "started":
        print(f"✅ Job deployed successfully!")
        print(f"Job ID: {response.get('job_id')}")
        print(f"Evaluation ID: {response.get('eval_id')}")
    else:
        print(f"❌ Failed to deploy job.")
        print(f"Status: {response.get('status')}")
        print(f"Message: {response.get('message', 'Unknown error')}")

if __name__ == "__main__":
    main()

2. Deploying the Nomad Job

Option A: Using the Nomad CLI

# Deploy using a .nomad file
nomad job run your-job-file.nomad

# Verify the job was submitted
nomad job status your-job-name

Option B: Using the Python Deployment Script

# Run the deployment script
python deploy_your_job.py

Option C: Using the Nomad MCP API

# Using curl
curl -X POST http://localhost:8000/api/claude/create-job \
  -H "Content-Type: application/json" \
  -d '{
    "job_id": "your-job-name",
    "name": "Your Job Name",
    "type": "service",
    "datacenters": ["dc1"],
    "namespace": "development",
    "docker_image": "your-registry/your-image:tag",
    "count": 1,
    "cpu": 200,
    "memory": 256,
    "ports": [
      {
        "Label": "http",
        "Value": 0,
        "To": 8000
      }
    ],
    "env_vars": {
      "PORT": "8000",
      "HOST": "0.0.0.0",
      "LOG_LEVEL": "INFO",
      "PYTHONPATH": "/app",
      "STATIC_DIR": "/local/app-code/static"
    }
  }'

# Using PowerShell
Invoke-RestMethod -Uri "http://localhost:8000/api/claude/create-job" -Method POST -Headers @{"Content-Type"="application/json"} -Body '{
  "job_id": "your-job-name",
  "name": "Your Job Name",
  "type": "service",
  "datacenters": ["dc1"],
  "namespace": "development",
  "docker_image": "your-registry/your-image:tag",
  "count": 1,
  "cpu": 200,
  "memory": 256,
  "ports": [
    {
      "Label": "http",
      "Value": 0,
      "To": 8000
    }
  ],
  "env_vars": {
    "PORT": "8000",
    "HOST": "0.0.0.0",
    "LOG_LEVEL": "INFO",
    "PYTHONPATH": "/app",
    "STATIC_DIR": "/local/app-code/static"
  }
}'

3. Checking Job Status

After deploying a job, you should check its status to ensure it's running correctly.

Option A: Using the Nomad CLI

# Check job status
nomad job status your-job-name

# Check allocations for the job
nomad job allocs your-job-name

# Check the most recent allocation
nomad alloc status -job your-job-name

Option B: Using the Nomad MCP API

# Using curl
curl -X POST http://localhost:8000/api/claude/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "job_id": "your-job-name",
    "action": "status",
    "namespace": "development"
  }'

# Using PowerShell
Invoke-RestMethod -Uri "http://localhost:8000/api/claude/jobs" -Method POST -Headers @{"Content-Type"="application/json"} -Body '{
  "job_id": "your-job-name",
  "action": "status",
  "namespace": "development"
}'

Option C: Using a Python Script

#!/usr/bin/env python
from app.services.nomad_client import NomadService

def main():
    # Initialize the Nomad service
    service = NomadService()
    
    # Get job information
    job = service.get_job('your-job-name')
    print(f"Job Status: {job.get('Status', 'Unknown')}")
    print(f"Job Type: {job.get('Type', 'Unknown')}")
    print(f"Job Datacenters: {job.get('Datacenters', [])}")
    
    # Get allocations
    allocations = service.get_allocations('your-job-name')
    print(f"\nFound {len(allocations)} allocations")
    
    if allocations:
        latest_alloc = allocations[0]
        print(f"Latest allocation ID: {latest_alloc.get('ID', 'Unknown')}")
        print(f"Allocation Status: {latest_alloc.get('ClientStatus', 'Unknown')}")

if __name__ == "__main__":
    main()

4. Checking Job Logs

Logs are crucial for diagnosing issues with your job. Here's how to access them:

Option A: Using the Nomad CLI

# First, get the allocation ID
nomad job allocs your-job-name

# Then view the logs for a specific allocation
nomad alloc logs <allocation-id>

# View stderr logs
nomad alloc logs -stderr <allocation-id>

# Follow logs in real-time
nomad alloc logs -f <allocation-id>

Option B: Using the Nomad MCP API

# Using curl
curl -X GET http://localhost:8000/api/claude/job-logs/your-job-name

# Using PowerShell
Invoke-RestMethod -Uri "http://localhost:8000/api/claude/job-logs/your-job-name" -Method GET

Option C: Using a Python Script

#!/usr/bin/env python
from app.services.nomad_client import NomadService

def main():
    # Initialize the Nomad service
    service = NomadService()
    
    # Get allocations for the job
    allocations = service.get_allocations('your-job-name')
    
    if allocations:
        latest_alloc = allocations[0]
        alloc_id = latest_alloc["ID"]
        print(f"Latest allocation ID: {alloc_id}")
        
        # Get logs for the allocation
        try:
            # Get stdout logs
            stdout_logs = service.get_allocation_logs(alloc_id, task="your-task-name", log_type="stdout")
            print("\nStandard Output Logs:")
            print(stdout_logs)
            
            # Get stderr logs
            stderr_logs = service.get_allocation_logs(alloc_id, task="your-task-name", log_type="stderr")
            print("\nStandard Error Logs:")
            print(stderr_logs)
        except Exception as e:
            print(f"Error getting logs: {str(e)}")
    else:
        print("No allocations found for your-job-name job")

if __name__ == "__main__":
    main()

5. Troubleshooting Common Issues

Issue: Job Fails to Start

  1. Check the job status:

    nomad job status your-job-name
    
  2. Examine the allocation status:

    nomad alloc status -job your-job-name
    
  3. Check the logs for errors:

    # Get the allocation ID first
    nomad job allocs your-job-name
    # Then check the logs
    nomad alloc logs -stderr <allocation-id>
    
  4. Common errors and solutions:

    a. Missing static directory:

    RuntimeError: Directory 'static' does not exist
    

    Solution: Add an environment variable to specify the static directory path:

    env {
      STATIC_DIR = "/local/app-code/static"
    }
    

    b. Invalid mount configuration:

    invalid mount config for type 'bind': bind source path does not exist
    

    Solution: Ensure the source path exists or is created by an artifact:

    artifact {
      source = "git::ssh://git@your-git-server:port/org/repo.git"
      destination = "local/app-code"
    }
    

    c. Port already allocated:

    Allocation failed: Failed to place allocation: failed to place alloc: port is already allocated
    

    Solution: Use dynamic ports or choose a different port:

    network {
      port "http" {
        to = 8000
      }
    }
    

Issue: Application Errors After Deployment

  1. Check application logs:

    nomad alloc logs <allocation-id>
    
  2. Verify environment variables:

    nomad alloc status <allocation-id>
    

    Look for the "Environment Variables" section.

  3. Check resource constraints: Ensure the job has enough CPU and memory allocated:

    resources {
      cpu    = 200  # Increase if needed
      memory = 256  # Increase if needed
    }
    

6. Updating a Job

After fixing issues, you'll need to update the job:

Option A: Using the Nomad CLI

# Update the job with the modified specification
nomad job run your-updated-job-file.nomad

Option B: Using the Nomad MCP API

# Using PowerShell to restart a job
Invoke-RestMethod -Uri "http://localhost:8000/api/claude/jobs" -Method POST -Headers @{"Content-Type"="application/json"} -Body '{
  "job_id": "your-job-name",
  "action": "restart",
  "namespace": "development"
}'

Option C: Using a Python Script

#!/usr/bin/env python
from app.services.nomad_client import NomadService

def main():
    # Initialize the Nomad service
    service = NomadService()
    
    # Get the current job specification
    job = service.get_job('your-job-name')
    
    # Modify the job specification as needed
    # For example, update environment variables:
    job["TaskGroups"][0]["Tasks"][0]["Env"]["STATIC_DIR"] = "/local/app-code/static"
    
    # Update the job
    response = service.start_job({"Job": job})
    
    print(f"Job update response: {response}")

if __name__ == "__main__":
    main()

7. Stopping a Job

When you're done with a job, you can stop it:

Option A: Using the Nomad CLI

# Stop a job
nomad job stop your-job-name

# Stop and purge a job
nomad job stop -purge your-job-name

Option B: Using the Nomad MCP API

# Using PowerShell
Invoke-RestMethod -Uri "http://localhost:8000/api/claude/jobs" -Method POST -Headers @{"Content-Type"="application/json"} -Body '{
  "job_id": "your-job-name",
  "action": "stop",
  "namespace": "development",
  "purge": true
}'

Option C: Using a Python Script

#!/usr/bin/env python
from app.services.nomad_client import NomadService

def main():
    # Initialize the Nomad service
    service = NomadService()
    
    # Stop the job
    response = service.stop_job('your-job-name', purge=True)
    
    print(f"Job stop response: {response}")

if __name__ == "__main__":
    main()

8. Complete Workflow Example

Here's a complete workflow for deploying, monitoring, troubleshooting, and updating a job:

#!/usr/bin/env python
import time
from app.services.nomad_client import NomadService

def main():
    # Initialize the Nomad service
    service = NomadService()
    
    # 1. Create and deploy the job
    job_spec = {
        "Job": {
            "ID": "example-app",
            "Name": "Example Application",
            "Type": "service",
            "Datacenters": ["dc1"],
            "Namespace": "development",
            # ... rest of job specification ...
        }
    }
    
    deploy_response = service.start_job(job_spec)
    print(f"Deployment response: {deploy_response}")
    
    # 2. Wait for the job to be scheduled
    print("Waiting for job to be scheduled...")
    time.sleep(5)
    
    # 3. Check job status
    job = service.get_job('example-app')
    print(f"Job Status: {job.get('Status', 'Unknown')}")
    
    # 4. Get allocations
    allocations = service.get_allocations('example-app')
    
    if allocations:
        latest_alloc = allocations[0]
        alloc_id = latest_alloc["ID"]
        print(f"Latest allocation ID: {alloc_id}")
        print(f"Allocation Status: {latest_alloc.get('ClientStatus', 'Unknown')}")
        
        # 5. Check logs for errors
        stderr_logs = service.get_allocation_logs(alloc_id, log_type="stderr")
        
        # 6. Look for common errors
        if "Directory 'static' does not exist" in stderr_logs:
            print("Error detected: Missing static directory")
            
            # 7. Update the job to fix the issue
            job["TaskGroups"][0]["Tasks"][0]["Env"]["STATIC_DIR"] = "/local/app-code/static"
            update_response = service.start_job({"Job": job})
            print(f"Job update response: {update_response}")
            
            # 8. Wait for the updated job to be scheduled
            print("Waiting for updated job to be scheduled...")
            time.sleep(5)
            
            # 9. Check the updated job status
            updated_job = service.get_job('example-app')
            print(f"Updated Job Status: {updated_job.get('Status', 'Unknown')}")
    else:
        print("No allocations found for the job")

if __name__ == "__main__":
    main()

9. Best Practices

  1. Always check logs after deployment: Logs are your primary tool for diagnosing issues.

  2. Use environment variables for configuration: This makes your jobs more flexible and easier to update.

  3. Implement health checks: Health checks help Nomad determine if your application is running correctly.

  4. Set appropriate resource limits: Allocate enough CPU and memory for your application to run efficiently.

  5. Use artifacts for code deployment: Pull code from a Git repository to ensure consistency.

  6. Implement proper error handling: Your application should handle errors gracefully and provide meaningful error messages.

  7. Use namespaces: Organize your jobs into namespaces based on environment or team.

  8. Document your job specifications: Include comments in your job files to explain configuration choices.

  9. Implement a CI/CD pipeline: Automate the deployment process to reduce errors and improve efficiency.

  10. Monitor job performance: Use Nomad's monitoring capabilities to track resource usage and performance.

10. Conclusion

Managing Nomad jobs effectively requires understanding the job lifecycle, from creation to deployment, monitoring, troubleshooting, and updating. By following this guide, you can create robust deployment processes that ensure your applications run reliably in a Nomad cluster.

Remember that the key to successful job management is thorough testing, careful monitoring, and quick response to issues. With the right tools and processes in place, you can efficiently manage even complex applications in a Nomad environment.