Revise Nomad Job Management Guide with comprehensive workflow and best practices

This commit is contained in:
2025-02-26 17:28:18 +07:00
parent e3e19f5099
commit 1c2166111b
2 changed files with 239 additions and 558 deletions

View File

@ -1,22 +1,24 @@
# Nomad Job Management Guide # Nomad Job Management Guide
This guide explains the complete process of creating, deploying, monitoring, and troubleshooting Nomad jobs using the Nomad MCP service. It's designed to be used by both humans and AI assistants to effectively manage containerized applications in a Nomad cluster. This guide provides comprehensive instructions for managing Nomad jobs, including creating, deploying, monitoring, and troubleshooting.
## Prerequisites ## Prerequisites
- Access to a Nomad cluster Before you begin, ensure you have:
- Nomad MCP service installed and running
- Proper environment configuration (NOMAD_ADDR, NOMAD_NAMESPACE, etc.)
- Python with required packages installed
## 1. Creating a Nomad Job Specification 1. Access to a Nomad cluster
2. Proper authentication credentials (token if ACLs are enabled)
3. Network connectivity to the Nomad API endpoint (default port 4646)
4. Access to the Gitea repository (if using artifact integration)
A Nomad job specification defines how your application should run. This can be created in two formats: ## Job Specifications
### Option A: Using a .nomad HCL File ### HCL Format (Nomad Job File)
Nomad jobs can be defined in HashiCorp Configuration Language (HCL):
```hcl ```hcl
job "your-job-name" { job "example-job" {
datacenters = ["dc1"] datacenters = ["dc1"]
type = "service" type = "service"
namespace = "development" namespace = "development"
@ -24,677 +26,356 @@ job "your-job-name" {
group "app" { group "app" {
count = 1 count = 1
network { task "server" {
port "http" {
to = 8000
}
}
task "app-task" {
driver = "docker" driver = "docker"
config { config {
image = "your-registry/your-image:tag" image = "nginx:latest"
ports = ["http"] ports = ["http"]
command = "python"
args = ["-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
# Mount volumes if needed
mount {
type = "bind"
source = "local/app-code"
target = "/app"
readonly = false
}
} }
# Pull code from Git repository if needed
artifact {
source = "git::ssh://git@your-git-server:port/org/repo.git"
destination = "local/app-code"
options {
ref = "main"
sshkey = "your-base64-encoded-ssh-key"
}
}
env {
# Environment variables
PORT = "8000"
HOST = "0.0.0.0"
LOG_LEVEL = "INFO"
PYTHONPATH = "/app"
# Add any application-specific environment variables
STATIC_DIR = "/local/app-code/static"
}
resources { resources {
cpu = 200 cpu = 100
memory = 256 memory = 128
}
service {
name = "your-service-name"
port = "http"
tags = [
"traefik.enable=true",
"traefik.http.routers.your-service.entryPoints=https",
"traefik.http.routers.your-service.rule=Host(`your-service.domain.com`)"
]
check {
type = "http"
path = "/api/health"
interval = "10s"
timeout = "2s"
}
} }
} }
} }
} }
``` ```
### Option B: Using a Python Deployment Script ### Python Format (Dictionary)
For programmatic job management, use Python dictionaries:
```python
job_spec = {
"Job": {
"ID": "example-job",
"Name": "example-job",
"Type": "service",
"Datacenters": ["dc1"],
"Namespace": "development",
"TaskGroups": [
{
"Name": "app",
"Count": 1,
"Tasks": [
{
"Name": "server",
"Driver": "docker",
"Config": {
"image": "nginx:latest"
},
"Resources": {
"CPU": 100,
"MemoryMB": 128
}
}
]
}
]
}
}
```
## Deployment Methods
> **CRITICAL: Always commit and push your code changes to Gitea before deploying jobs!**
>
> When using Gitea artifacts in your Nomad jobs, the job will pull code from the repository at deployment time. If you don't commit and push your changes first, the job will use the old version of the code, and your changes won't be reflected in the deployed application.
### Using the Nomad CLI
```bash
# Deploy a job using an HCL file
nomad job run job_spec.nomad
# Stop a job
nomad job stop example-job
# Purge a job (completely remove it)
nomad job stop -purge example-job
```
### Using Python with the Nomad API
```python
from app.services.nomad_client import NomadService
# Initialize the service
nomad_service = NomadService()
# Start a job
response = nomad_service.start_job(job_spec)
print(f"Job started: {response}")
# Stop a job
response = nomad_service.stop_job("example-job", purge=False)
print(f"Job stopped: {response}")
```
### Using a Deployment Script
Create a deployment script that handles the job specification and deployment:
```python ```python
#!/usr/bin/env python #!/usr/bin/env python
import os import os
import json
from app.services.nomad_client import NomadService from app.services.nomad_client import NomadService
def main(): def main():
# Initialize the Nomad service # Initialize the Nomad service
nomad_service = NomadService() nomad_service = NomadService()
# Create job specification # Define job specification
job_spec = { job_spec = {
"Job": { "Job": {
"ID": "your-job-name", "ID": "example-job",
"Name": "your-job-name", # ... job configuration ...
"Type": "service",
"Datacenters": ["dc1"],
"Namespace": "development",
"TaskGroups": [
{
"Name": "app",
"Count": 1,
"Networks": [
{
"DynamicPorts": [
{
"Label": "http",
"To": 8000
}
]
}
],
"Tasks": [
{
"Name": "app-task",
"Driver": "docker",
"Config": {
"image": "your-registry/your-image:tag",
"ports": ["http"],
"command": "python",
"args": ["-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"],
"mount": [
{
"type": "bind",
"source": "local/app-code",
"target": "/app",
"readonly": False
}
]
},
"Artifacts": [
{
"GetterSource": "git::ssh://git@your-git-server:port/org/repo.git",
"RelativeDest": "local/app-code",
"GetterOptions": {
"ref": "main",
"sshkey": "your-base64-encoded-ssh-key"
}
}
],
"Env": {
"PORT": "8000",
"HOST": "0.0.0.0",
"LOG_LEVEL": "INFO",
"PYTHONPATH": "/app",
"STATIC_DIR": "/local/app-code/static"
},
"Resources": {
"CPU": 200,
"MemoryMB": 256
},
"Services": [
{
"Name": "your-service-name",
"PortLabel": "http",
"Tags": [
"traefik.enable=true",
"traefik.http.routers.your-service.entryPoints=https",
"traefik.http.routers.your-service.rule=Host(`your-service.domain.com`)"
],
"Checks": [
{
"Type": "http",
"Path": "/api/health",
"Interval": 10000000000, # 10 seconds in nanoseconds
"Timeout": 2000000000 # 2 seconds in nanoseconds
}
]
}
]
}
]
}
]
} }
} }
# Start the job # Start the job
response = nomad_service.start_job(job_spec) response = nomad_service.start_job(job_spec)
print(f"Job deployment response: {response}")
if response.get("status") == "started": if response.get("status") == "started":
print(f"Job deployed successfully!") print(f"Job started successfully: {response.get('job_id')}")
print(f"Job ID: {response.get('job_id')}")
print(f"Evaluation ID: {response.get('eval_id')}")
else: else:
print(f"Failed to deploy job.") print(f"Failed to start job: {response.get('message')}")
print(f"Status: {response.get('status')}")
print(f"Message: {response.get('message', 'Unknown error')}")
if __name__ == "__main__": if __name__ == "__main__":
main() main()
``` ```
## 2. Deploying the Nomad Job ## Checking Job Status
### Option A: Using the Nomad CLI ### Using the Nomad CLI
```bash ```bash
# Deploy using a .nomad file # Get job status
nomad job run your-job-file.nomad nomad job status example-job
# Verify the job was submitted # Get detailed allocation information
nomad job status your-job-name nomad alloc status <allocation_id>
``` ```
### Option B: Using the Python Deployment Script ### Using Python
```bash
# Run the deployment script
python deploy_your_job.py
```
### Option C: Using the Nomad MCP API
```bash
# Using curl
curl -X POST http://localhost:8000/api/claude/create-job \
-H "Content-Type: application/json" \
-d '{
"job_id": "your-job-name",
"name": "Your Job Name",
"type": "service",
"datacenters": ["dc1"],
"namespace": "development",
"docker_image": "your-registry/your-image:tag",
"count": 1,
"cpu": 200,
"memory": 256,
"ports": [
{
"Label": "http",
"Value": 0,
"To": 8000
}
],
"env_vars": {
"PORT": "8000",
"HOST": "0.0.0.0",
"LOG_LEVEL": "INFO",
"PYTHONPATH": "/app",
"STATIC_DIR": "/local/app-code/static"
}
}'
# Using PowerShell
Invoke-RestMethod -Uri "http://localhost:8000/api/claude/create-job" -Method POST -Headers @{"Content-Type"="application/json"} -Body '{
"job_id": "your-job-name",
"name": "Your Job Name",
"type": "service",
"datacenters": ["dc1"],
"namespace": "development",
"docker_image": "your-registry/your-image:tag",
"count": 1,
"cpu": 200,
"memory": 256,
"ports": [
{
"Label": "http",
"Value": 0,
"To": 8000
}
],
"env_vars": {
"PORT": "8000",
"HOST": "0.0.0.0",
"LOG_LEVEL": "INFO",
"PYTHONPATH": "/app",
"STATIC_DIR": "/local/app-code/static"
}
}'
```
## 3. Checking Job Status
After deploying a job, you should check its status to ensure it's running correctly.
### Option A: Using the Nomad CLI
```bash
# Check job status
nomad job status your-job-name
# Check allocations for the job
nomad job allocs your-job-name
# Check the most recent allocation
nomad alloc status -job your-job-name
```
### Option B: Using the Nomad MCP API
```bash
# Using curl
curl -X POST http://localhost:8000/api/claude/jobs \
-H "Content-Type: application/json" \
-d '{
"job_id": "your-job-name",
"action": "status",
"namespace": "development"
}'
# Using PowerShell
Invoke-RestMethod -Uri "http://localhost:8000/api/claude/jobs" -Method POST -Headers @{"Content-Type"="application/json"} -Body '{
"job_id": "your-job-name",
"action": "status",
"namespace": "development"
}'
```
### Option C: Using a Python Script
```python ```python
#!/usr/bin/env python
from app.services.nomad_client import NomadService from app.services.nomad_client import NomadService
def main(): # Initialize the service
# Initialize the Nomad service nomad_service = NomadService()
service = NomadService()
# Get job information
job = service.get_job('your-job-name')
print(f"Job Status: {job.get('Status', 'Unknown')}")
print(f"Job Type: {job.get('Type', 'Unknown')}")
print(f"Job Datacenters: {job.get('Datacenters', [])}")
# Get allocations
allocations = service.get_allocations('your-job-name')
print(f"\nFound {len(allocations)} allocations")
if allocations:
latest_alloc = allocations[0]
print(f"Latest allocation ID: {latest_alloc.get('ID', 'Unknown')}")
print(f"Allocation Status: {latest_alloc.get('ClientStatus', 'Unknown')}")
if __name__ == "__main__": # Get job status
main() job = nomad_service.get_job("example-job")
print(f"Job Status: {job.get('Status')}")
# Get allocations
allocations = nomad_service.get_allocations("example-job")
for alloc in allocations:
print(f"Allocation: {alloc.get('ID')}, Status: {alloc.get('ClientStatus')}")
``` ```
## 4. Checking Job Logs ## Retrieving Logs
Logs are crucial for diagnosing issues with your job. Here's how to access them: ### Using the Nomad CLI
### Option A: Using the Nomad CLI
```bash ```bash
# First, get the allocation ID # Get stdout logs
nomad job allocs your-job-name nomad alloc logs <allocation_id>
# Then view the logs for a specific allocation # Get stderr logs
nomad alloc logs <allocation-id> nomad alloc logs -stderr <allocation_id>
# View stderr logs
nomad alloc logs -stderr <allocation-id>
# Follow logs in real-time
nomad alloc logs -f <allocation-id>
``` ```
### Option B: Using the Nomad MCP API ### Using Python
```bash
# Using curl
curl -X GET http://localhost:8000/api/claude/job-logs/your-job-name
# Using PowerShell
Invoke-RestMethod -Uri "http://localhost:8000/api/claude/job-logs/your-job-name" -Method GET
```
### Option C: Using a Python Script
```python ```python
#!/usr/bin/env python
from app.services.nomad_client import NomadService from app.services.nomad_client import NomadService
def main(): # Initialize the service
# Initialize the Nomad service nomad_service = NomadService()
service = NomadService()
# Get allocations for the job
allocations = service.get_allocations('your-job-name')
if allocations:
latest_alloc = allocations[0]
alloc_id = latest_alloc["ID"]
print(f"Latest allocation ID: {alloc_id}")
# Get logs for the allocation
try:
# Get stdout logs
stdout_logs = service.get_allocation_logs(alloc_id, task="your-task-name", log_type="stdout")
print("\nStandard Output Logs:")
print(stdout_logs)
# Get stderr logs
stderr_logs = service.get_allocation_logs(alloc_id, task="your-task-name", log_type="stderr")
print("\nStandard Error Logs:")
print(stderr_logs)
except Exception as e:
print(f"Error getting logs: {str(e)}")
else:
print("No allocations found for your-job-name job")
if __name__ == "__main__": # Get allocations for the job
main() allocations = nomad_service.get_allocations("example-job")
if allocations:
# Get logs from the most recent allocation
latest_alloc = allocations[0]
stdout_logs = nomad_service.get_allocation_logs(latest_alloc["ID"], "server", "stdout")
stderr_logs = nomad_service.get_allocation_logs(latest_alloc["ID"], "server", "stderr")
print("STDOUT Logs:")
print(stdout_logs)
print("\nSTDERR Logs:")
print(stderr_logs)
``` ```
## 5. Troubleshooting Common Issues ## Troubleshooting
### Issue: Job Fails to Start ### Common Issues and Solutions
1. **Check the job status**: #### 1. Job Fails to Start
```bash
nomad job status your-job-name
```
2. **Examine the allocation status**: **Symptoms:**
```bash - Job status shows as "dead"
nomad alloc status -job your-job-name - Allocation status shows as "failed"
```
3. **Check the logs for errors**: **Possible Causes and Solutions:**
```bash
# Get the allocation ID first
nomad job allocs your-job-name
# Then check the logs
nomad alloc logs -stderr <allocation-id>
```
4. **Common errors and solutions**: a) **Resource Constraints:**
- Check if the job is requesting more resources than available
- Reduce CPU or memory requirements in the job specification
a. **Missing static directory**: b) **Missing Static Directory:**
``` - Error: `RuntimeError: Directory 'static' does not exist`
RuntimeError: Directory 'static' does not exist - Solution: Use environment variables to specify the static directory path
```
Solution: Add an environment variable to specify the static directory path:
```hcl ```hcl
env { env {
STATIC_DIR = "/local/app-code/static" STATIC_DIR = "/local/your_app/static"
} }
``` ```
b. **Invalid mount configuration**: c) **Module Import Errors:**
``` - Error: `ModuleNotFoundError: No module named 'app'`
invalid mount config for type 'bind': bind source path does not exist - Solution: Set the correct PYTHONPATH in the job specification
```
Solution: Ensure the source path exists or is created by an artifact:
```hcl ```hcl
artifact { env {
source = "git::ssh://git@your-git-server:port/org/repo.git" PYTHONPATH = "/local/your_app"
destination = "local/app-code"
} }
``` ```
c. **Port already allocated**: d) **Artifact Retrieval Failures:**
``` - Error: `Failed to download artifact: git::ssh://...`
Allocation failed: Failed to place allocation: failed to place alloc: port is already allocated - Solution: Verify SSH key, repository URL, and permissions
```
Solution: Use dynamic ports or choose a different port:
```hcl
network {
port "http" {
to = 8000
}
}
```
### Issue: Application Errors After Deployment e) **Old Code Version Running:**
- Symptom: Your recent changes aren't reflected in the deployed application
- Solution: **Commit and push your code changes to Gitea before deploying**
1. **Check application logs**: #### 2. Network Connectivity Issues
```bash
nomad alloc logs <allocation-id>
```
2. **Verify environment variables**: **Symptoms:**
```bash - Connection timeouts
nomad alloc status <allocation-id> - "Failed to connect to Nomad" errors
```
Look for the "Environment Variables" section.
3. **Check resource constraints**: **Solutions:**
Ensure the job has enough CPU and memory allocated: - Verify Nomad server address and port
```hcl - Check network connectivity and firewall rules
resources { - Ensure proper authentication token is provided
cpu = 200 # Increase if needed
memory = 256 # Increase if needed
}
```
## 6. Updating a Job #### 3. Permission Issues
After fixing issues, you'll need to update the job: **Symptoms:**
- "Permission denied" errors
- "ACL token not found" messages
### Option A: Using the Nomad CLI **Solutions:**
- Verify your token has appropriate permissions
- Check namespace settings in your job specification
- Ensure the token is properly set in the environment
## Complete Workflow Example
Here's a complete workflow for managing a Nomad job:
### 1. Develop and Test Your Application
```bash ```bash
# Update the job with the modified specification # Make changes to your application code
nomad job run your-updated-job-file.nomad # Test locally to ensure it works
python -m uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
``` ```
### Option B: Using the Nomad MCP API ### 2. Commit and Push Your Changes to Gitea
```bash ```bash
# Using PowerShell to restart a job # Stage your changes
Invoke-RestMethod -Uri "http://localhost:8000/api/claude/jobs" -Method POST -Headers @{"Content-Type"="application/json"} -Body '{ git add .
"job_id": "your-job-name",
"action": "restart", # Commit your changes
"namespace": "development" git commit -m "Update application with new feature"
}'
# Push to Gitea repository
git push origin main
``` ```
### Option C: Using a Python Script > **CRITICAL:** This step is essential when using Gitea artifacts in your Nomad jobs. Without pushing your changes, the job will pull the old version of the code.
```python ### 3. Deploy the Job
#!/usr/bin/env python
from app.services.nomad_client import NomadService
def main():
# Initialize the Nomad service
service = NomadService()
# Get the current job specification
job = service.get_job('your-job-name')
# Modify the job specification as needed
# For example, update environment variables:
job["TaskGroups"][0]["Tasks"][0]["Env"]["STATIC_DIR"] = "/local/app-code/static"
# Update the job
response = service.start_job({"Job": job})
print(f"Job update response: {response}")
if __name__ == "__main__":
main()
```
## 7. Stopping a Job
When you're done with a job, you can stop it:
### Option A: Using the Nomad CLI
```bash ```bash
# Stop a job # Using a deployment script
nomad job stop your-job-name python deploy_job.py
# Stop and purge a job # Or using the Nomad CLI
nomad job stop -purge your-job-name nomad job run job_spec.nomad
``` ```
### Option B: Using the Nomad MCP API ### 4. Check Job Status
```bash ```bash
# Using PowerShell # Using the Nomad CLI
Invoke-RestMethod -Uri "http://localhost:8000/api/claude/jobs" -Method POST -Headers @{"Content-Type"="application/json"} -Body '{ nomad job status example-job
"job_id": "your-job-name",
"action": "stop", # Or using Python
"namespace": "development", python -c "from app.services.nomad_client import NomadService; service = NomadService(); job = service.get_job('example-job'); print(f'Job Status: {job.get(\"Status\", \"Unknown\")}');"
"purge": true
}'
``` ```
### Option C: Using a Python Script ### 5. Check Logs if Issues Occur
```python ```bash
#!/usr/bin/env python # Get allocations
from app.services.nomad_client import NomadService allocations=$(nomad job status -json example-job | jq -r '.Allocations[0].ID')
def main(): # Check logs
# Initialize the Nomad service nomad alloc logs $allocations
service = NomadService() nomad alloc logs -stderr $allocations
# Stop the job
response = service.stop_job('your-job-name', purge=True)
print(f"Job stop response: {response}")
if __name__ == "__main__":
main()
``` ```
## 8. Complete Workflow Example ### 6. Fix Issues and Update
Here's a complete workflow for deploying, monitoring, troubleshooting, and updating a job: If you encounter issues:
```python 1. Fix the code in your local environment
#!/usr/bin/env python 2. **Commit and push changes to Gitea**
import time 3. Redeploy the job
from app.services.nomad_client import NomadService 4. Check status and logs again
def main(): ### 7. Stop the Job When Done
# Initialize the Nomad service
service = NomadService()
# 1. Create and deploy the job
job_spec = {
"Job": {
"ID": "example-app",
"Name": "Example Application",
"Type": "service",
"Datacenters": ["dc1"],
"Namespace": "development",
# ... rest of job specification ...
}
}
deploy_response = service.start_job(job_spec)
print(f"Deployment response: {deploy_response}")
# 2. Wait for the job to be scheduled
print("Waiting for job to be scheduled...")
time.sleep(5)
# 3. Check job status
job = service.get_job('example-app')
print(f"Job Status: {job.get('Status', 'Unknown')}")
# 4. Get allocations
allocations = service.get_allocations('example-app')
if allocations:
latest_alloc = allocations[0]
alloc_id = latest_alloc["ID"]
print(f"Latest allocation ID: {alloc_id}")
print(f"Allocation Status: {latest_alloc.get('ClientStatus', 'Unknown')}")
# 5. Check logs for errors
stderr_logs = service.get_allocation_logs(alloc_id, log_type="stderr")
# 6. Look for common errors
if "Directory 'static' does not exist" in stderr_logs:
print("Error detected: Missing static directory")
# 7. Update the job to fix the issue
job["TaskGroups"][0]["Tasks"][0]["Env"]["STATIC_DIR"] = "/local/app-code/static"
update_response = service.start_job({"Job": job})
print(f"Job update response: {update_response}")
# 8. Wait for the updated job to be scheduled
print("Waiting for updated job to be scheduled...")
time.sleep(5)
# 9. Check the updated job status
updated_job = service.get_job('example-app')
print(f"Updated Job Status: {updated_job.get('Status', 'Unknown')}")
else:
print("No allocations found for the job")
if __name__ == "__main__": ```bash
main() # Stop without purging (keeps job definition)
nomad job stop example-job
# Stop and purge (completely removes job)
nomad job stop -purge example-job
``` ```
## 9. Best Practices ## Best Practices
1. **Always check logs after deployment**: Logs are your primary tool for diagnosing issues. 1. **Always commit and push code before deployment**: When using Gitea artifacts, ensure your code is committed and pushed before deploying jobs.
2. **Use environment variables for configuration**: This makes your jobs more flexible and easier to update. 2. **Use namespaces**: Organize jobs by environment (development, staging, production).
3. **Implement health checks**: Health checks help Nomad determine if your application is running correctly. 3. **Set appropriate resource limits**: Specify realistic CPU and memory requirements.
4. **Set appropriate resource limits**: Allocate enough CPU and memory for your application to run efficiently. 4. **Implement health checks**: Add service health checks to detect application issues.
5. **Use artifacts for code deployment**: Pull code from a Git repository to ensure consistency. 5. **Use environment variables**: Configure applications through environment variables for flexibility.
6. **Implement proper error handling**: Your application should handle errors gracefully and provide meaningful error messages. 6. **Implement proper error handling**: Add robust error handling in your application.
7. **Use namespaces**: Organize your jobs into namespaces based on environment or team. 7. **Monitor job status**: Regularly check job status and logs.
8. **Document your job specifications**: Include comments in your job files to explain configuration choices. 8. **Version your artifacts**: Use specific tags or commits for reproducible deployments.
9. **Implement a CI/CD pipeline**: Automate the deployment process to reduce errors and improve efficiency. 9. **Document job specifications**: Keep documentation of job requirements and configurations.
10. **Monitor job performance**: Use Nomad's monitoring capabilities to track resource usage and performance. 10. **Test locally before deployment**: Verify application functionality in a local environment.
## 10. Conclusion ## Conclusion
Managing Nomad jobs effectively requires understanding the job lifecycle, from creation to deployment, monitoring, troubleshooting, and updating. By following this guide, you can create robust deployment processes that ensure your applications run reliably in a Nomad cluster. Managing Nomad jobs effectively requires understanding the job lifecycle, proper configuration, and troubleshooting techniques. By following this guide, you should be able to create, deploy, monitor, and troubleshoot Nomad jobs efficiently.
Remember that the key to successful job management is thorough testing, careful monitoring, and quick response to issues. With the right tools and processes in place, you can efficiently manage even complex applications in a Nomad environment. Remember that the most common issues are related to resource constraints, network connectivity, and configuration errors. Always check logs when troubleshooting, and ensure your code is properly committed and pushed to Gitea before deployment.

Binary file not shown.