Revise Nomad Job Management Guide with comprehensive workflow and best practices

This commit is contained in:
2025-02-26 17:28:18 +07:00
parent e3e19f5099
commit 1c2166111b
2 changed files with 239 additions and 558 deletions

View File

@ -1,22 +1,24 @@
# Nomad Job Management Guide
This guide explains the complete process of creating, deploying, monitoring, and troubleshooting Nomad jobs using the Nomad MCP service. It's designed to be used by both humans and AI assistants to effectively manage containerized applications in a Nomad cluster.
This guide provides comprehensive instructions for managing Nomad jobs, including creating, deploying, monitoring, and troubleshooting.
## Prerequisites
- Access to a Nomad cluster
- Nomad MCP service installed and running
- Proper environment configuration (NOMAD_ADDR, NOMAD_NAMESPACE, etc.)
- Python with required packages installed
Before you begin, ensure you have:
## 1. Creating a Nomad Job Specification
1. Access to a Nomad cluster
2. Proper authentication credentials (token if ACLs are enabled)
3. Network connectivity to the Nomad API endpoint (default port 4646)
4. Access to the Gitea repository (if using artifact integration)
A Nomad job specification defines how your application should run. This can be created in two formats:
## Job Specifications
### Option A: Using a .nomad HCL File
### HCL Format (Nomad Job File)
Nomad jobs can be defined in HashiCorp Configuration Language (HCL):
```hcl
job "your-job-name" {
job "example-job" {
datacenters = ["dc1"]
type = "service"
namespace = "development"
@ -24,677 +26,356 @@ job "your-job-name" {
group "app" {
count = 1
network {
port "http" {
to = 8000
}
}
task "app-task" {
task "server" {
driver = "docker"
config {
image = "your-registry/your-image:tag"
image = "nginx:latest"
ports = ["http"]
command = "python"
args = ["-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
# Mount volumes if needed
mount {
type = "bind"
source = "local/app-code"
target = "/app"
readonly = false
}
}
# Pull code from Git repository if needed
artifact {
source = "git::ssh://git@your-git-server:port/org/repo.git"
destination = "local/app-code"
options {
ref = "main"
sshkey = "your-base64-encoded-ssh-key"
}
}
env {
# Environment variables
PORT = "8000"
HOST = "0.0.0.0"
LOG_LEVEL = "INFO"
PYTHONPATH = "/app"
# Add any application-specific environment variables
STATIC_DIR = "/local/app-code/static"
}
resources {
cpu = 200
memory = 256
}
service {
name = "your-service-name"
port = "http"
tags = [
"traefik.enable=true",
"traefik.http.routers.your-service.entryPoints=https",
"traefik.http.routers.your-service.rule=Host(`your-service.domain.com`)"
]
check {
type = "http"
path = "/api/health"
interval = "10s"
timeout = "2s"
}
cpu = 100
memory = 128
}
}
}
}
```
### Option B: Using a Python Deployment Script
### Python Format (Dictionary)
For programmatic job management, use Python dictionaries:
```python
job_spec = {
"Job": {
"ID": "example-job",
"Name": "example-job",
"Type": "service",
"Datacenters": ["dc1"],
"Namespace": "development",
"TaskGroups": [
{
"Name": "app",
"Count": 1,
"Tasks": [
{
"Name": "server",
"Driver": "docker",
"Config": {
"image": "nginx:latest"
},
"Resources": {
"CPU": 100,
"MemoryMB": 128
}
}
]
}
]
}
}
```
## Deployment Methods
> **CRITICAL: Always commit and push your code changes to Gitea before deploying jobs!**
>
> When using Gitea artifacts in your Nomad jobs, the job will pull code from the repository at deployment time. If you don't commit and push your changes first, the job will use the old version of the code, and your changes won't be reflected in the deployed application.
### Using the Nomad CLI
```bash
# Deploy a job using an HCL file
nomad job run job_spec.nomad
# Stop a job
nomad job stop example-job
# Purge a job (completely remove it)
nomad job stop -purge example-job
```
### Using Python with the Nomad API
```python
from app.services.nomad_client import NomadService
# Initialize the service
nomad_service = NomadService()
# Start a job
response = nomad_service.start_job(job_spec)
print(f"Job started: {response}")
# Stop a job
response = nomad_service.stop_job("example-job", purge=False)
print(f"Job stopped: {response}")
```
### Using a Deployment Script
Create a deployment script that handles the job specification and deployment:
```python
#!/usr/bin/env python
import os
import json
from app.services.nomad_client import NomadService
def main():
# Initialize the Nomad service
nomad_service = NomadService()
# Create job specification
# Define job specification
job_spec = {
"Job": {
"ID": "your-job-name",
"Name": "your-job-name",
"Type": "service",
"Datacenters": ["dc1"],
"Namespace": "development",
"TaskGroups": [
{
"Name": "app",
"Count": 1,
"Networks": [
{
"DynamicPorts": [
{
"Label": "http",
"To": 8000
}
]
}
],
"Tasks": [
{
"Name": "app-task",
"Driver": "docker",
"Config": {
"image": "your-registry/your-image:tag",
"ports": ["http"],
"command": "python",
"args": ["-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"],
"mount": [
{
"type": "bind",
"source": "local/app-code",
"target": "/app",
"readonly": False
}
]
},
"Artifacts": [
{
"GetterSource": "git::ssh://git@your-git-server:port/org/repo.git",
"RelativeDest": "local/app-code",
"GetterOptions": {
"ref": "main",
"sshkey": "your-base64-encoded-ssh-key"
}
}
],
"Env": {
"PORT": "8000",
"HOST": "0.0.0.0",
"LOG_LEVEL": "INFO",
"PYTHONPATH": "/app",
"STATIC_DIR": "/local/app-code/static"
},
"Resources": {
"CPU": 200,
"MemoryMB": 256
},
"Services": [
{
"Name": "your-service-name",
"PortLabel": "http",
"Tags": [
"traefik.enable=true",
"traefik.http.routers.your-service.entryPoints=https",
"traefik.http.routers.your-service.rule=Host(`your-service.domain.com`)"
],
"Checks": [
{
"Type": "http",
"Path": "/api/health",
"Interval": 10000000000, # 10 seconds in nanoseconds
"Timeout": 2000000000 # 2 seconds in nanoseconds
}
]
}
]
}
]
}
]
"ID": "example-job",
# ... job configuration ...
}
}
# Start the job
response = nomad_service.start_job(job_spec)
print(f"Job deployment response: {response}")
if response.get("status") == "started":
print(f"Job deployed successfully!")
print(f"Job ID: {response.get('job_id')}")
print(f"Evaluation ID: {response.get('eval_id')}")
print(f"Job started successfully: {response.get('job_id')}")
else:
print(f"Failed to deploy job.")
print(f"Status: {response.get('status')}")
print(f"Message: {response.get('message', 'Unknown error')}")
print(f"Failed to start job: {response.get('message')}")
if __name__ == "__main__":
main()
```
## 2. Deploying the Nomad Job
## Checking Job Status
### Option A: Using the Nomad CLI
### Using the Nomad CLI
```bash
# Deploy using a .nomad file
nomad job run your-job-file.nomad
# Get job status
nomad job status example-job
# Verify the job was submitted
nomad job status your-job-name
# Get detailed allocation information
nomad alloc status <allocation_id>
```
### Option B: Using the Python Deployment Script
```bash
# Run the deployment script
python deploy_your_job.py
```
### Option C: Using the Nomad MCP API
```bash
# Using curl
curl -X POST http://localhost:8000/api/claude/create-job \
-H "Content-Type: application/json" \
-d '{
"job_id": "your-job-name",
"name": "Your Job Name",
"type": "service",
"datacenters": ["dc1"],
"namespace": "development",
"docker_image": "your-registry/your-image:tag",
"count": 1,
"cpu": 200,
"memory": 256,
"ports": [
{
"Label": "http",
"Value": 0,
"To": 8000
}
],
"env_vars": {
"PORT": "8000",
"HOST": "0.0.0.0",
"LOG_LEVEL": "INFO",
"PYTHONPATH": "/app",
"STATIC_DIR": "/local/app-code/static"
}
}'
# Using PowerShell
Invoke-RestMethod -Uri "http://localhost:8000/api/claude/create-job" -Method POST -Headers @{"Content-Type"="application/json"} -Body '{
"job_id": "your-job-name",
"name": "Your Job Name",
"type": "service",
"datacenters": ["dc1"],
"namespace": "development",
"docker_image": "your-registry/your-image:tag",
"count": 1,
"cpu": 200,
"memory": 256,
"ports": [
{
"Label": "http",
"Value": 0,
"To": 8000
}
],
"env_vars": {
"PORT": "8000",
"HOST": "0.0.0.0",
"LOG_LEVEL": "INFO",
"PYTHONPATH": "/app",
"STATIC_DIR": "/local/app-code/static"
}
}'
```
## 3. Checking Job Status
After deploying a job, you should check its status to ensure it's running correctly.
### Option A: Using the Nomad CLI
```bash
# Check job status
nomad job status your-job-name
# Check allocations for the job
nomad job allocs your-job-name
# Check the most recent allocation
nomad alloc status -job your-job-name
```
### Option B: Using the Nomad MCP API
```bash
# Using curl
curl -X POST http://localhost:8000/api/claude/jobs \
-H "Content-Type: application/json" \
-d '{
"job_id": "your-job-name",
"action": "status",
"namespace": "development"
}'
# Using PowerShell
Invoke-RestMethod -Uri "http://localhost:8000/api/claude/jobs" -Method POST -Headers @{"Content-Type"="application/json"} -Body '{
"job_id": "your-job-name",
"action": "status",
"namespace": "development"
}'
```
### Option C: Using a Python Script
### Using Python
```python
#!/usr/bin/env python
from app.services.nomad_client import NomadService
def main():
# Initialize the Nomad service
service = NomadService()
# Get job information
job = service.get_job('your-job-name')
print(f"Job Status: {job.get('Status', 'Unknown')}")
print(f"Job Type: {job.get('Type', 'Unknown')}")
print(f"Job Datacenters: {job.get('Datacenters', [])}")
# Get allocations
allocations = service.get_allocations('your-job-name')
print(f"\nFound {len(allocations)} allocations")
if allocations:
latest_alloc = allocations[0]
print(f"Latest allocation ID: {latest_alloc.get('ID', 'Unknown')}")
print(f"Allocation Status: {latest_alloc.get('ClientStatus', 'Unknown')}")
# Initialize the service
nomad_service = NomadService()
if __name__ == "__main__":
main()
# Get job status
job = nomad_service.get_job("example-job")
print(f"Job Status: {job.get('Status')}")
# Get allocations
allocations = nomad_service.get_allocations("example-job")
for alloc in allocations:
print(f"Allocation: {alloc.get('ID')}, Status: {alloc.get('ClientStatus')}")
```
## 4. Checking Job Logs
## Retrieving Logs
Logs are crucial for diagnosing issues with your job. Here's how to access them:
### Option A: Using the Nomad CLI
### Using the Nomad CLI
```bash
# First, get the allocation ID
nomad job allocs your-job-name
# Get stdout logs
nomad alloc logs <allocation_id>
# Then view the logs for a specific allocation
nomad alloc logs <allocation-id>
# View stderr logs
nomad alloc logs -stderr <allocation-id>
# Follow logs in real-time
nomad alloc logs -f <allocation-id>
# Get stderr logs
nomad alloc logs -stderr <allocation_id>
```
### Option B: Using the Nomad MCP API
```bash
# Using curl
curl -X GET http://localhost:8000/api/claude/job-logs/your-job-name
# Using PowerShell
Invoke-RestMethod -Uri "http://localhost:8000/api/claude/job-logs/your-job-name" -Method GET
```
### Option C: Using a Python Script
### Using Python
```python
#!/usr/bin/env python
from app.services.nomad_client import NomadService
def main():
# Initialize the Nomad service
service = NomadService()
# Get allocations for the job
allocations = service.get_allocations('your-job-name')
if allocations:
latest_alloc = allocations[0]
alloc_id = latest_alloc["ID"]
print(f"Latest allocation ID: {alloc_id}")
# Get logs for the allocation
try:
# Get stdout logs
stdout_logs = service.get_allocation_logs(alloc_id, task="your-task-name", log_type="stdout")
print("\nStandard Output Logs:")
print(stdout_logs)
# Get stderr logs
stderr_logs = service.get_allocation_logs(alloc_id, task="your-task-name", log_type="stderr")
print("\nStandard Error Logs:")
print(stderr_logs)
except Exception as e:
print(f"Error getting logs: {str(e)}")
else:
print("No allocations found for your-job-name job")
# Initialize the service
nomad_service = NomadService()
if __name__ == "__main__":
main()
# Get allocations for the job
allocations = nomad_service.get_allocations("example-job")
if allocations:
# Get logs from the most recent allocation
latest_alloc = allocations[0]
stdout_logs = nomad_service.get_allocation_logs(latest_alloc["ID"], "server", "stdout")
stderr_logs = nomad_service.get_allocation_logs(latest_alloc["ID"], "server", "stderr")
print("STDOUT Logs:")
print(stdout_logs)
print("\nSTDERR Logs:")
print(stderr_logs)
```
## 5. Troubleshooting Common Issues
## Troubleshooting
### Issue: Job Fails to Start
### Common Issues and Solutions
1. **Check the job status**:
```bash
nomad job status your-job-name
```
#### 1. Job Fails to Start
2. **Examine the allocation status**:
```bash
nomad alloc status -job your-job-name
```
**Symptoms:**
- Job status shows as "dead"
- Allocation status shows as "failed"
3. **Check the logs for errors**:
```bash
# Get the allocation ID first
nomad job allocs your-job-name
# Then check the logs
nomad alloc logs -stderr <allocation-id>
```
**Possible Causes and Solutions:**
4. **Common errors and solutions**:
a) **Resource Constraints:**
- Check if the job is requesting more resources than available
- Reduce CPU or memory requirements in the job specification
a. **Missing static directory**:
```
RuntimeError: Directory 'static' does not exist
```
Solution: Add an environment variable to specify the static directory path:
b) **Missing Static Directory:**
- Error: `RuntimeError: Directory 'static' does not exist`
- Solution: Use environment variables to specify the static directory path
```hcl
env {
STATIC_DIR = "/local/app-code/static"
STATIC_DIR = "/local/your_app/static"
}
```
b. **Invalid mount configuration**:
```
invalid mount config for type 'bind': bind source path does not exist
```
Solution: Ensure the source path exists or is created by an artifact:
c) **Module Import Errors:**
- Error: `ModuleNotFoundError: No module named 'app'`
- Solution: Set the correct PYTHONPATH in the job specification
```hcl
artifact {
source = "git::ssh://git@your-git-server:port/org/repo.git"
destination = "local/app-code"
env {
PYTHONPATH = "/local/your_app"
}
```
c. **Port already allocated**:
```
Allocation failed: Failed to place allocation: failed to place alloc: port is already allocated
```
Solution: Use dynamic ports or choose a different port:
```hcl
network {
port "http" {
to = 8000
}
}
```
d) **Artifact Retrieval Failures:**
- Error: `Failed to download artifact: git::ssh://...`
- Solution: Verify SSH key, repository URL, and permissions
### Issue: Application Errors After Deployment
e) **Old Code Version Running:**
- Symptom: Your recent changes aren't reflected in the deployed application
- Solution: **Commit and push your code changes to Gitea before deploying**
1. **Check application logs**:
```bash
nomad alloc logs <allocation-id>
```
#### 2. Network Connectivity Issues
2. **Verify environment variables**:
```bash
nomad alloc status <allocation-id>
```
Look for the "Environment Variables" section.
**Symptoms:**
- Connection timeouts
- "Failed to connect to Nomad" errors
3. **Check resource constraints**:
Ensure the job has enough CPU and memory allocated:
```hcl
resources {
cpu = 200 # Increase if needed
memory = 256 # Increase if needed
}
```
**Solutions:**
- Verify Nomad server address and port
- Check network connectivity and firewall rules
- Ensure proper authentication token is provided
## 6. Updating a Job
#### 3. Permission Issues
After fixing issues, you'll need to update the job:
**Symptoms:**
- "Permission denied" errors
- "ACL token not found" messages
### Option A: Using the Nomad CLI
**Solutions:**
- Verify your token has appropriate permissions
- Check namespace settings in your job specification
- Ensure the token is properly set in the environment
## Complete Workflow Example
Here's a complete workflow for managing a Nomad job:
### 1. Develop and Test Your Application
```bash
# Update the job with the modified specification
nomad job run your-updated-job-file.nomad
# Make changes to your application code
# Test locally to ensure it works
python -m uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
```
### Option B: Using the Nomad MCP API
### 2. Commit and Push Your Changes to Gitea
```bash
# Using PowerShell to restart a job
Invoke-RestMethod -Uri "http://localhost:8000/api/claude/jobs" -Method POST -Headers @{"Content-Type"="application/json"} -Body '{
"job_id": "your-job-name",
"action": "restart",
"namespace": "development"
}'
# Stage your changes
git add .
# Commit your changes
git commit -m "Update application with new feature"
# Push to Gitea repository
git push origin main
```
### Option C: Using a Python Script
> **CRITICAL:** This step is essential when using Gitea artifacts in your Nomad jobs. Without pushing your changes, the job will pull the old version of the code.
```python
#!/usr/bin/env python
from app.services.nomad_client import NomadService
def main():
# Initialize the Nomad service
service = NomadService()
# Get the current job specification
job = service.get_job('your-job-name')
# Modify the job specification as needed
# For example, update environment variables:
job["TaskGroups"][0]["Tasks"][0]["Env"]["STATIC_DIR"] = "/local/app-code/static"
# Update the job
response = service.start_job({"Job": job})
print(f"Job update response: {response}")
if __name__ == "__main__":
main()
```
## 7. Stopping a Job
When you're done with a job, you can stop it:
### Option A: Using the Nomad CLI
### 3. Deploy the Job
```bash
# Stop a job
nomad job stop your-job-name
# Using a deployment script
python deploy_job.py
# Stop and purge a job
nomad job stop -purge your-job-name
# Or using the Nomad CLI
nomad job run job_spec.nomad
```
### Option B: Using the Nomad MCP API
### 4. Check Job Status
```bash
# Using PowerShell
Invoke-RestMethod -Uri "http://localhost:8000/api/claude/jobs" -Method POST -Headers @{"Content-Type"="application/json"} -Body '{
"job_id": "your-job-name",
"action": "stop",
"namespace": "development",
"purge": true
}'
# Using the Nomad CLI
nomad job status example-job
# Or using Python
python -c "from app.services.nomad_client import NomadService; service = NomadService(); job = service.get_job('example-job'); print(f'Job Status: {job.get(\"Status\", \"Unknown\")}');"
```
### Option C: Using a Python Script
### 5. Check Logs if Issues Occur
```python
#!/usr/bin/env python
from app.services.nomad_client import NomadService
```bash
# Get allocations
allocations=$(nomad job status -json example-job | jq -r '.Allocations[0].ID')
def main():
# Initialize the Nomad service
service = NomadService()
# Stop the job
response = service.stop_job('your-job-name', purge=True)
print(f"Job stop response: {response}")
if __name__ == "__main__":
main()
# Check logs
nomad alloc logs $allocations
nomad alloc logs -stderr $allocations
```
## 8. Complete Workflow Example
### 6. Fix Issues and Update
Here's a complete workflow for deploying, monitoring, troubleshooting, and updating a job:
If you encounter issues:
```python
#!/usr/bin/env python
import time
from app.services.nomad_client import NomadService
1. Fix the code in your local environment
2. **Commit and push changes to Gitea**
3. Redeploy the job
4. Check status and logs again
def main():
# Initialize the Nomad service
service = NomadService()
# 1. Create and deploy the job
job_spec = {
"Job": {
"ID": "example-app",
"Name": "Example Application",
"Type": "service",
"Datacenters": ["dc1"],
"Namespace": "development",
# ... rest of job specification ...
}
}
deploy_response = service.start_job(job_spec)
print(f"Deployment response: {deploy_response}")
# 2. Wait for the job to be scheduled
print("Waiting for job to be scheduled...")
time.sleep(5)
# 3. Check job status
job = service.get_job('example-app')
print(f"Job Status: {job.get('Status', 'Unknown')}")
# 4. Get allocations
allocations = service.get_allocations('example-app')
if allocations:
latest_alloc = allocations[0]
alloc_id = latest_alloc["ID"]
print(f"Latest allocation ID: {alloc_id}")
print(f"Allocation Status: {latest_alloc.get('ClientStatus', 'Unknown')}")
# 5. Check logs for errors
stderr_logs = service.get_allocation_logs(alloc_id, log_type="stderr")
# 6. Look for common errors
if "Directory 'static' does not exist" in stderr_logs:
print("Error detected: Missing static directory")
# 7. Update the job to fix the issue
job["TaskGroups"][0]["Tasks"][0]["Env"]["STATIC_DIR"] = "/local/app-code/static"
update_response = service.start_job({"Job": job})
print(f"Job update response: {update_response}")
# 8. Wait for the updated job to be scheduled
print("Waiting for updated job to be scheduled...")
time.sleep(5)
# 9. Check the updated job status
updated_job = service.get_job('example-app')
print(f"Updated Job Status: {updated_job.get('Status', 'Unknown')}")
else:
print("No allocations found for the job")
### 7. Stop the Job When Done
if __name__ == "__main__":
main()
```bash
# Stop without purging (keeps job definition)
nomad job stop example-job
# Stop and purge (completely removes job)
nomad job stop -purge example-job
```
## 9. Best Practices
## Best Practices
1. **Always check logs after deployment**: Logs are your primary tool for diagnosing issues.
1. **Always commit and push code before deployment**: When using Gitea artifacts, ensure your code is committed and pushed before deploying jobs.
2. **Use environment variables for configuration**: This makes your jobs more flexible and easier to update.
2. **Use namespaces**: Organize jobs by environment (development, staging, production).
3. **Implement health checks**: Health checks help Nomad determine if your application is running correctly.
3. **Set appropriate resource limits**: Specify realistic CPU and memory requirements.
4. **Set appropriate resource limits**: Allocate enough CPU and memory for your application to run efficiently.
4. **Implement health checks**: Add service health checks to detect application issues.
5. **Use artifacts for code deployment**: Pull code from a Git repository to ensure consistency.
5. **Use environment variables**: Configure applications through environment variables for flexibility.
6. **Implement proper error handling**: Your application should handle errors gracefully and provide meaningful error messages.
6. **Implement proper error handling**: Add robust error handling in your application.
7. **Use namespaces**: Organize your jobs into namespaces based on environment or team.
7. **Monitor job status**: Regularly check job status and logs.
8. **Document your job specifications**: Include comments in your job files to explain configuration choices.
8. **Version your artifacts**: Use specific tags or commits for reproducible deployments.
9. **Implement a CI/CD pipeline**: Automate the deployment process to reduce errors and improve efficiency.
9. **Document job specifications**: Keep documentation of job requirements and configurations.
10. **Monitor job performance**: Use Nomad's monitoring capabilities to track resource usage and performance.
10. **Test locally before deployment**: Verify application functionality in a local environment.
## 10. Conclusion
## Conclusion
Managing Nomad jobs effectively requires understanding the job lifecycle, from creation to deployment, monitoring, troubleshooting, and updating. By following this guide, you can create robust deployment processes that ensure your applications run reliably in a Nomad cluster.
Managing Nomad jobs effectively requires understanding the job lifecycle, proper configuration, and troubleshooting techniques. By following this guide, you should be able to create, deploy, monitor, and troubleshoot Nomad jobs efficiently.
Remember that the key to successful job management is thorough testing, careful monitoring, and quick response to issues. With the right tools and processes in place, you can efficiently manage even complex applications in a Nomad environment.
Remember that the most common issues are related to resource constraints, network connectivity, and configuration errors. Always check logs when troubleshooting, and ensure your code is properly committed and pushed to Gitea before deployment.