Revise Nomad Job Management Guide with comprehensive workflow and best practices
This commit is contained in:
@ -1,22 +1,24 @@
|
||||
# Nomad Job Management Guide
|
||||
|
||||
This guide explains the complete process of creating, deploying, monitoring, and troubleshooting Nomad jobs using the Nomad MCP service. It's designed to be used by both humans and AI assistants to effectively manage containerized applications in a Nomad cluster.
|
||||
This guide provides comprehensive instructions for managing Nomad jobs, including creating, deploying, monitoring, and troubleshooting.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Access to a Nomad cluster
|
||||
- Nomad MCP service installed and running
|
||||
- Proper environment configuration (NOMAD_ADDR, NOMAD_NAMESPACE, etc.)
|
||||
- Python with required packages installed
|
||||
Before you begin, ensure you have:
|
||||
|
||||
## 1. Creating a Nomad Job Specification
|
||||
1. Access to a Nomad cluster
|
||||
2. Proper authentication credentials (token if ACLs are enabled)
|
||||
3. Network connectivity to the Nomad API endpoint (default port 4646)
|
||||
4. Access to the Gitea repository (if using artifact integration)
|
||||
|
||||
A Nomad job specification defines how your application should run. This can be created in two formats:
|
||||
## Job Specifications
|
||||
|
||||
### Option A: Using a .nomad HCL File
|
||||
### HCL Format (Nomad Job File)
|
||||
|
||||
Nomad jobs can be defined in HashiCorp Configuration Language (HCL):
|
||||
|
||||
```hcl
|
||||
job "your-job-name" {
|
||||
job "example-job" {
|
||||
datacenters = ["dc1"]
|
||||
type = "service"
|
||||
namespace = "development"
|
||||
@ -24,677 +26,356 @@ job "your-job-name" {
|
||||
group "app" {
|
||||
count = 1
|
||||
|
||||
network {
|
||||
port "http" {
|
||||
to = 8000
|
||||
}
|
||||
}
|
||||
|
||||
task "app-task" {
|
||||
task "server" {
|
||||
driver = "docker"
|
||||
|
||||
|
||||
config {
|
||||
image = "your-registry/your-image:tag"
|
||||
image = "nginx:latest"
|
||||
ports = ["http"]
|
||||
command = "python"
|
||||
args = ["-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
|
||||
|
||||
# Mount volumes if needed
|
||||
mount {
|
||||
type = "bind"
|
||||
source = "local/app-code"
|
||||
target = "/app"
|
||||
readonly = false
|
||||
}
|
||||
}
|
||||
|
||||
# Pull code from Git repository if needed
|
||||
artifact {
|
||||
source = "git::ssh://git@your-git-server:port/org/repo.git"
|
||||
destination = "local/app-code"
|
||||
options {
|
||||
ref = "main"
|
||||
sshkey = "your-base64-encoded-ssh-key"
|
||||
}
|
||||
}
|
||||
|
||||
env {
|
||||
# Environment variables
|
||||
PORT = "8000"
|
||||
HOST = "0.0.0.0"
|
||||
LOG_LEVEL = "INFO"
|
||||
PYTHONPATH = "/app"
|
||||
|
||||
# Add any application-specific environment variables
|
||||
STATIC_DIR = "/local/app-code/static"
|
||||
}
|
||||
|
||||
|
||||
resources {
|
||||
cpu = 200
|
||||
memory = 256
|
||||
}
|
||||
|
||||
service {
|
||||
name = "your-service-name"
|
||||
port = "http"
|
||||
tags = [
|
||||
"traefik.enable=true",
|
||||
"traefik.http.routers.your-service.entryPoints=https",
|
||||
"traefik.http.routers.your-service.rule=Host(`your-service.domain.com`)"
|
||||
]
|
||||
|
||||
check {
|
||||
type = "http"
|
||||
path = "/api/health"
|
||||
interval = "10s"
|
||||
timeout = "2s"
|
||||
}
|
||||
cpu = 100
|
||||
memory = 128
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Option B: Using a Python Deployment Script
|
||||
### Python Format (Dictionary)
|
||||
|
||||
For programmatic job management, use Python dictionaries:
|
||||
|
||||
```python
|
||||
job_spec = {
|
||||
"Job": {
|
||||
"ID": "example-job",
|
||||
"Name": "example-job",
|
||||
"Type": "service",
|
||||
"Datacenters": ["dc1"],
|
||||
"Namespace": "development",
|
||||
"TaskGroups": [
|
||||
{
|
||||
"Name": "app",
|
||||
"Count": 1,
|
||||
"Tasks": [
|
||||
{
|
||||
"Name": "server",
|
||||
"Driver": "docker",
|
||||
"Config": {
|
||||
"image": "nginx:latest"
|
||||
},
|
||||
"Resources": {
|
||||
"CPU": 100,
|
||||
"MemoryMB": 128
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Deployment Methods
|
||||
|
||||
> **CRITICAL: Always commit and push your code changes to Gitea before deploying jobs!**
|
||||
>
|
||||
> When using Gitea artifacts in your Nomad jobs, the job will pull code from the repository at deployment time. If you don't commit and push your changes first, the job will use the old version of the code, and your changes won't be reflected in the deployed application.
|
||||
|
||||
### Using the Nomad CLI
|
||||
|
||||
```bash
|
||||
# Deploy a job using an HCL file
|
||||
nomad job run job_spec.nomad
|
||||
|
||||
# Stop a job
|
||||
nomad job stop example-job
|
||||
|
||||
# Purge a job (completely remove it)
|
||||
nomad job stop -purge example-job
|
||||
```
|
||||
|
||||
### Using Python with the Nomad API
|
||||
|
||||
```python
|
||||
from app.services.nomad_client import NomadService
|
||||
|
||||
# Initialize the service
|
||||
nomad_service = NomadService()
|
||||
|
||||
# Start a job
|
||||
response = nomad_service.start_job(job_spec)
|
||||
print(f"Job started: {response}")
|
||||
|
||||
# Stop a job
|
||||
response = nomad_service.stop_job("example-job", purge=False)
|
||||
print(f"Job stopped: {response}")
|
||||
```
|
||||
|
||||
### Using a Deployment Script
|
||||
|
||||
Create a deployment script that handles the job specification and deployment:
|
||||
|
||||
```python
|
||||
#!/usr/bin/env python
|
||||
import os
|
||||
import json
|
||||
from app.services.nomad_client import NomadService
|
||||
|
||||
def main():
|
||||
# Initialize the Nomad service
|
||||
nomad_service = NomadService()
|
||||
|
||||
# Create job specification
|
||||
# Define job specification
|
||||
job_spec = {
|
||||
"Job": {
|
||||
"ID": "your-job-name",
|
||||
"Name": "your-job-name",
|
||||
"Type": "service",
|
||||
"Datacenters": ["dc1"],
|
||||
"Namespace": "development",
|
||||
"TaskGroups": [
|
||||
{
|
||||
"Name": "app",
|
||||
"Count": 1,
|
||||
"Networks": [
|
||||
{
|
||||
"DynamicPorts": [
|
||||
{
|
||||
"Label": "http",
|
||||
"To": 8000
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"Tasks": [
|
||||
{
|
||||
"Name": "app-task",
|
||||
"Driver": "docker",
|
||||
"Config": {
|
||||
"image": "your-registry/your-image:tag",
|
||||
"ports": ["http"],
|
||||
"command": "python",
|
||||
"args": ["-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"],
|
||||
"mount": [
|
||||
{
|
||||
"type": "bind",
|
||||
"source": "local/app-code",
|
||||
"target": "/app",
|
||||
"readonly": False
|
||||
}
|
||||
]
|
||||
},
|
||||
"Artifacts": [
|
||||
{
|
||||
"GetterSource": "git::ssh://git@your-git-server:port/org/repo.git",
|
||||
"RelativeDest": "local/app-code",
|
||||
"GetterOptions": {
|
||||
"ref": "main",
|
||||
"sshkey": "your-base64-encoded-ssh-key"
|
||||
}
|
||||
}
|
||||
],
|
||||
"Env": {
|
||||
"PORT": "8000",
|
||||
"HOST": "0.0.0.0",
|
||||
"LOG_LEVEL": "INFO",
|
||||
"PYTHONPATH": "/app",
|
||||
"STATIC_DIR": "/local/app-code/static"
|
||||
},
|
||||
"Resources": {
|
||||
"CPU": 200,
|
||||
"MemoryMB": 256
|
||||
},
|
||||
"Services": [
|
||||
{
|
||||
"Name": "your-service-name",
|
||||
"PortLabel": "http",
|
||||
"Tags": [
|
||||
"traefik.enable=true",
|
||||
"traefik.http.routers.your-service.entryPoints=https",
|
||||
"traefik.http.routers.your-service.rule=Host(`your-service.domain.com`)"
|
||||
],
|
||||
"Checks": [
|
||||
{
|
||||
"Type": "http",
|
||||
"Path": "/api/health",
|
||||
"Interval": 10000000000, # 10 seconds in nanoseconds
|
||||
"Timeout": 2000000000 # 2 seconds in nanoseconds
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
"ID": "example-job",
|
||||
# ... job configuration ...
|
||||
}
|
||||
}
|
||||
|
||||
# Start the job
|
||||
response = nomad_service.start_job(job_spec)
|
||||
|
||||
print(f"Job deployment response: {response}")
|
||||
|
||||
if response.get("status") == "started":
|
||||
print(f"✅ Job deployed successfully!")
|
||||
print(f"Job ID: {response.get('job_id')}")
|
||||
print(f"Evaluation ID: {response.get('eval_id')}")
|
||||
print(f"Job started successfully: {response.get('job_id')}")
|
||||
else:
|
||||
print(f"❌ Failed to deploy job.")
|
||||
print(f"Status: {response.get('status')}")
|
||||
print(f"Message: {response.get('message', 'Unknown error')}")
|
||||
print(f"Failed to start job: {response.get('message')}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
```
|
||||
|
||||
## 2. Deploying the Nomad Job
|
||||
## Checking Job Status
|
||||
|
||||
### Option A: Using the Nomad CLI
|
||||
### Using the Nomad CLI
|
||||
|
||||
```bash
|
||||
# Deploy using a .nomad file
|
||||
nomad job run your-job-file.nomad
|
||||
# Get job status
|
||||
nomad job status example-job
|
||||
|
||||
# Verify the job was submitted
|
||||
nomad job status your-job-name
|
||||
# Get detailed allocation information
|
||||
nomad alloc status <allocation_id>
|
||||
```
|
||||
|
||||
### Option B: Using the Python Deployment Script
|
||||
|
||||
```bash
|
||||
# Run the deployment script
|
||||
python deploy_your_job.py
|
||||
```
|
||||
|
||||
### Option C: Using the Nomad MCP API
|
||||
|
||||
```bash
|
||||
# Using curl
|
||||
curl -X POST http://localhost:8000/api/claude/create-job \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"job_id": "your-job-name",
|
||||
"name": "Your Job Name",
|
||||
"type": "service",
|
||||
"datacenters": ["dc1"],
|
||||
"namespace": "development",
|
||||
"docker_image": "your-registry/your-image:tag",
|
||||
"count": 1,
|
||||
"cpu": 200,
|
||||
"memory": 256,
|
||||
"ports": [
|
||||
{
|
||||
"Label": "http",
|
||||
"Value": 0,
|
||||
"To": 8000
|
||||
}
|
||||
],
|
||||
"env_vars": {
|
||||
"PORT": "8000",
|
||||
"HOST": "0.0.0.0",
|
||||
"LOG_LEVEL": "INFO",
|
||||
"PYTHONPATH": "/app",
|
||||
"STATIC_DIR": "/local/app-code/static"
|
||||
}
|
||||
}'
|
||||
|
||||
# Using PowerShell
|
||||
Invoke-RestMethod -Uri "http://localhost:8000/api/claude/create-job" -Method POST -Headers @{"Content-Type"="application/json"} -Body '{
|
||||
"job_id": "your-job-name",
|
||||
"name": "Your Job Name",
|
||||
"type": "service",
|
||||
"datacenters": ["dc1"],
|
||||
"namespace": "development",
|
||||
"docker_image": "your-registry/your-image:tag",
|
||||
"count": 1,
|
||||
"cpu": 200,
|
||||
"memory": 256,
|
||||
"ports": [
|
||||
{
|
||||
"Label": "http",
|
||||
"Value": 0,
|
||||
"To": 8000
|
||||
}
|
||||
],
|
||||
"env_vars": {
|
||||
"PORT": "8000",
|
||||
"HOST": "0.0.0.0",
|
||||
"LOG_LEVEL": "INFO",
|
||||
"PYTHONPATH": "/app",
|
||||
"STATIC_DIR": "/local/app-code/static"
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
## 3. Checking Job Status
|
||||
|
||||
After deploying a job, you should check its status to ensure it's running correctly.
|
||||
|
||||
### Option A: Using the Nomad CLI
|
||||
|
||||
```bash
|
||||
# Check job status
|
||||
nomad job status your-job-name
|
||||
|
||||
# Check allocations for the job
|
||||
nomad job allocs your-job-name
|
||||
|
||||
# Check the most recent allocation
|
||||
nomad alloc status -job your-job-name
|
||||
```
|
||||
|
||||
### Option B: Using the Nomad MCP API
|
||||
|
||||
```bash
|
||||
# Using curl
|
||||
curl -X POST http://localhost:8000/api/claude/jobs \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"job_id": "your-job-name",
|
||||
"action": "status",
|
||||
"namespace": "development"
|
||||
}'
|
||||
|
||||
# Using PowerShell
|
||||
Invoke-RestMethod -Uri "http://localhost:8000/api/claude/jobs" -Method POST -Headers @{"Content-Type"="application/json"} -Body '{
|
||||
"job_id": "your-job-name",
|
||||
"action": "status",
|
||||
"namespace": "development"
|
||||
}'
|
||||
```
|
||||
|
||||
### Option C: Using a Python Script
|
||||
### Using Python
|
||||
|
||||
```python
|
||||
#!/usr/bin/env python
|
||||
from app.services.nomad_client import NomadService
|
||||
|
||||
def main():
|
||||
# Initialize the Nomad service
|
||||
service = NomadService()
|
||||
|
||||
# Get job information
|
||||
job = service.get_job('your-job-name')
|
||||
print(f"Job Status: {job.get('Status', 'Unknown')}")
|
||||
print(f"Job Type: {job.get('Type', 'Unknown')}")
|
||||
print(f"Job Datacenters: {job.get('Datacenters', [])}")
|
||||
|
||||
# Get allocations
|
||||
allocations = service.get_allocations('your-job-name')
|
||||
print(f"\nFound {len(allocations)} allocations")
|
||||
|
||||
if allocations:
|
||||
latest_alloc = allocations[0]
|
||||
print(f"Latest allocation ID: {latest_alloc.get('ID', 'Unknown')}")
|
||||
print(f"Allocation Status: {latest_alloc.get('ClientStatus', 'Unknown')}")
|
||||
# Initialize the service
|
||||
nomad_service = NomadService()
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
# Get job status
|
||||
job = nomad_service.get_job("example-job")
|
||||
print(f"Job Status: {job.get('Status')}")
|
||||
|
||||
# Get allocations
|
||||
allocations = nomad_service.get_allocations("example-job")
|
||||
for alloc in allocations:
|
||||
print(f"Allocation: {alloc.get('ID')}, Status: {alloc.get('ClientStatus')}")
|
||||
```
|
||||
|
||||
## 4. Checking Job Logs
|
||||
## Retrieving Logs
|
||||
|
||||
Logs are crucial for diagnosing issues with your job. Here's how to access them:
|
||||
|
||||
### Option A: Using the Nomad CLI
|
||||
### Using the Nomad CLI
|
||||
|
||||
```bash
|
||||
# First, get the allocation ID
|
||||
nomad job allocs your-job-name
|
||||
# Get stdout logs
|
||||
nomad alloc logs <allocation_id>
|
||||
|
||||
# Then view the logs for a specific allocation
|
||||
nomad alloc logs <allocation-id>
|
||||
|
||||
# View stderr logs
|
||||
nomad alloc logs -stderr <allocation-id>
|
||||
|
||||
# Follow logs in real-time
|
||||
nomad alloc logs -f <allocation-id>
|
||||
# Get stderr logs
|
||||
nomad alloc logs -stderr <allocation_id>
|
||||
```
|
||||
|
||||
### Option B: Using the Nomad MCP API
|
||||
|
||||
```bash
|
||||
# Using curl
|
||||
curl -X GET http://localhost:8000/api/claude/job-logs/your-job-name
|
||||
|
||||
# Using PowerShell
|
||||
Invoke-RestMethod -Uri "http://localhost:8000/api/claude/job-logs/your-job-name" -Method GET
|
||||
```
|
||||
|
||||
### Option C: Using a Python Script
|
||||
### Using Python
|
||||
|
||||
```python
|
||||
#!/usr/bin/env python
|
||||
from app.services.nomad_client import NomadService
|
||||
|
||||
def main():
|
||||
# Initialize the Nomad service
|
||||
service = NomadService()
|
||||
|
||||
# Get allocations for the job
|
||||
allocations = service.get_allocations('your-job-name')
|
||||
|
||||
if allocations:
|
||||
latest_alloc = allocations[0]
|
||||
alloc_id = latest_alloc["ID"]
|
||||
print(f"Latest allocation ID: {alloc_id}")
|
||||
|
||||
# Get logs for the allocation
|
||||
try:
|
||||
# Get stdout logs
|
||||
stdout_logs = service.get_allocation_logs(alloc_id, task="your-task-name", log_type="stdout")
|
||||
print("\nStandard Output Logs:")
|
||||
print(stdout_logs)
|
||||
|
||||
# Get stderr logs
|
||||
stderr_logs = service.get_allocation_logs(alloc_id, task="your-task-name", log_type="stderr")
|
||||
print("\nStandard Error Logs:")
|
||||
print(stderr_logs)
|
||||
except Exception as e:
|
||||
print(f"Error getting logs: {str(e)}")
|
||||
else:
|
||||
print("No allocations found for your-job-name job")
|
||||
# Initialize the service
|
||||
nomad_service = NomadService()
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
# Get allocations for the job
|
||||
allocations = nomad_service.get_allocations("example-job")
|
||||
if allocations:
|
||||
# Get logs from the most recent allocation
|
||||
latest_alloc = allocations[0]
|
||||
stdout_logs = nomad_service.get_allocation_logs(latest_alloc["ID"], "server", "stdout")
|
||||
stderr_logs = nomad_service.get_allocation_logs(latest_alloc["ID"], "server", "stderr")
|
||||
|
||||
print("STDOUT Logs:")
|
||||
print(stdout_logs)
|
||||
print("\nSTDERR Logs:")
|
||||
print(stderr_logs)
|
||||
```
|
||||
|
||||
## 5. Troubleshooting Common Issues
|
||||
## Troubleshooting
|
||||
|
||||
### Issue: Job Fails to Start
|
||||
### Common Issues and Solutions
|
||||
|
||||
1. **Check the job status**:
|
||||
```bash
|
||||
nomad job status your-job-name
|
||||
```
|
||||
#### 1. Job Fails to Start
|
||||
|
||||
2. **Examine the allocation status**:
|
||||
```bash
|
||||
nomad alloc status -job your-job-name
|
||||
```
|
||||
**Symptoms:**
|
||||
- Job status shows as "dead"
|
||||
- Allocation status shows as "failed"
|
||||
|
||||
3. **Check the logs for errors**:
|
||||
```bash
|
||||
# Get the allocation ID first
|
||||
nomad job allocs your-job-name
|
||||
# Then check the logs
|
||||
nomad alloc logs -stderr <allocation-id>
|
||||
```
|
||||
**Possible Causes and Solutions:**
|
||||
|
||||
4. **Common errors and solutions**:
|
||||
a) **Resource Constraints:**
|
||||
- Check if the job is requesting more resources than available
|
||||
- Reduce CPU or memory requirements in the job specification
|
||||
|
||||
a. **Missing static directory**:
|
||||
```
|
||||
RuntimeError: Directory 'static' does not exist
|
||||
```
|
||||
Solution: Add an environment variable to specify the static directory path:
|
||||
b) **Missing Static Directory:**
|
||||
- Error: `RuntimeError: Directory 'static' does not exist`
|
||||
- Solution: Use environment variables to specify the static directory path
|
||||
```hcl
|
||||
env {
|
||||
STATIC_DIR = "/local/app-code/static"
|
||||
STATIC_DIR = "/local/your_app/static"
|
||||
}
|
||||
```
|
||||
|
||||
b. **Invalid mount configuration**:
|
||||
```
|
||||
invalid mount config for type 'bind': bind source path does not exist
|
||||
```
|
||||
Solution: Ensure the source path exists or is created by an artifact:
|
||||
c) **Module Import Errors:**
|
||||
- Error: `ModuleNotFoundError: No module named 'app'`
|
||||
- Solution: Set the correct PYTHONPATH in the job specification
|
||||
```hcl
|
||||
artifact {
|
||||
source = "git::ssh://git@your-git-server:port/org/repo.git"
|
||||
destination = "local/app-code"
|
||||
env {
|
||||
PYTHONPATH = "/local/your_app"
|
||||
}
|
||||
```
|
||||
|
||||
c. **Port already allocated**:
|
||||
```
|
||||
Allocation failed: Failed to place allocation: failed to place alloc: port is already allocated
|
||||
```
|
||||
Solution: Use dynamic ports or choose a different port:
|
||||
```hcl
|
||||
network {
|
||||
port "http" {
|
||||
to = 8000
|
||||
}
|
||||
}
|
||||
```
|
||||
d) **Artifact Retrieval Failures:**
|
||||
- Error: `Failed to download artifact: git::ssh://...`
|
||||
- Solution: Verify SSH key, repository URL, and permissions
|
||||
|
||||
### Issue: Application Errors After Deployment
|
||||
e) **Old Code Version Running:**
|
||||
- Symptom: Your recent changes aren't reflected in the deployed application
|
||||
- Solution: **Commit and push your code changes to Gitea before deploying**
|
||||
|
||||
1. **Check application logs**:
|
||||
```bash
|
||||
nomad alloc logs <allocation-id>
|
||||
```
|
||||
#### 2. Network Connectivity Issues
|
||||
|
||||
2. **Verify environment variables**:
|
||||
```bash
|
||||
nomad alloc status <allocation-id>
|
||||
```
|
||||
Look for the "Environment Variables" section.
|
||||
**Symptoms:**
|
||||
- Connection timeouts
|
||||
- "Failed to connect to Nomad" errors
|
||||
|
||||
3. **Check resource constraints**:
|
||||
Ensure the job has enough CPU and memory allocated:
|
||||
```hcl
|
||||
resources {
|
||||
cpu = 200 # Increase if needed
|
||||
memory = 256 # Increase if needed
|
||||
}
|
||||
```
|
||||
**Solutions:**
|
||||
- Verify Nomad server address and port
|
||||
- Check network connectivity and firewall rules
|
||||
- Ensure proper authentication token is provided
|
||||
|
||||
## 6. Updating a Job
|
||||
#### 3. Permission Issues
|
||||
|
||||
After fixing issues, you'll need to update the job:
|
||||
**Symptoms:**
|
||||
- "Permission denied" errors
|
||||
- "ACL token not found" messages
|
||||
|
||||
### Option A: Using the Nomad CLI
|
||||
**Solutions:**
|
||||
- Verify your token has appropriate permissions
|
||||
- Check namespace settings in your job specification
|
||||
- Ensure the token is properly set in the environment
|
||||
|
||||
## Complete Workflow Example
|
||||
|
||||
Here's a complete workflow for managing a Nomad job:
|
||||
|
||||
### 1. Develop and Test Your Application
|
||||
|
||||
```bash
|
||||
# Update the job with the modified specification
|
||||
nomad job run your-updated-job-file.nomad
|
||||
# Make changes to your application code
|
||||
# Test locally to ensure it works
|
||||
python -m uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
|
||||
```
|
||||
|
||||
### Option B: Using the Nomad MCP API
|
||||
### 2. Commit and Push Your Changes to Gitea
|
||||
|
||||
```bash
|
||||
# Using PowerShell to restart a job
|
||||
Invoke-RestMethod -Uri "http://localhost:8000/api/claude/jobs" -Method POST -Headers @{"Content-Type"="application/json"} -Body '{
|
||||
"job_id": "your-job-name",
|
||||
"action": "restart",
|
||||
"namespace": "development"
|
||||
}'
|
||||
# Stage your changes
|
||||
git add .
|
||||
|
||||
# Commit your changes
|
||||
git commit -m "Update application with new feature"
|
||||
|
||||
# Push to Gitea repository
|
||||
git push origin main
|
||||
```
|
||||
|
||||
### Option C: Using a Python Script
|
||||
> **CRITICAL:** This step is essential when using Gitea artifacts in your Nomad jobs. Without pushing your changes, the job will pull the old version of the code.
|
||||
|
||||
```python
|
||||
#!/usr/bin/env python
|
||||
from app.services.nomad_client import NomadService
|
||||
|
||||
def main():
|
||||
# Initialize the Nomad service
|
||||
service = NomadService()
|
||||
|
||||
# Get the current job specification
|
||||
job = service.get_job('your-job-name')
|
||||
|
||||
# Modify the job specification as needed
|
||||
# For example, update environment variables:
|
||||
job["TaskGroups"][0]["Tasks"][0]["Env"]["STATIC_DIR"] = "/local/app-code/static"
|
||||
|
||||
# Update the job
|
||||
response = service.start_job({"Job": job})
|
||||
|
||||
print(f"Job update response: {response}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
```
|
||||
|
||||
## 7. Stopping a Job
|
||||
|
||||
When you're done with a job, you can stop it:
|
||||
|
||||
### Option A: Using the Nomad CLI
|
||||
### 3. Deploy the Job
|
||||
|
||||
```bash
|
||||
# Stop a job
|
||||
nomad job stop your-job-name
|
||||
# Using a deployment script
|
||||
python deploy_job.py
|
||||
|
||||
# Stop and purge a job
|
||||
nomad job stop -purge your-job-name
|
||||
# Or using the Nomad CLI
|
||||
nomad job run job_spec.nomad
|
||||
```
|
||||
|
||||
### Option B: Using the Nomad MCP API
|
||||
### 4. Check Job Status
|
||||
|
||||
```bash
|
||||
# Using PowerShell
|
||||
Invoke-RestMethod -Uri "http://localhost:8000/api/claude/jobs" -Method POST -Headers @{"Content-Type"="application/json"} -Body '{
|
||||
"job_id": "your-job-name",
|
||||
"action": "stop",
|
||||
"namespace": "development",
|
||||
"purge": true
|
||||
}'
|
||||
# Using the Nomad CLI
|
||||
nomad job status example-job
|
||||
|
||||
# Or using Python
|
||||
python -c "from app.services.nomad_client import NomadService; service = NomadService(); job = service.get_job('example-job'); print(f'Job Status: {job.get(\"Status\", \"Unknown\")}');"
|
||||
```
|
||||
|
||||
### Option C: Using a Python Script
|
||||
### 5. Check Logs if Issues Occur
|
||||
|
||||
```python
|
||||
#!/usr/bin/env python
|
||||
from app.services.nomad_client import NomadService
|
||||
```bash
|
||||
# Get allocations
|
||||
allocations=$(nomad job status -json example-job | jq -r '.Allocations[0].ID')
|
||||
|
||||
def main():
|
||||
# Initialize the Nomad service
|
||||
service = NomadService()
|
||||
|
||||
# Stop the job
|
||||
response = service.stop_job('your-job-name', purge=True)
|
||||
|
||||
print(f"Job stop response: {response}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
# Check logs
|
||||
nomad alloc logs $allocations
|
||||
nomad alloc logs -stderr $allocations
|
||||
```
|
||||
|
||||
## 8. Complete Workflow Example
|
||||
### 6. Fix Issues and Update
|
||||
|
||||
Here's a complete workflow for deploying, monitoring, troubleshooting, and updating a job:
|
||||
If you encounter issues:
|
||||
|
||||
```python
|
||||
#!/usr/bin/env python
|
||||
import time
|
||||
from app.services.nomad_client import NomadService
|
||||
1. Fix the code in your local environment
|
||||
2. **Commit and push changes to Gitea**
|
||||
3. Redeploy the job
|
||||
4. Check status and logs again
|
||||
|
||||
def main():
|
||||
# Initialize the Nomad service
|
||||
service = NomadService()
|
||||
|
||||
# 1. Create and deploy the job
|
||||
job_spec = {
|
||||
"Job": {
|
||||
"ID": "example-app",
|
||||
"Name": "Example Application",
|
||||
"Type": "service",
|
||||
"Datacenters": ["dc1"],
|
||||
"Namespace": "development",
|
||||
# ... rest of job specification ...
|
||||
}
|
||||
}
|
||||
|
||||
deploy_response = service.start_job(job_spec)
|
||||
print(f"Deployment response: {deploy_response}")
|
||||
|
||||
# 2. Wait for the job to be scheduled
|
||||
print("Waiting for job to be scheduled...")
|
||||
time.sleep(5)
|
||||
|
||||
# 3. Check job status
|
||||
job = service.get_job('example-app')
|
||||
print(f"Job Status: {job.get('Status', 'Unknown')}")
|
||||
|
||||
# 4. Get allocations
|
||||
allocations = service.get_allocations('example-app')
|
||||
|
||||
if allocations:
|
||||
latest_alloc = allocations[0]
|
||||
alloc_id = latest_alloc["ID"]
|
||||
print(f"Latest allocation ID: {alloc_id}")
|
||||
print(f"Allocation Status: {latest_alloc.get('ClientStatus', 'Unknown')}")
|
||||
|
||||
# 5. Check logs for errors
|
||||
stderr_logs = service.get_allocation_logs(alloc_id, log_type="stderr")
|
||||
|
||||
# 6. Look for common errors
|
||||
if "Directory 'static' does not exist" in stderr_logs:
|
||||
print("Error detected: Missing static directory")
|
||||
|
||||
# 7. Update the job to fix the issue
|
||||
job["TaskGroups"][0]["Tasks"][0]["Env"]["STATIC_DIR"] = "/local/app-code/static"
|
||||
update_response = service.start_job({"Job": job})
|
||||
print(f"Job update response: {update_response}")
|
||||
|
||||
# 8. Wait for the updated job to be scheduled
|
||||
print("Waiting for updated job to be scheduled...")
|
||||
time.sleep(5)
|
||||
|
||||
# 9. Check the updated job status
|
||||
updated_job = service.get_job('example-app')
|
||||
print(f"Updated Job Status: {updated_job.get('Status', 'Unknown')}")
|
||||
else:
|
||||
print("No allocations found for the job")
|
||||
### 7. Stop the Job When Done
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
```bash
|
||||
# Stop without purging (keeps job definition)
|
||||
nomad job stop example-job
|
||||
|
||||
# Stop and purge (completely removes job)
|
||||
nomad job stop -purge example-job
|
||||
```
|
||||
|
||||
## 9. Best Practices
|
||||
## Best Practices
|
||||
|
||||
1. **Always check logs after deployment**: Logs are your primary tool for diagnosing issues.
|
||||
1. **Always commit and push code before deployment**: When using Gitea artifacts, ensure your code is committed and pushed before deploying jobs.
|
||||
|
||||
2. **Use environment variables for configuration**: This makes your jobs more flexible and easier to update.
|
||||
2. **Use namespaces**: Organize jobs by environment (development, staging, production).
|
||||
|
||||
3. **Implement health checks**: Health checks help Nomad determine if your application is running correctly.
|
||||
3. **Set appropriate resource limits**: Specify realistic CPU and memory requirements.
|
||||
|
||||
4. **Set appropriate resource limits**: Allocate enough CPU and memory for your application to run efficiently.
|
||||
4. **Implement health checks**: Add service health checks to detect application issues.
|
||||
|
||||
5. **Use artifacts for code deployment**: Pull code from a Git repository to ensure consistency.
|
||||
5. **Use environment variables**: Configure applications through environment variables for flexibility.
|
||||
|
||||
6. **Implement proper error handling**: Your application should handle errors gracefully and provide meaningful error messages.
|
||||
6. **Implement proper error handling**: Add robust error handling in your application.
|
||||
|
||||
7. **Use namespaces**: Organize your jobs into namespaces based on environment or team.
|
||||
7. **Monitor job status**: Regularly check job status and logs.
|
||||
|
||||
8. **Document your job specifications**: Include comments in your job files to explain configuration choices.
|
||||
8. **Version your artifacts**: Use specific tags or commits for reproducible deployments.
|
||||
|
||||
9. **Implement a CI/CD pipeline**: Automate the deployment process to reduce errors and improve efficiency.
|
||||
9. **Document job specifications**: Keep documentation of job requirements and configurations.
|
||||
|
||||
10. **Monitor job performance**: Use Nomad's monitoring capabilities to track resource usage and performance.
|
||||
10. **Test locally before deployment**: Verify application functionality in a local environment.
|
||||
|
||||
## 10. Conclusion
|
||||
## Conclusion
|
||||
|
||||
Managing Nomad jobs effectively requires understanding the job lifecycle, from creation to deployment, monitoring, troubleshooting, and updating. By following this guide, you can create robust deployment processes that ensure your applications run reliably in a Nomad cluster.
|
||||
Managing Nomad jobs effectively requires understanding the job lifecycle, proper configuration, and troubleshooting techniques. By following this guide, you should be able to create, deploy, monitor, and troubleshoot Nomad jobs efficiently.
|
||||
|
||||
Remember that the key to successful job management is thorough testing, careful monitoring, and quick response to issues. With the right tools and processes in place, you can efficiently manage even complex applications in a Nomad environment.
|
||||
Remember that the most common issues are related to resource constraints, network connectivity, and configuration errors. Always check logs when troubleshooting, and ensure your code is properly committed and pushed to Gitea before deployment.
|
Binary file not shown.
Reference in New Issue
Block a user