9.6 KiB
Nomad Job Management Guide
This guide provides comprehensive instructions for managing Nomad jobs, including creating, deploying, monitoring, and troubleshooting.
Prerequisites
Before you begin, ensure you have:
- Access to a Nomad cluster
- Proper authentication credentials (token if ACLs are enabled)
- Network connectivity to the Nomad API endpoint (default port 4646)
- Access to the Gitea repository (if using artifact integration)
Job Specifications
HCL Format (Nomad Job File)
Nomad jobs can be defined in HashiCorp Configuration Language (HCL):
job "example-job" {
datacenters = ["dc1"]
type = "service"
namespace = "development"
group "app" {
count = 1
task "server" {
driver = "docker"
config {
image = "nginx:latest"
ports = ["http"]
}
resources {
cpu = 100
memory = 128
}
}
}
}
Python Format (Dictionary)
For programmatic job management, use Python dictionaries:
job_spec = {
"Job": {
"ID": "example-job",
"Name": "example-job",
"Type": "service",
"Datacenters": ["dc1"],
"Namespace": "development",
"TaskGroups": [
{
"Name": "app",
"Count": 1,
"Tasks": [
{
"Name": "server",
"Driver": "docker",
"Config": {
"image": "nginx:latest"
},
"Resources": {
"CPU": 100,
"MemoryMB": 128
}
}
]
}
]
}
}
Deployment Methods
CRITICAL: Always commit and push your code changes to Gitea before deploying jobs!
When using Gitea artifacts in your Nomad jobs, the job will pull code from the repository at deployment time. If you don't commit and push your changes first, the job will use the old version of the code, and your changes won't be reflected in the deployed application.
Using the Nomad CLI
# Deploy a job using an HCL file
nomad job run job_spec.nomad
# Stop a job
nomad job stop example-job
# Purge a job (completely remove it)
nomad job stop -purge example-job
Using Python with the Nomad API
from app.services.nomad_client import NomadService
# Initialize the service
nomad_service = NomadService()
# Start a job
response = nomad_service.start_job(job_spec)
print(f"Job started: {response}")
# Stop a job
response = nomad_service.stop_job("example-job", purge=False)
print(f"Job stopped: {response}")
Using a Deployment Script
Create a deployment script that handles the job specification and deployment:
#!/usr/bin/env python
import os
from app.services.nomad_client import NomadService
def main():
# Initialize the Nomad service
nomad_service = NomadService()
# Define job specification
job_spec = {
"Job": {
"ID": "example-job",
# ... job configuration ...
}
}
# Start the job
response = nomad_service.start_job(job_spec)
if response.get("status") == "started":
print(f"Job started successfully: {response.get('job_id')}")
else:
print(f"Failed to start job: {response.get('message')}")
if __name__ == "__main__":
main()
Checking Job Status
Using the Nomad CLI
# Get job status
nomad job status example-job
# Get detailed allocation information
nomad alloc status <allocation_id>
Using Python
from app.services.nomad_client import NomadService
# Initialize the service
nomad_service = NomadService()
# Get job status
job = nomad_service.get_job("example-job")
print(f"Job Status: {job.get('Status')}")
# Get allocations
allocations = nomad_service.get_allocations("example-job")
for alloc in allocations:
print(f"Allocation: {alloc.get('ID')}, Status: {alloc.get('ClientStatus')}")
Retrieving Logs
Using the Nomad CLI
# Get stdout logs
nomad alloc logs <allocation_id>
# Get stderr logs
nomad alloc logs -stderr <allocation_id>
Using Python
from app.services.nomad_client import NomadService
# Initialize the service
nomad_service = NomadService()
# Get allocations for the job
allocations = nomad_service.get_allocations("example-job")
if allocations:
# Get logs from the most recent allocation
latest_alloc = allocations[0]
stdout_logs = nomad_service.get_allocation_logs(latest_alloc["ID"], "server", "stdout")
stderr_logs = nomad_service.get_allocation_logs(latest_alloc["ID"], "server", "stderr")
print("STDOUT Logs:")
print(stdout_logs)
print("\nSTDERR Logs:")
print(stderr_logs)
Troubleshooting
Common Issues and Solutions
1. Job Fails to Start
Symptoms:
- Job status shows as "dead"
- Allocation status shows as "failed"
Possible Causes and Solutions:
a) Resource Constraints:
- Check if the job is requesting more resources than available
- Reduce CPU or memory requirements in the job specification
b) Missing Static Directory:
- Error:
RuntimeError: Directory 'static' does not exist
- Solution: Use environment variables to specify the static directory path
env {
STATIC_DIR = "/local/your_app/static"
}
c) Module Import Errors:
- Error:
ModuleNotFoundError: No module named 'app'
- Solution: Set the correct PYTHONPATH in the job specification
env {
PYTHONPATH = "/local/your_app"
}
d) Artifact Retrieval Failures:
- Error:
Failed to download artifact: git::ssh://...
- Solution: Verify SSH key, repository URL, and permissions
e) Old Code Version Running:
- Symptom: Your recent changes aren't reflected in the deployed application
- Solution: Commit and push your code changes to Gitea before deploying
2. Network Connectivity Issues
Symptoms:
- Connection timeouts
- "Failed to connect to Nomad" errors
Solutions:
- Verify Nomad server address and port
- Check network connectivity and firewall rules
- Ensure proper authentication token is provided
3. Permission Issues
Symptoms:
- "Permission denied" errors
- "ACL token not found" messages
Solutions:
- Verify your token has appropriate permissions
- Check namespace settings in your job specification
- Ensure the token is properly set in the environment
Complete Workflow Example
Here's a complete workflow for managing a Nomad job:
1. Develop and Test Your Application
# Make changes to your application code
# Test locally to ensure it works
python -m uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
2. Commit and Push Your Changes to Gitea
# Stage your changes
git add .
# Commit your changes
git commit -m "Update application with new feature"
# Push to Gitea repository
git push origin main
CRITICAL: This step is essential when using Gitea artifacts in your Nomad jobs. Without pushing your changes, the job will pull the old version of the code.
3. Deploy the Job
# Using a deployment script
python deploy_job.py
# Or using the Nomad CLI
nomad job run job_spec.nomad
4. Check Job Status
# Using the Nomad CLI
nomad job status example-job
# Or using Python
python -c "from app.services.nomad_client import NomadService; service = NomadService(); job = service.get_job('example-job'); print(f'Job Status: {job.get(\"Status\", \"Unknown\")}');"
5. Check Logs if Issues Occur
# Get allocations
allocations=$(nomad job status -json example-job | jq -r '.Allocations[0].ID')
# Check logs
nomad alloc logs $allocations
nomad alloc logs -stderr $allocations
6. Fix Issues and Update
If you encounter issues:
- Fix the code in your local environment
- Commit and push changes to Gitea
- Redeploy the job
- Check status and logs again
7. Stop the Job When Done
# Stop without purging (keeps job definition)
nomad job stop example-job
# Stop and purge (completely removes job)
nomad job stop -purge example-job
Best Practices
-
Always commit and push code before deployment: When using Gitea artifacts, ensure your code is committed and pushed before deploying jobs.
-
Use namespaces: Organize jobs by environment (development, staging, production).
-
Set appropriate resource limits: Specify realistic CPU and memory requirements.
-
Implement health checks: Add service health checks to detect application issues.
-
Use environment variables: Configure applications through environment variables for flexibility.
-
Implement proper error handling: Add robust error handling in your application.
-
Monitor job status: Regularly check job status and logs.
-
Version your artifacts: Use specific tags or commits for reproducible deployments.
-
Document job specifications: Keep documentation of job requirements and configurations.
-
Test locally before deployment: Verify application functionality in a local environment.
Conclusion
Managing Nomad jobs effectively requires understanding the job lifecycle, proper configuration, and troubleshooting techniques. By following this guide, you should be able to create, deploy, monitor, and troubleshoot Nomad jobs efficiently.
Remember that the most common issues are related to resource constraints, network connectivity, and configuration errors. Always check logs when troubleshooting, and ensure your code is properly committed and pushed to Gitea before deployment.