# Nomad Job Management Guide This guide provides comprehensive instructions for managing Nomad jobs, including creating, deploying, monitoring, and troubleshooting. ## Prerequisites Before you begin, ensure you have: 1. Access to a Nomad cluster 2. Proper authentication credentials (token if ACLs are enabled) 3. Network connectivity to the Nomad API endpoint (default port 4646) 4. Access to the Gitea repository (if using artifact integration) ## Job Specifications ### HCL Format (Nomad Job File) Nomad jobs can be defined in HashiCorp Configuration Language (HCL): ```hcl job "example-job" { datacenters = ["dc1"] type = "service" namespace = "development" group "app" { count = 1 task "server" { driver = "docker" config { image = "nginx:latest" ports = ["http"] } resources { cpu = 100 memory = 128 } } } } ``` ### Python Format (Dictionary) For programmatic job management, use Python dictionaries: ```python job_spec = { "Job": { "ID": "example-job", "Name": "example-job", "Type": "service", "Datacenters": ["dc1"], "Namespace": "development", "TaskGroups": [ { "Name": "app", "Count": 1, "Tasks": [ { "Name": "server", "Driver": "docker", "Config": { "image": "nginx:latest" }, "Resources": { "CPU": 100, "MemoryMB": 128 } } ] } ] } } ``` ## Deployment Methods > **CRITICAL: Always commit and push your code changes to Gitea before deploying jobs!** > > When using Gitea artifacts in your Nomad jobs, the job will pull code from the repository at deployment time. If you don't commit and push your changes first, the job will use the old version of the code, and your changes won't be reflected in the deployed application. ### Using the Nomad CLI ```bash # Deploy a job using an HCL file nomad job run job_spec.nomad # Stop a job nomad job stop example-job # Purge a job (completely remove it) nomad job stop -purge example-job ``` ### Using Python with the Nomad API ```python from app.services.nomad_client import NomadService # Initialize the service nomad_service = NomadService() # Start a job response = nomad_service.start_job(job_spec) print(f"Job started: {response}") # Stop a job response = nomad_service.stop_job("example-job", purge=False) print(f"Job stopped: {response}") ``` ### Using a Deployment Script Create a deployment script that handles the job specification and deployment: ```python #!/usr/bin/env python import os from app.services.nomad_client import NomadService def main(): # Initialize the Nomad service nomad_service = NomadService() # Define job specification job_spec = { "Job": { "ID": "example-job", # ... job configuration ... } } # Start the job response = nomad_service.start_job(job_spec) if response.get("status") == "started": print(f"Job started successfully: {response.get('job_id')}") else: print(f"Failed to start job: {response.get('message')}") if __name__ == "__main__": main() ``` ## Checking Job Status ### Using the Nomad CLI ```bash # Get job status nomad job status example-job # Get detailed allocation information nomad alloc status ``` ### Using Python ```python from app.services.nomad_client import NomadService # Initialize the service nomad_service = NomadService() # Get job status job = nomad_service.get_job("example-job") print(f"Job Status: {job.get('Status')}") # Get allocations allocations = nomad_service.get_allocations("example-job") for alloc in allocations: print(f"Allocation: {alloc.get('ID')}, Status: {alloc.get('ClientStatus')}") ``` ## Retrieving Logs ### Using the Nomad CLI ```bash # Get stdout logs nomad alloc logs # Get stderr logs nomad alloc logs -stderr ``` ### Using Python ```python from app.services.nomad_client import NomadService # Initialize the service nomad_service = NomadService() # Get allocations for the job allocations = nomad_service.get_allocations("example-job") if allocations: # Get logs from the most recent allocation latest_alloc = allocations[0] stdout_logs = nomad_service.get_allocation_logs(latest_alloc["ID"], "server", "stdout") stderr_logs = nomad_service.get_allocation_logs(latest_alloc["ID"], "server", "stderr") print("STDOUT Logs:") print(stdout_logs) print("\nSTDERR Logs:") print(stderr_logs) ``` ## Troubleshooting ### Common Issues and Solutions #### 1. Job Fails to Start **Symptoms:** - Job status shows as "dead" - Allocation status shows as "failed" **Possible Causes and Solutions:** a) **Resource Constraints:** - Check if the job is requesting more resources than available - Reduce CPU or memory requirements in the job specification b) **Missing Static Directory:** - Error: `RuntimeError: Directory 'static' does not exist` - Solution: Use environment variables to specify the static directory path ```hcl env { STATIC_DIR = "/local/your_app/static" } ``` c) **Module Import Errors:** - Error: `ModuleNotFoundError: No module named 'app'` - Solution: Set the correct PYTHONPATH in the job specification ```hcl env { PYTHONPATH = "/local/your_app" } ``` d) **Artifact Retrieval Failures:** - Error: `Failed to download artifact: git::ssh://...` - Solution: Verify SSH key, repository URL, and permissions e) **Old Code Version Running:** - Symptom: Your recent changes aren't reflected in the deployed application - Solution: **Commit and push your code changes to Gitea before deploying** #### 2. Network Connectivity Issues **Symptoms:** - Connection timeouts - "Failed to connect to Nomad" errors **Solutions:** - Verify Nomad server address and port - Check network connectivity and firewall rules - Ensure proper authentication token is provided #### 3. Permission Issues **Symptoms:** - "Permission denied" errors - "ACL token not found" messages **Solutions:** - Verify your token has appropriate permissions - Check namespace settings in your job specification - Ensure the token is properly set in the environment ## Complete Workflow Example Here's a complete workflow for managing a Nomad job: ### 1. Develop and Test Your Application ```bash # Make changes to your application code # Test locally to ensure it works python -m uvicorn app.main:app --reload --host 0.0.0.0 --port 8000 ``` ### 2. Commit and Push Your Changes to Gitea ```bash # Stage your changes git add . # Commit your changes git commit -m "Update application with new feature" # Push to Gitea repository git push origin main ``` > **CRITICAL:** This step is essential when using Gitea artifacts in your Nomad jobs. Without pushing your changes, the job will pull the old version of the code. ### 3. Deploy the Job ```bash # Using a deployment script python deploy_job.py # Or using the Nomad CLI nomad job run job_spec.nomad ``` ### 4. Check Job Status ```bash # Using the Nomad CLI nomad job status example-job # Or using Python python -c "from app.services.nomad_client import NomadService; service = NomadService(); job = service.get_job('example-job'); print(f'Job Status: {job.get(\"Status\", \"Unknown\")}');" ``` ### 5. Check Logs if Issues Occur ```bash # Get allocations allocations=$(nomad job status -json example-job | jq -r '.Allocations[0].ID') # Check logs nomad alloc logs $allocations nomad alloc logs -stderr $allocations ``` ### 6. Fix Issues and Update If you encounter issues: 1. Fix the code in your local environment 2. **Commit and push changes to Gitea** 3. Redeploy the job 4. Check status and logs again ### 7. Stop the Job When Done ```bash # Stop without purging (keeps job definition) nomad job stop example-job # Stop and purge (completely removes job) nomad job stop -purge example-job ``` ## Best Practices 1. **Always commit and push code before deployment**: When using Gitea artifacts, ensure your code is committed and pushed before deploying jobs. 2. **Use namespaces**: Organize jobs by environment (development, staging, production). 3. **Set appropriate resource limits**: Specify realistic CPU and memory requirements. 4. **Implement health checks**: Add service health checks to detect application issues. 5. **Use environment variables**: Configure applications through environment variables for flexibility. 6. **Implement proper error handling**: Add robust error handling in your application. 7. **Monitor job status**: Regularly check job status and logs. 8. **Version your artifacts**: Use specific tags or commits for reproducible deployments. 9. **Document job specifications**: Keep documentation of job requirements and configurations. 10. **Test locally before deployment**: Verify application functionality in a local environment. ## Conclusion Managing Nomad jobs effectively requires understanding the job lifecycle, proper configuration, and troubleshooting techniques. By following this guide, you should be able to create, deploy, monitor, and troubleshoot Nomad jobs efficiently. Remember that the most common issues are related to resource constraints, network connectivity, and configuration errors. Always check logs when troubleshooting, and ensure your code is properly committed and pushed to Gitea before deployment.