Senior AI Infrastructure & Platform Engineer - Riyadh,KSA
DeepSource Technologies · Posted 2025-12-02
Role Overview We are seeking a highly skilled Senior AI Infrastructure & Platform Engineer to join our client’s team in Riyadh. In this role, you’ll be responsible for building, managing, and optimizing scalable AI infrastructure and compute environments that support high-performance workloads, including GPU-accelerated AI/ML pipelines, cluster scheduling, and orchestration. Key Responsibilities Deploy, maintain, and optimize GPU-based compute clusters and infrastructure. Manage and operate GPU orchestration tools and platforms such as: Nvidia Base Command Manager (critical) Nvidia AI Enterprise Suite Nvidia GPU and Network Operators Nvidia NIMs and Blueprints Configure, deploy, and maintain compute workloads using scheduling and orchestration tools including: Slurm (critical) Vanilla Kubernetes Install, configure, and maintain the underlying OS (e.g. Canonical Ubuntu) and supporting system software. Monitor and troubleshoot infrastructure performance, availability, and reliability; ensure high uptime for AI/ML workloads. Work with data scientists, ML engineers, and dev teams to define infrastructure requirements, resource allocation, and deployment workflows. Develop automation scripts, CI/CD pipelines, and best practices for infrastructure provisioning and management. Document architecture, configurations, and operational procedures; enforce security, compliance, and backup policies. Required Skills & Experience Proven experience managing GPU-based AI/ML infrastructure and compute clusters. Hands-on experience with: Nvidia Base Command Manager Nvidia AI Enterprise Suite Nvidia GPU/Network Operators, NIMs, Blueprints Strong experience with Slurm and/or Kubernetes orchestration. Solid Linux system administration skills — preferably on Ubuntu or similar distributions. Strong scripting/automation ability (e.g. Bash, Python, or relevant tooling) for provisioning, deployment, and maintenance. Excellent troubleshooting and performance-tuning skills. Experience collaborating with ML/data science teams and integrating infrastructure with their workflows. Strong understanding of networking, security, resource allocation, and cluster management best practices. Preferred Qualifications Previous experience working in a high-performance computing (HPC) or AI-focused infrastructure team. Knowledge of containerization, container orchestration, and GPUs in cloud or on-prem environments. Experience with CI/CD, infrastructure-as-code (e.g. Terraform, Ansible), monitoring tools, and logging setups. Familiarity with workload scheduling, job queuing, resource quotas, and GPU-shared environments.