About the Role
The AI Infrastructure Engineer is a platform specialist responsible for architecting, building, and operating high-performance AI infrastructure to support advanced AI workloads, including LLMs, GenAI, Computer Vision, and MLOps. This role will focus on managing GPU clusters (NVIDIA A100/H100), deploying and maintaining Red Hat OpenShift AI (RHODS), and ensuring secure, scalable, and cost-efficient AI platforms across SDD’s Sovereign Cloud and hybrid/multi-cloud environments. The engineer will enable enterprise-grade AI adoption for 200+ government entities.
Key Responsibilities & Deliverables
Design and implement GPU-based compute clusters. Define reference architectures for LLM hosting, Vector Databases, MLOps, and high-performance storage/networking.
Fully operational GPU-based AI infrastructure. GPU Cluster Uptime and Performance Utilization. Reduction in Cost per Training/Inference Workload.
Key Responsibilities
- Install, configure, and optimize core components: CUDA, cuDNN, NCCL, NVIDIA Drivers, and GPU Operators. Implement GPU partitioning, scheduling, and performance tuning for high-end GPUs (e.g., A100/H100).
- High-availability architecture for all AI workloads. Complete documentation and runbooks.
- OpenShift AI (RHODS) Management
- Deploy, configure, and maintain the Red Hat OpenShift AI (RHODS) platform for multi-tenant use. Manage the integration of NVIDIA GPU Operator for efficient GPU scheduling and support Data Scientists with Notebooks, Training, and Inference Endpoints.
How to Apply
More jobs at get9to5jobs.com