By optimizing compute, storage, and high‑performance networking (e.g., InfiniBand, NCCL), you enable large‑scale AI workloads in industrial contexts. * You are responsible for developing and operating core infrastructure components such as scheduling and resource management systems (e.g., SLURM, Ray, Run:ai), ensuring efficient utilization of shared GPU resources. * Using modern tooling, you build and maintain automated, reproducible infrastructure (e.g., Docker, Kubernetes, Terraform, Ansible, CI/CD). * Experience with distributed systems and high‑performance networking (e.g. InfiniBand, NCCL), combined with experience in cloud environments (AWS, Azure) alongside on‑prem infrastructure. * Practical experience with resource scheduling and workload orchestration (e.g., SLURM, Ray, NVIDIA Run:ai). * Strong experience in infrastructure automation (e.g., Docker, Kubernetes, Terraform, Ansible, CI/CD) and ...
mehr