Working closely with global infrastructure, development, and operations teams, you will drive everything from daily health monitoring and incident resolution to full GPU cluster bring-up and large-scale hardware deployments. * Configure and bring up InfiniBand fabric and GPU clusters, including switch configuration, subnet management, and end-to-end validation testing. * Participate in an on-call rotation and provide on-site or remote support during maintenance windows and operational incidents. * Working knowledge of InfiniBand networking, including switch configuration and subnet management. * Experience installing, configuring, and troubleshooting routers, switches, and terminal servers for out-of-band management.
mehr