Own the full post-training pipeline end to end — from data curation and reward modeling through fine-tuning, preference optimization, distillation, safety tuning, evaluation, and deployment * Advance techniques across the post-training stack: SFT, RLHF, RLAIF, DPO, preference learning, and reward modeling to align models with human intent and aesthetic judgment * Identify quality and alignment gaps through rigorous evaluation, then close them through targeted research and engineering * Deep experience across the post-training stack, not just one slice: reward modeling, preference learning, RLHF/RLAIF, and personalization - Depending on your role, you'll either join us in Freiburg or SF at least 2 days a week (or one full week every other week), or work remotely with a monthly in-person week to stay connected.
mehr