/Solutions Architect - AI / ML - Training & GPU infra

Solutions Architect - AI / ML - Training & GPU infra

's-Hertogenbosch, Europe (wider region), NLRemotenlvia direct
// Job Type
Full Time
// Salary
Not disclosed
// Posted
2 weeks ago
// Work Mode
remote

About the Role

<p>AI/ML Solutions Architect – Distributed Training &amp; GPU Infrastructure</p><p><strong>Company</strong></p><p>Join a fast-moving AI infrastructure team working on the cutting edge of large-scale ML workloads. This role is ideal for engineers who enjoy solving deep technical challenges in distributed training, multi-GPU systems, and scalable AI inference infrastructure. You will work directly with AI-focused clients, helping them get the most out of modern GPUs (H100, B200, etc.) and ML frameworks such as PyTorch (and JAX in some environments).</p><p><strong>Team &amp; Responsibilities</strong></p><p>Work alongside senior AI and infrastructure engineers building large-scale GPU platforms. As part of the customer solutions team, you will:</p><ul><li><p>Design and validate production-grade distributed training (primary) and large-scale inference architectures on large GPU clusters, typically tens to thousands of GPUs</p></li><li><p>Work hands-on with customers to debug, optimize, and scale ML workloads across multi-node GPU environments</p></li><li><p>Act as a technical authority on GPU performance, networking, and schedulers, making trade-offs at scale and translating customer needs into concrete platform requirements</p></li><li><p>Collaborate closely with engineering, product, and R&amp;D to influence roadmap decisions based on real-world ML workloads</p></li><li><p>This is a hands-on, technical role; you are expected to work directly in customer environments, not only advise at a high level</p></li></ul><p><strong>Required skills and experience</strong></p><ul><li><p>Hands-on <strong>experience designing and operating enterprise-scale, production-grade, multi-node GPU workloads</strong> for training (7B+ model size) or inference</p></li><li><p>Strong background in <strong>distributed deep learning</strong> (PyTorch Distributed, DeepSpeed, ...) on GPU clusters</p></li><li><p>Deep understanding of <strong>GPU architecture and interconnects</strong> (H100/A100 class, NVLink, InfiniBand)</p></li><li><p>Experience with <strong>Kubernetes or Slurm</strong></p></li><li><p>Experience with performance tuning using <strong>GPU profiling and monitoring tools</strong></p></li></ul><p>This role is not a fit if your experience is limited to single-node training, high-level AI strategy, or non-production research environments. We are looking for engineers and architects who thrive at the intersection of AI workloads and large-scale infrastructure.</p><p><strong>What's offered</strong></p><p>Location: Remote from anywhere in Europe</p><p>Total compensation up to EUR 250k (base + variable / OTE), depending on level and experience</p>

Interested in this job?

Login to Apply

Use our AI to tailor your resume for this Solutions Architect - AI / ML - Training & GPU infra position at The Next Chapter W&S.