ML Infrastructure Engineer
About the Role
We are hiring an ML Infrastructure Engineer with deep DevOps and platform engineering expertise to build and manage scalable infrastructure for real-time AI systems. This is not a model development role — we are specifically looking for someone who can design, deploy, and manage robust ML platforms that support production-grade AI workloads across cloud environments.
You will be responsible for cloud infrastructure, container orchestration, CI/CD automation, and GPU inference optimization — working closely with AI researchers and media engineering teams to support high-throughput, low-latency systems.
What You Will Own
Platform Engineering & Architecture
Architect and maintain a production-ready ML platform, integrating training, deployment, and monitoring pipelines.
Design and implement cloud-native infrastructure that is reliable, secure, and scalable (primarily on AWS using EKS).
Build modular systems that are optimized for real-time performance, reproducibility, and automation.
DevOps, Automation & IaC
Build and manage CI/CD pipelines for model packaging, deployment, and rollback using GitHub Actions, ArgoCD, or Jenkins.
Use Terraform or other Infrastructure as Code (IaC) tools to define and provision infrastructure consistently and securely.
Implement model observability and performance diagnostics using Prometheus, Grafana, or equivalent tooling.
Model Serving & Inference Optimization
Deploy and manage inference endpoints using tools like TensorRT, ONNX Runtime, or TorchScript.
Optimize resource utilization and latency in GPU-powered environments.
Handle version control, rollback strategies, and reproducibility standards for deployed models.
Must-Have Qualifications
3+ years of experience as a DevOps, ML Infrastructure, or Platform Engineer (not a research/modeling role).
Strong hands-on experience with:
Cloud platforms (preferably AWS, including EKS)
Kubernetes and Docker in production environments
Terraform or other IaC tools
CI/CD systems for ML pipelines (GitHub Actions, ArgoCD, Jenkins, etc.)
Experience supporting real-time ML inference systems using ONNX Runtime, TensorRT, or similar tools.
- Strong understanding of platform monitoring, alerting, and observability best practices
Nice-to-Haves
Experience deploying generative AI or LLMs at scale.
Familiarity with media processing, VFX systems, or GPU-intensive real-time workloads.
Exposure to security best practices (IAM, encryption, container hardening, etc.).
We’re an equal opportunity employer committed to increasing diversity and inclusion in today’s workforce. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, or veteran status. Minorities, women, LGBTQ candidates, and individuals with disabilities are encouraged to apply. If you require an accommodation, please review our accessibility policy and reach out to our accessibility officer with any questions.
Similar Jobs
Toronto, ON
Remote
Toronto, ON
Remote
Vancouver, BC
Hybrid