JOBS

ML Infrastructure Engineer

Job Type

Permanent

Industry

Service

Language

English

Work Arrangement

Remote

Date Posted

Wed, June 25, 2025

Salary

90000 - 110000 per Year

Specialization

IT - AI Development

Security Clearance

Location

Toronto, ON

Opportunity Number

22799

Job Description

About the Role

We are hiring an ML Infrastructure Engineer with deep DevOps and platform engineering expertise to build and manage scalable infrastructure for real-time AI systems. This is not a model development role — we are specifically looking for someone who can design, deploy, and manage robust ML platforms that support production-grade AI workloads across cloud environments.

You will be responsible for cloud infrastructure, container orchestration, CI/CD automation, and GPU inference optimization — working closely with AI researchers and media engineering teams to support high-throughput, low-latency systems.

What You Will Own

Platform Engineering & Architecture

Architect and maintain a production-ready ML platform, integrating training, deployment, and monitoring pipelines.
Design and implement cloud-native infrastructure that is reliable, secure, and scalable (primarily on AWS using EKS).
Build modular systems that are optimized for real-time performance, reproducibility, and automation.

DevOps, Automation & IaC

Build and manage CI/CD pipelines for model packaging, deployment, and rollback using GitHub Actions, ArgoCD, or Jenkins.
Use Terraform or other Infrastructure as Code (IaC) tools to define and provision infrastructure consistently and securely.
Implement model observability and performance diagnostics using Prometheus, Grafana, or equivalent tooling.

Model Serving & Inference Optimization

Deploy and manage inference endpoints using tools like TensorRT, ONNX Runtime, or TorchScript.
Optimize resource utilization and latency in GPU-powered environments.
Handle version control, rollback strategies, and reproducibility standards for deployed models.

Must-Have Qualifications

3+ years of experience as a DevOps, ML Infrastructure, or Platform Engineer (not a research/modeling role).
Strong hands-on experience with:
- Cloud platforms (preferably AWS, including EKS)
- Kubernetes and Docker in production environments
- Terraform or other IaC tools
- CI/CD systems for ML pipelines (GitHub Actions, ArgoCD, Jenkins, etc.)
Experience supporting real-time ML inference systems using ONNX Runtime, TensorRT, or similar tools.
Strong understanding of platform monitoring, alerting, and observability best practices

Nice-to-Haves

Experience deploying generative AI or LLMs at scale.
Familiarity with media processing, VFX systems, or GPU-intensive real-time workloads.
Exposure to security best practices (IAM, encryption, container hardening, etc.).

We’re an equal opportunity employer committed to increasing diversity and inclusion in today’s workforce. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, or veteran status. Minorities, women, LGBTQ candidates, and individuals with disabilities are encouraged to apply. If you require an accommodation, please review our accessibility policy and reach out to our accessibility officer with any questions.

Similar Jobs

Fullstack Software Engineer

Toronto, ON

Remote

Machine Learning Engineer

Toronto, ON

Remote

Security Engineer (PIPELINING)

Vancouver, BC

Hybrid