Coupang
Sr. Staff Observability Engineer (GPU Cloud & Telemetry Platform)
Seoul, South KoreaPosted 4 days ago
What you'd do
- Own end-to-end observability platform design and operations for GPU-as-a-Service infrastructure
- Architect and scale telemetry pipelines using Grafana Alloy,Mimir,Loki,and Vector
- Define multi-year roadmap for GPU infrastructure observability and SLO-driven monitoring
What they want
- Senior IC experience in observability,SRE,or platform engineering (8+ years implied)
- Deep expertise with Grafana stack (Alloy,Mimir,Loki) and Prometheus remote_write
- Experience operating Kubernetes at large scale with GPU workload awareness
Nice to have
- Experience with Datadog Vector for log ingestion and transformation at scale
- Prior work on GPU-as-a-Service or AI/ML infrastructure telemetry at hyperscale