Modernization without reliability is just a faster way to fail. At DuskByte, our "Risk-First" engineering means that Stability is our North Star. As a Platform Reliability Engineer, you are the guardian of operational continuity. You will build the frameworks, observability, and guardrails that allow us to modernize legacy B2B SaaS platforms without risking a single second of unplanned downtime.
What You Will Do (The Role)
Error Budgeting & SLOs
Define and monitor Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to ensure we move as fast as possible without compromising reliability.
Chaos Engineering
Proactively test system resilience by injecting controlled failures into modernized environments to identify "hidden" technical debt.
Incident Management & Post-Mortems
Lead the "Blameless Post-Mortem" culture, turning every production hiccup into a permanent architectural fix.
Observability Architecture
Build world-class monitoring stacks (Prometheus, Grafana, Datadog) that provide deep-tissue visibility into legacy and modern hybrid systems.
Automated Guardrails
Develop "Self-Healing" infrastructure that automatically scales, rolls back, or isolates failing components across AWS, GCP, and Azure.
The SRE Tech Stack
You are the master of the "Safety Net"
Orchestration
Kubernetes (EKS/GKE/AKS), Docker, Service Meshes (Istio/Linkerd)
Observability
Prometheus, Grafana, ELK Stack, New Relic, Datadog
You think in "nines" (99.99%). You are obsessed with edge cases and race conditions.
Automation Over Action
You hate manual tasks. If you have to do something twice, you write a script to do it forever.
The Calm Architect
You are at your best when things are breaking. You have the discipline to follow a runbook while the "fire" is being put out.
Experience
8+ years in DevOps or SRE roles, with a deep understanding of distributed systems and cloud-native safety patterns.
Why This Role is Unique at Duskbyte
You aren't just "maintaining" a server. You are the high-level consultant who tells the development team when they are moving too fast. You have the authority to halt a deployment if it doesn't meet our Risk-First standards. You are the reason our enterprise clients sleep soundly at night.
We use cookies to enhance your browsing experience, serve personalised ads or content, and analyse our traffic. By clicking "Accept All", you consent to our use of cookies. Cookie Policy