Platform Reliability Engineer
SRE / DevOps Risk Control
The Mission
Modernization without reliability is just a faster way to fail. At DuskByte, our "Risk-First" engineering means that Stability is our North Star. As a Platform Reliability Engineer, you are the guardian of operational continuity. You will build the frameworks, observability, and guardrails that allow us to modernize legacy B2B SaaS platforms without risking a single second of unplanned downtime.
What You Will Do (The Role)
Error Budgeting & SLOs
Define and monitor Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to ensure we move as fast as possible without compromising reliability.
Chaos Engineering
Proactively test system resilience by injecting controlled failures into modernized environments to identify "hidden" technical debt.
Incident Management & Post-Mortems
Lead the "Blameless Post-Mortem" culture, turning every production hiccup into a permanent architectural fix.
Observability Architecture
Build world-class monitoring stacks (Prometheus, Grafana, Datadog) that provide deep-tissue visibility into legacy and modern hybrid systems.
Automated Guardrails
Develop "Self-Healing" infrastructure that automatically scales, rolls back, or isolates failing components across AWS, GCP, and Azure.
The SRE Tech Stack
You are the master of the "Safety Net"
Orchestration
Kubernetes (EKS/GKE/AKS), Docker, Service Meshes (Istio/Linkerd)
Observability
Prometheus, Grafana, ELK Stack, New Relic, Datadog
Infrastructure
Terraform, Ansible, Cross-Cloud networking (AWS/GCP/Azure)
Scripting & Automation
Python, GoLang, Bash for automating away "Toil"
Reliability Tools
Gremlin (Chaos Engineering), PagerDuty, Automated Load Testing (k6/JMeter)
Who You Are (Requirements)
The "Risk-Aware" Engineer
You think in "nines" (99.99%). You are obsessed with edge cases and race conditions.
Automation Over Action
You hate manual tasks. If you have to do something twice, you write a script to do it forever.
The Calm Architect
You are at your best when things are breaking. You have the discipline to follow a runbook while the "fire" is being put out.
Experience
8+ years in DevOps or SRE roles, with a deep understanding of distributed systems and cloud-native safety patterns.
Why This Role is Unique at Duskbyte
You aren't just "maintaining" a server. You are the high-level consultant who tells the development team when they are moving too fast. You have the authority to halt a deployment if it doesn't meet our Risk-First standards. You are the reason our enterprise clients sleep soundly at night.
© 2026 DuskByte. Engineering stability for complex platforms.