Platform Reliability Engineer

The Mission

Modernization without reliability is just a faster way to fail. At DuskByte, our "Risk-First" engineering means that Stability is our North Star. As a Platform Reliability Engineer, you are the guardian of operational continuity. You will build the frameworks, observability, and guardrails that allow us to modernize legacy B2B SaaS platforms without risking a single second of unplanned downtime.

What You Will Do (The Role)

Error Budgeting & SLOs

Define and monitor Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to ensure we move as fast as possible without compromising reliability.

Chaos Engineering

Proactively test system resilience by injecting controlled failures into modernized environments to identify "hidden" technical debt.

Incident Management & Post-Mortems

Lead the "Blameless Post-Mortem" culture, turning every production hiccup into a permanent architectural fix.

Observability Architecture

Build world-class monitoring stacks (Prometheus, Grafana, Datadog) that provide deep-tissue visibility into legacy and modern hybrid systems.

Automated Guardrails

Develop "Self-Healing" infrastructure that automatically scales, rolls back, or isolates failing components across AWS, GCP, and Azure.

The SRE Tech Stack

You are the master of the "Safety Net"

Orchestration

Kubernetes (EKS/GKE/AKS), Docker, Service Meshes (Istio/Linkerd)

Observability

Prometheus, Grafana, ELK Stack, New Relic, Datadog

Infrastructure

Terraform, Ansible, Cross-Cloud networking (AWS/GCP/Azure)

Scripting & Automation

Python, GoLang, Bash for automating away "Toil"

Reliability Tools

Gremlin (Chaos Engineering), PagerDuty, Automated Load Testing (k6/JMeter)

Who You Are (Requirements)

The "Risk-Aware" Engineer

You think in "nines" (99.99%). You are obsessed with edge cases and race conditions.

Automation Over Action

You hate manual tasks. If you have to do something twice, you write a script to do it forever.

The Calm Architect

You are at your best when things are breaking. You have the discipline to follow a runbook while the "fire" is being put out.

Experience

8+ years in DevOps or SRE roles, with a deep understanding of distributed systems and cloud-native safety patterns.

Why This Role is Unique at Duskbyte

You aren't just "maintaining" a server. You are the high-level consultant who tells the development team when they are moving too fast. You have the authority to halt a deployment if it doesn't meet our Risk-First standards. You are the reason our enterprise clients sleep soundly at night.

Apply Now

Real Estate & PropTech

CCM & Messaging Platform

E-commerce & Distribution

DevOps & Database Platform

Marketing Analytics Platform

Printing & Fulfillment Platforms

LegalTech Platforms

Industries We Serve

Enterprise Software Development

Enterprise SaaS & Platform Modernization

Legacy SaaS Modernization

SaaS Cloud Migration (AWS)

Automation, Integrations & Applied AI

Modernize, Stabilize & Evolve Enterprise Systems