Site Reliability Engineer
What we are looking for
Are you looking for a challenge to automate stateful infrastructure at scale? Ditto's core data sync platform is able to run across a variety of platforms, including in cloud environments. Now our growing team is looking for a passionate site reliablity engineer to help us automate these deployments.
The ideal candidate has professional experience with Kubernetes and cloud-native architectures. Do you know the tradeoffs with StatefulSets? Interested in writing an Operator?
At Ditto, we have no shortage of additional hard technical problems, such as mesh networking, replication protocols, CRDTs, and database design, just to name a few! You will collaborate with other senior engineers and ensure our platform is reliable to operate. We are investing heavily into Rust as we aim to create bindings for various languages with a one-click deployment.
Work with a remote team, manage your own time, and tackle interesting problems. Ditto is an equal opportunity employer with people from many different cultures and countries. We celebrate diversity and are committed to building a team that represents a variety of backgrounds, skillsets, and perspectives, and to providing our employees with a rewarding and inclusive work environment.
You will be a part of our cloud team, working with all of engineering on building new services, automating infrastructure lifecycle on Kubernetes, and monitoring our services with the goal of offering a reliable, scalable and high-performance SaaS. One of our primary goals is to run a managed, cloud-based database-as-a-service with 99.5% uptime or better, and this role is critical for that goal.
Build & design Ditto’s cloud infrastructure with reliability and performance in mind.
Build tools & services to allow automated infrastructure management and self-healing, including deployments and upgrades.
Be in charge of end-to-end monitoring of our cloud. Layer observability into our Kubernetes operators. Prioritize what metrics to collect, drive analysis of those metrics, and influence our roadmap based on that analysis.
Participate in on-call rotations, working to keep customer workloads running and incident free.
You’ll be part of a diverse team with members in both US and international locations!
- 3+ years of experience in an SRE-like role
- Comfortable working with a 100% distributed engineering team, collaborating on GitHub
- Strong experience with public cloud providers
- Experience running highly-scalable production workloads reliably on Kubernetes
- Experience with monitoring at scale
- Experience managing infrastructure predictably through GitOps and IaC
- Solid programming skills
- Willingness to participate in an on-call rotation
- Excellent written communication skills
- A BS in Computer Science or equivalent experience
Nice To Have
- Strong understanding of Rust and Kubernetes
- Experience operating a SaaS platform
- Fluency in a couple of programming languages (for example, Rust or Python)
- Operated and used streaming platforms either as a user or provider
- Experience with the Prometheus monitoring stack