Staff Site Reliability Engineer

About Verrus

Data centers should be valuable assets to their customers, the grid, and the local community such that we are able to give customers what they want, where they want it, while cities and utilities are actively seeking us to build in their region.

Verrus is a developer of next generation, greenfield hyperscale data centers that flexibly leverage its onsite energy infrastructure (i.e. large scale BESS) through our proprietary technology. This allows Verrus to be a flexible resource to the grid, and provide grid services that facilitate interconnection and the effective procurement of power and carbon free energy, and balance traditional and variable uptime workloads (e.g., AI/ML). Verrus builds and capitalizes its turnkey data centers deeply integrated with energy resources, providing significant advantages over traditional designs.

Verrus is led by industry veterans with experience at the largest data center development companies as well as power & utilities, delivering GWs of data center capacity across four continents in their careers (representing billions of investment).

About Sidewalk Infrastructure Partners:

Verrus is backed by Sidewalk Infrastructure Partners. With investment from Alphabet, Ontario Teachers Pension Plan, and Stepstone Group, SIP is a consolidated holding company that builds innovative technology-enabled infrastructure companies and projects that deliver positive social and environmental impact at scale. SIP believes that technology-enabled infrastructure will help to solve some of the world’s most pressing social and environmental challenges.

Following years of research and in collaboration with the Verrus management team, SIP formed Verrus to design, build, and operate the world’s most innovative data centers, leveraging unique technology to develop a scaled project portfolio that delivers significant financial, environmental, and innovation outcomes.

Role

Verrus is looking for a Staff Site Reliability Engineer to design, build, and operate the software infrastructure that underpins everything Verrus does — and to write the production applications that ride on top of it. This is a hands-on individual contributor role. You will spend the majority of your time writing production code, debugging real systems, and shipping the platform, services, and tooling the rest of the engineering organization depends on — not setting strategy from a distance. This is a full-time position based out of the Mountain View, CA office.

The Software Infrastructure team's scope is intentionally broad. You'll build and support the platform that the rest of engineering runs on — Kubernetes, IaC, observability, CI/CD, and the layers that bridge cloud and on-premise — and you'll also write and operate the production services this team directly owns, including the public API gateway, authentication service, edge processors, and others. SRE at Verrus supports more than it owns: the job is to make every other team faster, safer, and more reliable — not to be a gatekeeper. The workloads on top vary widely; your concern is that everything Verrus runs, runs reliably.

Verrus operates one of the most technically distinctive infrastructure environments in the industry: large-scale battery energy storage (BESS), HCI compute with AWS EKS-A (EKS Anywhere) alongside AWS EKS, PLCs and industrial sensors, microgrid controls — all coordinated by software we build in-house. Reliability at Verrus means reasoning across cloud and HCI control planes, container orchestration, and the cyber-physical boundary where software decisions translate into electrons, heat, and customer SLAs.

You will be one of the most senior engineers on the infrastructure team. You will take on the hardest parts of the codebase, lead designs through prototypes and pull requests rather than slide decks, and raise the technical bar through what you ship and the code you review. You'll be a technical peer to senior and staff engineers across networking, software optimization, controls, mechanical & electrical engineering, and product.

Responsibilities

Hands-on engineering: write production-grade code (primarily Golang) for the platform, services, tooling, and automation the rest of engineering depends on. The highest-leverage and gnarliest problems will land on your queue.
Platform building: design, build, and support the internal platform spanning AWS EKS, on-prem HCI, and EKS-A — including the components that bridge controls and physical-infrastructure boundaries. Make it easy for other teams to ship reliably on top of it.
Application engineering: write and operate the production services this team directly owns — public API gateway, authentication, edge processing, and others — with the same care for testing, observability, and operational ergonomics that you bring to the platform.
Production hardening: when things break, lead the investigation, run rigorous postmortems, and write the structural fixes yourself when the right answer is code rather than process.
Cyber-physical reliability: build the tooling and abstractions that make reliability engineering coherent across software, infrastructure, and BESS/controls layers — including the framework that distinguishes what can be safely chaos-tested in software, what requires structured GameDays, and what must remain simulation-only.
SLOs and observability: implement SLO-driven operations — error budgets, burn-rate alerting, and the monitoring/logging/tracing stack (Prometheus, Grafana, and adjacent tooling) — and partner with service teams to adopt them.
Infrastructure as Code: grow the IaC and developer-tooling foundations other teams rely on, with a strong emphasis on testability, reproducibility, and ergonomics that actually get adopted.
CI/CD: build and improve the CI/CD pipelines across environments, balancing deployment velocity against the safety constraints of a physically-deployed product.
Cross-team collaboration: work directly with networking, controls, ME, and product engineers on system designs — push back on proposals that won't operate reliably and contribute concrete alternatives, often in the form of working prototypes.
Mentorship through doing: raise the bar through code review, design review, and pairing with senior engineers. Lead by what you ship and what you push back on.

Minimum Qualifications

8+ years of experience in Software Engineering, SRE, or Production Engineering, with at least 3 years operating at a Senior or Staff level on production-critical systems.
A track record of building force-multiplier infrastructure: you can point to specific platforms, tools, or systems you designed, wrote significant code for, and shepherded into broad production use.
Strong production-grade proficiency in Golang. You have shipped non-trivial Go services and tooling, and generally understand how best to use Go for infrastructure tooling.
Deep experience with public cloud (AWS, GCP) and container orchestration (Kubernetes) — including production operation and failure-mode reasoning, not just deployment.
Proven track record of managing infrastructure as code at scale.
Strong distributed-systems foundations: consensus, replication, RPC and async messaging architectures, capacity planning under uncertainty, and the failure modes that come with each.
Demonstrated incident-leadership experience: you have served as incident commander on serious production events, run rigorous postmortems, and driven the structural fixes through to closure.
Genuine appetite for being hands-on. You want to spend most of your week writing code and debugging real systems, not running working groups.

Preferred Qualifications

Experience with EKS-A provisioning or hybrid-cloud environments.
Experience with Kubernetes and familiarity with Crossplane.
Familiarity with NATS or other publish/subscribe technologies.
Familiarity with cuelang.
Some familiarity with Rust and functional programming basics.
An understanding of distributed systems and general RPC architectures.
Familiarity with energy data standards and protocols (e.g, Modbus TCP, DNP3, IEC-61850), IoT protocols, or industrial control systems.
Ability to efficiently identify and resolve issues using problem-solving and communication skills.
Adaptability to work in a rapidly changing, fast-paced environment and picking up new technical areas of expertise.

Compensation

Total cash (salary+bonus) for this position in California is $250,000-$300,000. Salary is only one part of Verrus’ comprehensive compensation package, which also includes equity, general health benefits and paid time off. Compensation is determined by multiple factors, including market location, and may vary based on job-related knowledge, skills, and experience.

How to Apply

If you are interested in consideration for this role, please apply through careers@verrusdata.com by submitting your resume with the job title in the subject line of your application email. Due to the amount of interest in Verrus’ work, we may not have the chance to follow up with you directly following your submission.

We are an equal opportunity employer. For us, this is more than legal boilerplate; it is reflective of a deeply held belief that our success depends on our willingness to recruit, hire, empower, compensate, reward, and promote the best and brightest individuals, of all backgrounds and characteristics.

Careers

Staff Site Reliability Engineer