Senior Site Reliability Engineer (SRE)

This is a remote, US-based role. As a Sr. Site Reliability Engineer (SRE), your primary goals will be to provide exceptional value to our clients working across our managed and professional services offerings.

Sr. Site Reliability Engineer - Engineering
Fairwinds Ops, Inc. is a Managed Services Provider specializing in Kubernetes, catering to businesses across North America. Our dynamic team of experts is committed to empowering organizations to thrive and expand. With nearly a decade of experience in Kubernetes Cloud Architectures and Security services, we've supported diverse sectors including SMB, Enterprise, SaaS, Healthcare, Financial, and Not-for-Profit, assisting our partners in meeting their daily IT and business objectives. Our diverse clientele and mission-critical applications necessitate a deep understanding of providing robust architectures and highly secure environments.
Fairwinds is seeking an intellectually curious, collaborative, enthusiastic, and flexible engineer to join our team as a Sr. Site Reliability Engineer. As an Sr. SRE at Fairwinds, you’ll work directly with clients to ensure their goals are met through automation, analysis, and infrastructure configuration. Working collaboratively with the other SREs at Fairwinds, your expertise will help our clients succeed with Kubernetes through building robust infrastructure, solving complex problems, maintaining reliable, secure environments, and standing behind our work as a part of our on-call rotation. As a Sr. SRE, you will be involved in mentoring and developing junior to mid level engineers and be actively involved in evolving our internal tooling and processes as well as contribute to pre-sales activities. We work with a diverse set of technologies in and around Kubernetes depending on what our clients need. Commonly this includes technologies like:

-EKS/GKE/AKS and their associated cloud primitives
-Common Kubernetes addons like cert-manager, external-dns, karpenter, keda, etc.
-Load balancing and ingress
-CI/CD tools (Gitlab, CircleCI)
-Configuration tools such as Terraform and Helm
-Golang
-Monitoring tools (Datadog, Prometheus)

What we offer:

Competitive salary complemented by a performance based bonus structure.
Compensation $140,000 – 170,000 base salary dependent on experience.
Career growth opportunities in a fast-growing industry.
100% of insurance premiums paid by Company (medical, dental, vision)
401K Plan

Qualifications:

5-10 years of experience in Site Reliability Engineering, DevOps, or Infrastructure roles, with a focus on cloud-native environments
Strong hands-on experience with Kubernetes (EKS, GKE, AKS, or self-managed clusters) and container orchestration at scale
Proficiency in Infrastructure as Code (Terraform, Helm, GitOps workflows)
Deep understanding of AWS (or other major cloud providers) including networking, security, and cost optimization
Skilled in observability tools (i.e. Prometheus, Grafana, Loki, ELK, Datadog, etc) for monitoring, logging and tracing
Experience building and managing CI/CD tools such as CircleCI, Jenkins, Gitlab, etc
Strong knowledge of Linux systems, networking, and security fundamentals
Proven track record of Incident management, root cause analysis, and reliability improvements
Familiarity with cloud-native security practices (RBAC, IAM, network policies, secrets management)
Programming/scripting experience in Go, or Bash for automation and tooling
Excellent collaboration and communication skills, with experience mentoring junior engineers
Experience working with external customers

Responsibilities:

Design, deploy and operate highly available and secure Kubernetes clusters for customer workloads across multiple clouds (AWS, GCP, Azure)

Attend customer syncs to provide support to our customers regularly around their Kubernetes questions

Automate infrastructure provisioning and management using Terraform, Helm, and GitOps workflows
Implement observability solutions (metrics, logging, tracing, alerting) to ensure visibility and proactive incident detection across clusters
Participate in Incident response, driving root cause analysis and long term reliability improvements

Enhance security posture by managing IAM, RBAC, network policies, and compliance related controls

Mentor engineers and help establish best practices for reliability, automation, and Kubernetes operations

Participate in pre-sales activities by collaborating with our Sales Director and providing technical support during initial calls
Contribute to tooling and internal platforms that streamline customer cluster onboarding, upgrades, and lifecycle management.
Optimize performance and cost of clusters and cloud resources through right-sizing, autoscaling, and capacity planning
Participate in on-call rotations to ensure uptime and responsiveness for our customer’s clusters

Benefits:

Health, dental, vision, and life insurance.
Unmetered PTO.

Schedule: Monday to Friday

Supplemental pay types: Bonus

Experience: 5 years relevant experience preferred

Required travel: 10%-20%

Work Location: Fully Remote

Apply Now