Senior Site Reliability Engineer (SRE)
This is a remote, US-based role. As a Sr. Site Reliability Engineer (SRE), your primary goals will be to provide exceptional value to our clients working across our managed and professional services offerings.
Sr. Site Reliability Engineer - Engineering
Fairwinds Ops, Inc. is a Managed Services Provider specializing in Kubernetes, catering to businesses across North America. Our dynamic team of experts is committed to empowering organizations to thrive and expand. With nearly a decade of experience in Kubernetes Cloud Architectures and Security services, we've supported diverse sectors including SMB, Enterprise, SaaS, Healthcare, Financial, and Not-for-Profit, assisting our partners in meeting their daily IT and business objectives. Our diverse clientele and mission-critical applications necessitate a deep understanding of providing robust architectures and highly secure environments.
Fairwinds is seeking an intellectually curious, collaborative, enthusiastic, and flexible engineer to join our team as a Sr. Site Reliability Engineer. As an Sr. SRE at Fairwinds, you’ll work directly with clients to ensure their goals are met through automation, analysis, and infrastructure configuration. Working collaboratively with the other SREs at Fairwinds, your expertise will help our clients succeed with Kubernetes through building robust infrastructure, solving complex problems, maintaining reliable, secure environments, and standing behind our work as a part of our on-call rotation. As a Sr. SRE, you will be involved in mentoring and developing junior to mid level engineers and be actively involved in evolving our internal tooling and processes as well as contribute to pre-sales activities. We work with a diverse set of technologies in and around Kubernetes depending on what our clients need. Commonly this includes technologies like:
-EKS/GKE/AKS and their associated cloud primitives
-Common Kubernetes addons like cert-manager, external-dns, karpenter, keda, etc.
-Load balancing and ingress
-CI/CD tools (Gitlab, CircleCI)
-Configuration tools such as Terraform and Helm
-Golang
-Monitoring tools (Datadog, Prometheus)
What we offer:
- Competitive salary complemented by a performance based bonus structure.
- Compensation $140,000 – 170,000 base salary dependent on experience.
- Career growth opportunities in a fast-growing industry.
- 100% of insurance premiums paid by Company (medical, dental, vision)
Qualifications:
- 5-10 years of experience in Site Reliability Engineering, DevOps, or Infrastructure roles, with a focus on cloud-native environments
- Strong hands-on experience with Kubernetes (EKS, GKE, AKS, or self-managed clusters) and container orchestration at scale
- Proficiency in Infrastructure as Code (Terraform, Helm, GitOps workflows)
- Deep understanding of AWS (or other major cloud providers) including networking, security, and cost optimization
- Skilled in observability tools (i.e. Prometheus, Grafana, Loki, ELK, Datadog, etc) for monitoring, logging and tracing
- Experience building and managing CI/CD tools such as CircleCI, Jenkins, Gitlab, etc
- Strong knowledge of Linux systems, networking, and security fundamentals
- Proven track record of Incident management, root cause analysis, and reliability improvements
- Familiarity with cloud-native security practices (RBAC, IAM, network policies, secrets management)
- Programming/scripting experience in Go, or Bash for automation and tooling
- Excellent collaboration and communication skills, with experience mentoring junior engineers
- Experience working with external customers
Responsibilities:
- Design, deploy and operate highly available and secure Kubernetes clusters for customer workloads across multiple clouds (AWS, GCP, Azure)
- Attend customer syncs to provide support to our customers regularly around their Kubernetes questions
- Automate infrastructure provisioning and management using Terraform, Helm, and GitOps workflows
- Implement observability solutions (metrics, logging, tracing, alerting) to ensure visibility and proactive incident detection across clusters
- Participate in Incident response, driving root cause analysis and long term reliability improvements
- Enhance security posture by managing IAM, RBAC, network policies, and compliance related controls
- Mentor engineers and help establish best practices for reliability, automation, and Kubernetes operations
- Participate in pre-sales activities by collaborating with our Sales Director and providing technical support during initial calls
- Contribute to tooling and internal platforms that streamline customer cluster onboarding, upgrades, and lifecycle management.
- Optimize performance and cost of clusters and cloud resources through right-sizing, autoscaling, and capacity planning
- Participate in on-call rotations to ensure uptime and responsiveness for our customer’s clusters
Benefits:
Schedule: Monday to Friday
Supplemental pay types: Bonus
Experience: 5 years relevant experience preferred
Required travel: 10%-20%
Work Location: Fully Remote
Apply Now