job title: site reliability engineering (sre) manager
location: hyderabad
employment type: full-time
work model - 3 days from office (hybrid)
summary:
the sre manager at company will lead the reliability engineering function, ensuring infrastructure resiliency and optimal operational performance. this hybrid role blends technical leadership with team mentorship and cross-functional coordination.
experience required:
10+ years total experience, with 3+ years in a leadership role in sre or cloud operations.
technical knowledge and skills:
mandatory:
• deep understanding of kubernetes, gke, prometheus, terraform
• cloud: advanced gcp administration
• ci/cd: jenkins, argo cd, github actions
• incident management: full lifecycle, tools like opsgenie
nice to have:
• knowledge of service mesh and observability stacks
• strong scripting skills (python, bash)
• big query /dataflow exposure for telemetry
scope:
• build and lead a team of sres
• standardize practices for reliability, alerting, and response
• engage with engineering and product leaders
roles and responsibilities:
• establish and lead the implementation of organizational reliability strategies, aligning slas, slos, and error budgets with business goals and customer expectations.
• develop and institutionalize incident response frameworks, including escalation policies, on-call scheduling, service ownership mapping, and rca process governance.
• lead technical reviews for infrastructure reliability design, high-availability architectures, and resiliency patterns across distributed cloud services. champion observability and monitoring culture by standardizing tooling, alert definitions, dashboard templates, and telemetry data schemas across all product teams.
• drive continuous improvement through operational maturity assessments, toil elimination initiatives, and sre okrs aligned with product objectives. collaborate with cloud engineering and platform teams to introduce self-healing systems, capacity-aware autoscaling, and latency-optimized service mesh patterns.
• act as the principal escalation point for reliability-related concerns and ensure incident retrospectives lead to measurable improvements in uptime and mttr.
• own runbook standardization, capacity planning, failure mode analysis, and production readiness reviews for new feature launches. mentor and develop a high-performing sre team, fostering a proactive ownership culture, encouraging cross-functional knowledge sharing, and establishing technical career pathways.
collaborate with leadership, delivery, and customer stakeholders to define reliability goals, track performance, and demonstrate roi on sre investments