Ensuring Reliability | Automating Ops | Scaling Systems | Monitoring
I am a Lead Site Reliability Engineer (SRE) and ITSM expert with 9+ years of experience in Incident response, building, automating, and scaling production systems. Skilled in Kubernetes, Kafka, CI/CD, monitoring, and cloud-native technologies. I am passionate about bridging the gap between development and operations, ensuring uptime, performance, and reliability.
Deployed and managed Apache Kafka clusters on Kubernetes using Strimzi Operator, enabled real-time data pipelines with monitoring and alerting.
Driving platform support team, Managing and maintanance of kafka kubernetes cluster by GKE upgrade, CFK upgrade, SSL key rotation and Incident & problem management.
Designed Infrastructure as Code (IaC) using Terraform to provision and manage cloud environments with monitoring and scaling. Automated may routine tasks including "Daily_Health_check report" using bash script, java and selenium and VB macros.
Connect with me via: