How to become a Site Reliability Engineer – SRE Engineer?

To be an SRE engineer you need to understand what was the problem before SRE concepts.

In our earlier process of Software development, engineers used to write the code. Then they use to hand it over to the operations team to deploy, maintain, and respond to incidents regarding their code.

But things have changed now. Today, software development has become faster and more complex as these days businesses rely more and more on the internet and applications, where they need to release features every other day and they need to handle all the infrastructure too without having latency, outage, and any other performance issues, this is where traditional software teams started having trouble keeping up with the pace.

This is where the industry needed Skilled people who can help with the transition of workflows from development to production applications and who can increase the reliability and performance of the systems. Before SRE, organizations adopted DevOps concepts (To know the difference between DevOps and SRE you may read this article), which successfully helped them with the transition of workflows from development to production applications. But still, the reliability and performance of the system were missing. And this is where site reliability engineering skills comes in.

However, this is a new concept for the world. But, Google’s Benjamin Treynor introduced the SRE concept back in 2002 in their organization.

Site reliability engineers are responsible for the reliability of the complete software development lifecycle, from the front-end, customer-facing applications to the back-end database and hardware infrastructure. SRE engineers can easily identify and resolve issues more efficiently than the traditional development and operations team or now DevOps team can do. The SRE role is ultimately responsible for maintaining systems’ uptime and reliability.

To be a successful SRE engineer you need the following skillsets to acquire:

  • Know How to Code:- Understanding of development and coding can helps to automate the processes and dealing with systems.
  • Understanding of Operating Systems:- SRE engineers needs to work with servers at a large scale and that can be stressful if you not good in operating systems.
  • Continuous integration/Continuous deployment:- CI/CD process is not limited to DevOps engineers only. SRE engineers also needs to know how to build CI/CD pipeline from scratch.
  • How to use version control tools:- While working in a team specailly in coding you must needed to understand the versioning of the codes. So lean version control systems needs to be added in your skillsets to become a Site Reliability Engineer.
  • How to use monitoring tools:- Monitoring tools are life saver for SRE engineers. System performance and issues can not be tracked without implementing monitoring tools.
  • Understanding of database:- Understanding of database required so that an engineer can understand what a data model is, why data models are necessary, and how the data model should inform your choice of database and your service architecture.
  • Cloud-native applications:- Understanding of cloud native applications is an important thing whihc make your tasks easier in the workplace. Container applications like Docker and Kubernetes are must have for SRE engineers.
  • Distributed computing:- As an SRE engineer you need to be handle large and distributed systems, so having knowledge with how distributed computing works and understanding of microservices concepts required for an SRE professional.
  • Communication & Collaboration:- As an SRE engineer you need to communicate and collaborate with mutiple stack holders like software engineers who are working with you and with chief executive officer, chief technical officer, or with your managers and You’ll need to report as well whatever the critical incidents are happening or whatever incidents can affect the application.

You can refer to this image too to visualize the SRE role

You may learn below mentioned toolsets that can help you to be a successful SRE engineer:-

  • SDLC Models & Architecture with Agile, DevOps, SRE & DevSecOps, SOA & Microservices – Concept
  • Platform – Operating Systems – Centos/Ubuntu & VirtualBox & Vagrant
  • Platform – Cloud – AWS
  • Platform – Containers – Docker
  • Planning and Designing – Jira & Confluence
  • Source Code Versioning – Git using Github
  • Webserver – Apache HTTP & Nginx
  • Configuration & Deployment Management – Ansible
  • Container Orchestration – Kubernetes & Helm Introduction
  • Infrastructure Coding – Terraform
  • Services mesh Data planes & Control Planes – Envoy & Istio
  • Network configurations and Service Discovery – Consul
  • Continuous Integration – Jenkins
  • Securing credentials – HashiCorp Vault & SSL & Certificates
  • Infrastructure Monitoring Tool 1 – Datadog
  • Infrastructure Monitoring Tool 2 – Prometheus with Grafana
  • Log Monitoring Tool 1 – Splunk
  • Log Monitoring Tool 2 – ELK stake
  • Performance & RUM Monitoring – NewRelic
  • Emergency Response & Alerting & Chat & Notification SMTP, SES, SNS,Pagerduty & Slack – Pagerduty & Slack

All these toolsets to learn may seem surprising.

But make your mindset you can learn all these things, it’s all about practice. You don’t need to master every single tool. Understanding the concept and knowing the about to the essential level of each topic will be fine, as long as you are eager to learn.

Learning new things without support many times is not a good approach if you want to save your time. You may ask for help from the DevOpsSchool team.


SRE Certifications

Mantosh Singh