How to make or build your SRE team?

If you are planning to implement SRE in your process or organization then this article can help you to make that daunting task easy for you.

This post will help you to understand how can you incorporate the SRE process without tearing the whole organization apart and putting it back together and how to set the right objectives and goals of the SRE team?

Once you decide to implement the SRE in your organization – first thing start with small project.

When we are saying start small means start a pilot SRE project in your process. Try to find out where you are expecting the results and grow from there, by saying this we mean to choose a system which is currently having some reliability issues. But, it shouldn’t be your least reliable system, pick the least reliable system that results in lost profit.

Why we are saying so? because if you would work on improving the reliability of an application which doesn’t affect much to the business, then your SRE pilot will not prove anything.

Once you selected the application or system, build an SRE team with the one objective of improving the reliability of that application.

Assign them first tasks to determine the required reliability level of that particular application or system.

Use iterative agile processes to build upon successes and learnings from the pilot to grow further. This will allow you to focus on the problems with the most potential value.

Now coming to the next step – staff your team – avoid to make a big team.

Don’t expect SRE team as a team that can take care of all the availability and reliability needs of your organization starting on day one. Instead of that focus on high-risk and high-value work first with your SRE team that will help them to understand and grow as an SRE team.

When are making your SRE team, you may go with two options:

  1. Hire externally.
  2. Hire internally from your existng team.

But, before going with any option – try to find below-mentioned skillsets in the individual:

  • Ability to solve problems and troubleshoot;- The SRE should be able to address various issues related to availability and reliability. Most of the time, they are going to troubleshoot the problems of the applications which they haven’t written themselves. So they need to be able to debug without having the deep domain knowledge.
  • Desire to automate:- One of the goals of SRE includes automating away toil. Therefore, SRE requires an internal desire to minimize manual work. They should have that eager to automate the manual and repetitive work that your traditional ops team cannot.
  • Curiosity:- With curiosity, SREs can discover new solutions to the problems. It also helps to find unexpected causes of familiar problems.
  • Teamwork:- An SRE team needs to work together and collaborate with mutiple stack holders like software engineers who are working with you and with chief executive officer, chief technical officer, or with your managers and You’ll need to report as well whatever the critical incidents are happening or whatever incidents can affect the application. You need people that will band together behind a common goal.
  • Communication:- Whether discussing problems during high-pressure outages or talking about them more comfortably when looking at long-term automation strategies, SRE’s should have strong communication skills because they need to communicate more ofen with mutiple stack holders.
  • Should understand big picture:- Often, reliability problems can be solved in several ways. The SRE needs to be able to see each possible solution in a larger context. Otherwise, they may solve one problem but may cause two others.

After having above-mentioned skillsets in your mind – when getting started with the staffing process, try to find within your organization only. This will let you bring people into the SRE project who already have domain knowledge. Plus, you’ll have an idea about them based on their previous work within the organization.

Try to find peoples who have above qualities or who can meet the requirements with little efforts. But be wary of associating your team with the superstars of your organization. Just because someone does well in the development team or operations team doesn’t mean they will be able to contribute as an SRE. You need to look for the skills that allow the person to move forward.

Additionally, a superstar who saved his day may not be able to build trust and camaraderie with others in the team. Site reliability engineering is a team sport, and one toxic person can make it difficult for a team to succeed.

In the end, bringing a team together must be done carefully. You might not want to have a bunch of engineers who have never interacted with each other and expect everything to work.

As an alternative to finding individuals from different teams, consider using an existing high-performing team. Then turn them into an SRE team focused on credibility.

And of course, if you don’t have the talent within your organizations, or if you reach the point of rapid growth, you have to look at these candidates from the outside.

Now after identifing the right people, the next step is to prepare them.

Here you can use Training and Development!

Becoming an SRE is not just a JOB title. It is a culture shift and a change in mindset. Without undersatnding the proper fundamentals, your new team can not be succeed.

Here you can utilse the expertise of DevOpsSchool’s SRE consulting or consultants services, who can conduct corporate training sessions for your team which will provide them in-depth understanding, right skills and culture and that will help your team to get the the base which will take them further through the SRE journey.

After this step you can easily set your Charter and Governance of your SRE team.

Establish a charter an define what the priorities of the SRE team include and how they will operate – puts emphasis on metrics and objectives. Make sure that whoever look after the SRE team on top has clear guidelines and expectations from the team and SRE team has the tools to provide the relevant data which they required to get their objectives.

site-reliability-engineering-sre-certification-training-course

Mantosh Singh