Interview Questions & Answers Sets on SRE

What are the differences between SRE and DevOps?

Google: “One could view DevOps as a generalization of several core SRE principles to a wider range of organizations, management structures, and personnel.”

What SRE team is responsible for?

Google: “the SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their services”

What is an error budget?

Atlassian: “An error budget is the maximum amount of time that a technical system can fail without contractual consequences.”

What do you think about the following statement: “100% is the only right availability target for a system”

Wrong. No system can guarantee 100% availability as no system is safe from experiencing zero downtime. Many systems and services will fall somewhere between 99% and 100% uptime (or at least this is how most systems and services should be).

What are MTTF (mean time to failure) and MTTR (mean time to repair)? What these metrics help us to evaluate?

  • MTTF (mean time to failure) other known as uptime, can be defined as how long the system runs before if fails.
  • MTTR (mean time to recover) on the other hand, is the amount of time it takes to repair a broken system.
  • MTBF (mean time between failures) is the amount of time between failures of the system.

What is the role of monitoring in SRE?

Google: “Monitoring is one of the primary means by which service owners keep track of a system’s health and availability”

Tell me the difference between DevOps & SRE.

They focus on both the departments: Dev and Ops to bridge these two worlds.SRE considers Ops like a software engineering problem.
They are more focused on automation.They are focused on grasping consistent technologies.
The primary focus of DevOps is on the performance and getting the improvement in their results on the basis of the feedback.They require evaluation of the SLOs as principal metrics.

Why do you think that you will become a Site Reliability Engineer?

With this question, the interviewer is interested to know about your will and knowledge about the role. The perfect answer to this question can be as below.

I have experienced in the same role with a deep understanding of:

  • The principles behind SRE.
  • Relationship of SRE with DevOps among other popular frameworks.
  • Experienced with SLI’s (Service Level Indicators)
  • Practical knowledge in eliminating toil.
  • Error budgets and the policies associated with them.
  • SRE tools, techniques of performing automation, and the importance of security.

Hence, with all this information and knowledge I feel this is the perfect role for me.

What are Error Budgets? And for what error budgets are used?

Error budgets are basically used to define the maximum amount of time that a technical system can fail without any contractual consequences.

Error budgets are used to strengthen the teams to reduce the real incidents and increases innovation by taking more risks within the acceptable limits.

How do you differentiate between process and thread?

The process is admitted as an occurrence of the computer program that is being executed.The thread is known as the component of the process that is considered the smallest execution unit.
The process is not lightweightThreads are light-weighted
Creation of process takes more timeCreation of thread takes less time
The process does not share the dataThreads share the data with each other
In context switching, the process takes more time.In context switching, the thread takes less time.

What activity means Reducing Toil?

Below are the activities that can reduce the toil:

  • Creating internal automation
  • Creating external automation
  • Enhance the services so that they do not need maintenance interference.

Have you ever heard of TCP? Please enlist some TCP connection states

TCP is the Transmission Control Protocol which is one of the important protocols of the Internet protocol suite. It is a communication standard that is used to enable the application programs and computing devices for exchanging messages over the network.

TCP connections states are listed below.


What is the Linux kill command? Enlist all the Linux to kill commands with their functions.

The kill command in Linux is the command used for sending the signals to the specified processes or process the groups.

Below listed are the kill commands:

  • Killall: This command is being used to kill all the processes with a particular name.
  • Pkill: This command is very much similar to the Killall command, the only difference is it kills processes with partial names.
  • Xkill: This command allows the users to kill the command simply by clicking on the window.

What is cloud computing?

Cloud computing is the immediate possibility of the computer system resources, especially the cloud or the data storage, and the computing power, without being active directly in the management by the user. This term is generally being used for describing the data centers that are available to multiple users over the internet.

What is DHCP, for what it used?

DHCP is abbreviated as Dynamic Host Configuration Protocol. It is known as the protocol for network management that is used on IP networks by which a DHCP server effectively assigns the IP address and other configurations on the network parameters to every individual device on the network; so that they can easily communicate with the other IP networks.

The DHCP server is being used for:

  • Diminishing the requirement for a network administration or a client to physically assign IP addresses to all the network devices.
  • Requesting the Internet Protocol (IP) addresses and the parameters of networking from the ISP (Internet Service Provider).

How will you secure your Docker containers?

For securing the docker container, one must follow the below guidelines:

  • Third-party containers should be chosen carefully.
  • Enables the docker content trust.
  • One should need to set the resource limit for their containers.
  • Third-party security tools should be considered.
  • Docker bench security should be used.

Would you describe to us the Best SRE Tools for each Stage of DevOps?

Below listed are the best SRE tools for each stage of DevOps:

Planning: JIRA, Pivotal tracker, and other famous task management tools.

Creation: GitHub Verification: CD/CI tools such as Jenkins and CircleCI

Packaging: Container arrangement services such as Mesosphere or Kubernetes

Configuration: Tools like Ansible and Terraform

Have you ever heard of SLO? If yes then explain.

An SLO is the Service Level Objective that is basically an essential element of the SLA (Service Level Agreement) among the service provider and the customer which is agreed upon at the time of measuring the performances of the service providers and they are built in the way that avoids the disputes among two parties.

SLO can be a particular measurable trait of SLA like accessibility, throughput, recurrence, reaction time, or quality. These SLOs together characterize the normal service among the provider and the client while differing relying upon the service’s earnestness, resources, and financial plan. SLOs give a quantitative means to characterize the degree of service a client can anticipate from a provider

Rajesh Kumar
Follow me
Latest posts by Rajesh Kumar (see all)
Notify of
Inline Feedbacks
View all comments
Would love your thoughts, please comment.x