Interview Questions & Answers Complete Guide for SRE

What is the difference between DevOps and SRE?

Ans. A. Organizational Silos Reduction:

  • SRE approaches Ops as if it were a software engineering issue.
  • To bridge these two worlds, DevOps focuses on both the Dev and Ops departments.

B. Taking Advantage of Tooling and Automation:

  • SRE is focused on ensuring that all IT teams use the same technologies and have access to the same information.
  • DevOps focuses on technology adoption and automation.

C. Measuring everything:

  • DevOps is primarily concerned with the performance of processes and the results obtained through a feedback loop in order to achieve continuous improvement.
  • Because SRE views Ops issues as software engineering issues, it necessitates the use of SLOs as the primary metric.

What makes you think you’ll work as a Site Reliability Engineer?

Ans. I have a strong practical understanding and working knowledge of DevOps, including:

SRE’s relationship with DevOps and other well-known frameworks

  • The SRE’s underlying principles
  • SLOs (Service Level Objectives) and the users they serve
  • The modern monitoring landscape and Service Level Indicators (SLIs)
  • Budgets for errors and the policies that go with them
  • The impact of toil on a company’s productivity
  • There are some practical steps that can be taken to help eliminate toil.
  • Observability as a metric for assessing a service’s health
  • The importance of security, SRE tools, and automation techniques
  • Our approach to failure and failure testing is called anti-fragility.
  • The impact of implementing SRE on the organisation

As a result, I believe Site Reliability Engineer is the ideal job for me.

Have you heard of SLO before? If yes, please explain.
Ans. An SLO, or Service Level Objective, is a key component of a service-level agreement (SLA) between a service provider and a customer that is agreed upon to measure service providers’ performance and is formed to avoid disputes. There are two parties involved.

SLO refers to a measurable characteristic of a service level agreement, such as availability, throughput, frequency, response time, or quality. These SLOs define the expected level of service between the provider and the customer, and they vary depending on the urgency, resources, and budget of the service. SLOs are a numerical way of describing the level of service a customer can expect from a provider.

Describe the data structure. Give a data structure a name.
Ans.A data structure is a format for organising, managing, and storing data that allows for quick access and modification. A data structure, more precisely, is a collection of data values, their relationships, and the functions or operations that can be performed on the data.

The following are types of data structures:

  • Linear: Arrays, lists
  • Tree: Binary, heaps
  • Graphs: Decision, Acyclic, etc
  • Hash: Distributed hash table, hash tree, etc

What’s the difference between a process and a thread?


  • Process occurs when the execution of a programme allows you to perform the actions specified in the programme.
  • The thread, on the other hand, is a process segment.
  • The procedure is not light. Threads are thin and light.
  • The process takes longer to complete. Threads take longer to complete.
  • It takes longer to create a process. It takes less time to create a thread.
  • In context switching, the process takes longer. Context switching takes less time with threads.
  • The procedure is more segregated. Memory is shared by threads.
  • Data is not shared in this process. Data is shared between threads.

What are Error Budgets, and how do they work? And for what purposes are error budgets used?
Ans. The maximum amount of time a technical system can fail without contractual consequences is defined by the error budget.
The error budget encourages teams to minimize real-world incidents while maximizing innovation by allowing them to take risks within reasonable limits.

Define the budget policy for Error?
Ans. When SLO indicates that a service is not reliable enough, an error budget policy demonstrates how a business decides to trade off reliability work against other feature work.

What does it mean to “reduce toil”?

Ans. The following are some activities that can help you save time and effort:

  • External automation creation
  • Internal automation creation
  • Improving the service so that it does not require maintenance.

Define Service Level Indicators (SLI).
A Service Level Indicator (SLI) is a measurement of a service provider’s customer service level. SLIs are the foundation for Service Level Objectives (SLOs), which are the foundation for Service Level Agreements (SLAs) (SLAs). An SLI metric is also known as a SLA metric.

Despite the fact that each system’s services are unique, common SLIs are frequently used. Latency, throughput, availability, and error rate are some of the most common SLIs; others include durability (in storage systems), end-to-end latency (for complex data processing systems, particularly pipelines), and correctness.

Mention a list of all Linux signals you’re aware of.
The following are some of the most common Linux signals:


Do you know what TCP stands for? Please mention some TCP connections list.
One of the most important protocols in the Internet protocol suite is the Transmission Control Protocol (TCP). TCP arose from the initial network implementation, where it was used to supplement the Internet Protocol (IP). As a result, it is commonly referred to as TCP/IP.

The following are a few TCP connection states:
1) LISTEN – The server is listening for traffic on a specific port, such as HTTP.

2) SYNC-SENT – Requested a SYN and is waiting for a response.

3) RECEIVED SYN – (Server) After the server sends an ACK, the client waits for an ACK.

4) ESTABLISHED – The three-way TCP handshake has been established.

Define the term inode.
Ans. In Unix, an inode is a data structure that stores metadata about a file. The following are some of the items found in an inode:


2) owner(UID, GID)

3) Dimensions

4)atime, ctime, and mtime

What is the kill command in Linux? Make a list of all the Linux kill commands, along with their functions.
Ans. The Kill commands in Linux are as follows:

  • Killall: The killall command kills all processes with a specific name.
  • Pkill: Similar to killall, but this command only kills processes with partial names.
  • Xkill: xkill allows users to execute a command by right-clicking on a window.

What is cloud computing and how does it work?
Ans. Cloud computing refers to the on-demand availability of computer system resources, particularly data storage (cloud storage) and computing power, without the need for the user to actively manage them. The term refers to data centres that are accessible via the Internet to a large number of people. Functions from central servers are often distributed across multiple locations in today’s large clouds. It may be designated as an edge server if the connection to the user is relatively close.

How would you describe the roles and responsibilities of a perfect DevOps team?
The functions of an ideal DevOps team are impossible to pinpoint. We all know that the DevOps team is responsible for bridging the gap between development and operations, as well as contributing to continuous delivery.

To begin, the DevOps team should be communicative, knowledgeable about automation, and knowledgeable about the tools used to build CI/CD pipelines.

They should also be capable of handling small, frequent code releases that focus on narrowing the scope of functionality.

What is observability, and how can organisations improve their systems’ observability?
Observability is essentially a discussion about an organization’s measurement and instrument.
To increase an organization’s visibility, you must:

Recognize the various types of data that flow from a given environment, as well as which of those data types are relevant and useful to your observability objectives.

Figure out how your strategy is making sense of data by distilling, curating, and transforming it into actionable insights into the performance of your systems by getting a clear vision of what a team cares about.

Observability can provide valuable information about a company’s DevOps maturity.

What is DHCP and how does it work?
Ans. The Dynamic Host Configuration Protocol (DHCP) is a network management protocol that allows devices on an IP network to communicate with other IP networks by dynamically assigning an IP address and other network configuration parameters to each device on the network.

A DHCP server is used for the following purposes:

  1. Obtaining IP addresses and networking parameters from the Internet service provider automatically (ISP)
  2. Getting rid of the need for a network administrator or user to assign IP addresses to all network devices manually.

What exactly is the distinction between snat and dnat?
SNAT (Source Network Address Translation) is a method for allowing traffic from a private network to reach the internet.

DNAT (destination network address translation) is a technique for changing an end route packet’s destination IP address invisibly while also performing the inverse function for any replies. This packet transformation can be performed by any router between two endpoints.


We have an outside world and an inside world on either side of a NAT device. When the inside world communicates with the outside world, SNAT occurs. When the outside world and the inside world communicate, DNAT occurs.

Static SNAT is when multiple internal private IP addresses are translated to a single public IP address. Dynamic SNAT is when a large number of internal private IP addresses are translated to a large number of public IP addresses.

Source NAT modifies the IP header packet’s source address. Source NAT modifies the IP header packet’s destination address.

Multiple hosts on the “inside” can connect to any host on the “outside” using SNAT. Multiple hosts on the “outside” can connect to any host on the “inside” using DNAT.

How are you going to keep your Docker containers safe?
You must adhere to the following guidelines in order to secure your docker container:

  • Carefully select third-party containers.
  • Docker content trust should be enabled.
  • Limit the amount of resources available to your containers.
  • Think about using a third-party security tool.
  • Docker Bench Security can be used.

Can you explain the best SRE tools for each DevOps stage?
The following SRE tools are appropriate for each stage of DevOps:

  • Plan: Jira, Pivotal Tracker, and other task management tool
  • Create: Source-control tools like GitHub 
  • Verify: CI/CD tools like Jenkins or  CircleCI
  • Package: Container orchestration services like Kubernetes or Mesosphere.
  • Configure: Tools like Terraform and Ansible

What do you think the most important skills are for a DevOps/SRE role?
Ans. By asking this question, you’re basically asking the interviewee to describe the “ideal” DevOps/SRE hire. By enquiring about both hard and soft skills, you can provide some additional context. Encourage the candidate to discuss their level of expertise in a particular skill when they speak about it. It’s now the candidate’s turn to elaborate and present their case.

Conclusion: You should have realized from the above SRE interview questions and answers that you will need both practical and theoretical knowledge to pass an SRE interview.