What is Toil with SRE perspective?

What is Toil?

Time Off in Lieu

Exhausting Physical Labour. Work extremely hard or incessantly.

Time off in lieu, otherwise known as TOIL, is when an employer offers time off to workers who have gone above and beyond their contracted hours. Essentially, it serves as an alternative to pay, meaning that any overtime hours worked by an employee can be taken as part of their annual leave.

What is Toil with SRE perspective?

Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.

Toil is a term coined by Google to describe tedious, repetitive tasks associated with running a production environment. For Site Reliability Engineering (SRE) teams, the aim is to reduce or even eliminate toil in order to maximize the time spent on engineering and innovation.

If teams spend the majority of their time on these types of tasks, they have less time for high-value work. As a consequence, operational costs rise and the focus becomes more reactive than proactive. This prohibits innovation.

When software engineers write code, they want it to be simple, fast, and reliable. We refer to this as “bug and cruft” free. SREs want the same thing for operations. In the realm of operations, “cruft and bugs” can be described by one word: toil. Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. Toil is any engineering effort devoid of meaningful value.

Manual

This includes work such as manually running a script that automates some task. Running a script may be quicker than manually executing each step in the script, but the hands-on time a human spends running that script (not the elapsed time) is still toil time.

Repetitive

If you’re performing a task for the first time ever, or even the second time, this work is not toil. Toil is work you do over and over. If you’re solving a novel problem or inventing a new solution, this work is not toil.

Automatable

If a machine could accomplish the task just as well as a human, or the need for the task could be designed away, that task is toil. If human judgment is essential for the task, there’s a good chance it’s not toil.

Tactical

Toil is interrupt-driven and reactive, rather than strategy-driven and proactive. Handling pager alerts is toil. We may never be able to eliminate this type of work completely, but we have to continually work toward minimizing it.

No enduring value

If your service remains in the same state after you have finished a task, the task was probably toil. If the task produced a permanent improvement in your service, it probably wasn’t toil, even if some amount of grunt work—such as digging into legacy code and configurations and straightening them out—was involved.

O(n) with service growth

If the work involved in a task scales up linearly with service size, traffic volume, or user count, that task is probably toil. An ideally managed and designed service can grow by at least one order of magnitude with zero additional work, other than some one-time efforts to add resources.

What is an example of TOIL in SRE

Some examples of toil may include:

  • Handling quota requests
  • Applying database schema changes
  • Reviewing non-critical monitoring alerts
  • Copying and pasting commands from a playbook
  • To work hard and long.
  • To proceed with laborious effort

For the individual, high-levels of toil lead to:

  • Discontent and a lack of feeling of accomplishment
  • Burnout
  • More errors, leading to time-consuming rework to fix
  • No time to learn new skills
  • Career stagnation (hurt by a lack of opportunity to deliver value-adding projects)

For the organization, high-levels of toil lead to:

  • Constant shortages of team capacity
  • Excessive operational support costs
  • Inability to make progress on strategic initiatives (the “everybody is busy, but nothing is getting done” syndrome)
  • Inability to retain top talent (and acquire top talent once word gets out about how the organization functions)

How to reduce toil

There are many ways SRE minimizes the costs of toil. The following six techniques will help your IT organization improve SRE management.

Standardize

A lack of standardization leads to a more complex IT platform, which then increases toil. Minimize the number of IT platforms in place — for example, through different types of Unix, different versions of Windows Server and multiple separate hardware suppliers. Also, interrogate function repetition. Multiple applications that carry out the same functions — for example, using overlapping customer relationship management and sales force automation applications — increases the complexity, and therefore toil, of the environment. Standardization makes it easier to manage the platform as other steps are taken.

Reuse

Many toil tasks are repetitive. Therefore, once a fix is found for a task, engineers should apply it repeatedly to the same task, even on a different part of the platform. A library of callable scripts will help reduce toil. Increasingly, many tools used in SRE come with preexisting libraries that cover the most common areas.

Monitor

Triage, also called firefighting, is the worst thing that can happen to an IT platform. A problem that affects users harms the business and creates a negative perception from the business to IT while encouraging responders to cut corners. Operations teams must institute a solid procedure for monitoring the entire IT platform — a system that can identify possible problems before they become issues and which can then initiate events to fix the problem.

Automate

Humans are, unfortunately, often the root of problems in the IT environment. Unchecked changes can domino into catastrophic issues across the platform. Therefore, look to systems that check any change before implementation, automate that change and roll back if any problems are identified post-deployment.

Improve

Poor code leads to more problems, which means more toil. Use an integrated DevOps approach with solid testing to improve initial code quality, with automated feedback loops between operations and development to raise any identified issues, along with indications of priority for fixing.

Embrace new technologies

But not too fast — and don’t assume they will remedy all problems. Machine learning, deep learning and AI will increasingly improve SRE capabilities but are still at an early stage of maturation in the market. However, waiting until they are 100% proven will cost your organization in toil levels. Introduce such technologies in small, defined areas and judge their effectiveness. Organizations can then begin to roll them out across the total platform as faith in their capabilities grows.

Toil Identification Checklist in SRE

Toil is work you do over and over. If you’re solving a novel problem or inventing a new solution, this work is not toil. Automatable. If a machine could accomplish the task just as well as a human, or the need for the task could be designed away, that task is toil.

Reference

  • https://www.devopsschool.com/blog/toil-identification-checklist-survery-question-in-sre/
  • https://sre.google/sre-book/eliminating-toil/
  • https://cloud.google.com/blog/products/management-tools/identifying-and-tracking-toil-using-sre-principles
  • https://www.rundeck.com/blog/toil-finally-a-name-for-a-problem
  • https://www.devopsschool.com/blog/sre-site-reliability-engineering-summary/
Rajesh Kumar
Follow me
Subscribe
Notify of
guest
1 Comment
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
Anonymous
Anonymous
1 year ago

This taught me what toil means in 5 mins. To-the-point article with visual explanations only explaining what is necessary to get to grips with the meaning of the term.

1
0
Would love your thoughts, please comment.x
()
x