{"id":29588,"date":"2022-04-17T05:51:59","date_gmt":"2022-04-17T05:51:59","guid":{"rendered":"https:\/\/www.devopsschool.com\/blog\/?p=29588"},"modified":"2022-12-23T06:19:36","modified_gmt":"2022-12-23T06:19:36","slug":"what-is-toil-with-sre-perspective","status":"publish","type":"post","link":"https:\/\/www.devopsschool.com\/blog\/what-is-toil-with-sre-perspective\/","title":{"rendered":"What is Toil with SRE perspective?"},"content":{"rendered":"\n\n\n<h2 class=\"wp-block-heading\">What is Toil?<\/h2>\n\n\n\n<p><strong>Time Off in Lieu<\/strong><\/p>\n\n\n\n<p>Exhausting Physical Labour. Work extremely hard or incessantly.<\/p>\n\n\n\n<p>Time off in lieu, otherwise known as TOIL, is when an employer offers time off to workers who have gone above and beyond their contracted hours. Essentially, it serves as an alternative to pay, meaning that any overtime hours worked by an employee can be taken as part of their annual leave.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"878\" height=\"846\" src=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/04\/sre-toil-5.png\" alt=\"\" class=\"wp-image-29591\" srcset=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/04\/sre-toil-5.png 878w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/04\/sre-toil-5-300x289.png 300w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/04\/sre-toil-5-768x740.png 768w\" sizes=\"auto, (max-width: 878px) 100vw, 878px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">What is Toil with SRE perspective?<\/h2>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"637\" src=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/04\/sre-toil-1-1024x637.jpg\" alt=\"\" class=\"wp-image-29592\" srcset=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/04\/sre-toil-1-1024x637.jpg 1024w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/04\/sre-toil-1-300x187.jpg 300w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/04\/sre-toil-1-768x478.jpg 768w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/04\/sre-toil-1.jpg 1027w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.<\/p>\n\n\n\n<p>Toil is a term coined by Google to describe tedious, repetitive tasks associated with running a production environment. For Site Reliability Engineering (SRE) teams, the aim is to reduce or even eliminate toil in order to maximize the time spent on engineering and innovation.<\/p>\n\n\n\n<p>If teams spend the majority of their time on these types of tasks, they have less time for high-value work. As a consequence, operational costs rise and the focus becomes more reactive than proactive. This prohibits innovation.<\/p>\n\n\n\n<p>When software engineers write code, they want it to be simple, fast, and reliable. We refer to this as \u201cbug and cruft\u201d free. SREs want the same thing for operations. In the realm of operations, \u201ccruft and bugs\u201d can be described by one word: toil. Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. Toil is any engineering effort devoid of meaningful value.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Manual<\/h3>\n\n\n\n<p>This includes work such as manually running a script that automates some task. Running a script may be quicker than manually executing each step in the script, but the hands-on time a human spends running that script (not the elapsed time) is still toil time.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Repetitive<\/h3>\n\n\n\n<p>If you\u2019re performing a task for the first time ever, or even the second time, this work is not toil. Toil is work you do over and over. If you\u2019re solving a novel problem or inventing a new solution, this work is not toil.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Automatable<\/h3>\n\n\n\n<p>If a machine could accomplish the task just as well as a human, or the need for the task could be designed away, that task is toil. If human judgment is essential for the task, there\u2019s a good chance it\u2019s not toil.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Tactical<\/h3>\n\n\n\n<p>Toil is interrupt-driven and reactive, rather than strategy-driven and proactive. Handling pager alerts is toil. We may never be able to eliminate this type of work completely, but we have to continually work toward minimizing it.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">No enduring value<\/h3>\n\n\n\n<p>If your service remains in the same state after you have finished a task, the task was probably toil. If the task produced a permanent improvement in your service, it probably wasn\u2019t toil, even if some amount of grunt work\u2014such as digging into legacy code and configurations and straightening them out\u2014was involved.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">O(n) with service growth<\/h3>\n\n\n\n<p>If the work involved in a task scales up linearly with service size, traffic volume, or user count, that task is probably toil. An ideally managed and designed service can grow by at least one order of magnitude with zero additional work, other than some one-time efforts to add resources.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"699\" height=\"353\" src=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/04\/toil-sre-2.png\" alt=\"\" class=\"wp-image-29593\" srcset=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/04\/toil-sre-2.png 699w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/04\/toil-sre-2-300x152.png 300w\" sizes=\"auto, (max-width: 699px) 100vw, 699px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">What is an example of TOIL in SRE<\/h2>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"733\" height=\"278\" src=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/04\/toil-sre-3.png\" alt=\"\" class=\"wp-image-29594\" srcset=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/04\/toil-sre-3.png 733w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/04\/toil-sre-3-300x114.png 300w\" sizes=\"auto, (max-width: 733px) 100vw, 733px\" \/><\/figure>\n\n\n\n<p>Some examples of toil may include:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Handling quota requests<\/li><li>Applying database schema changes<\/li><li>Reviewing non-critical monitoring alerts<\/li><li>Copying and pasting commands from a playbook<\/li><li>To work hard and long.<\/li><li>To proceed with laborious effort<\/li><\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">For the individual, high-levels of toil lead to:<\/h2>\n\n\n\n<ul class=\"wp-block-list\"><li>Discontent and a lack of feeling of accomplishment<\/li><li>Burnout<\/li><li>More errors, leading to time-consuming rework to fix<\/li><li>No time to learn new skills<\/li><li>Career stagnation (hurt by a lack of opportunity to deliver value-adding projects)<\/li><\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">For the organization, high-levels of toil lead to:<\/h2>\n\n\n\n<ul class=\"wp-block-list\"><li>Constant shortages of team capacity<\/li><li>Excessive operational support costs<\/li><li>Inability to make progress on strategic initiatives (the \u201ceverybody is busy, but nothing is getting done\u201d syndrome)<\/li><li>Inability to retain top talent (and acquire top talent once word gets out about how the organization functions)<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"963\" height=\"695\" src=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/04\/sre-toil-4.png\" alt=\"\" class=\"wp-image-29595\" srcset=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/04\/sre-toil-4.png 963w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/04\/sre-toil-4-300x217.png 300w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/04\/sre-toil-4-768x554.png 768w\" sizes=\"auto, (max-width: 963px) 100vw, 963px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"699\" height=\"353\" src=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/04\/toil-sre-2-1.png\" alt=\"\" class=\"wp-image-29596\" srcset=\"https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/04\/toil-sre-2-1.png 699w, https:\/\/www.devopsschool.com\/blog\/wp-content\/uploads\/2022\/04\/toil-sre-2-1-300x152.png 300w\" sizes=\"auto, (max-width: 699px) 100vw, 699px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">How to reduce toil<\/h2>\n\n\n\n<p>There are many ways SRE minimizes the costs of toil. The following six techniques will help your IT organization improve SRE management.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Standardize<\/h3>\n\n\n\n<p>A lack of standardization leads to a more complex IT platform, which then increases toil. Minimize the number of IT platforms in place &#8212; for example, through different types of Unix, different versions of Windows Server and multiple separate hardware suppliers. Also, interrogate function repetition. Multiple applications that carry out the same functions &#8212; for example, using overlapping customer relationship management and sales force automation applications &#8212; increases the complexity, and therefore toil, of the environment. Standardization makes it easier to manage the platform as other steps are taken.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Reuse<\/h3>\n\n\n\n<p>Many toil tasks are repetitive. Therefore, once a fix is found for a task, engineers should apply it repeatedly to the same task, even on a different part of the platform. A library of callable scripts will help reduce toil. Increasingly, many tools used in SRE come with preexisting libraries that cover the most common areas.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Monitor<\/h3>\n\n\n\n<p>Triage, also called firefighting, is the worst thing that can happen to an IT platform. A problem that affects users harms the business and creates a negative perception from the business to IT while encouraging responders to cut corners. Operations teams must institute a solid procedure for monitoring the entire IT platform &#8212; a system that can identify possible problems before they become issues and which can then initiate events to fix the problem.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Automate<\/h3>\n\n\n\n<p>Humans are, unfortunately, often the root of problems in the IT environment. Unchecked changes can domino into catastrophic issues across the platform. Therefore, look to systems that check any change before implementation, automate that change and roll back if any problems are identified post-deployment.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Improve<\/h3>\n\n\n\n<p>Poor code leads to more problems, which means more toil. Use an integrated DevOps approach with solid testing to improve initial code quality, with automated feedback loops between operations and development to raise any identified issues, along with indications of priority for fixing.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Embrace new technologies<\/h3>\n\n\n\n<p>But not too fast &#8212; and don&#8217;t assume they will remedy all problems. Machine learning, deep learning and AI will increasingly improve SRE capabilities but are still at an early stage of maturation in the market. However, waiting until they are 100% proven will cost your organization in toil levels. Introduce such technologies in small, defined areas and judge their effectiveness. Organizations can then begin to roll them out across the total platform as faith in their capabilities grows.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Toil Identification Checklist in SRE<\/h2>\n\n\n\n<p>Toil is work you do over and over. If you&#8217;re solving a novel problem or inventing a new solution, this work is not toil. Automatable. If a machine could accomplish the task just as well as a human, or the need for the task could be designed away, that task is toil.<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-wp-embed is-provider-devopsschool-com wp-block-embed-devopsschool-com\"><div class=\"wp-block-embed__wrapper\">\n<blockquote class=\"wp-embedded-content\" data-secret=\"edBJbTWoFK\"><a href=\"https:\/\/www.devopsschool.com\/blog\/toil-identification-checklist-survery-question-in-sre\/\">Toil Identification Checklist &#038; Survery Question in SRE<\/a><\/blockquote><iframe loading=\"lazy\" class=\"wp-embedded-content\" sandbox=\"allow-scripts\" security=\"restricted\" style=\"position: absolute; clip: rect(1px, 1px, 1px, 1px);\" title=\"&#8220;Toil Identification Checklist &#038; Survery Question in SRE&#8221; &#8212; DevOpsSchool.com\" src=\"https:\/\/www.devopsschool.com\/blog\/toil-identification-checklist-survery-question-in-sre\/embed\/#?secret=oJjdQFl52L#?secret=edBJbTWoFK\" data-secret=\"edBJbTWoFK\" width=\"600\" height=\"338\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\"><\/iframe>\n<\/div><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Reference<\/h2>\n\n\n\n<ul class=\"wp-block-list\"><li>https:\/\/www.devopsschool.com\/blog\/toil-identification-checklist-survery-question-in-sre\/<\/li><li>https:\/\/sre.google\/sre-book\/eliminating-toil\/<\/li><li>https:\/\/cloud.google.com\/blog\/products\/management-tools\/identifying-and-tracking-toil-using-sre-principles<\/li><li>https:\/\/www.rundeck.com\/blog\/toil-finally-a-name-for-a-problem<\/li><li>https:\/\/www.devopsschool.com\/blog\/sre-site-reliability-engineering-summary\/<\/li><\/ul>\n","protected":false},"excerpt":{"rendered":"<p>What is Toil? Time Off in Lieu Exhausting Physical Labour. Work extremely hard or incessantly. Time off in lieu, otherwise known as TOIL, is when an employer offers time off&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_joinchat":[],"footnotes":""},"categories":[2],"tags":[],"class_list":["post-29588","post","type-post","status-publish","format-standard","hentry","category-uncategorised"],"_links":{"self":[{"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/29588","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/comments?post=29588"}],"version-history":[{"count":3,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/29588\/revisions"}],"predecessor-version":[{"id":29599,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/29588\/revisions\/29599"}],"wp:attachment":[{"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/media?parent=29588"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/categories?post=29588"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/tags?post=29588"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}