{"id":74774,"date":"2026-04-15T17:58:04","date_gmt":"2026-04-15T17:58:04","guid":{"rendered":"https:\/\/www.devopsschool.com\/blog\/head-of-infrastructure-role-blueprint-responsibilities-skills-kpis-and-career-path\/"},"modified":"2026-04-15T17:58:04","modified_gmt":"2026-04-15T17:58:04","slug":"head-of-infrastructure-role-blueprint-responsibilities-skills-kpis-and-career-path","status":"publish","type":"post","link":"https:\/\/www.devopsschool.com\/blog\/head-of-infrastructure-role-blueprint-responsibilities-skills-kpis-and-career-path\/","title":{"rendered":"Head of Infrastructure: Role Blueprint, Responsibilities, Skills, KPIs, and Career Path"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">1) Role Summary<\/h2>\n\n\n\n<p>The Head of Infrastructure is the senior leader accountable for the reliability, scalability, security, and cost-effectiveness of the company\u2019s production and corporate infrastructure platforms. This role establishes the infrastructure strategy and operating model, leads infrastructure engineering and operations teams, and ensures the infrastructure enables product delivery at the required performance and availability levels.<\/p>\n\n\n\n<p>This role exists in software and IT organizations because infrastructure is a core product dependency: customer experience, developer productivity, and security posture are all constrained by infrastructure design and operational maturity. The Head of Infrastructure creates business value by improving uptime and incident outcomes, accelerating delivery through standardized platforms and automation, reducing unit costs (e.g., cost per customer\/transaction), enabling compliance readiness, and building resilient systems that support revenue growth.<\/p>\n\n\n\n<p>Role horizon: <strong>Current<\/strong> (with increasing emphasis on platform engineering, FinOps, and security-by-design).<\/p>\n\n\n\n<p>Typical interaction map includes: Product Engineering, SRE\/Operations, Security (AppSec\/SecOps\/GRC), Architecture, Data\/ML, IT\/Corporate Systems, Finance (FinOps), Procurement\/Vendor Management, Customer Support\/Success, and Executive Leadership.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">2) Role Mission<\/h2>\n\n\n\n<p><strong>Core mission:<\/strong><br\/>\nProvide a resilient, secure, scalable, and cost-efficient infrastructure foundation\u2014delivered through clear standards, automation, and operational excellence\u2014so that product teams can ship safely and customers experience reliable performance.<\/p>\n\n\n\n<p><strong>Strategic importance:<\/strong><br\/>\nInfrastructure is the runtime and delivery backbone for software products. As the organization scales, unmanaged infrastructure complexity becomes a primary source of outages, security risk, delivery friction, and runaway cloud spend. The Head of Infrastructure is accountable for preventing infrastructure from becoming a growth limiter.<\/p>\n\n\n\n<p><strong>Primary business outcomes expected:<\/strong>\n&#8211; Sustained high availability and performance of customer-facing services\n&#8211; Reduced incident frequency, severity, and time-to-recovery\n&#8211; Predictable capacity and resilient disaster recovery capability\n&#8211; Secure-by-default infrastructure and improved audit\/compliance readiness\n&#8211; Improved developer experience via self-service platform capabilities\n&#8211; Transparent infrastructure economics and reduced waste (FinOps maturity)\n&#8211; A healthy infrastructure organization with strong on-call practices and talent development<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">3) Core Responsibilities<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Strategic responsibilities<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Infrastructure strategy and roadmap:<\/strong> Define a multi-year infrastructure and platform strategy aligned to product growth, reliability targets, security posture, and financial constraints; maintain a prioritized roadmap with measurable outcomes.<\/li>\n<li><strong>Target architecture and standards:<\/strong> Establish reference architectures and engineering standards for compute, networking, storage, observability, identity, secrets, and CI\/CD infrastructure.<\/li>\n<li><strong>Platform operating model:<\/strong> Determine which capabilities are centralized vs embedded (platform teams, SRE, ops), including service ownership, support boundaries, and escalation models.<\/li>\n<li><strong>Reliability and resilience strategy:<\/strong> Define availability targets and service tiering (e.g., Tier 0\/1\/2), including multi-region strategy, disaster recovery posture, and resilience patterns.<\/li>\n<li><strong>Cloud and vendor strategy:<\/strong> Select cloud approach (single vs multi-cloud), negotiate vendor contracts, oversee managed services adoption, and manage vendor performance.<\/li>\n<li><strong>Cost strategy (FinOps):<\/strong> Create cost allocation\/tagging standards, unit cost models, budgeting\/forecasting mechanisms, and optimization initiatives linked to product metrics.<\/li>\n<li><strong>Security and compliance alignment:<\/strong> Partner with Security\/GRC to ensure infrastructure meets policy and compliance needs (e.g., SOC2\/ISO 27001, HIPAA, PCI depending on context).<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Operational responsibilities<\/h3>\n\n\n\n<ol class=\"wp-block-list\" start=\"8\">\n<li><strong>Service availability and operations:<\/strong> Accountable for 24\/7 production reliability and operational health of infrastructure platforms; ensure clear on-call rotations and escalation paths.<\/li>\n<li><strong>Incident management leadership:<\/strong> Ensure incident response processes exist (triage, comms, postmortems), with consistent execution and measurable improvements.<\/li>\n<li><strong>Change management:<\/strong> Implement safe change practices (progressive delivery, maintenance windows where necessary, approval policies for high-risk changes) and reduce change failure rate.<\/li>\n<li><strong>Capacity and performance management:<\/strong> Build capacity planning processes, load\/performance testing support, and proactive scaling triggers; eliminate bottlenecks before they impact customers.<\/li>\n<li><strong>Business continuity and disaster recovery:<\/strong> Own DR planning, testing schedules, runbooks, and recovery objectives; ensure predictable recoverability and evidence of testing.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Technical responsibilities<\/h3>\n\n\n\n<ol class=\"wp-block-list\" start=\"13\">\n<li><strong>Infrastructure-as-Code and automation:<\/strong> Drive IaC adoption and maturity (provisioning, configuration, policy-as-code) to improve speed, consistency, and auditability.<\/li>\n<li><strong>Kubernetes\/container platform leadership (where applicable):<\/strong> Ensure a secure, well-governed container ecosystem with clear cluster lifecycle management, upgrades, and multi-tenancy controls.<\/li>\n<li><strong>Observability platform ownership:<\/strong> Ensure robust monitoring, logging, tracing, and alerting; establish SLOs\/SLIs and actionable alert standards that reduce noise.<\/li>\n<li><strong>Identity, access, and secrets management:<\/strong> Ensure secure IAM patterns, least privilege access, centralized secrets management, and strong operational controls for privileged access.<\/li>\n<li><strong>Network and edge services:<\/strong> Oversee network architecture, DNS\/CDN, ingress, WAF, VPN\/zero-trust connectivity, and connectivity reliability for internal and external services.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Cross-functional or stakeholder responsibilities<\/h3>\n\n\n\n<ol class=\"wp-block-list\" start=\"18\">\n<li><strong>Developer experience enablement:<\/strong> Partner with engineering leaders to reduce friction (environment provisioning, CI reliability, standardized pipelines, golden paths, documentation).<\/li>\n<li><strong>Customer-impact communication:<\/strong> Ensure timely, accurate infrastructure communications for incidents and changes (status pages, customer comms via Support\/Success), especially for material outages.<\/li>\n<li><strong>Executive reporting and decision support:<\/strong> Provide infrastructure health reporting: reliability, risk, cost, capacity, and roadmap progress; communicate trade-offs and recommend decisions.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Governance, compliance, or quality responsibilities<\/h3>\n\n\n\n<ol class=\"wp-block-list\" start=\"21\">\n<li><strong>Policy and control implementation:<\/strong> Implement infrastructure controls (logging retention, backup standards, encryption, vulnerability management integration, patching SLAs).<\/li>\n<li><strong>Audit readiness and evidence:<\/strong> Ensure repeatable evidence collection for audits (access reviews, change logs, DR test results, asset inventory, vendor attestations).<\/li>\n<li><strong>Service catalog and ownership governance:<\/strong> Maintain clear ownership, service documentation standards, and lifecycle management for infrastructure components.<\/li>\n<li><strong>Data protection and privacy enablement (as applicable):<\/strong> Ensure infrastructure supports data residency, encryption, retention, and deletion requirements.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Leadership responsibilities<\/h3>\n\n\n\n<ol class=\"wp-block-list\" start=\"25\">\n<li><strong>Team leadership and talent development:<\/strong> Build and lead infrastructure org (managers, tech leads, SRE, network, cloud engineering); set expectations, career paths, coaching, and succession planning.<\/li>\n<li><strong>Budget and resource management:<\/strong> Own infrastructure budget (cloud spend, tooling, vendors, headcount), run portfolio trade-offs, and justify investments with ROI.<\/li>\n<li><strong>Culture and operational excellence:<\/strong> Establish a blameless, learning-focused culture with strong engineering discipline, documentation, and sustainable on-call health.<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">4) Day-to-Day Activities<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Daily activities<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Review production health dashboards (availability, latency, saturation, error rates), major alerts, and customer-impact signals.<\/li>\n<li>Triage and unblock critical infrastructure issues impacting product teams (CI instability, capacity constraints, access issues, cluster degradation).<\/li>\n<li>Approve or delegate high-risk changes (e.g., network changes, IAM policy changes, major upgrades) according to change policy.<\/li>\n<li>Coordinate with Security on urgent vulnerabilities, zero-days, or policy enforcement actions.<\/li>\n<li>Engage with Engineering leaders on near-term delivery needs and platform bottlenecks.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Weekly activities<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Reliability review: incident summaries, SLO breaches, top alerts, and corrective action progress.<\/li>\n<li>Cost review: spend trends, anomaly detection, savings opportunities, reserved capacity\/commitments management.<\/li>\n<li>Platform roadmap grooming: reprioritize based on business needs, risk, and operational toil.<\/li>\n<li>Team leadership: 1:1s with managers\/leads, hiring pipeline review, performance coaching, on-call health check.<\/li>\n<li>Cross-functional sync: Architecture\/CTO staff meeting; Security sync; Support\/Success escalation review.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Monthly or quarterly activities<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Capacity planning cycle aligned to product roadmap and growth forecasts (traffic, tenants, regions).<\/li>\n<li>Disaster recovery exercise planning and execution; review RTO\/RPO achievement and remediation backlog.<\/li>\n<li>Vendor\/contract reviews: SLAs, spend, renewals, tooling consolidation, and negotiation strategy.<\/li>\n<li>Policy\/compliance readiness review: access reviews, control gaps, audit evidence checks.<\/li>\n<li>Organizational planning: headcount plan, skills gaps, L&amp;D plan, and succession coverage.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Recurring meetings or rituals<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Weekly: Infrastructure leadership staff meeting (delivery, incidents, risk, cost).<\/li>\n<li>Weekly\/biweekly: Change Advisory\/High-Risk Change review (context-specific; avoid bureaucracy where not needed).<\/li>\n<li>Biweekly\/monthly: FinOps review with Finance and engineering cost owners.<\/li>\n<li>Monthly: Reliability Council \/ SRE review with product engineering leadership.<\/li>\n<li>Quarterly: Executive infrastructure review (KPIs, roadmap outcomes, risk register, budget forecast).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Incident, escalation, or emergency work (when relevant)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Participate as Incident Commander or Executive Sponsor for SEV-1\/SEV-2 events.<\/li>\n<li>Ensure stakeholder communications: exec briefings, status page updates, customer comms alignment.<\/li>\n<li>Lead \u201cstop-the-line\u201d decisions when risk is high (e.g., rollback releases, disable features, freeze changes).<\/li>\n<li>Sponsor and enforce corrective actions from postmortems, ensuring improvements are implemented and validated.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">5) Key Deliverables<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Infrastructure strategy deck<\/strong> (12\u201324 month horizon) with investment themes, risk posture, and measurable targets.<\/li>\n<li><strong>Infrastructure roadmap and backlog<\/strong> with prioritization rationale (risk, ROI, developer productivity, reliability).<\/li>\n<li><strong>Reference architectures and standards<\/strong> (cloud landing zone, networking, IAM, secrets, encryption, backup).<\/li>\n<li><strong>Service tiering model<\/strong> (Tier 0\/1\/2) with availability targets, DR expectations, and support model.<\/li>\n<li><strong>SLO\/SLI framework<\/strong> for critical platform services (CI, Kubernetes, service mesh, databases if owned, observability).<\/li>\n<li><strong>Incident management playbooks<\/strong> and <strong>postmortem templates<\/strong>; recurring incident trend reports.<\/li>\n<li><strong>Disaster recovery plan<\/strong> and <strong>DR runbooks<\/strong>; evidence of DR tests and improvement actions.<\/li>\n<li><strong>Infrastructure-as-Code repositories<\/strong> and policy-as-code artifacts; module catalogs and golden paths.<\/li>\n<li><strong>Cloud cost allocation model<\/strong> (tags\/labels), unit economics dashboards, and cost optimization plans.<\/li>\n<li><strong>Operational dashboards<\/strong> (availability, latency, error budgets, on-call load, MTTR, change failure rate).<\/li>\n<li><strong>Security controls implementation artifacts<\/strong> (access review process, privileged access workflows, secrets rotation plan).<\/li>\n<li><strong>Vendor evaluation and renewal packets<\/strong> (RFP criteria, total cost of ownership, risk assessments).<\/li>\n<li><strong>Team operating model documentation<\/strong> (RACI, escalation paths, on-call, service ownership, support hours).<\/li>\n<li><strong>Training artifacts<\/strong>: onboarding guides, runbooks, internal workshops on IaC, incident response, and platform usage.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">6) Goals, Objectives, and Milestones<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">30-day goals<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Establish clear understanding of current-state infrastructure: topology, reliability risks, cost drivers, security gaps, and team capacity.<\/li>\n<li>Build relationship map with CTO\/VP Engineering, Security, Product Engineering leaders, Finance, Support\/Success.<\/li>\n<li>Review incident history (last 6\u201312 months) and identify top systemic drivers (toil, architectural fragility, noisy alerts).<\/li>\n<li>Validate on-call health and escalation model; address any immediate burnout or coverage gaps.<\/li>\n<li>Inventory critical vendors\/tools, contracts, and renewal dates; identify high-risk dependencies.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">60-day goals<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Publish a <strong>current-state assessment<\/strong> and <strong>risk register<\/strong> with ranked priorities and mitigation plans.<\/li>\n<li>Define service tiering and baseline reliability targets for critical customer journeys.<\/li>\n<li>Implement (or tighten) incident response standards: severity definitions, comms templates, postmortems with action tracking.<\/li>\n<li>Establish FinOps basics: tagging policy, showback dashboards, top 5 cost optimization opportunities.<\/li>\n<li>Align with Security on top infrastructure control gaps and remediation timeline (e.g., IAM cleanup, secrets, logging retention).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">90-day goals<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Deliver a prioritized 2\u20133 quarter <strong>infrastructure roadmap<\/strong> with clear outcomes and owners.<\/li>\n<li>Launch key platform initiatives (examples): standardized IaC modules; Kubernetes baseline hardening; observability upgrades; CI reliability improvements.<\/li>\n<li>Implement or refine SLOs for core platform services and connect to alerting and error budget policies.<\/li>\n<li>Formalize change risk management for high-impact systems (upgrade policy, rollout\/rollback standards).<\/li>\n<li>Present budget forecast and investment case(s) to executive leadership.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6-month milestones<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Measurable incident reduction (frequency\/severity) and improved MTTR through better detection, runbooks, and automation.<\/li>\n<li>Demonstrable cost governance: cost allocation coverage, anomaly detection, and realized savings.<\/li>\n<li>Documented and tested DR for Tier-0 services, with RTO\/RPO achieved or a clear plan to reach targets.<\/li>\n<li>Platform self-service improvements: faster environment provisioning, reduced ticket volume, more paved roads.<\/li>\n<li>Team maturity improvements: clear role definitions, leveling, hiring plan underway, improved on-call sustainability.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">12-month objectives<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Achieve agreed reliability targets (availability, latency, SLO attainment) for Tier-0\/Tier-1 services.<\/li>\n<li>Reduce change failure rate and increase deployment safety through progressive delivery patterns and automated checks.<\/li>\n<li>Strong compliance posture: repeatable evidence for relevant audits; security controls integrated into infrastructure workflows.<\/li>\n<li>Significant automation and toil reduction: decrease manual operational workload and improve developer satisfaction.<\/li>\n<li>Mature vendor portfolio and tooling: consolidation where appropriate, improved SLAs, optimized spend.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Long-term impact goals (12\u201336 months)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Infrastructure becomes a competitive advantage: rapid product experimentation, consistent performance at scale, and trusted security posture.<\/li>\n<li>A \u201cplatform as product\u201d approach: defined platform APIs, internal SLAs, user research with engineering teams, and measurable adoption.<\/li>\n<li>Predictable unit economics: cloud cost per customer\/transaction becomes stable or improving despite growth.<\/li>\n<li>High-performing infrastructure org with strong leadership bench and reduced key-person risk.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Role success definition<\/h3>\n\n\n\n<p>Success is achieved when infrastructure reliability and security are consistently strong, engineering teams can ship with minimal friction, costs are transparent and actively managed, and the infrastructure organization operates sustainably with clear ownership and measurable outcomes.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What high performance looks like<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Proactively identifies and mitigates systemic risks before incidents occur.<\/li>\n<li>Builds leverage via automation, standardization, and strong platform products.<\/li>\n<li>Communicates trade-offs clearly to executives; earns trust through predictable delivery and operational discipline.<\/li>\n<li>Develops leaders and creates resilient team structures rather than relying on heroics.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">7) KPIs and Productivity Metrics<\/h2>\n\n\n\n<p>The metrics below are intended to be practical, auditable, and aligned to business outcomes. Targets vary by company stage and service criticality; sample targets represent a mature SaaS environment.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Metric name<\/th>\n<th>What it measures<\/th>\n<th>Why it matters<\/th>\n<th>Example target \/ benchmark<\/th>\n<th>Frequency<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Tier-0 availability<\/td>\n<td>Uptime of most critical customer-facing services<\/td>\n<td>Direct revenue and trust driver<\/td>\n<td>99.9%\u201399.99% monthly<\/td>\n<td>Weekly\/monthly<\/td>\n<\/tr>\n<tr>\n<td>SLO attainment (platform services)<\/td>\n<td>% of time SLOs met for CI, clusters, observability, etc.<\/td>\n<td>Indicates platform reliability and user trust<\/td>\n<td>\u2265 95% of SLOs met monthly<\/td>\n<td>Monthly<\/td>\n<\/tr>\n<tr>\n<td>Error budget burn rate<\/td>\n<td>Rate of SLO budget consumption<\/td>\n<td>Forces reliability vs velocity trade-offs<\/td>\n<td>&lt; 1.0 burn rate sustained<\/td>\n<td>Weekly<\/td>\n<\/tr>\n<tr>\n<td>MTTR (SEV-1\/2)<\/td>\n<td>Mean time to restore service<\/td>\n<td>Measures operational effectiveness<\/td>\n<td>SEV-1: &lt; 60\u201390 mins (context-specific)<\/td>\n<td>Monthly<\/td>\n<\/tr>\n<tr>\n<td>MTTD<\/td>\n<td>Mean time to detect incidents<\/td>\n<td>Observability and alerting quality<\/td>\n<td>&lt; 5\u201310 mins for Tier-0<\/td>\n<td>Monthly<\/td>\n<\/tr>\n<tr>\n<td>Incident recurrence rate<\/td>\n<td>% of incidents repeating within a time window<\/td>\n<td>Measures root cause elimination<\/td>\n<td>&lt; 10% recurring within 90 days<\/td>\n<td>Monthly<\/td>\n<\/tr>\n<tr>\n<td>Change failure rate<\/td>\n<td>% of changes causing incidents\/rollback<\/td>\n<td>Deployment safety and maturity<\/td>\n<td>&lt; 10% (elite teams lower)<\/td>\n<td>Monthly<\/td>\n<\/tr>\n<tr>\n<td>Infrastructure deployment frequency<\/td>\n<td>Frequency of infra changes safely deployed<\/td>\n<td>Indicates automation and agility<\/td>\n<td>Multiple times\/week for IaC<\/td>\n<td>Weekly<\/td>\n<\/tr>\n<tr>\n<td>On-call load (pages\/person\/week)<\/td>\n<td>Alert volume and toil<\/td>\n<td>Predicts burnout and quality<\/td>\n<td>Target: sustainable baseline; reduce noisy alerts by 30\u201350%<\/td>\n<td>Weekly<\/td>\n<\/tr>\n<tr>\n<td>Alert quality (actionable %)<\/td>\n<td>% of alerts that require action<\/td>\n<td>Reduces noise and improves response<\/td>\n<td>&gt; 70\u201380% actionable<\/td>\n<td>Monthly<\/td>\n<\/tr>\n<tr>\n<td>Capacity headroom<\/td>\n<td>Available capacity vs peak (CPU\/mem\/IO\/DB connections)<\/td>\n<td>Prevents performance incidents<\/td>\n<td>Maintain 20\u201340% headroom (varies)<\/td>\n<td>Weekly<\/td>\n<\/tr>\n<tr>\n<td>DR test pass rate<\/td>\n<td>Successful completion of DR exercises<\/td>\n<td>Validates recoverability<\/td>\n<td>100% scheduled tests executed; gaps tracked<\/td>\n<td>Quarterly<\/td>\n<\/tr>\n<tr>\n<td>RTO\/RPO attainment<\/td>\n<td>Achieved recovery time\/point objectives<\/td>\n<td>Business continuity<\/td>\n<td>Tier-0 meets targets; Tier-1 plan in progress<\/td>\n<td>Quarterly<\/td>\n<\/tr>\n<tr>\n<td>Patch\/vulnerability SLA compliance<\/td>\n<td>Timely remediation of critical vulnerabilities<\/td>\n<td>Reduces security risk<\/td>\n<td>Critical fixes within 7\u201314 days (context-specific)<\/td>\n<td>Weekly\/monthly<\/td>\n<\/tr>\n<tr>\n<td>IAM policy exceptions<\/td>\n<td>Count\/age of privilege exceptions<\/td>\n<td>Least privilege maturity<\/td>\n<td>Exceptions time-bound; aging &lt; 30\u201360 days<\/td>\n<td>Monthly<\/td>\n<\/tr>\n<tr>\n<td>Logging coverage<\/td>\n<td>% of Tier-0 services with required logs retained<\/td>\n<td>Compliance and detection<\/td>\n<td>&gt; 95% coverage<\/td>\n<td>Monthly<\/td>\n<\/tr>\n<tr>\n<td>Cloud cost variance to forecast<\/td>\n<td>Accuracy of cost forecasting<\/td>\n<td>Financial control<\/td>\n<td>Within \u00b15\u201310% monthly<\/td>\n<td>Monthly<\/td>\n<\/tr>\n<tr>\n<td>Unit cost (per customer\/transaction)<\/td>\n<td>Cost efficiency trend<\/td>\n<td>Scales profitably<\/td>\n<td>Stable or decreasing with growth<\/td>\n<td>Monthly\/quarterly<\/td>\n<\/tr>\n<tr>\n<td>Savings realized<\/td>\n<td>Verified savings from optimization<\/td>\n<td>Shows impact of FinOps<\/td>\n<td>5\u201315% annualized savings (depends on baseline)<\/td>\n<td>Quarterly<\/td>\n<\/tr>\n<tr>\n<td>Platform adoption rate<\/td>\n<td>% of teams using standard modules\/golden paths<\/td>\n<td>Standardization leverage<\/td>\n<td>&gt; 70% adoption for target paths<\/td>\n<td>Quarterly<\/td>\n<\/tr>\n<tr>\n<td>Developer satisfaction (platform NPS)<\/td>\n<td>Internal customer satisfaction<\/td>\n<td>Correlates with delivery velocity<\/td>\n<td>Positive trend; target &gt; +20 NPS (example)<\/td>\n<td>Quarterly<\/td>\n<\/tr>\n<tr>\n<td>Roadmap predictability<\/td>\n<td>Delivered vs planned outcomes<\/td>\n<td>Execution health<\/td>\n<td>70\u201385% on-time delivery of committed outcomes<\/td>\n<td>Quarterly<\/td>\n<\/tr>\n<tr>\n<td>Team retention \/ engagement<\/td>\n<td>Stability and culture health<\/td>\n<td>Sustains operational excellence<\/td>\n<td>Healthy retention; engagement improving<\/td>\n<td>Quarterly<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">8) Technical Skills Required<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Must-have technical skills<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Cloud infrastructure architecture (AWS\/Azure\/GCP)<\/strong><br\/>\n   &#8211; Description: Designing secure, scalable cloud environments with appropriate managed services.<br\/>\n   &#8211; Use: Landing zones, account\/subscription structure, networking, compute, storage, IAM patterns.<br\/>\n   &#8211; Importance: <strong>Critical<\/strong>.<\/li>\n<li><strong>Infrastructure reliability engineering<\/strong><br\/>\n   &#8211; Description: Applying SRE principles (SLOs, error budgets, toil reduction, incident\/postmortems).<br\/>\n   &#8211; Use: Reliability targets, operational maturity, on-call practices, observability.<br\/>\n   &#8211; Importance: <strong>Critical<\/strong>.<\/li>\n<li><strong>Infrastructure-as-Code (IaC)<\/strong> (e.g., Terraform\/CloudFormation\/Bicep)<br\/>\n   &#8211; Description: Declarative provisioning, versioning, automated reviews, and controlled rollout.<br\/>\n   &#8211; Use: Standard modules, repeatable environments, auditability.<br\/>\n   &#8211; Importance: <strong>Critical<\/strong>.<\/li>\n<li><strong>Networking fundamentals and cloud networking<\/strong><br\/>\n   &#8211; Description: VPC\/VNet design, routing, DNS, load balancing, private connectivity, segmentation.<br\/>\n   &#8211; Use: Secure connectivity, performance, hybrid connectivity if needed.<br\/>\n   &#8211; Importance: <strong>Critical<\/strong>.<\/li>\n<li><strong>Security fundamentals for infrastructure<\/strong><br\/>\n   &#8211; Description: IAM\/least privilege, secrets management, encryption, key management, logging, threat modeling basics.<br\/>\n   &#8211; Use: Secure-by-default platform design; risk mitigation.<br\/>\n   &#8211; Importance: <strong>Critical<\/strong>.<\/li>\n<li><strong>Observability design<\/strong><br\/>\n   &#8211; Description: Metrics\/logs\/traces, alerting strategy, dashboards, SLO reporting.<br\/>\n   &#8211; Use: Operational health, faster detection and diagnosis, capacity\/perf management.<br\/>\n   &#8211; Importance: <strong>Critical<\/strong>.<\/li>\n<li><strong>Linux and systems fundamentals<\/strong><br\/>\n   &#8211; Description: OS-level troubleshooting, performance tuning concepts, automation readiness.<br\/>\n   &#8211; Use: Root cause investigations, platform design review, operations leadership.<br\/>\n   &#8211; Importance: <strong>Important<\/strong>.<\/li>\n<li><strong>Operational processes (ITIL-inspired, pragmatic)<\/strong><br\/>\n   &#8211; Description: Incident\/problem\/change management, runbooks, ownership models.<br\/>\n   &#8211; Use: Consistent operations at scale without excessive bureaucracy.<br\/>\n   &#8211; Importance: <strong>Important<\/strong>.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Good-to-have technical skills<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Kubernetes and container platforms<\/strong><br\/>\n   &#8211; Use: Platform standardization, multi-tenancy, cluster lifecycle, workload governance.<br\/>\n   &#8211; Importance: <strong>Important<\/strong> (Critical in K8s-heavy companies).<\/li>\n<li><strong>CI\/CD platform engineering<\/strong><br\/>\n   &#8211; Use: Pipeline reliability, build systems, artifact management, release safety controls.<br\/>\n   &#8211; Importance: <strong>Important<\/strong>.<\/li>\n<li><strong>Database reliability and scaling concepts<\/strong><br\/>\n   &#8211; Use: Partnering with data teams; designing HA\/backup; capacity and performance planning.<br\/>\n   &#8211; Importance: <strong>Optional<\/strong> (depends on ownership boundaries).<\/li>\n<li><strong>Edge services (CDN\/WAF\/DDoS)<\/strong><br\/>\n   &#8211; Use: Protecting and accelerating customer traffic, reducing attack surface.<br\/>\n   &#8211; Importance: <strong>Important<\/strong> for internet-facing SaaS.<\/li>\n<li><strong>Configuration management &amp; automation<\/strong> (Ansible, etc.)<br\/>\n   &#8211; Use: Standardization, patching automation, fleet management (where applicable).<br\/>\n   &#8211; Importance: <strong>Optional<\/strong> (context-specific).<\/li>\n<li><strong>Windows\/enterprise endpoint and corporate IT systems<\/strong><br\/>\n   &#8211; Use: If infrastructure org also owns corporate systems\/identity.<br\/>\n   &#8211; Importance: <strong>Optional<\/strong> (context-specific).<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Advanced or expert-level technical skills<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Architecture for multi-region resilience<\/strong><br\/>\n   &#8211; Use: DR design, active-active\/active-passive strategies, data replication trade-offs.<br\/>\n   &#8211; Importance: <strong>Important<\/strong> (Critical for high-availability products).<\/li>\n<li><strong>Performance engineering and capacity modeling<\/strong><br\/>\n   &#8211; Use: Predicting resource needs; managing scaling policies; preventing saturation incidents.<br\/>\n   &#8211; Importance: <strong>Important<\/strong>.<\/li>\n<li><strong>Security architecture &amp; zero trust patterns<\/strong><br\/>\n   &#8211; Use: Privileged access management, network segmentation, policy-as-code, secure service identity.<br\/>\n   &#8211; Importance: <strong>Important<\/strong>.<\/li>\n<li><strong>FinOps and cloud economics<\/strong><br\/>\n   &#8211; Use: Commitment planning, allocation models, unit cost KPIs, optimization governance.<br\/>\n   &#8211; Importance: <strong>Important<\/strong>.<\/li>\n<li><strong>Large-scale incident leadership<\/strong><br\/>\n   &#8211; Use: Coordinating complex outages with many teams; running comms and decision-making.<br\/>\n   &#8211; Importance: <strong>Critical<\/strong> at scale.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Emerging future skills for this role (2\u20135 years)<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Policy-as-code and automated compliance<\/strong><br\/>\n   &#8211; Use: Continuous control monitoring, drift detection, automated evidence capture.<br\/>\n   &#8211; Importance: <strong>Important<\/strong>.<\/li>\n<li><strong>Internal developer platforms (IDPs) and \u201cplatform as product\u201d<\/strong><br\/>\n   &#8211; Use: Golden paths, self-service, developer portals, standardized runtime templates.<br\/>\n   &#8211; Importance: <strong>Important<\/strong>.<\/li>\n<li><strong>AI-assisted operations (AIOps)<\/strong><br\/>\n   &#8211; Use: Alert correlation, anomaly detection, incident summarization, faster diagnosis.<br\/>\n   &#8211; Importance: <strong>Optional<\/strong> today; trending <strong>Important<\/strong>.<\/li>\n<li><strong>Confidential computing and advanced data protection<\/strong><br\/>\n   &#8211; Use: High-sensitivity workloads and stronger isolation guarantees.<br\/>\n   &#8211; Importance: <strong>Optional<\/strong> (context-specific).<\/li>\n<li><strong>Sustainability-aware infrastructure optimization<\/strong><br\/>\n   &#8211; Use: Efficiency initiatives that consider energy\/carbon signals (where required).<br\/>\n   &#8211; Importance: <strong>Optional<\/strong>.<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">9) Soft Skills and Behavioral Capabilities<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Executive communication and narrative building<\/strong><br\/>\n   &#8211; Why it matters: Infrastructure requires investment and trade-offs; leaders must understand risk, cost, and timelines.<br\/>\n   &#8211; Shows up as: Clear status updates, crisp incident briefings, roadmap proposals with ROI.<br\/>\n   &#8211; Strong performance: Can explain technical risk in business terms; sets expectations without alarmism.<\/li>\n<li><strong>Prioritization under constraints<\/strong><br\/>\n   &#8211; Why it matters: Infrastructure backlogs are always larger than capacity; misprioritization creates outages or waste.<br\/>\n   &#8211; Shows up as: Balancing feature delivery enablement with reliability\/security work.<br\/>\n   &#8211; Strong performance: Uses frameworks (risk\/ROI\/SLOs) and can say \u201cno\u201d with rationale.<\/li>\n<li><strong>Systems thinking<\/strong><br\/>\n   &#8211; Why it matters: Failures often involve multiple interacting components (people\/process\/tech).<br\/>\n   &#8211; Shows up as: Root cause prevention, dependency mapping, eliminating systemic toil.<br\/>\n   &#8211; Strong performance: Fixes classes of problems, not one-off symptoms.<\/li>\n<li><strong>Calm leadership in high-severity incidents<\/strong><br\/>\n   &#8211; Why it matters: Outages create urgency and noise; poor leadership increases downtime.<br\/>\n   &#8211; Shows up as: Incident command, decision clarity, communications discipline.<br\/>\n   &#8211; Strong performance: Maintains structure, protects teams from chaos, drives fast restoration.<\/li>\n<li><strong>Stakeholder management and influence<\/strong><br\/>\n   &#8211; Why it matters: Infrastructure teams rarely \u201cown\u201d all dependent services; outcomes depend on influence.<br\/>\n   &#8211; Shows up as: Aligning engineering teams on standards, adoption, and shared ownership.<br\/>\n   &#8211; Strong performance: Builds trust, negotiates boundaries, resolves conflicts pragmatically.<\/li>\n<li><strong>Talent development and coaching<\/strong><br\/>\n   &#8211; Why it matters: Reliability is a team sport; strong leaders build strong leaders.<br\/>\n   &#8211; Shows up as: Clear expectations, mentoring managers, leveling and progression.<br\/>\n   &#8211; Strong performance: Improves team capability over time; reduces key-person risk.<\/li>\n<li><strong>Operational discipline and follow-through<\/strong><br\/>\n   &#8211; Why it matters: Postmortems and roadmaps fail without execution rigor.<br\/>\n   &#8211; Shows up as: Action tracking, validating fixes, ensuring runbooks stay current.<br\/>\n   &#8211; Strong performance: Measurable reductions in recurring incidents and toil.<\/li>\n<li><strong>Product mindset for platforms<\/strong><br\/>\n   &#8211; Why it matters: Internal platforms succeed when treated as products with users and adoption.<br\/>\n   &#8211; Shows up as: Roadmaps tied to developer outcomes; documentation; feedback loops.<br\/>\n   &#8211; Strong performance: Platform adoption rises and engineering satisfaction improves.<\/li>\n<li><strong>Financial acumen<\/strong><br\/>\n   &#8211; Why it matters: Cloud and tooling spend can materially impact gross margin and runway.<br\/>\n   &#8211; Shows up as: Forecasting, unit cost metrics, vendor negotiations, ROI cases.<br\/>\n   &#8211; Strong performance: Reduces waste while preserving reliability and velocity.<\/li>\n<li><strong>Integrity and risk stewardship<\/strong><br\/>\n   &#8211; Why it matters: Infrastructure leaders handle privileged access, sensitive incidents, and audit commitments.<br\/>\n   &#8211; Shows up as: Transparent reporting, adherence to controls, responsible decision-making.<br\/>\n   &#8211; Strong performance: No hidden risks; credible and trusted by Security and executives.<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">10) Tools, Platforms, and Software<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Category<\/th>\n<th>Tool \/ platform \/ software<\/th>\n<th>Primary use<\/th>\n<th>Common \/ Optional \/ Context-specific<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Cloud platforms<\/td>\n<td>AWS<\/td>\n<td>Core compute\/storage\/networking managed services<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Cloud platforms<\/td>\n<td>Microsoft Azure<\/td>\n<td>Alternative core cloud platform<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Cloud platforms<\/td>\n<td>Google Cloud Platform (GCP)<\/td>\n<td>Alternative core cloud platform<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Container \/ orchestration<\/td>\n<td>Kubernetes (EKS\/AKS\/GKE)<\/td>\n<td>Container orchestration platform<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Container \/ orchestration<\/td>\n<td>Helm \/ Kustomize<\/td>\n<td>Kubernetes packaging and configuration<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Container \/ orchestration<\/td>\n<td>Docker<\/td>\n<td>Container build\/runtime basics<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>IaC \/ provisioning<\/td>\n<td>Terraform<\/td>\n<td>Infrastructure provisioning and modules<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>IaC \/ provisioning<\/td>\n<td>CloudFormation \/ CDK \/ Bicep<\/td>\n<td>Cloud-native IaC alternatives<\/td>\n<td>Context-specific<\/td>\n<\/tr>\n<tr>\n<td>Config management<\/td>\n<td>Ansible<\/td>\n<td>Configuration automation for servers<\/td>\n<td>Optional<\/td>\n<\/tr>\n<tr>\n<td>CI\/CD<\/td>\n<td>GitHub Actions<\/td>\n<td>CI workflows and automation<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>CI\/CD<\/td>\n<td>GitLab CI<\/td>\n<td>CI\/CD pipelines and runners<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>CI\/CD<\/td>\n<td>Jenkins<\/td>\n<td>CI orchestration (legacy\/common)<\/td>\n<td>Context-specific<\/td>\n<\/tr>\n<tr>\n<td>Source control<\/td>\n<td>GitHub \/ GitLab<\/td>\n<td>Version control and reviews<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Observability<\/td>\n<td>Datadog<\/td>\n<td>Metrics, traces, logs, dashboards<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Observability<\/td>\n<td>Prometheus + Grafana<\/td>\n<td>Metrics collection and visualization<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Observability<\/td>\n<td>ELK\/Elastic Stack<\/td>\n<td>Logging and search<\/td>\n<td>Context-specific<\/td>\n<\/tr>\n<tr>\n<td>Observability<\/td>\n<td>Splunk<\/td>\n<td>Centralized logging\/security analytics<\/td>\n<td>Context-specific<\/td>\n<\/tr>\n<tr>\n<td>Alerting \/ on-call<\/td>\n<td>PagerDuty<\/td>\n<td>On-call schedules, escalation, incident response<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Alerting \/ on-call<\/td>\n<td>Opsgenie<\/td>\n<td>Alternative on-call\/alerting<\/td>\n<td>Optional<\/td>\n<\/tr>\n<tr>\n<td>Incident collaboration<\/td>\n<td>Slack \/ Microsoft Teams<\/td>\n<td>Incident coordination and comms<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>ITSM<\/td>\n<td>ServiceNow<\/td>\n<td>Incident\/change\/problem workflows, CMDB<\/td>\n<td>Context-specific (common in enterprise)<\/td>\n<\/tr>\n<tr>\n<td>ITSM<\/td>\n<td>Jira Service Management<\/td>\n<td>ITSM-lite ticketing and workflows<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Project management<\/td>\n<td>Jira<\/td>\n<td>Planning, delivery tracking<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Documentation<\/td>\n<td>Confluence \/ Notion<\/td>\n<td>Runbooks, standards, internal docs<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Security (IAM\/SSO)<\/td>\n<td>Okta \/ Entra ID<\/td>\n<td>SSO, identity governance<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Security (secrets)<\/td>\n<td>HashiCorp Vault<\/td>\n<td>Secrets management<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Security (cloud native)<\/td>\n<td>AWS KMS \/ Azure Key Vault \/ GCP KMS<\/td>\n<td>Keys, secrets, encryption primitives<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Security (policy-as-code)<\/td>\n<td>Open Policy Agent (OPA) \/ Gatekeeper<\/td>\n<td>Policy enforcement for K8s and more<\/td>\n<td>Optional<\/td>\n<\/tr>\n<tr>\n<td>Security posture<\/td>\n<td>Wiz \/ Prisma Cloud<\/td>\n<td>Cloud security posture management<\/td>\n<td>Context-specific<\/td>\n<\/tr>\n<tr>\n<td>Networking\/edge<\/td>\n<td>Cloudflare<\/td>\n<td>CDN, WAF, DDoS protection<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Networking\/edge<\/td>\n<td>AWS CloudFront \/ Azure Front Door<\/td>\n<td>CDN\/edge routing<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Cost management<\/td>\n<td>CloudHealth \/ Apptio Cloudability<\/td>\n<td>FinOps dashboards and governance<\/td>\n<td>Context-specific<\/td>\n<\/tr>\n<tr>\n<td>Cost management<\/td>\n<td>Native cloud cost tools<\/td>\n<td>Cost allocation, budgets, recommendations<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Automation\/scripting<\/td>\n<td>Python<\/td>\n<td>Automation, tooling, integrations<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Automation\/scripting<\/td>\n<td>Bash<\/td>\n<td>Scripting and ops automation<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Artifact management<\/td>\n<td>Artifactory \/ Nexus<\/td>\n<td>Artifact storage, dependency proxy<\/td>\n<td>Context-specific<\/td>\n<\/tr>\n<tr>\n<td>Compliance evidence<\/td>\n<td>Drata \/ Vanta<\/td>\n<td>Automated audit evidence collection<\/td>\n<td>Context-specific<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">11) Typical Tech Stack \/ Environment<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Infrastructure environment<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Predominantly cloud-hosted (AWS\/Azure\/GCP), often multi-account\/subscription with a landing zone design.<\/li>\n<li>Mix of managed services (databases, queues, caches) and containerized workloads on Kubernetes.<\/li>\n<li>Mature organizations may operate multi-region architectures, with tiered recoverability and explicit DR patterns.<\/li>\n<li>Connectivity includes private networking, VPN\/zero-trust access, and secure connectivity for internal systems.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Application environment<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Microservices and APIs (common), with some monolith components in many real companies.<\/li>\n<li>Service-to-service traffic managed through ingress controllers, API gateways, and sometimes a service mesh.<\/li>\n<li>Deployment patterns include rolling deployments, canaries, blue\/green, feature flags (ownership may be shared).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Data environment<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data stores may include managed relational databases, managed NoSQL, object storage, and streaming systems.<\/li>\n<li>The Head of Infrastructure often partners with Data Engineering rather than owning all data systems, but infrastructure must support data reliability, backups, and access controls.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Security environment<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>SSO integrated with IAM; least privilege and role-based access patterns.<\/li>\n<li>Centralized secrets management; encryption in transit and at rest.<\/li>\n<li>Vulnerability scanning integrated into CI and runtime environments; logging and monitoring aligned to security detection needs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Delivery model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Platform and infrastructure delivered as products: self-service modules, templates, golden paths, and internal documentation.<\/li>\n<li>\u201cYou build it, you run it\u201d is common, but platform teams provide paved roads and operational guardrails.<\/li>\n<li>Ownership model typically includes shared responsibility for reliability: product teams own services; infrastructure\/SRE owns platform layers and incident leadership support.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Agile or SDLC context<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Quarterly planning with rolling prioritization; infrastructure work must be visible and justified with business outcomes.<\/li>\n<li>Infrastructure changes treated as code: PR reviews, automated tests, staged rollouts.<\/li>\n<li>Operational readiness reviews for high-risk changes (context-specific).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scale or complexity context<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Complexity increases with: customer growth, global footprint, compliance requirements, and a growing microservices ecosystem.<\/li>\n<li>The role is designed for a mid-size to enterprise SaaS or IT organization where uptime and security are material.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Team topology<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Typical org components under this role (varies by company):<\/li>\n<li>Cloud Platform \/ Platform Engineering<\/li>\n<li>SRE \/ Production Engineering (sometimes a peer org)<\/li>\n<li>Network\/Edge Engineering<\/li>\n<li>Observability Engineering<\/li>\n<li>Infrastructure Security Engineering (sometimes in Security org)<\/li>\n<li>Corporate Infrastructure \/ IAM Operations (context-specific)<\/li>\n<li>FinOps (often dotted line to Finance)<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">12) Stakeholders and Collaboration Map<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Internal stakeholders<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>CTO \/ VP Engineering (typical manager):<\/strong> Align on strategy, budgets, platform roadmap, reliability posture, and risk.<\/li>\n<li><strong>Engineering Directors\/Managers (Product Engineering):<\/strong> Platform adoption, operational responsibilities, incident coordination, delivery enablement.<\/li>\n<li><strong>Security leadership (CISO \/ Head of Security, AppSec, SecOps, GRC):<\/strong> Controls, risk management, vulnerability response, audit readiness, incident response.<\/li>\n<li><strong>Enterprise Architecture (if present):<\/strong> Alignment on standards, target state, and technology choices.<\/li>\n<li><strong>Finance \/ FP&amp;A:<\/strong> Budgeting, forecasting, unit economics, commitment planning (Reserved Instances\/Savings Plans).<\/li>\n<li><strong>Procurement \/ Vendor Management:<\/strong> Tool selection, negotiations, renewals, risk assessments.<\/li>\n<li><strong>Customer Support \/ Customer Success:<\/strong> Incident communications, customer impact analysis, post-incident follow-ups.<\/li>\n<li><strong>IT \/ Corporate Systems:<\/strong> Identity, endpoints, SaaS app management (if scope overlaps).<\/li>\n<li><strong>Legal \/ Privacy (context-specific):<\/strong> Data residency, vendor contracts, breach readiness.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">External stakeholders (as applicable)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Cloud providers and critical vendors:<\/strong> Support escalations, roadmap influence, incident coordination, architectural guidance.<\/li>\n<li><strong>Audit firms \/ compliance assessors:<\/strong> Evidence review, control testing, remediation follow-up.<\/li>\n<li><strong>Key customers (enterprise):<\/strong> Reliability reviews, security questionnaires, major incident communications (typically via Success\/Support).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Peer roles<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Head of SRE (if separate), Head of Security, Head of Engineering Productivity\/Developer Experience, Head of Data Platform, Head of IT, Head of Architecture.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Upstream dependencies<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Product roadmap and growth forecasts (traffic, regions, SLAs).<\/li>\n<li>Security policies and risk appetite statements.<\/li>\n<li>Financial constraints and margin targets.<\/li>\n<li>Customer contractual uptime\/SLA commitments.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Downstream consumers<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Product engineering teams deploying services.<\/li>\n<li>Data engineering and analytics consumers.<\/li>\n<li>Security operations relying on logs\/telemetry.<\/li>\n<li>Support\/Success requiring status visibility and incident comms.<\/li>\n<li>Executives needing risk\/cost transparency.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Nature of collaboration<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Predominantly partnership-driven with formal governance only where necessary (e.g., high-risk changes, audit controls).<\/li>\n<li>Successful collaboration relies on clear service boundaries, published standards, and pragmatic adoption paths.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Typical decision-making authority and escalation points<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Infrastructure design standards: Head of Infrastructure proposes\/owns; Architecture\/CTO align on exceptions.<\/li>\n<li>Incidents: Incident Commander authority during active events; Head of Infrastructure accountable for process maturity and escalations.<\/li>\n<li>Budget and vendor decisions: Head of Infrastructure recommends; CTO\/Finance approve above thresholds.<\/li>\n<li>Security control decisions: joint ownership with Security; escalations to CTO\/CISO for risk acceptance.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">13) Decision Rights and Scope of Authority<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Decisions this role can typically make independently<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Infrastructure team operating processes (on-call structure, runbook standards, postmortem formats).<\/li>\n<li>Tool configuration and standardization within existing approved portfolio.<\/li>\n<li>Prioritization of infrastructure backlog within an agreed roadmap and OKRs.<\/li>\n<li>Approval of routine infrastructure changes within risk policy.<\/li>\n<li>Hiring decisions for approved headcount within the infrastructure org (often with HR\/CTO alignment).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Decisions requiring team approval (or architecture review)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Introduction of major new platform components (e.g., new service mesh, new observability stack).<\/li>\n<li>Breaking changes to shared infrastructure APIs or golden paths.<\/li>\n<li>Standard module interfaces and versioning policies.<\/li>\n<li>Decommissioning widely used services and migration timelines.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Decisions requiring manager\/executive approval (CTO\/VP Eng, Finance, CISO as applicable)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Annual budget and significant unplanned spend increases.<\/li>\n<li>Vendor selection\/renewals above approval thresholds; multi-year commitments.<\/li>\n<li>Major architectural shifts (e.g., single-to-multi-region, cloud provider changes, large re-platforming).<\/li>\n<li>Risk acceptance decisions that materially affect compliance, customer SLAs, or security posture.<\/li>\n<li>Major org changes (team splits\/merges, reassigning ownership across orgs).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Budget authority (typical)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Owns cloud infrastructure and tooling budgets with accountability for forecasting and variance management.<\/li>\n<li>Can approve discretionary spend up to a defined threshold (varies by company).<\/li>\n<li>Partners with Finance for commitment purchases and capitalization policy (where applicable).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Architecture authority<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Owns infrastructure reference architectures and required controls.<\/li>\n<li>Can enforce baseline standards; exceptions managed via a documented exception process with time-bound remediation.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Vendor authority<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Leads evaluations, proofs-of-concept, total cost of ownership analysis, and vendor scorecards.<\/li>\n<li>Owns ongoing vendor performance management and escalation.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Delivery authority<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Accountable for delivering infrastructure roadmap outcomes; may gate high-risk launches via operational readiness criteria.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Hiring and organizational authority<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Designs team structure, roles, leveling expectations, and hiring profiles.<\/li>\n<li>Accountable for on-call sustainability and staffing to meet uptime objectives.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">14) Required Experience and Qualifications<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Typical years of experience<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>12\u201318+ years<\/strong> in infrastructure engineering, SRE, platform engineering, or operations in software\/IT organizations.<\/li>\n<li><strong>5\u201310+ years<\/strong> in people leadership (managing managers strongly preferred for \u201cHead\u201d scope).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Education expectations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Bachelor\u2019s degree in Computer Science, Engineering, or related field is common.  <\/li>\n<li>Equivalent practical experience is often acceptable in software infrastructure leadership.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Certifications (relevant but not mandatory)<\/h3>\n\n\n\n<p>Labeling: <strong>Optional unless specified by company policy<\/strong>.\n&#8211; <strong>Common\/Optional:<\/strong> AWS Solutions Architect Professional, Azure Solutions Architect Expert, GCP Professional Cloud Architect\n&#8211; <strong>Optional:<\/strong> Kubernetes certifications (CKA\/CKAD\/CKS) for K8s-heavy environments\n&#8211; <strong>Optional:<\/strong> ITIL Foundation (useful in ITSM-heavy enterprises)\n&#8211; <strong>Context-specific:<\/strong> Security certifications (CISSP) if infrastructure leadership also covers significant security scope<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Prior role backgrounds commonly seen<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Infrastructure Engineering Manager \/ Director<\/li>\n<li>SRE Manager \/ Director (or Production Engineering leader)<\/li>\n<li>Platform Engineering leader<\/li>\n<li>Senior Cloud Architect with leadership progression<\/li>\n<li>Network\/Systems Engineering leader (modern cloud context)<\/li>\n<li>DevOps leader in organizations where \u201cDevOps\u201d functionally covers platform and operations<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Domain knowledge expectations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>SaaS reliability and production operations practices<\/li>\n<li>Cloud security and shared responsibility models<\/li>\n<li>Compliance-adjacent knowledge sufficient to partner with GRC (controls, evidence, audit cadence)<\/li>\n<li>Vendor and cost management in cloud environments<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Leadership experience expectations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Leading multi-team organizations with mixed disciplines (cloud, SRE, network, observability, tooling).<\/li>\n<li>Operating in 24\/7 environments with mature incident response.<\/li>\n<li>Demonstrated ability to influence peer engineering leaders and create adoption of standards.<\/li>\n<li>Experience managing budgets, forecasts, and vendor relationships.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">15) Career Path and Progression<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Common feeder roles into Head of Infrastructure<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Director of Infrastructure \/ Director of Platform Engineering<\/li>\n<li>Head\/Director of SRE or Production Engineering<\/li>\n<li>Senior Engineering Manager (Infrastructure\/Platform)<\/li>\n<li>Principal\/Lead Cloud Architect with strong people leadership trajectory (less common but possible)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Next likely roles after this role<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>VP Infrastructure \/ VP Platform Engineering<\/strong> (larger orgs)<\/li>\n<li><strong>VP Engineering<\/strong> (broader scope including product engineering)<\/li>\n<li><strong>CTO<\/strong> (especially in infrastructure-heavy or B2B enterprise contexts)<\/li>\n<li><strong>Chief Reliability Officer \/ Head of Technology Operations<\/strong> (rare, larger orgs)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Adjacent career paths<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Security leadership:<\/strong> Head of Cloud Security \/ Infrastructure Security (if strong security orientation)<\/li>\n<li><strong>Enterprise Architecture leadership:<\/strong> Chief Architect \/ Head of Architecture (if architecture-heavy and governance-focused)<\/li>\n<li><strong>Operations leadership:<\/strong> Head of SRE \/ Head of Operations (if incident and service management dominant)<\/li>\n<li><strong>Engineering Productivity\/Developer Experience leadership:<\/strong> if platform is strongly developer-facing<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Skills needed for promotion<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Proven track record delivering multi-quarter platform outcomes tied to business metrics.<\/li>\n<li>Strong executive presence and ability to manage cross-org strategy.<\/li>\n<li>Mature budget ownership and vendor portfolio optimization.<\/li>\n<li>Ability to build a leadership bench (managers of managers), not just a strong IC core.<\/li>\n<li>Demonstrated reliability improvements at scale (SLOs, DR readiness, reduced MTTR) and sustained cost control.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">How this role evolves over time<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Early phase: heavy stabilization, incident reduction, standardization, and foundational platform builds.<\/li>\n<li>Mid phase: platform as product, self-service expansion, SLO and cost governance maturity.<\/li>\n<li>Mature phase: strategic differentiation, multi-region expansion, compliance automation, and continuous optimization with strong internal SLAs.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">16) Risks, Challenges, and Failure Modes<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Common role challenges<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Competing priorities:<\/strong> Product delivery pressure vs reliability\/security investments.<\/li>\n<li><strong>Inherited complexity:<\/strong> Legacy infrastructure, inconsistent standards, and fragmented tooling.<\/li>\n<li><strong>Ambiguous ownership:<\/strong> Unclear lines between product teams, SRE, and infrastructure leading to gaps.<\/li>\n<li><strong>Scale inflection points:<\/strong> Rapid growth causing capacity, cost, and operational maturity gaps.<\/li>\n<li><strong>Cultural resistance:<\/strong> Teams avoiding standardization due to perceived loss of autonomy.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Bottlenecks<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Manual access provisioning and poor IAM hygiene leading to slow delivery and audit risk.<\/li>\n<li>Noisy alerting and immature observability causing slow detection and high on-call load.<\/li>\n<li>Lack of standardized IaC modules causing drift and inconsistent environments.<\/li>\n<li>Unowned dependencies (shared clusters, shared accounts, unmanaged vendor services).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Anti-patterns<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Hero culture:<\/strong> Reliance on a few experts to keep production alive.<\/li>\n<li><strong>Big-bang migrations:<\/strong> High-risk replatforming without incremental milestones or rollback.<\/li>\n<li><strong>Tool sprawl:<\/strong> Too many overlapping observability\/CI\/security tools increasing cost and complexity.<\/li>\n<li><strong>Ticket-driven platform:<\/strong> Platform team becomes a bottleneck instead of enabling self-service.<\/li>\n<li><strong>Over-governance:<\/strong> Heavy change processes that slow delivery without improving safety.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Common reasons for underperformance<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Treating infrastructure as a back-office function rather than a product enabler.<\/li>\n<li>Inability to articulate ROI or risk trade-offs to executives, leading to underinvestment.<\/li>\n<li>Weak incident leadership and lack of follow-through on corrective actions.<\/li>\n<li>Poor stakeholder management resulting in low adoption of standards.<\/li>\n<li>Neglecting team health (burnout, attrition) and failing to build leadership depth.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Business risks if this role is ineffective<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Increased outages, SLA breaches, customer churn, and reputational damage.<\/li>\n<li>Material security incidents due to poor IAM, secrets handling, and weak detection.<\/li>\n<li>Cloud spend grows faster than revenue, compressing margins or shortening runway.<\/li>\n<li>Inability to pass customer security reviews or audits, slowing enterprise sales.<\/li>\n<li>Slow product delivery due to unreliable CI\/platform friction and operational chaos.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">17) Role Variants<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">By company size<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Startup (Series A\u2013B):<\/strong> <\/li>\n<li>Scope often includes hands-on architecture and direct contribution; team may be small (2\u20138).  <\/li>\n<li>Focus: stabilize production, establish IaC, baseline monitoring, pragmatic security controls, cost visibility.<\/li>\n<li><strong>Mid-size (Series C\u2013D \/ scaling SaaS):<\/strong> <\/li>\n<li>Leads multiple teams; formalizes SRE, DR, and platform self-service.  <\/li>\n<li>Focus: scale reliability practices, multi-region planning, reduce toil, build developer platform.<\/li>\n<li><strong>Enterprise:<\/strong> <\/li>\n<li>Manages managers and interfaces with enterprise architecture, GRC, procurement, and formal ITSM.  <\/li>\n<li>Focus: compliance automation, vendor governance, standardized enterprise platforms, complex stakeholder landscape.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">By industry<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>B2B SaaS (general):<\/strong> Strong emphasis on uptime, customer trust, and enterprise security reviews.  <\/li>\n<li><strong>Fintech\/Payments:<\/strong> Higher bar for resiliency, auditability, and security controls; tighter change governance.  <\/li>\n<li><strong>Healthcare:<\/strong> Strong compliance and privacy controls; DR and access governance are critical.  <\/li>\n<li><strong>Consumer internet:<\/strong> Performance, scale, and cost efficiency at high traffic volumes; edge optimization and incident rigor.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">By geography<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Single-region customer base:<\/strong> DR may be simpler; focus on stability and cost.  <\/li>\n<li><strong>Global footprint:<\/strong> Multi-region strategy, latency management, data residency, and follow-the-sun incident response become more important.  <\/li>\n<li><strong>Region-specific regulations:<\/strong> Data residency and encryption standards may drive architecture and vendor selection.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Product-led vs service-led company<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Product-led:<\/strong> Infrastructure optimized for rapid iteration, self-service, and standardized runtime paths.  <\/li>\n<li><strong>Service-led \/ IT services:<\/strong> May emphasize client-specific environments, stricter project management, and SLAs; more variation management and isolation patterns.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Startup vs enterprise operating model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Startup:<\/strong> Faster decisions, fewer controls, more direct ownership.  <\/li>\n<li><strong>Enterprise:<\/strong> More governance and compliance, greater vendor and process complexity, larger-scale cost optimization.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Regulated vs non-regulated environment<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Regulated:<\/strong> Formal control mapping, evidence retention, access reviews, change records, DR testing cadence.  <\/li>\n<li><strong>Non-regulated:<\/strong> More flexibility, but enterprise customers may still demand SOC2-like practices.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">18) AI \/ Automation Impact on the Role<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Tasks that can be automated (increasingly)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Alert correlation and noise reduction:<\/strong> AIOps can cluster events, suppress duplicates, and suggest likely root causes.<\/li>\n<li><strong>Incident summarization:<\/strong> Automated timelines, impact analysis drafts, and customer-facing status updates (with human review).<\/li>\n<li><strong>Infrastructure drift detection and remediation:<\/strong> Continuous validation of IaC vs runtime with automated rollback\/repair patterns.<\/li>\n<li><strong>Cost anomaly detection and optimization recommendations:<\/strong> Automated detection of spend spikes, idle resources, and commitment planning suggestions.<\/li>\n<li><strong>Automated compliance checks:<\/strong> Policy-as-code enforcement and continuous control monitoring for encryption, logging, IAM constraints.<\/li>\n<li><strong>Runbook automation:<\/strong> Self-healing actions triggered by known failure patterns (restart loops, failovers, scaling actions).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Tasks that remain human-critical<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Strategy and trade-off decisions:<\/strong> Reliability vs cost vs delivery speed; selecting architectural direction and sequencing investments.<\/li>\n<li><strong>Risk acceptance and governance:<\/strong> Determining when exceptions are justified and communicating impacts.<\/li>\n<li><strong>Cross-functional influence and culture:<\/strong> Driving adoption, changing behaviors, and managing stakeholder expectations.<\/li>\n<li><strong>High-stakes incident leadership:<\/strong> Humans remain accountable for decision-making under uncertainty and ensuring safe actions.<\/li>\n<li><strong>Vendor and contract negotiation:<\/strong> Context, leverage, and relationship management remain critical.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">How AI changes the role over the next 2\u20135 years<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The Head of Infrastructure will be expected to adopt <strong>automation-first operations<\/strong>, reducing manual toil and raising the baseline for reliability.<\/li>\n<li>Increased emphasis on <strong>internal platform product management<\/strong>, as AI accelerates development and increases demand for fast, safe environments.<\/li>\n<li>More sophisticated <strong>capacity and performance forecasting<\/strong> using predictive analytics tied to product usage signals.<\/li>\n<li>Expanded responsibility for <strong>governance of AI-enabled operational actions<\/strong>, including guardrails and approval flows for automated changes.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">New expectations caused by AI, automation, or platform shifts<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Establish policies for AI tooling in operations: data handling, audit logs, approval thresholds, and safe automation patterns.<\/li>\n<li>Build \u201cpaved roads\u201d that include AI-enabled workflows without creating uncontrolled change vectors.<\/li>\n<li>Demonstrate measurable reductions in toil and improved incident outcomes attributable to automation investments.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">19) Hiring Evaluation Criteria<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What to assess in interviews<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Infrastructure strategy and architecture judgment<\/strong>\n   &#8211; Can the candidate define a target state that matches business needs?\n   &#8211; Do they understand trade-offs: managed services vs self-managed, multi-region complexity, vendor lock-in, reliability vs cost?<\/li>\n<li><strong>Reliability leadership<\/strong>\n   &#8211; Depth in incident management, postmortems, SLOs, and operational maturity.\n   &#8211; Evidence of reducing MTTR and recurring incidents through systemic fixes.<\/li>\n<li><strong>Cloud and security competence<\/strong>\n   &#8211; IAM and least privilege, secrets handling, encryption, logging, network segmentation.\n   &#8211; Ability to partner with Security and meet compliance expectations pragmatically.<\/li>\n<li><strong>Operating model and platform mindset<\/strong>\n   &#8211; Experience building platform teams and self-service capabilities.\n   &#8211; Clear thinking on ownership boundaries and reducing ticket-driven bottlenecks.<\/li>\n<li><strong>FinOps and cost governance<\/strong>\n   &#8211; Ability to build cost visibility, allocation, and optimization processes tied to unit economics.<\/li>\n<li><strong>Leadership and org building<\/strong>\n   &#8211; Managing managers, team health, hiring, coaching, and building a sustainable on-call model.<\/li>\n<li><strong>Stakeholder influence<\/strong>\n   &#8211; Proven ability to drive adoption across engineering orgs and align executives on investments.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Practical exercises or case studies (recommended)<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Infrastructure maturity assessment case<\/strong>\n   &#8211; Provide a short \u201ccompany brief\u201d (traffic growth, incident history, cloud spend trend, compliance needs).<br\/>\n   &#8211; Candidate produces a 90-day plan: top risks, quick wins, roadmap themes, KPIs, and operating model changes.<\/li>\n<li><strong>Incident leadership simulation<\/strong>\n   &#8211; Walk through a SEV-1 scenario: partial outage, unclear root cause, multiple teams involved.<br\/>\n   &#8211; Evaluate command structure, comms discipline, decision-making, and follow-up actions.<\/li>\n<li><strong>Cost and reliability trade-off exercise<\/strong>\n   &#8211; Present a cost spike and reliability risk; candidate proposes optimizations without degrading SLOs (or explains trade-offs).<\/li>\n<li><strong>Architecture review<\/strong>\n   &#8211; Candidate reviews a reference diagram (K8s + managed DB + CDN) and identifies failure points, security gaps, and DR improvements.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Strong candidate signals<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Clear examples of reliability improvements with measurable outcomes (MTTR reduction, fewer SEV-1s, improved SLO attainment).<\/li>\n<li>Evidence of building self-service infrastructure capabilities and increasing platform adoption.<\/li>\n<li>Mature, pragmatic security posture and strong partnership with Security\/GRC.<\/li>\n<li>Demonstrated cost optimization tied to allocation and unit metrics (not just one-off savings).<\/li>\n<li>Calm, structured incident leadership with strong communication patterns.<\/li>\n<li>Strong org-building skills: hiring, leveling, developing managers, on-call sustainability.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Weak candidate signals<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Over-indexing on tools rather than outcomes; lacks a measurement framework.<\/li>\n<li>Limited incident leadership experience; cannot articulate postmortem-driven improvement loops.<\/li>\n<li>Treats security\/compliance as external constraints rather than integrated design requirements.<\/li>\n<li>Cannot discuss cloud cost drivers or budgeting with confidence.<\/li>\n<li>Excessive reliance on heroics or \u201cbest people fix it\u201d culture.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Red flags<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Blame-oriented incident culture; unwillingness to run blameless postmortems.<\/li>\n<li>Dismissive attitude toward documentation, runbooks, or operational discipline.<\/li>\n<li>Makes large architectural bets without migration strategy, milestones, and rollback plans.<\/li>\n<li>Poor ethics around access controls or audit expectations (\u201cwe can just bypass it\u201d).<\/li>\n<li>Chronic underinvestment in team health (unsustainable on-call as the norm).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scorecard dimensions (recommended)<\/h3>\n\n\n\n<p>Use a consistent 1\u20135 scale (1 = insufficient, 3 = meets, 5 = exceptional).<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Dimension<\/th>\n<th>What \u201cmeets\u201d looks like<\/th>\n<th>What \u201cexceptional\u201d looks like<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Infrastructure strategy<\/td>\n<td>Coherent roadmap aligned to growth<\/td>\n<td>Clear target state + sequencing + ROI\/risk narrative<\/td>\n<\/tr>\n<tr>\n<td>Reliability\/SRE<\/td>\n<td>Understands SLOs and incident practice<\/td>\n<td>Proven track record transforming reliability outcomes<\/td>\n<\/tr>\n<tr>\n<td>Cloud architecture<\/td>\n<td>Solid cloud fundamentals<\/td>\n<td>Deep judgment on managed services, multi-region, scaling<\/td>\n<\/tr>\n<tr>\n<td>Security &amp; compliance<\/td>\n<td>Partners well; applies baseline controls<\/td>\n<td>Builds secure-by-default platforms and automates controls<\/td>\n<\/tr>\n<tr>\n<td>FinOps<\/td>\n<td>Can allocate and optimize costs<\/td>\n<td>Ties cost to unit economics; drives sustained efficiency<\/td>\n<\/tr>\n<tr>\n<td>Platform engineering<\/td>\n<td>Enables teams with standards<\/td>\n<td>Platform-as-product mindset; high adoption outcomes<\/td>\n<\/tr>\n<tr>\n<td>Leadership &amp; org building<\/td>\n<td>Manages teams effectively<\/td>\n<td>Builds leaders, succession, sustainable on-call culture<\/td>\n<\/tr>\n<tr>\n<td>Communication<\/td>\n<td>Clear and structured updates<\/td>\n<td>Executive-ready narratives; strong influence across org<\/td>\n<\/tr>\n<tr>\n<td>Execution<\/td>\n<td>Delivers planned outcomes<\/td>\n<td>Predictable delivery, strong operational follow-through<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">20) Final Role Scorecard Summary<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Item<\/th>\n<th>Executive summary<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Role title<\/td>\n<td>Head of Infrastructure<\/td>\n<\/tr>\n<tr>\n<td>Role purpose<\/td>\n<td>Lead infrastructure strategy, engineering, and operations to ensure secure, scalable, reliable, and cost-effective platforms that enable product delivery and protect customer experience.<\/td>\n<\/tr>\n<tr>\n<td>Top 10 responsibilities<\/td>\n<td>1) Infrastructure strategy\/roadmap 2) Reference architectures\/standards 3) Reliability\/SLO program 4) Incident management maturity 5) IaC and automation 6) Observability platform ownership 7) DR and business continuity 8) Cloud cost governance\/FinOps 9) Security controls implementation with Security 10) Org leadership, hiring, and on-call sustainability<\/td>\n<\/tr>\n<tr>\n<td>Top 10 technical skills<\/td>\n<td>1) Cloud architecture 2) SRE\/reliability engineering 3) IaC (Terraform etc.) 4) Observability design 5) Cloud networking 6) IAM\/secrets\/encryption fundamentals 7) Incident management systems 8) Kubernetes\/container platforms (common) 9) FinOps\/cloud economics 10) Multi-region resilience patterns<\/td>\n<\/tr>\n<tr>\n<td>Top 10 soft skills<\/td>\n<td>1) Executive communication 2) Prioritization 3) Systems thinking 4) Calm incident leadership 5) Influence without authority 6) Talent development 7) Operational discipline 8) Platform product mindset 9) Financial acumen 10) Integrity and risk stewardship<\/td>\n<\/tr>\n<tr>\n<td>Top tools\/platforms<\/td>\n<td>Cloud (AWS\/Azure\/GCP), Kubernetes, Terraform, Datadog\/Prometheus\/Grafana, PagerDuty, GitHub\/GitLab, Jira, Confluence\/Notion, Vault\/KMS\/Key Vault, Cloudflare\/CDN\/WAF, ServiceNow\/Jira Service Management (context-specific)<\/td>\n<\/tr>\n<tr>\n<td>Top KPIs<\/td>\n<td>Tier-0 availability, MTTR\/MTTD, SLO attainment &amp; error budget burn, change failure rate, incident recurrence rate, on-call load &amp; actionable alert %, DR test pass rate &amp; RTO\/RPO attainment, cloud cost variance to forecast, unit cost trend, platform adoption &amp; developer satisfaction<\/td>\n<\/tr>\n<tr>\n<td>Main deliverables<\/td>\n<td>Infrastructure strategy and roadmap; reference architectures; SLO framework; incident playbooks and postmortem program; DR plans and test evidence; IaC modules and golden paths; observability dashboards; cost allocation and unit economics reporting; security control implementation artifacts; vendor evaluation\/renewal packets; operating model and RACI documentation<\/td>\n<\/tr>\n<tr>\n<td>Main goals<\/td>\n<td>Stabilize production reliability; reduce incident recurrence and MTTR; mature DR; build self-service platform capabilities; implement strong cost governance; strengthen security posture and audit readiness; build a healthy infrastructure org with sustainable operations<\/td>\n<\/tr>\n<tr>\n<td>Career progression options<\/td>\n<td>VP Platform\/Infrastructure; VP Engineering; CTO (context-dependent); Head of SRE\/Technology Operations; Head of Cloud Security\/Infrastructure Security (adjacent path)<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>The Head of Infrastructure is the senior leader accountable for the reliability, scalability, security, and cost-effectiveness of the company\u2019s production and corporate infrastructure platforms. This role establishes the infrastructure strategy and operating model, leads infrastructure engineering and operations teams, and ensures the infrastructure enables product delivery at the required performance and availability levels.<\/p>\n","protected":false},"author":61,"featured_media":0,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"_joinchat":[],"footnotes":""},"categories":[24486,24483],"tags":[],"class_list":["post-74774","post","type-post","status-publish","format-standard","hentry","category-engineering-leadership","category-leadership"],"_links":{"self":[{"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/74774","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/users\/61"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/comments?post=74774"}],"version-history":[{"count":0,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/74774\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/media?parent=74774"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/categories?post=74774"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/tags?post=74774"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}