{"id":74775,"date":"2026-04-15T18:02:10","date_gmt":"2026-04-15T18:02:10","guid":{"rendered":"https:\/\/www.devopsschool.com\/blog\/head-of-infrastructure-engineering-role-blueprint-responsibilities-skills-kpis-and-career-path\/"},"modified":"2026-04-15T18:02:10","modified_gmt":"2026-04-15T18:02:10","slug":"head-of-infrastructure-engineering-role-blueprint-responsibilities-skills-kpis-and-career-path","status":"publish","type":"post","link":"https:\/\/www.devopsschool.com\/blog\/head-of-infrastructure-engineering-role-blueprint-responsibilities-skills-kpis-and-career-path\/","title":{"rendered":"Head of Infrastructure Engineering: Role Blueprint, Responsibilities, Skills, KPIs, and Career Path"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">1) Role Summary<\/h2>\n\n\n\n<p>The Head of Infrastructure Engineering is accountable for designing, building, and operating the company\u2019s infrastructure platforms and core reliability capabilities that enable product engineering teams to ship software safely, quickly, and cost-effectively. This role leads the infrastructure engineering organization (often including cloud infrastructure, Kubernetes\/platform engineering, networking, observability, and incident management) and ensures infrastructure is scalable, secure, and operationally mature.<\/p>\n\n\n\n<p>This role exists in software and IT organizations to translate business growth and product requirements into dependable infrastructure capabilities\u2014compute, storage, network, CI\/CD enablement, and operational tooling\u2014while reducing operational risk and controlling unit cost. The business value comes from improved availability and performance, faster delivery cycles through platform leverage, reduced incident impact, predictable capacity and cost, and audit-ready controls that protect customer trust.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Role horizon: <strong>Current<\/strong> (well-established leadership role in modern software organizations)<\/li>\n<li>Typical interactions: Product Engineering, SRE, Security, Architecture, IT\/Enterprise Systems, Finance\/FinOps, Customer Support, Data Engineering, Compliance\/Risk, Vendors\/Cloud providers<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">2) Role Mission<\/h2>\n\n\n\n<p><strong>Core mission:<\/strong> Build and lead an infrastructure engineering function that delivers a secure, scalable, and cost-efficient platform that enables product teams to deliver customer value with high reliability and strong operational governance.<\/p>\n\n\n\n<p><strong>Strategic importance:<\/strong> Infrastructure is the execution layer for the company\u2019s technology strategy. The Head of Infrastructure Engineering ensures the organization can grow (traffic, customers, data volume, global reach) without disproportionately increasing operational burden, downtime risk, or cloud spend. This leader also defines the \u201cpaved roads\u201d (standard patterns and self-service capabilities) that make engineering execution repeatable and resilient.<\/p>\n\n\n\n<p><strong>Primary business outcomes expected:<\/strong>\n&#8211; High service availability and predictable performance aligned to customer expectations (SLOs\/SLAs)\n&#8211; Reduced incident frequency and faster recovery (lower operational risk)\n&#8211; Faster delivery enablement via standardized platforms and automation (higher engineering throughput)\n&#8211; Cloud and infrastructure cost efficiency measured in unit economics and budget predictability\n&#8211; Security and compliance alignment through embedded controls and auditable operations\n&#8211; Sustainable on-call and operations model that retains talent and reduces burnout<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">3) Core Responsibilities<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Strategic responsibilities<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Infrastructure strategy and roadmap:<\/strong> Define and maintain a 12\u201324 month infrastructure roadmap aligned with product growth, architectural direction, and business objectives (availability, expansion, cost, speed).<\/li>\n<li><strong>Platform operating model:<\/strong> Establish and evolve the operating model for infrastructure engineering (platform team topology, ownership boundaries, SLAs\/SLOs, self-service strategy, escalation and on-call design).<\/li>\n<li><strong>Cloud and data center strategy (context-dependent):<\/strong> Own cloud strategy (single vs multi-cloud), hosting patterns, regional expansion, and deprecation plans for legacy infrastructure.<\/li>\n<li><strong>Reliability strategy with SRE\/Engineering:<\/strong> Partner with SRE and application leaders to set reliability targets and define shared accountability models (SLOs, error budgets, operational readiness).<\/li>\n<li><strong>FinOps and unit cost strategy:<\/strong> Co-own infrastructure unit economics (cost per customer\/tenant, cost per request, cost per environment) with Finance\/FinOps; drive cost optimization and forecasting discipline.<\/li>\n<li><strong>Vendor and tooling strategy:<\/strong> Select and rationalize critical infrastructure tooling (observability, CI\/CD, secrets management, CDNs, DDoS protection), including vendor negotiations and renewal governance.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Operational responsibilities<\/h3>\n\n\n\n<ol class=\"wp-block-list\" start=\"7\">\n<li><strong>Operational excellence and incident leadership:<\/strong> Ensure 24\/7 operational coverage model, incident response execution, post-incident reviews (PIRs), and systemic remediation to prevent recurrence.<\/li>\n<li><strong>Capacity and performance management:<\/strong> Establish capacity planning, load testing strategy (with performance engineering), and operational readiness gates for major launches.<\/li>\n<li><strong>Change management and release governance:<\/strong> Own infrastructure change practices (change windows, progressive delivery for platform, rollout risk management, rollback readiness).<\/li>\n<li><strong>Service health reporting:<\/strong> Provide regular reporting on uptime, latency, error rates, incident trends, and operational risks for executive and stakeholder consumption.<\/li>\n<li><strong>Environment management:<\/strong> Ensure healthy, consistent environments (dev\/test\/stage\/prod), including provisioning, isolation, data handling controls, and lifecycle management.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Technical responsibilities<\/h3>\n\n\n\n<ol class=\"wp-block-list\" start=\"12\">\n<li><strong>Infrastructure architecture direction:<\/strong> Set reference architectures and reusable patterns for networking, compute, storage, Kubernetes, and IaC modules; enforce standards through automation and reviews.<\/li>\n<li><strong>Infrastructure as Code and automation:<\/strong> Drive standardization and adoption of IaC and automation (e.g., Terraform modules, GitOps pipelines) to reduce manual work and drift.<\/li>\n<li><strong>Observability and telemetry:<\/strong> Ensure robust logging, metrics, tracing, alerting, and runbooks exist and are actionable; evolve alert quality and reduce toil.<\/li>\n<li><strong>Security engineering alignment:<\/strong> Partner with Security to implement secure-by-default configurations, secrets management, identity and access controls, network segmentation, and vulnerability remediation processes.<\/li>\n<li><strong>Resilience, backup, and disaster recovery:<\/strong> Define and test DR strategy (RTO\/RPO targets), backup\/restore procedures, and regional failover approaches; lead game days and resilience testing.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Cross-functional or stakeholder responsibilities<\/h3>\n\n\n\n<ol class=\"wp-block-list\" start=\"17\">\n<li><strong>Enablement of product engineering:<\/strong> Provide self-service platform capabilities, golden paths, and documentation that reduce dependencies and accelerate product delivery.<\/li>\n<li><strong>Customer-impact collaboration:<\/strong> Coordinate with Support\/Customer Success for incident communications, maintenance notifications, and problem management for recurring customer-impacting issues.<\/li>\n<li><strong>Executive communication:<\/strong> Translate technical risk and infrastructure investment needs into business outcomes, trade-offs, and prioritized funding asks.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Governance, compliance, or quality responsibilities<\/h3>\n\n\n\n<ol class=\"wp-block-list\" start=\"20\">\n<li><strong>Controls and audit readiness:<\/strong> Ensure infrastructure operations are auditable (access reviews, change controls, logging retention, asset inventory, secure configurations) supporting frameworks like SOC 2\/ISO 27001 (context-specific).<\/li>\n<li><strong>Policy and standards management:<\/strong> Maintain infrastructure standards and policies (naming, tagging, data handling, encryption, key rotation, lifecycle management).<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Leadership responsibilities (core to the \u201cHead of\u201d title)<\/h3>\n\n\n\n<ol class=\"wp-block-list\" start=\"22\">\n<li><strong>Org leadership and talent strategy:<\/strong> Build and lead managers and senior engineers; define job architecture, leveling, hiring plan, team structure, and career paths for infrastructure engineering.<\/li>\n<li><strong>Culture and execution management:<\/strong> Establish a culture of ownership, blameless learning, quality, and operational rigor; manage priorities and dependencies across multiple teams.<\/li>\n<li><strong>Budget ownership:<\/strong> Own or co-own the infrastructure engineering budget (headcount, tooling, cloud commitments), including forecasting and investment governance.<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">4) Day-to-Day Activities<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Daily activities<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Review service health dashboards: availability, latency, error budgets, key customer journeys, and platform saturation indicators.<\/li>\n<li>Monitor incident queue\/escalations; ensure proper triage, severity assignment, and ownership.<\/li>\n<li>Approve or review high-risk infrastructure changes (network changes, cluster upgrades, IAM policy changes, database platform changes\u2014context-specific).<\/li>\n<li>Unblock engineering teams on infrastructure dependencies: environment constraints, provisioning issues, access requests, capacity constraints.<\/li>\n<li>Review key alerts and on-call quality signals: noisy alerts, paging frequency, time-to-acknowledge, time-to-mitigate.<\/li>\n<li>Quick check-ins with direct reports (managers\/tech leads) to align priorities and remove obstacles.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Weekly activities<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Leadership staff meeting: roadmap progress, risks, hiring pipeline, operational issues, dependency alignment with Product Engineering\/SRE\/Security.<\/li>\n<li>Reliability review: major incidents, near misses, error budget status, problem management items, top operational risks.<\/li>\n<li>Change review board (lightweight): upcoming platform upgrades, deprecations, or migrations; ensure rollback plans and comms.<\/li>\n<li>FinOps review: cost anomalies, savings opportunities (commitments\/reservations), tagging compliance, environment sprawl.<\/li>\n<li>Partner meetings: Security (risk and controls), Architecture (standards), Data Engineering (platform needs), Support (customer-impact trends).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Monthly or quarterly activities<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Quarterly planning: infrastructure roadmap updates, headcount planning, major initiatives sequencing, dependency mapping.<\/li>\n<li>DR tests and resilience game days (monthly or quarterly depending on risk profile).<\/li>\n<li>Vendor reviews: renewal decisions, SLA performance, support escalations, feature roadmaps.<\/li>\n<li>Talent reviews: performance, promotions, retention risks, succession planning.<\/li>\n<li>Metrics and governance reporting to VP Engineering\/CTO: reliability trend, cost trends, operational maturity progress.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Recurring meetings or rituals<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Weekly incident\/problem management review<\/li>\n<li>Monthly reliability council (multi-team)<\/li>\n<li>Quarterly architecture review board (infrastructure standards and patterns)<\/li>\n<li>Quarterly business review (QBR) for infrastructure engineering function<\/li>\n<li>On-call retrospective (monthly) focusing on toil reduction and alert hygiene<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Incident, escalation, or emergency work (as relevant)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Serve as executive incident commander for Sev-1\/Sev-0 events when needed.<\/li>\n<li>Coordinate cross-functional response: product engineering, SRE, security, vendor support, and customer communications.<\/li>\n<li>Ensure post-incident reviews happen within agreed timelines and that corrective actions are prioritized and executed.<\/li>\n<li>Manage emergency capacity events (traffic spikes), vendor outages, or compromised credentials (in partnership with Security).<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">5) Key Deliverables<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Infrastructure strategy and roadmap (12\u201324 months):<\/strong> Investment plan, deprecations, platform modernization, and capacity growth.<\/li>\n<li><strong>Reference architectures and \u201cpaved road\u201d patterns:<\/strong> Standard architectures for services, networking, Kubernetes deployments, secrets, ingress, and observability.<\/li>\n<li><strong>Infrastructure-as-Code repositories and module catalogs:<\/strong> Versioned Terraform modules, policy-as-code, and templates enabling self-service provisioning.<\/li>\n<li><strong>Platform reliability framework:<\/strong> SLO\/SLI definitions, error budget policy, service tiering, operational readiness checklist.<\/li>\n<li><strong>Incident response program artifacts:<\/strong> Severity model, incident roles, runbooks, PIR templates, problem management backlog.<\/li>\n<li><strong>Observability standards:<\/strong> Logging schema guidance, metrics naming conventions, tracing instrumentation requirements, alert quality rules.<\/li>\n<li><strong>Capacity plans and performance readiness reports:<\/strong> Forecasting models, load test results, bottleneck remediation plans.<\/li>\n<li><strong>Disaster recovery plan and test reports:<\/strong> RTO\/RPO definitions, dependency mapping, DR runbooks, evidence of testing outcomes.<\/li>\n<li><strong>Security and compliance evidence:<\/strong> Access review process, change control evidence, configuration baselines, audit response artifacts (context-specific).<\/li>\n<li><strong>Service dashboards and executive reporting:<\/strong> Reliability scorecards, cost dashboards, toil metrics, delivery enablement metrics.<\/li>\n<li><strong>Vendor contracts and renewal recommendations:<\/strong> Business justification, cost-benefit analysis, risk assessments.<\/li>\n<li><strong>Team operating model documentation:<\/strong> Ownership boundaries (RACI), escalation paths, SLAs for platform services, engagement model.<\/li>\n<li><strong>Training and enablement materials:<\/strong> Platform onboarding, runbook writing guides, incident response training, internal workshops.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">6) Goals, Objectives, and Milestones<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">30-day goals (orientation and assessment)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Build a full inventory of infrastructure services, environments, critical dependencies, and operational pain points.<\/li>\n<li>Assess current reliability posture: incident history, top failure modes, monitoring gaps, on-call health.<\/li>\n<li>Review cloud spend structure and cost drivers; baseline current unit costs and major spend categories.<\/li>\n<li>Meet key stakeholders across Engineering, Security, Support, and Finance; confirm expectations and pain points.<\/li>\n<li>Validate team structure, skills coverage, and current roadmap; identify immediate execution risks.<\/li>\n<\/ul>\n\n\n\n<p><strong>Success definition (30 days):<\/strong> Clear baseline of current state, prioritized risk register, and aligned stakeholder expectations.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">60-day goals (stabilization and near-term improvements)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Deliver a prioritized infrastructure roadmap draft with clear outcomes, owners, and sequencing.<\/li>\n<li>Implement top 3 operational improvements (e.g., noisy alert reduction, incident response improvements, critical runbook gaps).<\/li>\n<li>Establish a consistent metrics cadence: reliability scorecard, cost dashboard, and operational review rhythm.<\/li>\n<li>Confirm DR posture and schedule first resilience test\/game day if maturity is low.<\/li>\n<li>Improve team execution visibility: dependency tracking, staffing plan, and hiring priorities.<\/li>\n<\/ul>\n\n\n\n<p><strong>Success definition (60 days):<\/strong> Operational cadence running; visible improvements to reliability\/toil; roadmap agreed in principle.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">90-day goals (execution and operating model establishment)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Launch paved road initiatives: self-service provisioning improvements, standardized IaC modules, baseline security controls.<\/li>\n<li>Formalize SLOs\/SLIs and service tiering for critical systems (with SRE and product leadership).<\/li>\n<li>Implement capacity planning framework and performance readiness gates for launches.<\/li>\n<li>Produce an annual budget view for tooling and cloud commitments; align with Finance\/FinOps.<\/li>\n<li>Clarify ownership boundaries (Platform vs SRE vs Product teams), escalation paths, and service engagement model.<\/li>\n<\/ul>\n\n\n\n<p><strong>Success definition (90 days):<\/strong> Predictable operating model, measurable reliability improvements, and clear plan for scaling.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">6-month milestones (platform leverage and modernization)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Reduce high-severity incidents and\/or MTTR by a meaningful margin through systemic fixes and operational maturity.<\/li>\n<li>Improve platform self-service adoption (e.g., more teams using standardized modules\/pipelines; reduced provisioning cycle time).<\/li>\n<li>Achieve measurable cost optimization outcomes (commitment coverage, waste reduction, environment lifecycle controls).<\/li>\n<li>Complete at least one major platform modernization initiative (e.g., Kubernetes upgrade program, network segmentation improvements, observability platform consolidation).<\/li>\n<li>Demonstrate successful DR test execution with documented learnings and closed action items.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">12-month objectives (business-grade reliability and efficiency)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Meet agreed reliability targets for Tier-0\/Tier-1 services (availability, latency, error budget compliance).<\/li>\n<li>Establish infrastructure engineering as a product-like function with customer (engineering) satisfaction metrics, SLAs, and clear service catalogs.<\/li>\n<li>Achieve audit-ready infrastructure operations (as required): consistent access controls, logging, change management evidence, configuration baselines.<\/li>\n<li>Reduce infrastructure unit cost or stabilize cost growth relative to business growth through architecture and FinOps practices.<\/li>\n<li>Build a strong leadership bench: succession plan, manager capability, and senior technical leadership maturity.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Long-term impact goals (multi-year)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Create a scalable platform foundation enabling rapid product expansion (regions, enterprise features, new workloads) without linear headcount growth.<\/li>\n<li>Mature the company toward high reliability and delivery performance: fewer outages, faster launches, safer changes.<\/li>\n<li>Position infrastructure as a competitive advantage (performance, security posture, enterprise readiness).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">What high performance looks like<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Reliability is predictable; incidents are handled with discipline and result in lasting remediation.<\/li>\n<li>Engineering teams ship faster because infrastructure is self-service, standardized, and well-documented.<\/li>\n<li>Costs are explainable and optimized; trade-offs are data-driven.<\/li>\n<li>Security controls are embedded and do not rely on heroics.<\/li>\n<li>Team health is strong: sustainable on-call, clear priorities, high retention, and strong hiring signal.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">7) KPIs and Productivity Metrics<\/h2>\n\n\n\n<p>The metrics below are designed for an infrastructure engineering leader with accountability for reliability, cost, and platform enablement. Benchmarks vary significantly by business maturity and architecture; targets should be calibrated to service tier and customer commitments.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">KPI framework<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Metric name<\/th>\n<th>Type<\/th>\n<th>What it measures<\/th>\n<th>Why it matters<\/th>\n<th>Example target \/ benchmark<\/th>\n<th>Frequency<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Availability (Tier-0\/Tier-1)<\/td>\n<td>Outcome \/ Reliability<\/td>\n<td>Uptime of critical services or platform components<\/td>\n<td>Directly impacts customers and revenue<\/td>\n<td>Tier-0: 99.95\u201399.99%; Tier-1: 99.9\u201399.95%<\/td>\n<td>Weekly\/Monthly<\/td>\n<\/tr>\n<tr>\n<td>SLO compliance \/ Error budget burn<\/td>\n<td>Outcome \/ Reliability<\/td>\n<td>Whether services stay within SLOs and how fast error budgets burn<\/td>\n<td>Makes reliability trade-offs explicit<\/td>\n<td>&lt;25% burn mid-period; avoid budget exhaustion<\/td>\n<td>Weekly<\/td>\n<\/tr>\n<tr>\n<td>MTTR (Mean Time to Restore)<\/td>\n<td>Efficiency \/ Reliability<\/td>\n<td>Time from incident start to mitigation\/restore<\/td>\n<td>Reduces customer impact and operational risk<\/td>\n<td>Improve by 20\u201340% YoY (baseline-dependent)<\/td>\n<td>Monthly<\/td>\n<\/tr>\n<tr>\n<td>MTTD (Mean Time to Detect)<\/td>\n<td>Quality \/ Observability<\/td>\n<td>Time to detect incidents<\/td>\n<td>Indicates monitoring effectiveness<\/td>\n<td>Target minutes, not hours, for Tier-0<\/td>\n<td>Monthly<\/td>\n<\/tr>\n<tr>\n<td>Incident rate (Sev-0\/1\/2)<\/td>\n<td>Output\/Outcome<\/td>\n<td>Count of incidents by severity<\/td>\n<td>Tracks stability trend<\/td>\n<td>Downward trend; focus on Sev-1 reduction<\/td>\n<td>Monthly<\/td>\n<\/tr>\n<tr>\n<td>Change failure rate (infrastructure)<\/td>\n<td>Quality<\/td>\n<td>Percentage of changes causing incidents\/rollbacks<\/td>\n<td>Measures release safety<\/td>\n<td>&lt;10\u201315% depending on maturity<\/td>\n<td>Monthly<\/td>\n<\/tr>\n<tr>\n<td>Infrastructure deployment frequency<\/td>\n<td>Output<\/td>\n<td>How often infra\/platform changes are shipped<\/td>\n<td>Indicates automation and throughput<\/td>\n<td>Weekly cadence minimum; daily for mature teams<\/td>\n<td>Weekly\/Monthly<\/td>\n<\/tr>\n<tr>\n<td>Lead time for platform changes<\/td>\n<td>Efficiency<\/td>\n<td>Time from request\/PR to production<\/td>\n<td>Measures internal customer experience<\/td>\n<td>Days not weeks; tiered by risk<\/td>\n<td>Monthly<\/td>\n<\/tr>\n<tr>\n<td>Provisioning cycle time<\/td>\n<td>Efficiency \/ Enablement<\/td>\n<td>Time to provision environments\/resources via self-service<\/td>\n<td>Indicates platform leverage<\/td>\n<td>Minutes-hours for standard resources<\/td>\n<td>Monthly<\/td>\n<\/tr>\n<tr>\n<td>On-call load (pages per engineer)<\/td>\n<td>Leadership \/ Sustainability<\/td>\n<td>Paging frequency per on-call shift<\/td>\n<td>Signals toil and burnout risk<\/td>\n<td>Trending down; alert quality improvements<\/td>\n<td>Weekly\/Monthly<\/td>\n<\/tr>\n<tr>\n<td>Alert quality (actionable %)<\/td>\n<td>Quality<\/td>\n<td>Percent of alerts requiring action<\/td>\n<td>Reduces distraction and increases trust in monitoring<\/td>\n<td>&gt;70\u201385% actionable<\/td>\n<td>Monthly<\/td>\n<\/tr>\n<tr>\n<td>Infra cost vs budget<\/td>\n<td>Outcome \/ Financial<\/td>\n<td>Total infra spend vs forecast<\/td>\n<td>Budget predictability and governance<\/td>\n<td>\u00b15\u201310% variance<\/td>\n<td>Monthly<\/td>\n<\/tr>\n<tr>\n<td>Unit cost (e.g., cost per customer\/tenant\/request)<\/td>\n<td>Outcome \/ Financial<\/td>\n<td>Cost efficiency normalized to usage<\/td>\n<td>Aligns infra investment to growth<\/td>\n<td>Stable or improving trend<\/td>\n<td>Monthly\/Quarterly<\/td>\n<\/tr>\n<tr>\n<td>Waste reduction (idle resources, orphaned volumes, env sprawl)<\/td>\n<td>Efficiency \/ Financial<\/td>\n<td>Amount of eliminated waste<\/td>\n<td>Direct cost savings<\/td>\n<td>Quarterly savings target<\/td>\n<td>Monthly\/Quarterly<\/td>\n<\/tr>\n<tr>\n<td>Tagging\/chargeback coverage<\/td>\n<td>Governance<\/td>\n<td>Percentage of spend properly attributed<\/td>\n<td>Enables accountability<\/td>\n<td>&gt;90\u201395% tagged<\/td>\n<td>Monthly<\/td>\n<\/tr>\n<tr>\n<td>Backup success rate \/ Restore success<\/td>\n<td>Quality \/ Resilience<\/td>\n<td>Backup completion and tested restore success<\/td>\n<td>Ensures recoverability<\/td>\n<td>&gt;99% backups; quarterly restore tests<\/td>\n<td>Weekly\/Quarterly<\/td>\n<\/tr>\n<tr>\n<td>DR readiness score (RTO\/RPO achieved in tests)<\/td>\n<td>Outcome \/ Resilience<\/td>\n<td>Ability to meet DR objectives<\/td>\n<td>Protects business continuity<\/td>\n<td>Meet defined RTO\/RPO for Tier-0<\/td>\n<td>Quarterly<\/td>\n<\/tr>\n<tr>\n<td>Security baseline compliance (CIS, policy-as-code pass rate)<\/td>\n<td>Governance \/ Security<\/td>\n<td>Config compliance across infra<\/td>\n<td>Reduces breach risk<\/td>\n<td>&gt;95% compliance with exceptions tracked<\/td>\n<td>Monthly<\/td>\n<\/tr>\n<tr>\n<td>Audit findings (count and severity)<\/td>\n<td>Governance<\/td>\n<td>Number\/severity of infra-related audit issues<\/td>\n<td>Measures control maturity<\/td>\n<td>0 high severity; rapid remediation<\/td>\n<td>Quarterly<\/td>\n<\/tr>\n<tr>\n<td>Internal customer satisfaction (engineering NPS)<\/td>\n<td>Stakeholder<\/td>\n<td>Product engineering satisfaction with platform<\/td>\n<td>Measures enablement success<\/td>\n<td>Trending upward; target set locally<\/td>\n<td>Quarterly<\/td>\n<\/tr>\n<tr>\n<td>Hiring plan attainment<\/td>\n<td>Leadership<\/td>\n<td>Hiring progress vs plan<\/td>\n<td>Capacity to deliver roadmap<\/td>\n<td>90\u2013110% plan<\/td>\n<td>Monthly<\/td>\n<\/tr>\n<tr>\n<td>Retention \/ regretted attrition<\/td>\n<td>Leadership<\/td>\n<td>Team stability and talent health<\/td>\n<td>Reduces execution risk<\/td>\n<td>Low regretted attrition; intervene early<\/td>\n<td>Quarterly<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<p><strong>Implementation guidance:<\/strong>\n&#8211; Segment metrics by <strong>service tier<\/strong> (Tier-0 vs Tier-2) to avoid unrealistic uniform targets.\n&#8211; Pair every reliability metric with a <strong>remediation mechanism<\/strong> (problem management backlog, error budget policy).\n&#8211; Track <strong>trend lines<\/strong> more than point-in-time values for early-stage maturity.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">8) Technical Skills Required<\/h2>\n\n\n\n<p>The Head of Infrastructure Engineering is a technical leader. Depth expectations are highest in architecture, cloud primitives, operational maturity, and platform automation. Hands-on coding may vary by company size, but the ability to review designs and challenge assumptions is mandatory.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Must-have technical skills<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Cloud infrastructure architecture (AWS\/Azure\/GCP)<\/strong> <\/li>\n<li>Use: Setting standards, guiding designs, capacity\/cost decisions, risk reviews  <\/li>\n<li>Importance: <strong>Critical<\/strong><\/li>\n<li><strong>Linux systems and networking fundamentals (TCP\/IP, DNS, TLS, routing, load balancing)<\/strong> <\/li>\n<li>Use: Root cause analysis, architectural guardrails, performance and availability design  <\/li>\n<li>Importance: <strong>Critical<\/strong><\/li>\n<li><strong>Infrastructure as Code (Terraform common; alternatives context-specific)<\/strong> <\/li>\n<li>Use: Standardization, automation, drift control, scalable provisioning  <\/li>\n<li>Importance: <strong>Critical<\/strong><\/li>\n<li><strong>Containers and orchestration (Kubernetes common)<\/strong> <\/li>\n<li>Use: Platform strategy, cluster lifecycle, multi-tenancy patterns, upgrades, reliability  <\/li>\n<li>Importance: <strong>Critical<\/strong> (in containerized orgs); <strong>Important<\/strong> otherwise<\/li>\n<li><strong>Observability (metrics, logs, traces, alerting discipline)<\/strong> <\/li>\n<li>Use: Monitoring strategy, SLO measurement, incident reduction  <\/li>\n<li>Importance: <strong>Critical<\/strong><\/li>\n<li><strong>Incident management and operational readiness<\/strong> <\/li>\n<li>Use: Severity management, PIRs, runbooks, escalation models  <\/li>\n<li>Importance: <strong>Critical<\/strong><\/li>\n<li><strong>Security fundamentals for infrastructure (IAM, secrets, encryption, network segmentation)<\/strong> <\/li>\n<li>Use: Secure-by-default standards, risk remediation partnership with Security  <\/li>\n<li>Importance: <strong>Critical<\/strong><\/li>\n<li><strong>CI\/CD and delivery enablement (pipelines, artifact management, rollout strategies)<\/strong> <\/li>\n<li>Use: Platform automation, safe deployment patterns, standard workflows  <\/li>\n<li>Importance: <strong>Important<\/strong><\/li>\n<li><strong>Performance and capacity planning<\/strong> <\/li>\n<li>Use: Scaling decisions, launch readiness, cost control  <\/li>\n<li>Importance: <strong>Important<\/strong><\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Good-to-have technical skills<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Service mesh \/ ingress architectures (Envoy-based, API gateways)<\/strong> <\/li>\n<li>Use: Traffic management, security, observability at network layer  <\/li>\n<li>Importance: <strong>Optional<\/strong> (depends on architecture)<\/li>\n<li><strong>Policy-as-code (OPA\/Gatekeeper, cloud policy tools)<\/strong> <\/li>\n<li>Use: Guardrails, compliance automation, reducing manual approvals  <\/li>\n<li>Importance: <strong>Important<\/strong><\/li>\n<li><strong>Configuration management (Ansible, Chef, Puppet) and image building (Packer)<\/strong> <\/li>\n<li>Use: Standard images, baseline configs, legacy VM fleets  <\/li>\n<li>Importance: <strong>Optional<\/strong> (more relevant outside Kubernetes-first)<\/li>\n<li><strong>Database platform awareness (managed databases, backup\/restore patterns, HA concepts)<\/strong> <\/li>\n<li>Use: Partnering with DBAs\/Data teams, resilience planning  <\/li>\n<li>Importance: <strong>Important<\/strong><\/li>\n<li><strong>CDN\/DDoS\/WAF concepts<\/strong> <\/li>\n<li>Use: Edge performance, availability protection, security posture  <\/li>\n<li>Importance: <strong>Important<\/strong> (internet-facing products)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Advanced or expert-level technical skills<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Distributed systems reliability patterns<\/strong> (graceful degradation, circuit breakers, multi-region design)  <\/li>\n<li>Use: Architecture reviews, reliability posture for critical journeys  <\/li>\n<li>Importance: <strong>Important<\/strong> (often critical at scale)<\/li>\n<li><strong>FinOps and cloud economics<\/strong> (commitment strategies, cost allocation, unit economics)  <\/li>\n<li>Use: Cost governance, forecasting, architecture trade-offs  <\/li>\n<li>Importance: <strong>Important<\/strong><\/li>\n<li><strong>Complex migration leadership<\/strong> (data center to cloud, re-platforming, cluster migrations)  <\/li>\n<li>Use: De-risking major transitions and continuity planning  <\/li>\n<li>Importance: <strong>Important<\/strong><\/li>\n<li><strong>Advanced network design<\/strong> (multi-region routing, private connectivity, zero trust patterns)  <\/li>\n<li>Use: Enterprise readiness, regulated environments, latency-sensitive systems  <\/li>\n<li>Importance: <strong>Optional to Important<\/strong> (context-specific)<\/li>\n<li><strong>Reliability engineering program design<\/strong> (SLO taxonomy, error budgets, toil frameworks)  <\/li>\n<li>Use: Institutionalizing reliability practices  <\/li>\n<li>Importance: <strong>Important<\/strong><\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Emerging future skills for this role (2\u20135 year horizon)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>AI-assisted operations (AIOps) and anomaly detection<\/strong> <\/li>\n<li>Use: Faster detection, alert correlation, incident summarization  <\/li>\n<li>Importance: <strong>Important<\/strong> (becoming more common)<\/li>\n<li><strong>Platform engineering product management<\/strong> (service catalogs, developer experience metrics)  <\/li>\n<li>Use: Treating platform as a product; adoption and satisfaction focus  <\/li>\n<li>Importance: <strong>Important<\/strong><\/li>\n<li><strong>Confidential computing \/ advanced workload isolation<\/strong> <\/li>\n<li>Use: Higher assurance for sensitive workloads  <\/li>\n<li>Importance: <strong>Optional<\/strong> (regulated\/high-security contexts)<\/li>\n<li><strong>Software supply chain security (SLSA, provenance, artifact signing)<\/strong> <\/li>\n<li>Use: Reducing build and deployment tampering risk  <\/li>\n<li>Importance: <strong>Important<\/strong> (rising expectation)<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">9) Soft Skills and Behavioral Capabilities<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Systems thinking and prioritization<\/strong> <\/li>\n<li>Why it matters: Infrastructure work competes across reliability, cost, speed, and security; trade-offs must be coherent.  <\/li>\n<li>How it shows up: Clear decision frameworks, tiering, sequencing roadmaps, avoiding reactive thrash.  <\/li>\n<li>\n<p>Strong performance: Can explain \u201cwhy this, why now\u201d with data; avoids local optimizations that create systemic fragility.<\/p>\n<\/li>\n<li>\n<p><strong>Executive communication and narrative building<\/strong> <\/p>\n<\/li>\n<li>Why it matters: Infrastructure investments are often non-obvious but business-critical.  <\/li>\n<li>How it shows up: Concise risk reporting, budget justification, incident communications, roadmap storytelling.  <\/li>\n<li>\n<p>Strong performance: Earns trust from CTO\/VP Eng\/CFO; translates technical constraints into business outcomes and options.<\/p>\n<\/li>\n<li>\n<p><strong>Operational leadership under pressure<\/strong> <\/p>\n<\/li>\n<li>Why it matters: Sev-1 incidents require calm, clarity, and coordination.  <\/li>\n<li>How it shows up: Incident commander behavior, triage discipline, stakeholder updates, post-incident follow-through.  <\/li>\n<li>\n<p>Strong performance: Reduces chaos; drives fast mitigation; ensures learning without blame; closes actions.<\/p>\n<\/li>\n<li>\n<p><strong>Talent development and coaching (managers and senior ICs)<\/strong> <\/p>\n<\/li>\n<li>Why it matters: Infrastructure organizations depend on rare skills; retention and growth are strategic.  <\/li>\n<li>How it shows up: Coaching, clear expectations, growth plans, effective delegation, building leadership bench.  <\/li>\n<li>\n<p>Strong performance: Improves team autonomy; increases internal promotions; creates resilient org not dependent on heroes.<\/p>\n<\/li>\n<li>\n<p><strong>Cross-functional influence and partnership<\/strong> <\/p>\n<\/li>\n<li>Why it matters: Reliability and security are shared outcomes; infrastructure teams cannot succeed unilaterally.  <\/li>\n<li>How it shows up: Joint roadmaps with SRE\/Security, negotiated ownership, shared OKRs, aligned standards.  <\/li>\n<li>\n<p>Strong performance: Builds durable agreements; reduces friction; decisions stick because stakeholders co-own them.<\/p>\n<\/li>\n<li>\n<p><strong>Customer-centric mindset (internal and external)<\/strong> <\/p>\n<\/li>\n<li>Why it matters: Platform engineering serves product teams; outages affect paying customers.  <\/li>\n<li>How it shows up: Developer experience metrics, clear SLAs, empathetic incident comms, pragmatic usability.  <\/li>\n<li>\n<p>Strong performance: Platform adoption increases because it is easier than bespoke alternatives.<\/p>\n<\/li>\n<li>\n<p><strong>Decision-making with incomplete information<\/strong> <\/p>\n<\/li>\n<li>Why it matters: Infrastructure incidents and scaling constraints require action before perfect data exists.  <\/li>\n<li>How it shows up: Risk-based decisions, staging approaches, reversible choices, fast experimentation.  <\/li>\n<li>\n<p>Strong performance: Makes timely decisions; monitors outcomes; adjusts quickly without losing credibility.<\/p>\n<\/li>\n<li>\n<p><strong>Conflict management and boundary setting<\/strong> <\/p>\n<\/li>\n<li>Why it matters: Tension is common between speed and control, product urgency and platform constraints.  <\/li>\n<li>How it shows up: Clear engagement models, escalation paths, transparent prioritization, saying \u201cno\u201d with alternatives.  <\/li>\n<li>Strong performance: Prevents shadow infrastructure; keeps teams aligned without becoming a bottleneck.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">10) Tools, Platforms, and Software<\/h2>\n\n\n\n<p>Tooling varies by company size and cloud strategy. Items below reflect common enterprise-grade infrastructure engineering environments. Labels indicate applicability.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Category<\/th>\n<th>Tool \/ Platform<\/th>\n<th>Primary use<\/th>\n<th>Common \/ Optional \/ Context-specific<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Cloud platforms<\/td>\n<td>AWS<\/td>\n<td>Core compute\/storage\/network primitives<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Cloud platforms<\/td>\n<td>Azure<\/td>\n<td>Alternative\/secondary cloud or enterprise-aligned workloads<\/td>\n<td>Context-specific<\/td>\n<\/tr>\n<tr>\n<td>Cloud platforms<\/td>\n<td>GCP<\/td>\n<td>Data\/analytics-heavy workloads or secondary cloud<\/td>\n<td>Context-specific<\/td>\n<\/tr>\n<tr>\n<td>Container\/orchestration<\/td>\n<td>Kubernetes (EKS\/AKS\/GKE or self-managed)<\/td>\n<td>Container orchestration platform<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Container\/orchestration<\/td>\n<td>Helm \/ Kustomize<\/td>\n<td>Kubernetes packaging and configuration<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Container\/orchestration<\/td>\n<td>Argo CD \/ Flux (GitOps)<\/td>\n<td>Declarative deployments for platform\/workloads<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>IaC \/ provisioning<\/td>\n<td>Terraform<\/td>\n<td>Infrastructure-as-Code provisioning<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>IaC \/ provisioning<\/td>\n<td>CloudFormation \/ ARM \/ Pulumi<\/td>\n<td>Cloud-native IaC alternatives<\/td>\n<td>Context-specific<\/td>\n<\/tr>\n<tr>\n<td>Config mgmt<\/td>\n<td>Ansible<\/td>\n<td>Configuration automation, legacy fleets<\/td>\n<td>Optional<\/td>\n<\/tr>\n<tr>\n<td>Networking\/edge<\/td>\n<td>Cloudflare \/ Akamai<\/td>\n<td>CDN, WAF, DDoS protection<\/td>\n<td>Context-specific<\/td>\n<\/tr>\n<tr>\n<td>CI\/CD<\/td>\n<td>GitHub Actions \/ GitLab CI<\/td>\n<td>Pipeline automation<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>CI\/CD<\/td>\n<td>Jenkins<\/td>\n<td>Legacy CI or specialized workflows<\/td>\n<td>Context-specific<\/td>\n<\/tr>\n<tr>\n<td>Artifact mgmt<\/td>\n<td>Artifactory \/ Nexus \/ ECR\/GAR<\/td>\n<td>Artifact and container registry<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Observability<\/td>\n<td>Datadog<\/td>\n<td>Metrics\/APM\/logs unified observability<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Observability<\/td>\n<td>Prometheus + Grafana<\/td>\n<td>Metrics and dashboards<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Observability<\/td>\n<td>ELK\/OpenSearch<\/td>\n<td>Logging analytics<\/td>\n<td>Context-specific<\/td>\n<\/tr>\n<tr>\n<td>Observability<\/td>\n<td>Splunk<\/td>\n<td>Enterprise log analytics\/SIEM feed<\/td>\n<td>Context-specific<\/td>\n<\/tr>\n<tr>\n<td>Alerting\/on-call<\/td>\n<td>PagerDuty \/ Opsgenie<\/td>\n<td>On-call scheduling and incident response<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>ITSM<\/td>\n<td>ServiceNow<\/td>\n<td>Change management, incident\/problem records<\/td>\n<td>Context-specific<\/td>\n<\/tr>\n<tr>\n<td>Security\/IAM<\/td>\n<td>Okta \/ Entra ID<\/td>\n<td>SSO, identity federation<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Security\/secrets<\/td>\n<td>HashiCorp Vault \/ cloud secrets manager<\/td>\n<td>Secrets management and rotation<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Security\/scanning<\/td>\n<td>Wiz \/ Prisma Cloud<\/td>\n<td>Cloud security posture management<\/td>\n<td>Context-specific<\/td>\n<\/tr>\n<tr>\n<td>Security\/scanning<\/td>\n<td>Snyk \/ Trivy<\/td>\n<td>Container\/image vulnerability scanning<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Policy-as-code<\/td>\n<td>OPA\/Gatekeeper<\/td>\n<td>Kubernetes admission control\/guardrails<\/td>\n<td>Optional to Common<\/td>\n<\/tr>\n<tr>\n<td>Collaboration<\/td>\n<td>Slack \/ Microsoft Teams<\/td>\n<td>Operational coordination<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Documentation<\/td>\n<td>Confluence \/ Notion<\/td>\n<td>Runbooks, standards, knowledge base<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Work management<\/td>\n<td>Jira \/ Azure DevOps<\/td>\n<td>Backlog, planning, dependency tracking<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Source control<\/td>\n<td>GitHub \/ GitLab \/ Bitbucket<\/td>\n<td>Repo management, code review<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Automation\/scripting<\/td>\n<td>Python<\/td>\n<td>Tooling automation, operational scripts<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Automation\/scripting<\/td>\n<td>Bash<\/td>\n<td>Systems automation and troubleshooting<\/td>\n<td>Common<\/td>\n<\/tr>\n<tr>\n<td>Data\/analytics<\/td>\n<td>BigQuery\/Snowflake\/Databricks (awareness)<\/td>\n<td>Cost\/reliability analytics inputs<\/td>\n<td>Context-specific<\/td>\n<\/tr>\n<tr>\n<td>Feature flags (adjacent)<\/td>\n<td>LaunchDarkly<\/td>\n<td>Progressive delivery enablement (partner)<\/td>\n<td>Optional<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">11) Typical Tech Stack \/ Environment<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Infrastructure environment<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Predominantly cloud-hosted infrastructure with multiple environments (dev\/stage\/prod) and strong separation controls.<\/li>\n<li>Mix of managed services (managed Kubernetes, managed databases) and platform-managed components (service mesh, ingress, internal tooling).<\/li>\n<li>High availability patterns: multi-AZ for most Tier-0 services; multi-region for critical customer-facing components (context-dependent).<\/li>\n<li>Network architecture includes VPC\/VNet segmentation, private endpoints, secure egress, and centralized DNS\/TLS management.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Application environment<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Microservices and\/or modular monoliths running on Kubernetes or PaaS services.<\/li>\n<li>Standardized deployment patterns via GitOps or CI\/CD pipelines.<\/li>\n<li>Progressive delivery patterns (canary, blue\/green) increasingly common at higher maturity.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Data environment<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Managed relational and NoSQL databases (cloud-native), object storage, and streaming systems (Kafka or cloud equivalents).<\/li>\n<li>Data pipelines and analytics platforms exist but are usually owned by Data Engineering; infrastructure engineering ensures shared foundations (networking, security, observability, cost governance).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Security environment<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Centralized identity and access management with least-privilege policies, role-based access controls, and periodic access reviews.<\/li>\n<li>Secrets management and encryption-by-default.<\/li>\n<li>Security posture management and vulnerability management integrated into pipelines (policy gates, scanning, patch SLAs).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Delivery model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Platform engineering model with self-service: reusable modules, templates, service catalog, paved roads.<\/li>\n<li>Clear ownership boundaries between Infrastructure Engineering, SRE, and Product Engineering:<\/li>\n<li>Infrastructure Engineering: platform foundations and shared infrastructure services<\/li>\n<li>SRE: reliability practices, production readiness, service ownership support (varies)<\/li>\n<li>Product Engineering: application-level ownership and service SLOs<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Agile or SDLC context<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Quarterly planning and OKR-based execution; sprint or kanban at team level.<\/li>\n<li>Change management discipline scaled to risk: automated guardrails for low-risk changes; explicit approvals for high-risk.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scale or complexity context<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Typical scale for this role: tens to hundreds of services, multiple clusters, multi-region traffic, 24\/7 operations, enterprise customer expectations for uptime and security.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Team topology<\/h3>\n\n\n\n<p>A common topology under this role:\n&#8211; Cloud Platform \/ Runtime (Kubernetes, compute, base images)\n&#8211; Network &amp; Edge (DNS, ingress, WAF, connectivity)\n&#8211; Observability &amp; Incident Tooling (monitoring, logging, alerting, on-call tooling)\n&#8211; Infrastructure Automation (IaC modules, GitOps tooling, developer self-service)\n&#8211; (Optional) Database Platform or Reliability Enablement (if not owned elsewhere)<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">12) Stakeholders and Collaboration Map<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Internal stakeholders<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>CTO \/ VP Engineering (reports to one of these, commonly VP Engineering or CTO):<\/strong> Funding, strategy alignment, risk posture, org design.<\/li>\n<li><strong>Product Engineering leaders:<\/strong> Platform needs, delivery constraints, migration coordination, operational expectations.<\/li>\n<li><strong>SRE leadership (if separate):<\/strong> SLO frameworks, incident processes, error budgets, operational readiness.<\/li>\n<li><strong>CISO \/ Security Engineering:<\/strong> IAM, secrets, vulnerability management, compliance controls, incident response (security incidents).<\/li>\n<li><strong>Finance \/ FinOps:<\/strong> Budgeting, forecasting, unit economics, cost allocation, savings plans.<\/li>\n<li><strong>Customer Support \/ Customer Success:<\/strong> Incident communication, RCA sharing, customer-impact patterns.<\/li>\n<li><strong>Enterprise IT (if distinct):<\/strong> Identity systems, endpoint policies, network constraints, shared tooling.<\/li>\n<li><strong>Compliance \/ Risk \/ Internal Audit (context-specific):<\/strong> Evidence collection, control design, audit timelines.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">External stakeholders<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Cloud providers and strategic vendors:<\/strong> Support escalations, roadmap alignment, incident coordination, contract negotiation.<\/li>\n<li><strong>Audit partners (context-specific):<\/strong> SOC 2\/ISO audit processes and evidence requirements.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Peer roles<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Head of SRE (if separate)<\/li>\n<li>Head of Platform Engineering (in some orgs this is the same function; in others it is a peer)<\/li>\n<li>Head of Security Engineering<\/li>\n<li>Head of Engineering Productivity \/ Developer Experience (DX)<\/li>\n<li>Enterprise Architect \/ Chief Architect<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Upstream dependencies<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Product strategy and traffic forecasts<\/li>\n<li>Security policies and risk assessments<\/li>\n<li>Finance budget and procurement processes<\/li>\n<li>Architecture standards and technology choices<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Downstream consumers<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Product engineering teams consuming infrastructure services, pipelines, and environments<\/li>\n<li>SRE\/on-call teams consuming observability and incident tooling<\/li>\n<li>Support teams consuming incident updates and reliability narratives<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Nature of collaboration<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Co-ownership model:<\/strong> Many outcomes are shared (reliability, security posture). Clear RACI is essential.<\/li>\n<li><strong>Service provider model (internal):<\/strong> Infrastructure engineering offers platform services with defined SLAs and support model.<\/li>\n<li><strong>Enablement model:<\/strong> Provide guardrails and automation so teams can move independently without lowering standards.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Decision-making authority and escalation points<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Infrastructure architecture and tooling decisions typically sit with the Head of Infrastructure Engineering, with consultation from Security and Architecture governance.<\/li>\n<li>Escalate to CTO\/VP Engineering for:<\/li>\n<li>Major spend or vendor commitments<\/li>\n<li>Multi-quarter migrations impacting product roadmaps<\/li>\n<li>Material risk acceptance decisions<\/li>\n<li>Organization-wide operating model changes<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">13) Decision Rights and Scope of Authority<\/h2>\n\n\n\n<p>Decision rights should be explicit to avoid bottlenecks and shadow infrastructure.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Can decide independently<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Infrastructure engineering internal priorities and sequencing within approved roadmap.<\/li>\n<li>Standards for IaC modules, baseline configurations, naming\/tagging, operational readiness checklists.<\/li>\n<li>On-call processes and incident management rituals (severity definitions may be jointly agreed).<\/li>\n<li>Tool configuration and implementation choices within approved vendor\/tool categories.<\/li>\n<li>Hiring decisions for roles within allocated headcount (following HR process).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Requires team or peer approval (collaborative governance)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Changes to shared reliability targets (SLOs) and error budget policy (typically with SRE and product leaders).<\/li>\n<li>Changes to security-sensitive baselines (IAM model, secrets approach, network segmentation) with Security sign-off.<\/li>\n<li>Major architectural patterns impacting app teams (service mesh adoption, cluster multi-tenancy model) via architecture review forum.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Requires manager\/executive approval (VP Eng\/CTO\/CFO depending on scope)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Budget increases, major vendor contracts, multi-year cloud commitments (Reserved Instances\/Savings Plans\/committed use).<\/li>\n<li>Large migrations or platform changes that materially impact product delivery timelines.<\/li>\n<li>Risk acceptance decisions where reliability\/security posture is knowingly reduced.<\/li>\n<li>Org restructuring beyond the function (e.g., merging SRE and platform teams; changing on-call ownership model at org scale).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Budget, architecture, vendor, delivery, hiring, compliance authority<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Budget:<\/strong> Commonly owns tooling budget and influences cloud spend; may directly own cloud cost center in some orgs.<\/li>\n<li><strong>Architecture:<\/strong> Owns infrastructure reference architectures; co-governs with Chief Architect\/Architecture Council where present.<\/li>\n<li><strong>Vendors:<\/strong> Owns evaluation and negotiation for infrastructure tooling; partners with Procurement and Security for due diligence.<\/li>\n<li><strong>Delivery:<\/strong> Accountable for delivery of infrastructure roadmap; responsible for change governance and platform lifecycle.<\/li>\n<li><strong>Hiring:<\/strong> Owns staffing plan for infrastructure engineering; approves final hiring decisions within allocated headcount.<\/li>\n<li><strong>Compliance:<\/strong> Accountable for operational controls implementation and evidence readiness for infrastructure scope.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">14) Required Experience and Qualifications<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Typical years of experience<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>12\u201318+ years<\/strong> in infrastructure engineering, SRE, platform engineering, or adjacent systems engineering roles.<\/li>\n<li><strong>5\u201310+ years<\/strong> leading teams (including managing managers) is common for \u201cHead of\u201d scope.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Education expectations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Bachelor\u2019s degree in Computer Science, Engineering, or equivalent practical experience is common.<\/li>\n<li>Advanced degrees are optional; real-world operational leadership is usually more predictive.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Certifications (optional; context-dependent)<\/h3>\n\n\n\n<p>Certifications are not mandatory but can be useful signals, especially in regulated or enterprise-heavy environments:\n&#8211; <strong>Common\/Optional:<\/strong> AWS\/Azure\/GCP professional-level certifications\n&#8211; <strong>Optional:<\/strong> Kubernetes (CKA\/CKAD\/CKS)\n&#8211; <strong>Context-specific:<\/strong> ITIL (if ITSM-heavy), security certifications (CISSP) for security-aligned environments<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Prior role backgrounds commonly seen<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Infrastructure Engineering Manager \/ Director<\/li>\n<li>Site Reliability Engineering Manager \/ Director<\/li>\n<li>Platform Engineering Lead \/ Director<\/li>\n<li>Senior Systems Engineer \/ Staff SRE transitioning into leadership<\/li>\n<li>DevOps Manager (in orgs where \u201cDevOps\u201d owns platform and operations)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Domain knowledge expectations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Broad software\/IT domain applicability; deeper domain expectations vary:<\/li>\n<li>B2B SaaS: enterprise readiness, compliance, tenant isolation, predictable SLAs<\/li>\n<li>Consumer: traffic spikes, low-latency performance, global delivery, edge strategies<\/li>\n<li>Regulated: audit evidence, strict access controls, change management rigor<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Leadership experience expectations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Proven experience scaling an infrastructure\/platform function, including:<\/li>\n<li>Building a roadmap and delivering multi-quarter initiatives<\/li>\n<li>Managing reliability and incident programs<\/li>\n<li>Hiring and developing managers and senior engineers<\/li>\n<li>Owning or influencing significant budgets and vendor relationships<\/li>\n<li>Driving cross-functional alignment with Security and Product Engineering<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">15) Career Path and Progression<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Common feeder roles into this role<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Director of Infrastructure \/ Platform Engineering<\/li>\n<li>Director\/Manager of SRE<\/li>\n<li>Senior Manager of DevOps \/ Cloud Engineering<\/li>\n<li>Principal\/Staff SRE or Infrastructure Architect with demonstrated leadership and org impact<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Next likely roles after this role<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>VP Engineering (Platform\/Infrastructure)<\/strong> or <strong>VP of Engineering<\/strong> (broader scope)<\/li>\n<li><strong>CTO (in smaller organizations)<\/strong> where infrastructure leadership expands into overall technology strategy<\/li>\n<li><strong>Head of Engineering Operations<\/strong> (broader operational maturity across engineering)<\/li>\n<li><strong>Chief Architect \/ Head of Technology Strategy<\/strong> (for leaders with strong architecture focus)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Adjacent career paths<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Security leadership track: Head of Security Engineering (for leaders with deep security orientation)<\/li>\n<li>Reliability track: VP\/Head of SRE (if distinct)<\/li>\n<li>Developer experience\/productivity leadership: Head of Developer Platform \/ DX<\/li>\n<li>Cloud cost leadership: FinOps leadership (less common but plausible in cost-driven orgs)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Skills needed for promotion<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Demonstrated multi-org influence and ability to align strategy across Engineering, Security, and Finance.<\/li>\n<li>Strong portfolio of outcomes: measurable reliability gains, cost improvements, and increased delivery velocity.<\/li>\n<li>Mature executive presence: board-level risk framing (as applicable), clear investment narratives.<\/li>\n<li>Succession and scaling: ability to build leaders and delegate effectively.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">How this role evolves over time<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Early phase: stabilize reliability and reduce operational pain; standardize baseline tooling and practices.<\/li>\n<li>Growth phase: scale platform adoption through self-service and paved roads; reduce dependencies and manual work.<\/li>\n<li>Mature phase: optimize unit economics, embed governance and compliance, drive global resilience and continuous modernization.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">16) Risks, Challenges, and Failure Modes<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Common role challenges<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Ambiguous ownership boundaries:<\/strong> SRE vs platform vs product teams; leads to gaps or duplication.<\/li>\n<li><strong>Competing priorities:<\/strong> Reliability work competes with product delivery; infra investments are often underfunded until outages occur.<\/li>\n<li><strong>Legacy complexity:<\/strong> Accumulated tech debt (ad hoc scripts, snowflake environments, unowned services) increases incident risk.<\/li>\n<li><strong>Tool sprawl:<\/strong> Multiple overlapping observability tools, CI systems, IaC patterns creating operational overhead.<\/li>\n<li><strong>On-call burnout:<\/strong> Excessive paging and lack of toil reduction leads to attrition.<\/li>\n<li><strong>Security friction:<\/strong> Misalignment on controls vs speed if guardrails are not automated and standardized.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Bottlenecks<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Manual provisioning and approvals; lack of self-service.<\/li>\n<li>Lack of standard architectures leading to one-off solutions that are hard to operate.<\/li>\n<li>Insufficient capacity planning or unclear forecasts from product teams.<\/li>\n<li>Vendor constraints or cloud service limits not proactively managed.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Anti-patterns<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Hero culture:<\/strong> Reliance on a few experts for critical systems; weak documentation and single points of failure.<\/li>\n<li><strong>\u201cTicket factory\u201d platform team:<\/strong> Team becomes a bottleneck, doing manual tasks instead of enabling self-service.<\/li>\n<li><strong>Metrics without action:<\/strong> Dashboards exist but do not drive remediation or prioritization.<\/li>\n<li><strong>Reliability theater:<\/strong> Incident reviews occur but systemic issues remain unaddressed; repeat incidents persist.<\/li>\n<li><strong>Cost optimization by blunt cuts:<\/strong> Reducing spend without understanding reliability\/performance impact; leads to hidden cost elsewhere.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Common reasons for underperformance<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Insufficient technical depth to challenge designs or guide incident RCA effectively.<\/li>\n<li>Weak stakeholder management; inability to secure resources or alignment.<\/li>\n<li>Over-indexing on tools instead of operating model and automation discipline.<\/li>\n<li>Inconsistent execution management: too many initiatives, unclear ownership, missed dependencies.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Business risks if this role is ineffective<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Increased downtime, degraded performance, and customer churn.<\/li>\n<li>Security incidents or audit failures due to weak controls and operational discipline.<\/li>\n<li>Uncontrolled cloud spend and poor cost predictability.<\/li>\n<li>Slowed product delivery due to unstable environments and brittle platform.<\/li>\n<li>Talent loss due to burnout and chaos, compounding operational risk.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">17) Role Variants<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">By company size<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Startup (Series A\u2013B):<\/strong> <\/li>\n<li>More hands-on; may also own SRE and DevOps directly.  <\/li>\n<li>Focus on foundational automation, basic observability, and early reliability practices.  <\/li>\n<li>Fewer layers; may report directly to CTO.<\/li>\n<li><strong>Mid-size (Series C\u2013pre-IPO):<\/strong> <\/li>\n<li>Strong focus on scaling, formalizing SLOs, DR, and cost governance.  <\/li>\n<li>Usually manages managers; multiple platform sub-teams emerge.  <\/li>\n<li>Heavy cross-functional work with Security and Finance.<\/li>\n<li><strong>Enterprise \/ Public company:<\/strong> <\/li>\n<li>Higher governance burden: audit evidence, change controls, multi-region resilience, vendor risk management.  <\/li>\n<li>Clear separation between SRE, platform, and IT; more formal architecture governance.  <\/li>\n<li>Larger budgets and procurement complexity.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">By industry<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Fintech\/healthcare (regulated):<\/strong> Controls, audit readiness, encryption, access governance, and DR testing are more stringent and frequent.<\/li>\n<li><strong>Consumer internet\/media:<\/strong> Emphasis on latency, global traffic, edge delivery, and traffic surge readiness.<\/li>\n<li><strong>B2B SaaS:<\/strong> Emphasis on tenant isolation, enterprise security requirements, predictable SLAs, and migration safety.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">By geography<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Global teams require stronger async operating model, handoffs, and follow-the-sun on-call design.<\/li>\n<li>Data residency requirements may influence region strategy and access controls (context-specific).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Product-led vs service-led company<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Product-led:<\/strong> Platform is a leverage engine; focus on self-service, paved roads, developer experience metrics.<\/li>\n<li><strong>Service-led\/IT org:<\/strong> More emphasis on ITSM, change control, standard operating procedures, and internal SLAs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Startup vs enterprise<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Startup: fewer tools, faster iteration; risk is under-investing in controls and DR.<\/li>\n<li>Enterprise: more governance and stakeholders; risk is bureaucracy and slow delivery without automation.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Regulated vs non-regulated environment<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Regulated: strong evidence collection, segregation of duties, formal access reviews, DR testing schedules.<\/li>\n<li>Non-regulated: more flexibility; still needs discipline to meet customer trust expectations and scale safely.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">18) AI \/ Automation Impact on the Role<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Tasks that can be automated (increasingly)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Alert enrichment and correlation:<\/strong> AI-assisted clustering of related alerts and surfacing likely root causes.<\/li>\n<li><strong>Incident summarization:<\/strong> Auto-generated timelines, impacted systems, and customer-facing summaries (with human review).<\/li>\n<li><strong>Runbook execution:<\/strong> Automated remediation for common failures (self-healing scripts, scaling actions, certificate renewals).<\/li>\n<li><strong>IaC generation and validation:<\/strong> AI-assisted module scaffolding, drift detection explanations, policy compliance suggestions.<\/li>\n<li><strong>Cost anomaly detection:<\/strong> Automated detection and explanation of cost spikes; recommendation of right-sizing actions.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Tasks that remain human-critical<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Accountable decision-making under risk:<\/strong> Choosing trade-offs (e.g., failover vs partial degradation, cost vs reliability).<\/li>\n<li><strong>Architecture judgment:<\/strong> Designing systems that fit the business context; avoiding over-engineering.<\/li>\n<li><strong>Cross-functional alignment:<\/strong> Negotiating ownership, budgets, priorities, and timelines.<\/li>\n<li><strong>Culture and leadership:<\/strong> Coaching, performance management, building trust during incidents.<\/li>\n<li><strong>Security risk acceptance:<\/strong> Evaluating threat models and approving exceptions with appropriate controls.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">How AI changes the role over the next 2\u20135 years<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The leader will be expected to <strong>operationalize AIOps responsibly<\/strong>: governance for AI-generated actions, audit trails, rollback and safety constraints.<\/li>\n<li>Observability and incident response will shift from manual triage toward <strong>AI-augmented diagnosis<\/strong>, increasing expectations for shorter MTTD\/MTTR.<\/li>\n<li>Platform teams will increasingly measure and optimize <strong>developer experience<\/strong> with more granular telemetry (time-to-first-deploy, friction points) and AI-assisted documentation\/support.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">New expectations caused by AI, automation, or platform shifts<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Stronger focus on <strong>policy-as-code and automated guardrails<\/strong> so teams can move quickly without manual approvals.<\/li>\n<li>Greater emphasis on <strong>data quality for operations<\/strong> (clean service catalogs, accurate ownership metadata, consistent logging\/metrics schemas).<\/li>\n<li>Increased need for <strong>tooling rationalization<\/strong>: avoid overlapping AI features across vendors; maintain clarity of source of truth.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">19) Hiring Evaluation Criteria<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What to assess in interviews<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Infrastructure architecture depth:<\/strong> Can the candidate design scalable, secure, resilient platforms and critique trade-offs?<\/li>\n<li><strong>Operational maturity:<\/strong> Evidence of incident program leadership, SLO adoption, and post-incident remediation discipline.<\/li>\n<li><strong>Platform enablement mindset:<\/strong> Ability to build self-service capabilities and reduce friction for product teams.<\/li>\n<li><strong>Leadership capability:<\/strong> Managing managers, building org structure, hiring, performance management, and culture shaping.<\/li>\n<li><strong>Cross-functional influence:<\/strong> Ability to align with Security, Finance, and Product Engineering; track record of getting hard things done.<\/li>\n<li><strong>Cost and capacity discipline:<\/strong> Understanding of cloud economics, forecasting, and unit metrics.<\/li>\n<li><strong>Communication:<\/strong> Executive-level clarity in risk reporting, roadmap narratives, and incident comms.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Practical exercises or case studies (recommended)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Case study 1: Reliability and operating model design (60\u201390 minutes)<\/strong><br\/>\n  Provide: incident history summary, org chart, platform architecture sketch.<br\/>\n  Ask: propose changes to operating model, top 5 remediation initiatives, and metrics.<br\/>\n  Evaluate: prioritization, realism, stakeholder alignment, and measurable outcomes.<\/li>\n<li><strong>Case study 2: Cloud cost + scaling scenario (45\u201360 minutes)<\/strong><br\/>\n  Provide: cost breakdown, usage growth forecast, performance constraints.<br\/>\n  Ask: propose cost optimization plan without harming reliability, including governance mechanisms.<br\/>\n  Evaluate: unit economics thinking, technical options, and risk management.<\/li>\n<li><strong>Case study 3: DR and resilience plan review (45 minutes)<\/strong><br\/>\n  Provide: RTO\/RPO requirements and a simplified dependency map.<br\/>\n  Ask: propose DR strategy, test plan, and readiness reporting.<br\/>\n  Evaluate: pragmatism, completeness, and operational realism.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Strong candidate signals<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Demonstrated outcomes: reduced incident rates\/MTTR, improved SLO compliance, major platform modernization delivered.<\/li>\n<li>Clear approach to \u201cplatform as product\u201d: service catalogs, paved roads, adoption metrics, stakeholder feedback loops.<\/li>\n<li>Mature incident leadership: calm, structured, metrics-driven; champions blameless learning with accountability for fixes.<\/li>\n<li>Good judgment on tooling: rationalizes rather than accumulates tools; prioritizes automation and standards.<\/li>\n<li>Has built leaders: can describe how they developed managers and created sustainable teams.<\/li>\n<li>Can speak credibly about security and compliance without turning it into bureaucracy.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Weak candidate signals<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Relies on buzzwords; cannot explain trade-offs at the level of cloud primitives and failure modes.<\/li>\n<li>Treats infrastructure as a ticket-taking function rather than an enablement platform.<\/li>\n<li>Over-focus on tools rather than operating model, automation, and standards.<\/li>\n<li>No evidence of closing the loop after incidents (repeat failures accepted as normal).<\/li>\n<li>Blames other teams for reliability issues without proposing shared-accountability mechanisms.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Red flags<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Downplays on-call health and sustainability; accepts burnout as normal.<\/li>\n<li>Avoids ownership during incidents or cannot articulate incident command practices.<\/li>\n<li>Cannot discuss cost governance or has a history of uncontrolled spend.<\/li>\n<li>Poor security posture awareness (e.g., overly broad IAM, weak secrets practices) or dismisses compliance needs.<\/li>\n<li>Creates brittle single points of failure (people or systems) through centralized control without self-service.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scorecard dimensions (sample)<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Dimension<\/th>\n<th>What \u201cmeets bar\u201d looks like<\/th>\n<th style=\"text-align: right;\">Weight<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Infrastructure architecture &amp; cloud depth<\/td>\n<td>Strong design judgment; can reason about failure modes and trade-offs<\/td>\n<td style=\"text-align: right;\">15%<\/td>\n<\/tr>\n<tr>\n<td>Kubernetes\/platform engineering (if applicable)<\/td>\n<td>Understands lifecycle, upgrades, multi-tenancy, operational patterns<\/td>\n<td style=\"text-align: right;\">10%<\/td>\n<\/tr>\n<tr>\n<td>Operational excellence &amp; incident leadership<\/td>\n<td>Mature incident program, PIR discipline, SLO usage<\/td>\n<td style=\"text-align: right;\">15%<\/td>\n<\/tr>\n<tr>\n<td>Observability &amp; reliability engineering<\/td>\n<td>Can define actionable telemetry strategy and error budgets<\/td>\n<td style=\"text-align: right;\">10%<\/td>\n<\/tr>\n<tr>\n<td>Security &amp; compliance alignment<\/td>\n<td>Implements guardrails and controls pragmatically<\/td>\n<td style=\"text-align: right;\">10%<\/td>\n<\/tr>\n<tr>\n<td>FinOps \/ cost &amp; capacity management<\/td>\n<td>Uses unit metrics; can forecast and optimize<\/td>\n<td style=\"text-align: right;\">10%<\/td>\n<\/tr>\n<tr>\n<td>Leadership (managing managers)<\/td>\n<td>Org design, coaching, performance management, hiring strategy<\/td>\n<td style=\"text-align: right;\">15%<\/td>\n<\/tr>\n<tr>\n<td>Cross-functional influence &amp; communication<\/td>\n<td>Executive-ready narratives; alignment and negotiation strength<\/td>\n<td style=\"text-align: right;\">15%<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">20) Final Role Scorecard Summary<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Category<\/th>\n<th>Summary<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Role title<\/td>\n<td>Head of Infrastructure Engineering<\/td>\n<\/tr>\n<tr>\n<td>Role purpose<\/td>\n<td>Lead the infrastructure engineering function to deliver secure, scalable, reliable, and cost-efficient platforms that enable product teams to ship quickly and safely while meeting customer expectations and compliance needs.<\/td>\n<\/tr>\n<tr>\n<td>Top 10 responsibilities<\/td>\n<td>1) Infrastructure strategy\/roadmap 2) Platform operating model 3) Reliability\/incident leadership 4) IaC and automation standards 5) Observability and alerting quality 6) Capacity\/performance planning 7) DR and resilience testing 8) Security guardrails and IAM\/secrets alignment 9) FinOps and cost\/unit economics governance 10) Org leadership: hiring, coaching, budgeting<\/td>\n<\/tr>\n<tr>\n<td>Top 10 technical skills<\/td>\n<td>1) Cloud architecture 2) Linux\/networking fundamentals 3) Terraform\/IaC 4) Kubernetes\/platform lifecycle 5) Observability (metrics\/logs\/traces) 6) Incident management discipline 7) IAM\/secrets\/encryption fundamentals 8) CI\/CD enablement 9) Capacity and performance engineering 10) FinOps\/cloud economics<\/td>\n<\/tr>\n<tr>\n<td>Top 10 soft skills<\/td>\n<td>1) Systems thinking 2) Prioritization under constraints 3) Executive communication 4) Incident leadership under pressure 5) Cross-functional influence 6) Coaching and talent development 7) Decision-making with incomplete info 8) Conflict management\/boundary setting 9) Customer-centric enablement mindset 10) Accountability and follow-through<\/td>\n<\/tr>\n<tr>\n<td>Top tools or platforms<\/td>\n<td>AWS (or Azure\/GCP), Kubernetes, Terraform, GitHub\/GitLab, Argo CD\/Flux, Datadog\/Prometheus+Grafana, PagerDuty\/Opsgenie, Vault\/Secrets Manager, Jira\/Confluence, Okta\/Entra ID<\/td>\n<\/tr>\n<tr>\n<td>Top KPIs<\/td>\n<td>Availability\/SLO compliance, MTTR\/MTTD, Sev-1 incident rate, change failure rate, provisioning cycle time, infra cost vs budget, unit cost trend, alert quality, DR test success (RTO\/RPO), internal engineering satisfaction<\/td>\n<\/tr>\n<tr>\n<td>Main deliverables<\/td>\n<td>Infrastructure roadmap, reference architectures, IaC module catalog, observability standards, incident response program artifacts, DR plans and test reports, cost dashboards and forecasting, governance policies and audit evidence, service catalogs and platform docs<\/td>\n<\/tr>\n<tr>\n<td>Main goals<\/td>\n<td>Stabilize reliability and operations; scale platform self-service; improve cost predictability and unit economics; embed security\/compliance guardrails; build a strong, sustainable infrastructure engineering organization<\/td>\n<\/tr>\n<tr>\n<td>Career progression options<\/td>\n<td>VP Engineering (Platform\/Infrastructure), VP of Engineering, CTO (smaller orgs), Head\/VP of SRE, Head of Developer Platform\/DX, Chief Architect\/Technology Strategy (context-dependent)<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>The Head of Infrastructure Engineering is accountable for designing, building, and operating the company\u2019s infrastructure platforms and core reliability capabilities that enable product engineering teams to ship software safely, quickly, and cost-effectively. This role leads the infrastructure engineering organization (often including cloud infrastructure, Kubernetes\/platform engineering, networking, observability, and incident management) and ensures infrastructure is scalable, secure, and operationally mature.<\/p>\n","protected":false},"author":61,"featured_media":0,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"_joinchat":[],"footnotes":""},"categories":[24486,24483],"tags":[],"class_list":["post-74775","post","type-post","status-publish","format-standard","hentry","category-engineering-leadership","category-leadership"],"_links":{"self":[{"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/74775","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/users\/61"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/comments?post=74775"}],"version-history":[{"count":0,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/posts\/74775\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/media?parent=74775"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/categories?post=74775"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devopsschool.com\/blog\/wp-json\/wp\/v2\/tags?post=74775"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}