Google Cloud Container-Optimized OS Tutorial: Architecture, Pricing, Use Cases, and Hands-On Guide for Compute

1. Introduction

What this service is
Container-Optimized OS is a Google-managed operating system image for Google Cloud Compute Engine virtual machines (VMs) that is specifically designed to run containers securely and efficiently.

Simple explanation (one paragraph)
If you want to run containers on a VM in Google Cloud without managing a general-purpose Linux distribution (packages, frequent configuration drift, large attack surface), Container-Optimized OS gives you a minimal OS that boots fast, stays locked down, and is tuned for container workloads.

Technical explanation (one paragraph)
Container-Optimized OS (often abbreviated as COS) is a hardened, minimal OS image maintained by Google, based on Chromium OS concepts (immutable / read-only root filesystem, verified boot design patterns, and automatic updates). It’s intended to be used as the host OS for container runtimes (commonly containerd, and in some contexts Docker compatibility—verify current runtime options in the official docs). COS integrates naturally with Compute Engine features (instance metadata, Managed Instance Groups, load balancing, service accounts, VPC networking) and is also a common node OS choice for Google Kubernetes Engine (GKE) node images (for example, COS variants used with containerd).

What problem it solves
It solves the “VM host management tax” for container hosting: OS patching risk, drift across fleets, oversized base images, inconsistent security baselines, and operational toil when you only need a stable host to run containers.

2. What is Container-Optimized OS?

Official purpose
Container-Optimized OS is designed by Google to provide a secure, efficient, and maintainable host environment for running containers on Compute Engine.

Core capabilities – Run containerized workloads on Compute Engine VMs with a minimal host OS footprint. – Reduce host attack surface compared to a general-purpose Linux OS. – Provide automated updates and a consistent base image across fleets. – Support container-focused deployment patterns (for example, “run a container as the VM workload” via Compute Engine’s container-on-VM workflows).

Major components (conceptual) – Minimal OS userland: fewer packages/tools than a general-purpose distro. – Hardened/immutable design: read-only root filesystem patterns help reduce drift and persistence of unwanted changes. – Container runtime support: commonly containerd (and sometimes Docker-related tooling depending on the image family and use case—verify current details). – Update system: designed for automated, reliable OS updates. – Compute Engine integration points: instance metadata, startup configuration patterns, logging/monitoring integration paths, and compatibility with fleet constructs like Managed Instance Groups (MIGs).

Service type
Container-Optimized OS is an operating system image provided by Google Cloud for Compute Engine. It is not a separate hosted “service” with its own control plane; you select it as the boot disk image for VMs (or implicitly via workflows that create COS-based instances).

Scope (how it’s “scoped” in Google Cloud)
– Image availability: COS images are published by Google and are accessible within projects when you create Compute Engine instances (subject to permissions). – Compute Engine resources: VMs are zonal resources; Managed Instance Groups can be zonal or regional; load balancers are global or regional depending on type. – Operational scope: you manage COS usage per project/VPC/instance template just like other Compute Engine images.

How it fits into the Google Cloud ecosystem – Compute Engine: primary place you use COS—single instances, MIGs, container-on-VM patterns. – GKE: COS is widely used as a node OS option (GKE manages nodes; you choose node image type). – Artifact Registry: store container images securely and pull from COS-hosted runtimes. – Cloud Logging/Monitoring: standard observability stack for VM and workload telemetry (implementation details depend on your chosen agents/approach; verify COS support for specific agents). – VPC + Cloud Load Balancing + Cloud Armor: front-end and secure COS-based workloads. – IAM + Service Accounts: authorize workloads to call Google APIs without embedding credentials.

Service name status
As of the latest generally available Google Cloud documentation, the product is still called Container-Optimized OS. (If you are using it via GKE node images, you may see COS variants referenced by image type names; verify the current image type labels in GKE docs.)

Official docs entry point: https://cloud.google.com/container-optimized-os/docs

3. Why use Container-Optimized OS?

Business reasons

Lower operational overhead: fewer OS-level tickets (patching cadence, baseline hardening, drift remediation) when your real product is the container workload.
Standardization: a consistent host OS across dev/test/prod and across teams reduces “snowflake VM” risk.
Faster time to production: fewer decisions about OS packages and configuration; focus on image build + deployment.

Technical reasons

Optimized for containers: COS is built for the “container is the unit of deployment” model.
Reduced footprint: smaller OS surface area than a typical general-purpose distro.
Immutability patterns: a read-only root filesystem approach discourages ad-hoc changes on the host.

Operational reasons

Fleet-friendly: works well with instance templates and Managed Instance Groups; replace instances rather than repair them.
Predictable updates: designed to be updated regularly in a controlled way (pin image versions when necessary, or use channels—verify exact mechanics in docs).
Faster boot and simpler host: in many environments, COS boots quickly and has fewer moving parts.

Security / compliance reasons

Smaller attack surface: fewer packages and services.
Hardening patterns: immutable root filesystem design, strong defaults, and automatic updates reduce exposure windows.
Better separation of concerns: app dependencies go into container images rather than the host OS.

Scalability / performance reasons

Works well with MIG autoscaling: you can scale out stateless container workloads by adding instances.
Container-centric resource usage: host overhead is typically smaller than full-featured distros (workload-dependent).

When teams should choose it

Choose Container-Optimized OS when: – You primarily run one or more containers as the VM workload. – You want standardized, hardened hosts with minimal customization. – You plan to use MIGs for elasticity and immutable infrastructure practices. – You want a stepping stone between “serverless” and “full Kubernetes”: – more control than Cloud Run – less platform complexity than managing Kubernetes for small deployments

When teams should not choose it

Avoid or reconsider COS when: – You need extensive OS customization, third-party agents that require package managers, or kernel/module tinkering. – You rely on interactive debugging with many common Linux tools installed by default. – Your workload expects a general-purpose VM environment (custom services, cron-heavy hosts, configuration management tools that assume writable root). – You want a managed container platform (consider Cloud Run or GKE Autopilot).

4. Where is Container-Optimized OS used?

Industries

SaaS and web: standardized container fleets behind load balancers.
Fintech and regulated industries: hardened baseline + controlled patching (always validate compliance needs against official attestations; COS itself isn’t automatically a compliance certification).
Gaming and media: burstable stateless services or edge-like services on VM fleets.
Data platforms: containerized sidecars, lightweight services, ingestion endpoints (not the place to run full data stacks unless designed carefully).

Team types

Platform engineering teams building VM-based container platforms.
DevOps/SRE teams maintaining fleets of stateless services.
Security teams standardizing hardened VM images.
App teams that want containers on VMs without adopting Kubernetes immediately.

Workloads

HTTP APIs and web front ends (Nginx, Envoy, app services).
Background workers / job processors (pull from Pub/Sub, process tasks).
Proxies, gateways, and lightweight network appliances packaged as containers.
Build runners or CI agents packaged in containers (be careful with privilege needs).
Internal tools that don’t justify Kubernetes overhead.

Architectures

Single VM running a container with a public IP (small dev/test).
MIG of COS instances pulling images from Artifact Registry, fronted by Cloud Load Balancing.
Blue/green or canary using multiple MIGs or rolling updates of instance templates.
Hybrid patterns: COS VMs for specific components; GKE for the rest.

Production vs dev/test usage

Dev/test: quick, low-maintenance way to run containers on VMs; useful for validation and demos.
Production: common when you want VM-level control (custom networking, instance types, GPUs, specialized disks) but still want container immutability and a hardened host.

5. Top Use Cases and Scenarios

Below are realistic scenarios where Container-Optimized OS on Google Cloud Compute Engine is a strong fit.

1) Single-container web service on a VM (simple hosting)

Problem: You need to host a small web service quickly, but don’t want to maintain Ubuntu patching and packages.
Why COS fits: Minimal host; run your container as the primary workload.
Example: A small internal dashboard served by nginx + a backend container on one VM for a dev environment.

2) Stateless API fleet with Managed Instance Groups

Problem: Scale an API horizontally with predictable, repeatable hosts.
Why COS fits: Great with instance templates + MIG; immutable rollout by replacing instances.
Example: A regional MIG of COS instances runs my-api:1.2.3, autoscaled by CPU, fronted by an external HTTP(S) load balancer.

3) Edge proxy / gateway layer

Problem: You need high-performance L7 proxying with strong OS hardening.
Why COS fits: Minimal OS + containerized proxy simplifies patching and upgrades.
Example: Envoy containers in a MIG terminate mTLS and route traffic to internal services.

4) Batch/worker nodes pulling tasks from Pub/Sub

Problem: Worker processes must scale up and down quickly and remain consistent.
Why COS fits: Fast to boot and easy to “replace instead of fix.”
Example: A MIG of worker VMs runs a container that pulls jobs from Pub/Sub and writes results to Cloud Storage.

5) Secure “jump workload” containers (not jump hosts)

Problem: You need controlled administrative tools without turning a VM into a long-lived snowflake.
Why COS fits: Host stays minimal; tools live in container images; access is audited via IAM and OS Login/IAP.
Example: Run a locked-down container image containing database admin CLI tools and short-lived credentials.

6) CI/CD self-hosted runners packaged as containers

Problem: You need runners that can be replaced easily and remain clean after jobs.
Why COS fits: Immutable host; runners in containers; replace on compromise.
Example: GitHub Actions runners or GitLab runners in a MIG where each instance is recycled frequently.

7) Dedicated network function appliances (containerized)

Problem: You need custom routing, NAT helpers, or observability sidecars in a controlled environment.
Why COS fits: Predictable baseline and fewer host services.
Example: A containerized forward proxy or DNS caching tier.

8) Pre-GKE stepping stone for teams adopting containers

Problem: Team wants containers but isn’t ready for Kubernetes complexity.
Why COS fits: Container workflow with VM primitives (firewall, load balancer) is simpler than Kubernetes.
Example: Two services deployed as two MIGs; rollouts via instance template version changes.

9) Multi-tenant internal services with strict baseline controls

Problem: Multiple teams run services on shared platform; need consistent OS baseline.
Why COS fits: Reduced drift; centralized image selection; strong defaults.
Example: Platform team provides an opinionated COS instance template and teams provide only container image + config.

10) Specialized Compute Engine shapes (high-memory, local SSD, etc.)

Problem: Your workload needs VM-specific features but you still want containers.
Why COS fits: You get Compute Engine flexibility with containerized apps.
Example: A high-memory VM runs a containerized in-memory service with persistent disks for snapshots.

11) Blue/green rollouts using instance template versions

Problem: You need controlled rollouts with easy rollback.
Why COS fits: New template references new container image digest; rollback is simply switching MIG template.
Example: Two MIGs (blue and green) behind a load balancer; shift traffic gradually.

12) Hardened internal developer preview environments

Problem: You want short-lived preview environments without long-term host maintenance.
Why COS fits: Easy to create and delete; predictable baseline.
Example: Per-branch preview service runs in a COS VM for a few hours, then deleted.

6. Core Features

Note: Some implementation details (exact runtime, channels, update controls, logging agent support) can change over time. Where appropriate, this section calls out what to verify in official docs.

6.1 Minimal, container-focused OS image

What it does: Provides a slim host OS designed primarily to run containers.
Why it matters: Fewer packages and services typically reduce attack surface and patching scope.
Practical benefit: Less OS maintenance; smaller baseline to secure.
Limitations/caveats: Not suited for workloads that assume a full Linux distro with package manager-based customization.

6.2 Read-only / immutable root filesystem patterns

What it does: Uses an immutable-style root filesystem (read-only root) design approach.
Why it matters: Reduces configuration drift and persistence of unauthorized host changes.
Practical benefit: Encourages immutable infrastructure (replace rather than patch-in-place).
Limitations/caveats: Installing host packages or modifying system files is intentionally constrained; you must plan for debugging and customization differently.

6.3 Automatic updates (designed for consistent patching)

What it does: COS is designed to receive updates from Google to address security and stability issues.
Why it matters: Shortens exposure window to vulnerabilities and reduces manual patch operations.
Practical benefit: Better baseline hygiene across fleets.
Limitations/caveats: Updates can require reboots; for production, use MIG rolling updates and capacity planning. Verify current controls for update strategy in official docs.

6.4 Support for container runtimes and OCI images

What it does: Runs standard container images (OCI/Docker image format).
Why it matters: Your build pipeline stays standard (Cloud Build, GitHub Actions, etc.).
Practical benefit: Build once, run anywhere containers; pull from Artifact Registry.
Limitations/caveats: Runtime tooling differs across images and use cases (for example, containerd vs Docker). Verify the current recommended runtime and tooling in COS docs.

6.5 Tight integration with Compute Engine primitives

What it does: COS is used like other Compute Engine images and works with:
instance templates and MIGs
VPC networks and firewall rules
load balancing
service accounts
metadata and startup configuration patterns
Why it matters: Lets you build production architectures with standard Google Cloud Compute building blocks.
Practical benefit: Consistent operations with the rest of Compute Engine.
Limitations/caveats: Some “traditional VM administration” approaches (configuration management writing to root) are not a great fit.

6.6 “Run a container as the VM workload” workflows

What it does: Compute Engine supports deploying a container to a VM in a way that starts the container on boot (commonly done with gcloud compute instances create-with-container and/or container declarations in instance metadata).
Why it matters: You can treat the VM as a container host appliance.
Practical benefit: Very fast path to “container on VM” without building a custom image.
Limitations/caveats: This is not Kubernetes. Health checks, rollouts, and multi-container orchestration are more manual unless you build them (or use MIG patterns). Confirm current container declaration capabilities in the Compute Engine containers documentation.

6.7 Strong compatibility with immutable/fleet operations (MIG)

What it does: Encourages immutable operations: update instance templates, roll instances.
Why it matters: Predictable deployments; simpler rollback; better reliability than repairing pets.
Practical benefit: Easier to standardize across teams.
Limitations/caveats: Stateful workloads require extra design (persistent disks, careful draining, database patterns).

6.8 Works well with Artifact Registry + private images

What it does: Pull images securely from Artifact Registry with IAM-controlled access.
Why it matters: Avoid unauthenticated public pulls; control provenance.
Practical benefit: Enterprise-ready image governance.
Limitations/caveats: You must ensure the VM’s service account has the right Artifact Registry permissions and that egress/firewall allows registry access.

6.9 Designed for secure boot patterns (verify features used)

What it does: COS is designed with verified boot concepts (Chromium OS heritage).
Why it matters: Integrity of the host OS is central to container security.
Practical benefit: Better baseline trust.
Limitations/caveats: Compute Engine also has Shielded VM features; confirm compatibility and best practices for COS + Shielded VM in official docs.

7. Architecture and How It Works

High-level architecture

At its simplest, Container-Optimized OS is: – A Compute Engine VM – Booting from a COS image – Running one or more containers as the workload – Connected to a VPC network – Observed via Cloud Logging/Monitoring and governed via IAM

Control flow vs data flow

Control plane (Google Cloud):
You define a VM or instance template referencing a COS image family/version.
You optionally provide container configuration (image, env vars, restart policy) via metadata or “create-with-container”.
IAM decides who can create/modify instances, firewall rules, service accounts, and who can SSH (for example via OS Login).
Data plane (your workload):
Traffic hits a VM external IP or a load balancer.
The container receives traffic on its exposed port.
The container calls other services (databases, Pub/Sub, Storage) using service account credentials.

Integrations with related services

Common and practical integrations include: – Artifact Registry for private container images. – Cloud Load Balancing for global/regional front ends. – Managed Instance Groups for scaling and rolling updates. – Cloud DNS for naming. – Secret Manager (recommended) for secrets retrieved at runtime by the app, rather than stored in instance metadata. – Cloud Logging and Cloud Monitoring for logs/metrics (agent approach varies; verify the recommended agent approach for COS in official docs). – Cloud Armor to protect HTTP(S) services from common attacks when using HTTP(S) load balancing.

Dependency services

Compute Engine API is required.
If using private images: Artifact Registry API and IAM bindings.
If using load balancing: additional networking and load balancing APIs/resources.

Security/authentication model

Google Cloud IAM: controls who can create/modify instances and associated resources.
Service accounts: attached to VMs to grant workload access to Google APIs.
OS Login / IAM-based SSH (recommended): use IAM to control SSH access and log it.
Firewall rules: enforce network exposure at VPC level.
Container image security: depends on your build pipeline, scanning, and provenance controls.

Networking model

VMs attach to a VPC network/subnet.
Ingress is controlled by firewall rules and (optionally) load balancers.
Egress follows VPC routing/NAT; consider Cloud NAT if you want private instances without external IPs.

Monitoring/logging/governance considerations

Decide how you will:
collect host and container logs
collect metrics and traces
patch/roll instances safely (MIG rolling update)
tag and label resources for cost allocation
The best practice is to treat COS instances as replaceable and to externalize state.

Simple architecture diagram (Mermaid)

flowchart LR
  User((User)) -->|HTTP| FW[Firewall rule]
  FW --> VM[COS VM<br/>Container-Optimized OS]
  VM --> C[Container<br/>Web App]
  C --> GCP[(Google APIs<br/>via Service Account)]

Production-style architecture diagram (Mermaid)

flowchart TB
  subgraph Internet[Internet]
    U((Users))
  end

  subgraph GCP[Google Cloud Project]
    direction TB

    LB[External HTTP(S) Load Balancer]
    ARMOR[Cloud Armor Policy]
    DNS[Cloud DNS]

    subgraph VPC[VPC Network]
      direction TB
      MIG[Regional Managed Instance Group<br/>COS instances]
      HC[Health Checks]
      FW2[Firewall Rules]
      NAT[Cloud NAT (optional)]
    end

    AR[Artifact Registry<br/>Private Images]
    SM[Secret Manager]
    LOG[Cloud Logging]
    MON[Cloud Monitoring]
    IAM[IAM + Service Accounts]
  end

  U -->|DNS| DNS --> LB
  LB --> ARMOR --> MIG
  HC --> MIG
  FW2 --> MIG
  MIG -->|pull image| AR
  MIG -->|fetch secrets at runtime| SM
  MIG --> LOG
  MIG --> MON
  IAM --> MIG
  MIG --> NAT

8. Prerequisites

Account / project requirements

A Google Cloud project with billing enabled.
Ability to enable required APIs in the project.

Permissions / IAM roles

For a lab in a personal sandbox project, Project Owner is simplest.

For least-privilege in a real environment, you typically need: – Permissions to create and manage Compute Engine instances (for example, roles/compute.instanceAdmin.v1) – Permissions to create firewall rules if you do that in the lab (for example, roles/compute.securityAdmin or roles/compute.networkAdmin) – Permission to use a service account if attaching one (roles/iam.serviceAccountUser on that service account)

Exact least-privilege depends on your organization policies; verify with your IAM admins.

Billing requirements

Compute Engine resources incur charges (VM core/RAM time, disks, IPs, load balancing, egress).
Container-Optimized OS itself is an image; pricing is primarily for the underlying Compute Engine resources.

CLI / tools

Cloud Shell (recommended) or local installation of:
gcloud CLI: https://cloud.google.com/sdk/docs/install
Optional: curl for testing endpoints.

Region availability

COS images are used in Compute Engine, which is available across many regions/zones. Choose a zone close to your users and other dependencies.
Some machine types and features are region/zone dependent. Verify in Compute Engine docs if you need specific hardware.

Quotas / limits

Common quotas to check: – vCPU quota in your chosen region – In-use IP addresses – Firewall rules quota (usually not an issue in small labs) – If using MIG/LB later: forwarding rules and backend service quotas

Prerequisite services

Compute Engine API must be enabled.

9. Pricing / Cost

Pricing model (accurate framing)

Container-Optimized OS does not have a separate SKU you pay for like a managed service. Your costs come from the Compute Engine resources you run COS on, plus any connected services (load balancer, disks, logs, egress, Artifact Registry, etc.).

Primary official pricing references: – Compute Engine pricing: https://cloud.google.com/compute/pricing – Google Cloud Pricing Calculator: https://cloud.google.com/products/calculator

Pricing dimensions to understand

You typically pay for: 1. VM runtime: vCPU + memory pricing by machine type and region. 2. Boot disk and attached disks: persistent disk type (balanced, SSD, standard), size (GB-month), and IOPS/throughput characteristics depending on disk type. 3. Networking: – Egress to the internet (often a major driver) – Cross-region traffic – Load balancer data processing (if used) 4. External IP: depending on how the IP is used (ephemeral vs reserved, attached vs unused) pricing can vary; verify current external IP pricing in official docs. 5. Operations suite (Logging/Monitoring): logs ingestion/retention and metrics beyond free allocations. 6. Artifact Registry: – storage for container images – network egress when pulling images across regions (and general network costs) 7. Optional security: – Cloud Armor policies and rules – KMS usage if you add customer-managed encryption keys (CMEK) to disks or other resources

Free tier (if applicable)

Google Cloud has an “Always Free” tier for some resources in some regions (historically including a small VM). Eligibility and details change over time and vary by region and usage. Verify current Always Free eligibility in official docs before assuming a workload is free.

Cost drivers (what surprises teams)

Internet egress from serving traffic publicly can exceed compute costs.
Overprovisioned machine types: using a larger instance than necessary for a small container.
Log volume: chatty containers can generate expensive log ingestion.
Load balancer + multiple zones: great for reliability, but adds cost.
Image pull patterns: frequent instance recreation can cause frequent image pulls (and potential egress) if not regionally optimized.

Hidden/indirect costs

Engineering time for:
secure image supply chain
rollouts/rollbacks
secrets management
observability
If you move from “single VM” to “production fleet”, costs often shift to:
load balancing
monitoring/logging
security controls (Armor, WAF-like policies)
multi-zone redundancy

How to optimize cost (practical)

Right-size instances (start small; measure CPU/memory).
Use Managed Instance Groups with autoscaling for variable traffic.
Use Sustained Use Discounts automatically where applicable and evaluate Committed Use Discounts for steady workloads (Compute Engine pricing model; verify current discount applicability).
Reduce log volume:
tune application logging levels
apply log exclusions in Cloud Logging if appropriate
Keep Artifact Registry in the same region as your compute fleet to minimize latency and cross-region egress.
Prefer private instances behind a load balancer + Cloud NAT if you don’t need per-VM public IPs.

Example low-cost starter estimate (no fabricated prices)

A minimal lab setup often includes: – 1 small VM instance (e.g., an E2-family small machine type) – 1 small boot disk – 1 firewall rule – Minimal internet egress (a few MB for testing)

To estimate your real cost: 1. Open the Pricing Calculator: https://cloud.google.com/products/calculator
2. Add “Compute Engine”. 3. Select your region, machine type, usage hours (e.g., a few hours), disk type/size. 4. Add expected internet egress (even small amounts). Because compute and network pricing are region-dependent, do not rely on a single universal number.

Example production cost considerations

A typical production pattern (MIG + load balancer) adds: – Multiple instances across zones (or regional MIG) – Load balancer components (forwarding rules, proxies, backend service) – Health checks – Higher log and metric volume – Potential Cloud Armor usage – More egress volume

Use the calculator with: – your steady-state instance count – expected peak scaling – expected requests/GB egress – log volume (if you can estimate)

10. Step-by-Step Hands-On Tutorial

This lab deploys a real container to a Compute Engine VM running Container-Optimized OS and exposes it over HTTP for quick validation.

Objective

Create a Compute Engine VM that uses Container-Optimized OS and automatically runs an nginx container.
Allow inbound HTTP traffic.
Validate the service.
Clean up resources to avoid ongoing charges.

Lab Overview

You will: 1. Set your project and enable the Compute Engine API. 2. Create a firewall rule to allow inbound TCP port 80. 3. Create a COS-based VM using gcloud compute instances create-with-container. 4. Validate with curl. 5. Troubleshoot common issues. 6. Delete the VM and firewall rule.

Why create-with-container?
It’s the most beginner-friendly way to run a container as the “main” VM workload on Container-Optimized OS without building a custom image.

Expected cost

Low, if you: – use a small VM – keep the lab running only briefly – generate minimal egress
Always verify pricing for your region and account.

Step 1: Select a project, region, and enable the API

In Cloud Shell, run:

gcloud auth list
gcloud config list project

Set your project:

export PROJECT_ID="YOUR_PROJECT_ID"
gcloud config set project "${PROJECT_ID}"

Pick a zone (example: us-central1-a). Choose one close to you:

export ZONE="us-central1-a"
gcloud config set compute/zone "${ZONE}"

Enable the Compute Engine API:

gcloud services enable compute.googleapis.com

Expected outcome – Compute Engine API is enabled. – Your gcloud default project and zone are set.

Step 2: Create a firewall rule to allow HTTP (port 80)

Create a firewall rule that allows inbound TCP:80 to instances with a specific network tag.

gcloud compute firewall-rules create allow-http-80 \
  --direction=INGRESS \
  --priority=1000 \
  --network=default \
  --action=ALLOW \
  --rules=tcp:80 \
  --source-ranges=0.0.0.0/0 \
  --target-tags=cos-http

Expected outcome – A firewall rule named allow-http-80 exists in your project. – Only instances tagged cos-http will be reachable on port 80.

Verification

gcloud compute firewall-rules describe allow-http-80 --format="value(name,network,direction,allowed[].IPProtocol,allowed[].ports)"

Step 3: Create a Container-Optimized OS VM that runs Nginx

Create a VM and specify a container image to run. This command will: – create the VM – use a COS-based container-VM workflow – start the container on boot

export VM_NAME="cos-nginx-1"

gcloud compute instances create-with-container "${VM_NAME}" \
  --tags=cos-http \
  --machine-type=e2-micro \
  --container-image=nginx:stable

Notes: – e2-micro is a small machine type commonly used for labs, but availability and cost depend on region. If it fails due to quota or availability, try e2-small. – The container image is pulled from a public registry in this example. For production, prefer Artifact Registry with IAM-controlled access.

Expected outcome – A VM instance is created. – Nginx container starts automatically. – The VM has an external IP (by default, in the default VPC unless you changed defaults).

Verification Describe the instance and capture its external IP:

gcloud compute instances describe "${VM_NAME}" --format="get(networkInterfaces[0].accessConfigs[0].natIP)"

Store it:

export EXTERNAL_IP="$(gcloud compute instances describe "${VM_NAME}" --format="get(networkInterfaces[0].accessConfigs[0].natIP)")"
echo "External IP: ${EXTERNAL_IP}"

Step 4: Test the web server from Cloud Shell

Run:

curl -i "http://${EXTERNAL_IP}/"

Expected outcome – You receive an HTTP response (typically HTTP/1.1 200 OK) and see the Nginx welcome HTML.

If you want to see headers only:

curl -I "http://${EXTERNAL_IP}/"

Step 5: Basic operational checks (instance + container)

Check instance status:

gcloud compute instances describe "${VM_NAME}" --format="value(status)"

If you need to SSH for deeper debugging:

gcloud compute ssh "${VM_NAME}"

Once connected, you can inspect system logs with journalctl (available on many systemd-based systems). The exact unit names and container supervisor depend on the container-on-VM implementation. If you don’t immediately see container logs, use:

sudo journalctl --no-pager -n 200

If the container-on-VM workflow uses a dedicated service unit, you can list units and search:

sudo systemctl list-units --type=service | head
sudo systemctl list-units --type=service | grep -i -E "container|konlet|docker|containerd" || true

If you require a precise “which service starts the container” answer for your chosen image family, verify in the official Compute Engine containers documentation, because the underlying components and naming can evolve.

Exit SSH:

exit

Validation

You have successfully validated that: – Container-Optimized OS can run a container workload on Compute Engine. – The workload is reachable over HTTP. – You can operate it using standard Compute Engine tooling.

A quick final validation summary:

echo "VM: ${VM_NAME}"
echo "IP: ${EXTERNAL_IP}"
curl -I "http://${EXTERNAL_IP}/" | head -n 1

Troubleshooting

Issue: `curl` times out / cannot connect

Common causes and fixes: 1. Firewall rule missing or wrong tag – Ensure the VM has the tag cos-http: bash gcloud compute instances describe "${VM_NAME}" --format="value(tags.items)" – Ensure firewall rule targets that tag: bash gcloud compute firewall-rules describe allow-http-80 --format="value(targetTags)"

Wrong IP – Re-check the external IP: bash gcloud compute instances describe "${VM_NAME}" --format="get(networkInterfaces[0].accessConfigs[0].natIP)"
Container not running – SSH in and inspect logs (journalctl) as shown above. – Recreate the instance if needed (in immutable style, replacing is often faster than deep repair).

Issue: `create-with-container` fails with permissions error

Ensure you have permissions to create instances.
In managed orgs, Organization Policy may block external IPs or public firewall rules. If so:
Use an internal load balancer / private access patterns
Or request policy exceptions in a sandbox project

Issue: Machine type not available / quota exceeded

Try a different zone: bash gcloud compute zones list --filter="region:(us-central1)" --format="value(name)"
Try a different machine type (e.g., e2-small).

Cleanup

Delete the VM:

gcloud compute instances delete "${VM_NAME}" --quiet

Delete the firewall rule:

gcloud compute firewall-rules delete allow-http-80 --quiet

Verify cleanup:

gcloud compute instances list --filter="name=${VM_NAME}"
gcloud compute firewall-rules list --filter="name=allow-http-80"

11. Best Practices

Architecture best practices

Prefer Managed Instance Groups for production:
Enables rolling updates and autohealing.
Makes “replace instances” the standard remediation.
Externalize state:
Store data in managed services (Cloud SQL, Spanner, Firestore) or persistent disks designed for that purpose.
Keep COS VMs as stateless as possible.
Use load balancers instead of per-VM public IPs:
Better security posture and easier TLS, health checks, and scaling.

IAM and security best practices

Use dedicated service accounts per workload and grant least privilege.
Use OS Login (and ideally IAP for SSH) to avoid unmanaged SSH keys.
Restrict firewall rules:
Avoid 0.0.0.0/0 unless necessary.
Limit inbound ports; default deny.
Pin and verify container images:
Prefer immutable image references (digests) in production rather than mutable tags like latest.

Cost best practices

Right-size aggressively; measure real CPU/memory usage.
Use autoscaling with MIGs for variable workloads.
Minimize egress and cross-region pulls (keep Artifact Registry close to compute).
Control log volume; implement log exclusions where appropriate.

Performance best practices

Keep container images small; optimize layer caching.
Use regional placement to reduce latency to dependencies.
Ensure health checks are representative and not overly expensive.

Reliability best practices

Run across multiple zones (regional MIG) for high availability.
Use load balancer health checks and autohealing.
Design for instance replacement during updates and failures.
Implement graceful shutdown in your application so rolling updates don’t drop requests.

Operations best practices

Standardize instance templates and use labels for ownership and environment (env=prod, team=payments).
Maintain a documented rollout process (update template, rolling update parameters, rollback).
Keep a break-glass procedure for emergency access that is auditable and time-bound.

Governance/tagging/naming best practices

Consistent naming:
svc-env-region-role-### (example: api-prod-uscentral1-web-001)
Use labels for:
cost center
data sensitivity tier
owner/oncall
Track COS image family/version and container image digest in deployment records.

12. Security Considerations

Identity and access model

IAM governs:
who can create/modify/delete instances, templates, firewall rules
who can attach service accounts and what scopes/permissions workloads get
Service accounts are the recommended way for apps to access Google Cloud APIs.
OS Login integrates Linux account access with IAM and helps centralize auditability.

Encryption

At rest: Compute Engine disks are encrypted by default. For stricter requirements, consider CMEK (customer-managed keys) for disks (verify the current CMEK support and configuration in Compute Engine docs).
In transit: Use TLS termination at the load balancer or in the container. Prefer managed certificates and modern TLS policies where applicable.

Network exposure

Avoid giving every instance a public IP in production.
Use:
External HTTP(S) Load Balancer (public entry)
Private instances in subnets
Cloud NAT for outbound access without inbound exposure
Use firewall rules with least exposure and target tags/service accounts.

Secrets handling

Avoid storing secrets in:
instance metadata
container image layers
source control
Prefer Secret Manager and fetch secrets at runtime using the VM’s service account identity.
Rotate secrets and use short-lived credentials where possible.

Audit/logging

Use Cloud Audit Logs for administrative actions (instance creation, firewall changes, IAM changes).
Ensure you can attribute:
who deployed a new container version
who changed network exposure
who accessed instances (OS Login + IAP logs, where used)
For workload logs:
centralize to Cloud Logging (agent/collection method depends on your approach; verify the recommended method for COS).

Compliance considerations

COS can support secure operations, but compliance depends on:
your configuration
identity controls
logging/retention
vulnerability management
network boundaries
Always validate requirements against Google Cloud compliance documentation and your auditor’s needs.

Common security mistakes

Leaving SSH open to the internet with weak key management.
Using wide firewall rules (0.0.0.0/0) for admin ports.
Running containers as root unnecessarily.
Pulling public images without provenance checks.
Using mutable tags (latest) in production.
Treating COS hosts as “pet servers” and making manual changes that aren’t reproducible.

Secure deployment recommendations

Prefer MIG + LB architecture.
Use private images in Artifact Registry with IAM.
Use image scanning/provenance in your CI pipeline (verify your chosen tooling).
Restrict metadata exposure and avoid sensitive data in metadata.
Implement runtime security controls in the application and container configuration (non-root user, read-only filesystem in container where possible).

13. Limitations and Gotchas

Known limitations / design constraints

Not a general-purpose Linux distro: package installation and customization are intentionally limited.
Debugging friction: fewer built-in tools; you may need “debug containers” or dedicated debugging workflows.
Host persistence model differs: immutable root patterns mean some changes won’t persist or are discouraged.
Multi-container orchestration is limited without Kubernetes or additional tooling: the simplest workflows assume “one main container per VM” or require you to build your own supervisor approach.

Quotas and scaling constraints

vCPU quota and IP quota can block scaling.
Load balancing quotas can surprise teams when moving to production patterns.

Regional constraints

Some machine types and accelerators are zone-specific.
Keep Artifact Registry and compute in compatible regions to avoid latency/egress.

Pricing surprises

Internet egress for public services.
Log ingestion volume from chatty containers.
External IP charges depending on usage type (verify current billing rules).

Compatibility issues

Some third-party monitoring/security agents assume they can install packages or write broadly to the filesystem.
Kernel module requirements can be tricky; verify whether your workload needs specific kernel modules/drivers.

Operational gotchas

If you treat instances as mutable pets, you’ll fight the platform.
Updates/reboots must be planned for (MIG rolling updates help).
Container image pull failures (auth, network) can cause instances to come up “healthy VM but unhealthy app.”

Migration challenges

Moving from Ubuntu to COS may require:
rebuilding host-installed software into container images
redesigning log collection
changing SSH/debug habits

Vendor-specific nuances

COS is deeply integrated with Google Cloud’s Compute Engine model. If you need portability across clouds at the VM OS level, consider whether a more generic OS (or Kubernetes) is a better abstraction.

14. Comparison with Alternatives

In Google Cloud (nearest options)

Ubuntu/Debian on Compute Engine: flexible general-purpose OS, more host maintenance.
GKE Standard / Autopilot: managed Kubernetes; more features for orchestration and scale, but more platform complexity.
Cloud Run: serverless containers; simplest ops model but less VM-level control and some workload constraints.

In other clouds (nearest conceptual peers)

AWS Bottlerocket: container-optimized OS for ECS/EKS.
Azure Linux / CBL-Mariner-based container host patterns: Microsoft has container host OS patterns; exact product choices vary—verify current Azure recommendations.
Self-managed minimal OS: Fedora CoreOS, Flatcar, etc., when you want an immutable OS with different ecosystem tradeoffs.

Comparison table

Option	Best For	Strengths	Weaknesses	When to Choose
Container-Optimized OS (Google Cloud)	Container workloads on Compute Engine VMs	Minimal/hardened host, designed for containers, good for MIG patterns	Limited host customization, different debugging model	You want containers on VMs with strong baseline and low OS toil
Ubuntu/Debian on Compute Engine	Mixed workloads, custom agents, traditional VM ops	Familiar tooling, package managers, broad compatibility	Larger attack surface, more patching/drift risk	You need broad OS flexibility or legacy software
GKE Standard	Kubernetes-managed container platforms	Rich orchestration, scaling, service discovery, policies	Kubernetes operational overhead (though managed)	You have multiple services and want Kubernetes features
GKE Autopilot	“Kubernetes with less ops”	Less node management, opinionated best practices	Less infrastructure control, different cost model	You want Kubernetes but don’t want to manage nodes
Cloud Run	Stateless HTTP services and event-driven containers	Very low ops, fast deploys, scale to zero	Platform constraints (request/response model, execution limits), less network control	You want serverless simplicity and fit the model
AWS Bottlerocket (AWS)	Container hosts in AWS	Minimal immutable OS for containers	Different cloud, different integrations	Multi-cloud comparison; choose if you’re on AWS
Fedora CoreOS / Flatcar (self-managed)	Immutable OS approach with broader control	Strong immutability story, flexible environments	You manage lifecycle and integration	You need an immutable OS but prefer non-cloud-vendor images

15. Real-World Example

Enterprise example: Secure API fleet on Compute Engine with controlled rollout

Problem: A large enterprise has strict security requirements and wants to reduce VM drift. They run containerized APIs that must integrate with existing VPCs, shared load balancers, and IAM.
Proposed architecture:
Artifact Registry for private images
Cloud Build pipeline builds and signs images (signing approach depends on chosen tooling; verify)
Regional MIG of COS instances using an instance template that references a pinned COS image family/version
External HTTP(S) Load Balancer + Cloud Armor in front
Workloads use service accounts to access Pub/Sub and Cloud SQL
Centralized logging/monitoring with alerting tied to SLOs
Why COS was chosen:
Minimal host OS reduces attack surface and drift
Automated updates fit enterprise patching goals when paired with MIG rolling updates
Clear separation: “host is appliance, app is container”
Expected outcomes:
Faster, safer rollouts (template update → rolling update)
Reduced OS vulnerabilities window and fewer manual patch cycles
Consistent baseline across environments

Startup/small-team example: Simple container hosting without Kubernetes

Problem: A startup needs a reliable service host for one API and one worker, but Kubernetes is too heavy for current team size.
Proposed architecture:
COS VM(s) running container workloads
A small MIG for the API behind a load balancer
Worker service on a separate MIG without public ingress
Artifact Registry for images
Secret Manager for API keys
Why COS was chosen:
Reduced ops compared to Ubuntu patch management
Easier than Kubernetes while still enabling immutable deployments
Expected outcomes:
Simple deploy pipeline: build image → update template → roll
Lower operational burden and predictable environment
A clear growth path to GKE later if needed

16. FAQ

1) Is Container-Optimized OS a separate billed product?
No. You pay for the Compute Engine resources (VMs, disks, network, etc.). COS is an OS image you choose for instances.

2) Can I SSH into a COS VM?
Yes, you can SSH like other Compute Engine VMs (subject to IAM and network controls). Use OS Login/IAP where possible for better security.

3) Can I install packages with apt or yum?
Typically, no—COS is not meant to be managed like a general-purpose distro. Put dependencies in container images instead.

4) How do OS updates work?
COS is designed to receive automated updates. For production, plan for reboots and use MIG rolling updates. Verify current update controls/channels in official docs.

5) Is COS only for a single container per VM?
Many common workflows assume one primary container, but you can run additional containers depending on your chosen approach. If you need multi-container orchestration with service discovery and rollouts, consider GKE.

6) Should I use COS or GKE?
Use COS on Compute Engine when you want VM-based control and simpler operations for a smaller set of services. Use GKE when you need Kubernetes orchestration features, multi-service scheduling, and Kubernetes-native policies.

7) Does COS work with Managed Instance Groups?
Yes. COS is often used with MIGs for autohealing, autoscaling, and rolling updates.

8) How do I pull private images from Artifact Registry?
Attach a service account to the VM with Artifact Registry read permissions and ensure network access to the registry endpoint. Verify the exact required IAM role(s) in Artifact Registry docs.

9) Where should I store secrets for COS workloads?
Use Secret Manager and fetch secrets at runtime using the VM’s service account identity. Avoid embedding secrets in metadata or images.

10) How do I handle persistent storage?
Prefer managed services. If you must persist files, use persistent disks or other Google Cloud storage products. Design carefully so instance replacement does not lose state.

11) Is COS “more secure” than Ubuntu by default?
It’s designed with a smaller footprint and hardened patterns, which can reduce attack surface. Security still depends heavily on your container image, IAM, network exposure, and operational practices.

12) Can I run non-container workloads on COS?
COS is intended for containers. If you need general-purpose workloads or host-installed software, use a general-purpose OS image.

13) How do I do blue/green deployments with COS?
Commonly: create a new instance template (new container image digest), roll a new MIG or update an existing MIG with controlled rollout, and switch traffic via load balancer backends.

14) How do I observe container logs and metrics?
Use Cloud Logging/Monitoring. The exact agent/collection method depends on COS support and your chosen approach. Verify the current recommended method in official docs.

15) What’s the difference between COS on Compute Engine and COS as GKE node image?
On Compute Engine, you manage the VM lifecycle and container startup method. On GKE, Google (or you, depending on mode) manages nodes and Kubernetes orchestrates containers.

16) Can I use COS for highly regulated environments?
Possibly, but you must validate the entire system (IAM, logging, encryption, network boundaries, patching processes) against your compliance framework. Don’t assume compliance from OS choice alone.

17) Do I need a public IP for a COS VM?
No. Many production designs use private VMs behind a load balancer, and use Cloud NAT for outbound access.

17. Top Online Resources to Learn Container-Optimized OS

Resource Type	Name	Why It Is Useful
Official documentation	Container-Optimized OS docs — https://cloud.google.com/container-optimized-os/docs	Primary source for COS concepts, images, security model, and operations
Official release notes	Container-Optimized OS release notes — https://cloud.google.com/container-optimized-os/docs/release-notes	Track security fixes, version changes, and behavioral updates
Official Compute Engine containers guide	Deploying containers on VMs (Compute Engine) — https://cloud.google.com/compute/docs/containers	Authoritative guide for `create-with-container` and container declaration patterns
Official pricing	Compute Engine pricing — https://cloud.google.com/compute/pricing	COS cost is primarily Compute Engine cost; this is the base pricing reference
Pricing calculator	Google Cloud Pricing Calculator — https://cloud.google.com/products/calculator	Build a region-specific estimate including disks, egress, and load balancing
Architecture guidance	Google Cloud Architecture Center — https://cloud.google.com/architecture	Patterns for MIGs, load balancing, security, and operations
Observability	Cloud Operations suite docs — https://cloud.google.com/products/operations	Logging/Monitoring patterns that apply to VM/container architectures
Container image registry	Artifact Registry docs — https://cloud.google.com/artifact-registry/docs	Secure private image storage and IAM-controlled access
Security/IAM	IAM overview — https://cloud.google.com/iam/docs/overview	Correct identity model for VM/container workloads
Tutorials (official)	Compute Engine tutorials — https://cloud.google.com/compute/docs/tutorials	VM patterns that often pair well with COS (MIGs, LBs, networking)

18. Training and Certification Providers

Institute	Suitable Audience	Likely Learning Focus	Mode	Website URL
DevOpsSchool.com	DevOps engineers, SREs, platform teams	DevOps tooling, cloud operations, CI/CD, container operations	Check website	https://www.devopsschool.com/
ScmGalaxy.com	Beginners to intermediate DevOps learners	SCM, DevOps fundamentals, build/release practices	Check website	https://www.scmgalaxy.com/
CLoudOpsNow.in	Cloud ops practitioners	Cloud operations, monitoring, automation	Check website	https://www.cloudopsnow.in/
SreSchool.com	SREs and reliability-focused engineers	SRE practices, reliability engineering, observability	Check website	https://www.sreschool.com/
AiOpsSchool.com	Ops teams exploring AIOps	AIOps concepts, automation, monitoring analytics	Check website	https://www.aiopsschool.com/

19. Top Trainers

Platform/Site	Likely Specialization	Suitable Audience	Website URL
RajeshKumar.xyz	DevOps/Cloud training content (verify offering)	Individuals and teams seeking guided learning	https://www.rajeshkumar.xyz/
devopstrainer.in	DevOps training (verify course catalog)	Beginners to advanced DevOps learners	https://www.devopstrainer.in/
devopsfreelancer.com	Freelance DevOps guidance/services (treat as a resource platform unless verified)	Teams needing short-term expert help	https://www.devopsfreelancer.com/
devopssupport.in	DevOps support/training resources (verify scope)	Engineers needing practical support	https://www.devopssupport.in/

20. Top Consulting Companies

Company Name	Likely Service Area	Where They May Help	Consulting Use Case Examples	Website URL
cotocus.com	DevOps and cloud consulting (verify exact offerings)	Cloud migration, CI/CD, infrastructure automation	COS-based MIG design, secure container hosting on Compute Engine, rollout/rollback automation	https://www.cotocus.com/
DevOpsSchool.com	DevOps consulting and enablement	Platform engineering, training + implementation	Designing container-on-VM reference architectures, setting up Artifact Registry + CI pipelines	https://www.devopsschool.com/
DEVOPSCONSULTING.IN	DevOps consulting (verify exact offerings)	Operational readiness, automation, reliability practices	MIG + load balancer production setup, logging/monitoring baseline, security review	https://www.devopsconsulting.in/

21. Career and Learning Roadmap

What to learn before Container-Optimized OS

Google Cloud fundamentals:
projects, billing, IAM, service accounts
VPC networking and firewall rules
Compute Engine basics:
instances, images, disks
instance templates and MIGs (recommended)
Containers fundamentals:
Docker/OCI images, registries
container networking and ports
basic security (non-root, minimal images)

What to learn after Container-Optimized OS

Production architectures:
external HTTP(S) load balancing
Cloud Armor basics
multi-zone design and SLOs
CI/CD and supply chain:
Cloud Build or other CI
Artifact Registry permissions and lifecycle policies
vulnerability scanning and provenance (verify your selected tooling)
Kubernetes (optional but common next step):
GKE Standard/Autopilot
deployment strategies, services, ingress, policies

Job roles that use it

Cloud engineer (Compute Engine + container hosting)
DevOps engineer / platform engineer
SRE (especially VM fleet operations)
Security engineer (hardened baseline, workload identity, network boundaries)

Certification path (if available)

There is no “Container-Optimized OS certification” specifically. Relevant Google Cloud certifications typically include: – Associate Cloud Engineer – Professional Cloud Architect – Professional Cloud DevOps Engineer
Verify current certification names and requirements: https://cloud.google.com/learn/certification

Project ideas for practice

Build a small service and deploy it to a COS MIG behind a load balancer.
Implement blue/green via two MIGs and controlled traffic switching.
Store images in Artifact Registry and restrict access via service accounts.
Implement Secret Manager integration and rotate secrets.
Add Cloud Monitoring alerts on HTTP error rate and latency.
Create a cost dashboard using labels for team/environment.

22. Glossary

Artifact Registry: Google Cloud service to store container images and other artifacts with IAM-based access control.
Compute Engine: Google Cloud’s IaaS VM service.
Container image: A packaged filesystem and metadata used to run a container (OCI/Docker format).
Container runtime: Software that runs containers on a host (commonly containerd; Docker Engine historically in some contexts—verify for your COS image).
COS: Common abbreviation for Container-Optimized OS.
Firewall rule (VPC): Network rule controlling allowed/denied traffic to VM instances.
IAM: Identity and Access Management, controls permissions in Google Cloud.
Instance template: A reusable VM configuration used by Managed Instance Groups.
Managed Instance Group (MIG): A group of identical VMs managed as a single entity for scaling, autohealing, and rolling updates.
OS Login: IAM-integrated method for managing SSH access to VMs.
Service account: A Google identity used by workloads to access Google Cloud APIs.
Shielded VM: Compute Engine features for protecting against boot-level and rootkit attacks (verify COS compatibility and best practices).
VPC: Virtual Private Cloud network in Google Cloud.
Workload identity (VM): Using a VM’s service account credentials to access Google Cloud APIs without static keys.

23. Summary

Container-Optimized OS is a Google-managed, container-focused operating system image for Google Cloud Compute Engine. It matters because it reduces OS maintenance overhead, limits host attack surface, and aligns well with immutable infrastructure practices—especially when combined with Managed Instance Groups for rolling updates and autohealing.

Cost-wise, COS itself is not a separate billed service; your spend is driven by Compute Engine VM runtime, disks, networking (especially egress), load balancing, and observability. Security-wise, COS helps by providing a minimal and hardened baseline, but real security still depends on IAM least privilege, firewall design, image provenance, secrets handling, and logging/auditing.

Use Container-Optimized OS when you want to run containers on VMs with a strong baseline and straightforward operations. If you need full orchestration and Kubernetes-native features, plan for GKE; if you want maximum simplicity and your workload fits, consider Cloud Run.

Next step: take the lab further by putting your COS instances into a Managed Instance Group behind an HTTP(S) load balancer, using Artifact Registry (private images) and Secret Manager (runtime secrets).

rajeshkumar

Category