{"id":550,"date":"2026-04-14T11:41:27","date_gmt":"2026-04-14T11:41:27","guid":{"rendered":"https:\/\/www.devopsschool.com\/tutorials\/google-cloud-deep-learning-vm-images-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-ai-and-ml\/"},"modified":"2026-04-14T11:41:27","modified_gmt":"2026-04-14T11:41:27","slug":"google-cloud-deep-learning-vm-images-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-ai-and-ml","status":"publish","type":"post","link":"https:\/\/www.devopsschool.com\/tutorials\/google-cloud-deep-learning-vm-images-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-ai-and-ml\/","title":{"rendered":"Google Cloud Deep Learning VM Images Tutorial: Architecture, Pricing, Use Cases, and Hands-On Guide for AI and ML"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Category<\/h2>\n\n\n\n<p>AI and ML<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">1. Introduction<\/h2>\n\n\n\n<p>Deep Learning VM Images is a Google Cloud offering that provides preconfigured virtual machine (VM) images for machine learning and deep learning work on Compute Engine. These images are designed to reduce setup time by including commonly used frameworks and tooling (for example, Python environments, GPU tooling for GPU-enabled images, and other ML developer utilities).<\/p>\n\n\n\n<p>In simple terms: you launch a Compute Engine VM using a Deep Learning VM Images image, connect to the VM, and start building or running ML workloads without spending hours installing drivers and frameworks.<\/p>\n\n\n\n<p>Technically, Deep Learning VM Images are public Compute Engine images published by Google (in a Google-managed image project) and intended to be used as the boot disk for your VM instances. You choose a specific image (or image family), select machine type and accelerators (GPUs), configure storage and networking, and then run training\/inference jobs directly on the VM\u2014optionally integrating with Cloud Storage, Artifact Registry, Cloud Logging\/Monitoring, and IAM service accounts.<\/p>\n\n\n\n<p><strong>What problem it solves:<\/strong> ML projects often fail early due to environment friction: incompatible CUDA\/cuDNN versions, missing dependencies, framework version mismatch, and inconsistent developer setups. Deep Learning VM Images provide a repeatable, supported baseline that speeds up experimentation and reduces operational risk when moving from laptops to cloud compute.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">2. What is Deep Learning VM Images?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Official purpose<\/h3>\n\n\n\n<p>Deep Learning VM Images provide Google-maintained VM images for Compute Engine that are optimized for deep learning workflows, typically including popular ML frameworks and supporting tools. You use these images to create VMs that are ready for ML development, training, or inference with minimal manual setup.<\/p>\n\n\n\n<p>Official documentation (start here):<br\/>\nhttps:\/\/cloud.google.com\/deep-learning-vm<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Core capabilities<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Launch Compute Engine VMs preconfigured for ML development.<\/li>\n<li>Choose CPU-only or GPU-capable images (exact availability depends on current image catalog\u2014verify in official docs).<\/li>\n<li>Use curated environments for common frameworks and workflows.<\/li>\n<li>Integrate with standard Google Cloud services (VPC networking, IAM, Cloud Storage, Cloud Logging\/Monitoring).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Major components<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Deep Learning VM Images catalog<\/strong>: Public images published by Google in a Google-managed project (commonly referenced in docs; verify the current image project and naming in official docs).<\/li>\n<li><strong>Compute Engine instances<\/strong>: Zonal VMs created from those images.<\/li>\n<li><strong>Persistent Disk (boot and data disks)<\/strong>: Storage backing the VM.<\/li>\n<li><strong>Optional GPU accelerators<\/strong>: NVIDIA GPUs attached to a VM (separately billed).<\/li>\n<li><strong>Networking<\/strong>: VPC, firewall rules, optional public IP, Cloud NAT for private egress, and routes\/DNS.<\/li>\n<li><strong>Identity<\/strong>: IAM and instance service accounts to access other Google Cloud APIs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Service type<\/h3>\n\n\n\n<p>Deep Learning VM Images is not a managed training service; it is <strong>curated VM images<\/strong> for <strong>Compute Engine<\/strong>. You still operate the VM (patching strategy, disk sizing, network exposure, user access, etc.) like any other IaaS VM\u2014just with a faster ML-ready starting point.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Scope (regional\/global\/zonal\/project-scoped)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Images<\/strong>: Compute Engine images are generally <strong>global resources<\/strong> (published once and usable across regions), but you should confirm the current publication model in the docs.<\/li>\n<li><strong>VM instances<\/strong>: Compute Engine VMs are <strong>zonal<\/strong> resources.<\/li>\n<li><strong>Access control<\/strong>: Primarily <strong>project-scoped<\/strong> via IAM (who can create VMs, use images, attach GPUs, access buckets, etc.).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">How it fits into the Google Cloud ecosystem<\/h3>\n\n\n\n<p>Deep Learning VM Images sits in the \u201cbuild\/run ML on infrastructure\u201d space:\n&#8211; Works well when you want <strong>full control<\/strong> of the runtime and dependencies.\n&#8211; Complements:\n  &#8211; <strong>Cloud Storage<\/strong> for datasets and checkpoints\n  &#8211; <strong>Artifact Registry<\/strong> for container images (if you run containers on the VM)\n  &#8211; <strong>Cloud Logging and Cloud Monitoring<\/strong> for ops visibility\n  &#8211; <strong>Vertex AI<\/strong> services when you want managed pipelines, managed training, managed endpoints, or notebook management (choose based on responsibility boundaries\u2014see comparison section)<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">3. Why use Deep Learning VM Images?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Business reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Faster time to first experiment<\/strong>: reduces environment setup time.<\/li>\n<li><strong>Consistency across teams<\/strong>: standardizes base images for training and inference.<\/li>\n<li><strong>Predictable operational baseline<\/strong>: fewer \u201cit works on my machine\u201d issues.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Technical reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Prebuilt ML environments<\/strong>: avoids manually assembling Python, libraries, and system dependencies.<\/li>\n<li><strong>Better alignment for GPU workloads<\/strong>: reduces the chance of driver\/runtime mismatch (still verify driver\/framework compatibility for your specific GPU and framework version).<\/li>\n<li><strong>Compute Engine flexibility<\/strong>: choose machine types, disks, networking, and GPUs suited to your workload.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operational reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Repeatable provisioning<\/strong>: you can automate instance creation via <code>gcloud<\/code>, Terraform, or instance templates (automation is critical for repeatability).<\/li>\n<li><strong>Integration with standard ops tooling<\/strong>: OS Login, IAP TCP forwarding, Cloud Logging\/Monitoring, startup scripts, and managed instance groups (when applicable).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/compliance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Google-maintained images<\/strong>: curated base reduces exposure from random community images (still requires your patching and hardening strategy).<\/li>\n<li><strong>IAM + service accounts<\/strong>: apply least privilege to dataset\/model access.<\/li>\n<li><strong>VPC controls<\/strong>: private networking, Cloud NAT, firewall policies, VPC Service Controls (where applicable to services you access).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scalability\/performance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Scale up<\/strong>: larger machine types, faster disks, and GPUs.<\/li>\n<li><strong>Scale out<\/strong>: multiple VMs (manual, managed instance groups for stateless workloads, or batch-style orchestration with other services).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should choose it<\/h3>\n\n\n\n<p>Choose Deep Learning VM Images when you:\n&#8211; Need <strong>full control<\/strong> of the environment.\n&#8211; Want a curated baseline for <strong>interactive development<\/strong> or <strong>custom training<\/strong>.\n&#8211; Run <strong>GPU-accelerated<\/strong> training\/inference on VMs.\n&#8211; Need to install custom system packages, drivers, or use bespoke frameworks.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">When they should not choose it<\/h3>\n\n\n\n<p>Consider alternatives when you:\n&#8211; Want a <strong>fully managed<\/strong> training platform (look at Vertex AI Training; verify in Vertex AI docs).\n&#8211; Prefer <strong>container-first<\/strong> execution with orchestrators (GKE + Deep Learning Containers, or Vertex AI custom jobs).\n&#8211; Need <strong>multi-tenant notebook governance<\/strong> and lifecycle management at scale (Vertex AI Workbench-managed setups may be a better fit; verify official docs).\n&#8211; Don\u2019t want to manage VM patching, SSH access, disk lifecycle, and network hardening.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">4. Where is Deep Learning VM Images used?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Industries<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Software\/SaaS: model training, recommendation systems, NLP, computer vision.<\/li>\n<li>Healthcare &amp; life sciences: imaging models, research pipelines (subject to compliance needs).<\/li>\n<li>Finance: fraud detection, time-series modeling (governance and auditability matter).<\/li>\n<li>Retail &amp; e-commerce: demand forecasting, personalization.<\/li>\n<li>Manufacturing: defect detection, predictive maintenance.<\/li>\n<li>Media &amp; gaming: content classification, generation workflows, real-time inference.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Team types<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>ML engineering teams needing repeatable environments.<\/li>\n<li>Data science teams doing exploration and prototyping (often dev\/test).<\/li>\n<li>Platform teams building standardized ML compute \u201cgolden paths\u201d.<\/li>\n<li>DevOps\/SRE teams enabling GPU infrastructure and cost controls.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Workloads<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Interactive notebooks and experimentation on VMs.<\/li>\n<li>Batch training jobs that run for minutes to days.<\/li>\n<li>Inference services hosted on a VM (often behind a load balancer or internal service).<\/li>\n<li>ETL + feature generation jobs near the model training runtime.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Architectures<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Single VM prototyping (common early stage).<\/li>\n<li>Multi-VM distributed training (requires careful network and framework configuration).<\/li>\n<li>VM + Cloud Storage \u201cdata lake\u201d pattern.<\/li>\n<li>Hybrid: VM-based training + deployment to managed serving (Vertex AI endpoints or GKE), depending on requirements.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Real-world deployment contexts<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Dev\/test<\/strong>: rapid experiments, short-lived spot VMs, small disks, minimal security exposure.<\/li>\n<li><strong>Production<\/strong>: hardened images, private networking, least privilege IAM, controlled data egress, monitoring, and change management.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">5. Top Use Cases and Scenarios<\/h2>\n\n\n\n<p>Below are realistic, field-tested patterns where Deep Learning VM Images fits well.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1) Fast GPU workstation for model prototyping<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Data scientists lose days configuring CUDA, drivers, and frameworks.<\/li>\n<li><strong>Why it fits:<\/strong> Deep Learning VM Images provides a preconfigured base aligned to ML workflows.<\/li>\n<li><strong>Scenario:<\/strong> Create a VM with an attached GPU in a dev VPC; connect via SSH\/IAP; iterate on PyTorch prototypes with minimal setup.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2) Reproducible training environment for a research team<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Different laptops and OS versions produce inconsistent results.<\/li>\n<li><strong>Why it fits:<\/strong> Standard VM images reduce environment drift.<\/li>\n<li><strong>Scenario:<\/strong> Lab standardizes on one Deep Learning VM Images image and provisions per-user VMs using instance templates.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3) Scheduled batch training on VMs<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Training needs to run nightly\/weekly with consistent dependencies.<\/li>\n<li><strong>Why it fits:<\/strong> VM images + startup scripts allow repeatable batch runs.<\/li>\n<li><strong>Scenario:<\/strong> A scheduler (external or internal tooling) creates a VM, runs training, uploads artifacts to Cloud Storage, then deletes the VM.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4) Data preprocessing close to training compute<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Preprocessing is slow on local machines and expensive on managed platforms if mis-sized.<\/li>\n<li><strong>Why it fits:<\/strong> Compute Engine flexibility and local SSD\/Persistent Disk choices.<\/li>\n<li><strong>Scenario:<\/strong> Launch a CPU-heavy VM from a DL image, preprocess data, store TFRecords\/Parquet in Cloud Storage.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">5) Inference on a VM with GPU acceleration<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Need low-latency GPU inference with custom system libraries.<\/li>\n<li><strong>Why it fits:<\/strong> Full VM control plus GPU attach.<\/li>\n<li><strong>Scenario:<\/strong> Host an internal inference service on a GPU VM, controlled by firewall rules and IAM.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6) Framework\/version pinning for regulated environments<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Production requires pinned versions and controlled updates.<\/li>\n<li><strong>Why it fits:<\/strong> You can select and pin a specific image version and then bake your own hardened custom image.<\/li>\n<li><strong>Scenario:<\/strong> Start from Deep Learning VM Images, apply patches and hardening, then create a custom image for production rollout.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">7) Multi-user \u201cjump box\u201d for ML tools (controlled)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Teams need shared access to tools and datasets.<\/li>\n<li><strong>Why it fits:<\/strong> Centralized VM with controlled access and OS Login.<\/li>\n<li><strong>Scenario:<\/strong> A secure VM in a private subnet hosts tools; access is granted via IAM groups and OS Login.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">8) Migration from on-prem GPU servers to cloud<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> On-prem GPU servers are overloaded and hard to upgrade.<\/li>\n<li><strong>Why it fits:<\/strong> Similar VM-based operational model; easier lift-and-shift.<\/li>\n<li><strong>Scenario:<\/strong> Port training scripts to run on a VM, store datasets in Cloud Storage, adopt snapshot-based backups.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">9) Hybrid workflows: VM training + managed model registry\/serving<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Want custom training control but managed deployment.<\/li>\n<li><strong>Why it fits:<\/strong> Train on VMs; store artifacts in Cloud Storage; then deploy via managed services.<\/li>\n<li><strong>Scenario:<\/strong> Train on DL VM, export SavedModel, register\/deploy using Vertex AI (verify current best practices in Vertex AI docs).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">10) Education and workshops with consistent lab environments<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Training sessions break due to laptop dependency issues.<\/li>\n<li><strong>Why it fits:<\/strong> Everyone uses the same cloud image and tools.<\/li>\n<li><strong>Scenario:<\/strong> Instructor provisions per-student VMs with budgets\/quotas and teardown scripts.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">6. Core Features<\/h2>\n\n\n\n<blockquote>\n<p>Note: The exact set of included frameworks\/tools depends on the specific Deep Learning VM Images image you select. Always validate the current catalog and included components in official docs and by inspecting the image on a running VM.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">1) Google-maintained public ML VM images<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Provides curated VM images designed for ML work.<\/li>\n<li><strong>Why it matters:<\/strong> Reduces setup time and risk of incompatible dependencies.<\/li>\n<li><strong>Practical benefit:<\/strong> Faster onboarding and fewer \u201cdependency hell\u201d incidents.<\/li>\n<li><strong>Caveat:<\/strong> You still own ongoing OS-level operations (patching, accounts, network exposure).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2) CPU and GPU-oriented options (image-dependent)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Offers images suitable for CPU-only or GPU-enabled Compute Engine instances.<\/li>\n<li><strong>Why it matters:<\/strong> GPU stacks are complex; curated images can reduce driver\/runtime mismatches.<\/li>\n<li><strong>Practical benefit:<\/strong> Less time spent debugging CUDA\/cuDNN issues.<\/li>\n<li><strong>Caveat:<\/strong> GPU availability also depends on region\/zone quotas and supported accelerator types. Verify compatibility for your target GPU model and framework version.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3) Works with standard Compute Engine primitives<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> You create VMs the same way you would for any Compute Engine workload.<\/li>\n<li><strong>Why it matters:<\/strong> Integrates with existing infra-as-code, networking, and IAM practices.<\/li>\n<li><strong>Practical benefit:<\/strong> Use instance templates, startup scripts, OS Login, and shielded VM settings.<\/li>\n<li><strong>Caveat:<\/strong> Misconfiguration risk is similar to any VM (open SSH to the internet, oversized disks, etc.).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4) Integration with IAM via instance service accounts<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Lets the VM access Google Cloud APIs using a service account.<\/li>\n<li><strong>Why it matters:<\/strong> Avoids embedding long-lived keys on disk.<\/li>\n<li><strong>Practical benefit:<\/strong> Fine-grained access to Cloud Storage buckets, Artifact Registry, BigQuery, etc.<\/li>\n<li><strong>Caveat:<\/strong> Over-privileged service accounts are a common security mistake.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">5) Storage options for datasets and checkpoints<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Supports Persistent Disk, Hyperdisk (where available), local SSD, and Cloud Storage for object storage.<\/li>\n<li><strong>Why it matters:<\/strong> ML workloads are storage- and throughput-sensitive.<\/li>\n<li><strong>Practical benefit:<\/strong> Keep large datasets in Cloud Storage; use PD\/SSD for scratch and checkpoints.<\/li>\n<li><strong>Caveat:<\/strong> Data locality (zone\/region) affects performance and egress.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6) Observability with Cloud Logging and Cloud Monitoring (agent\/config dependent)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> VMs can send logs\/metrics to Google Cloud\u2019s ops suite.<\/li>\n<li><strong>Why it matters:<\/strong> Training jobs fail\u2014visibility reduces time to resolution.<\/li>\n<li><strong>Practical benefit:<\/strong> Centralized logs, metrics, alerting.<\/li>\n<li><strong>Caveat:<\/strong> Some telemetry requires installing\/configuring agents or enabling features; verify current recommended setup.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">7) Automation hooks: startup scripts and image customization<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Automate dependency setup, dataset sync, job start, and shutdown.<\/li>\n<li><strong>Why it matters:<\/strong> Reproducibility and cost control.<\/li>\n<li><strong>Practical benefit:<\/strong> \u201cCreate VM \u2192 run job \u2192 upload results \u2192 delete VM\u201d pattern.<\/li>\n<li><strong>Caveat:<\/strong> Ensure scripts are idempotent and don\u2019t leak secrets into metadata.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">7. Architecture and How It Works<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">High-level architecture<\/h3>\n\n\n\n<p>At a high level:\n1. You select a <strong>Deep Learning VM Images image<\/strong>.\n2. You create a <strong>Compute Engine VM<\/strong> from that image in a chosen <strong>zone<\/strong>.\n3. You optionally attach <strong>GPUs<\/strong>, add <strong>data disks<\/strong>, and set up <strong>networking<\/strong>.\n4. Your workload reads datasets (often from <strong>Cloud Storage<\/strong>) and writes outputs (Cloud Storage, disks, or other services).\n5. Logs and metrics go to <strong>Cloud Logging\/Monitoring<\/strong> (depending on configuration).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Request\/data\/control flow<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Control plane<\/strong>: You (or automation) call Google Cloud APIs to create\/stop\/delete instances, attach disks, and set IAM.<\/li>\n<li><strong>Data plane<\/strong>: Your VM reads training data, writes checkpoints\/models, and optionally pulls\/pushes container images and packages.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Integrations with related services<\/h3>\n\n\n\n<p>Common integrations:\n&#8211; <strong>Cloud Storage<\/strong>: datasets, checkpoints, model artifacts.\n&#8211; <strong>Artifact Registry<\/strong>: store containers if you run containerized training\/inference on the VM.\n&#8211; <strong>Cloud Logging\/Monitoring<\/strong>: logs\/metrics\/alerts.\n&#8211; <strong>Secret Manager<\/strong>: store external API keys (if needed).\n&#8211; <strong>Cloud NAT<\/strong>: allow private VMs to reach the internet for package installs without public IPs.\n&#8211; <strong>IAM \/ OS Login<\/strong>: controlled SSH access.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Dependency services<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Compute Engine<\/strong> (required): Deep Learning VM Images are used to create Compute Engine instances.<\/li>\n<li><strong>VPC<\/strong> (required): networking, firewall rules.<\/li>\n<li><strong>Cloud Storage<\/strong> (optional but common): object storage.<\/li>\n<li><strong>Cloud Logging\/Monitoring<\/strong> (recommended): operational visibility.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/authentication model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Access to create\/manage instances: IAM roles on the project.<\/li>\n<li>VM access to APIs: instance <strong>service account<\/strong> and OAuth scopes (use IAM permissions; scopes are still relevant for some legacy flows\u2014verify current Compute Engine recommendations).<\/li>\n<li>User login: SSH keys (legacy) or <strong>OS Login<\/strong> recommended.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Networking model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>VMs attach to a VPC network and subnet in the selected region.<\/li>\n<li>You can expose a public IP (simple but riskier) or use private IP only + IAP\/Cloud VPN\/Interconnect for access.<\/li>\n<li>Firewall rules control ingress\/egress; follow least exposure.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Monitoring\/logging\/governance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Enable Cloud Logging\/Monitoring for VMs and standardize labels (env, owner, cost-center).<\/li>\n<li>Use budgets\/alerts for GPU and storage spend.<\/li>\n<li>Track VM lifecycles to prevent \u201cforgotten GPU VM\u201d incidents.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Simple architecture diagram (Mermaid)<\/h3>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart LR\n  U[Engineer \/ Data Scientist] --&gt;|gcloud \/ Console| CE[Compute Engine API]\n  CE --&gt; VM[VM from Deep Learning VM Images]\n\n  VM --&gt;|Read\/Write| GCS[Cloud Storage Bucket]\n  VM --&gt; LOG[Cloud Logging]\n  VM --&gt; MON[Cloud Monitoring]\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Production-style architecture diagram (Mermaid)<\/h3>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart TB\n  subgraph Project[Google Cloud Project]\n    subgraph VPC[VPC Network]\n      subgraph Subnet[Private Subnet (Regional)]\n        VM[Compute Engine VM\\nBoot: Deep Learning VM Images\\nNo public IP]\n      end\n\n      NAT[Cloud NAT\\n(Egress for updates\/packages)]\n      FW[Firewall Policies \/ Rules]\n    end\n\n    GCS[(Cloud Storage\\nDatasets &amp; Artifacts)]\n    SM[Secret Manager]\n    OPS[Cloud Logging + Monitoring]\n    IAM[IAM \/ OS Login]\n  end\n\n  Admin[Admin\/CI\/CD] --&gt;|IAM-authenticated API calls| VM\n  VM --&gt;|Private egress| NAT --&gt; Internet[(Internet)]\n  VM --&gt;|HTTPS| GCS\n  VM --&gt;|Fetch secrets (optional)| SM\n  VM --&gt; OPS\n  IAM --&gt; VM\n  FW --- VM\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">8. Prerequisites<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Account\/project requirements<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A Google Cloud account and an active <strong>Google Cloud project<\/strong>.<\/li>\n<li><strong>Billing enabled<\/strong> on the project.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Permissions \/ IAM roles<\/h3>\n\n\n\n<p>Minimum roles vary by your org\u2019s policies, but typically you need:\n&#8211; To create\/manage VMs: <strong>Compute Instance Admin<\/strong> (<code>roles\/compute.instanceAdmin.v1<\/code>) or a custom role with required permissions.\n&#8211; To use networks: <strong>Compute Network User<\/strong> (<code>roles\/compute.networkUser<\/code>) on the target VPC\/subnet (common in shared VPC setups).\n&#8211; To create service accounts (optional): <strong>Service Account Admin<\/strong> (<code>roles\/iam.serviceAccountAdmin<\/code>) or have one pre-created.\n&#8211; To access Cloud Storage: scoped permissions like <strong>Storage Object Admin<\/strong> on a specific bucket, not broad project-wide access.<\/p>\n\n\n\n<p>Follow least privilege. If your organization uses a centralized platform team, ask for a pre-approved project and roles.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Billing requirements<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Compute Engine charges for VM runtime, attached GPUs, disks, and network egress.<\/li>\n<li>Cloud Storage charges for storage and some operations.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">CLI\/SDK\/tools needed<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Google Cloud CLI<\/strong> (<code>gcloud<\/code>): https:\/\/cloud.google.com\/sdk\/docs\/install<\/li>\n<li>Optional: <code>ssh<\/code>, Python knowledge, and basic Linux command line.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Region availability<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Compute Engine is regional\/zonal. GPU availability varies by region\/zone.<\/li>\n<li>Deep Learning VM Images can typically be used across regions, but confirm current image availability and any constraints in official docs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Quotas\/limits<\/h3>\n\n\n\n<p>Common quota constraints:\n&#8211; <strong>GPUs<\/strong> per region\n&#8211; <strong>CPUs<\/strong> per region\n&#8211; <strong>Persistent Disk<\/strong> total GB\n&#8211; <strong>External IP addresses<\/strong>\nCheck quotas in the Google Cloud console: <strong>IAM &amp; Admin \u2192 Quotas<\/strong> (or \u201cQuotas\u201d in relevant service pages). GPU quotas are frequently the first blocker.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Prerequisite services<\/h3>\n\n\n\n<p>Enable APIs:\n&#8211; Compute Engine API\n&#8211; Cloud Storage API (commonly used)\nYou can enable them with <code>gcloud services enable<\/code> (shown in the lab).<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">9. Pricing \/ Cost<\/h2>\n\n\n\n<p>Deep Learning VM Images itself is typically not priced as a separate \u201cmanaged service.\u201d Costs come from the Google Cloud resources you run <strong>using<\/strong> these images\u2014primarily Compute Engine.<\/p>\n\n\n\n<p>Pricing references (official):\n&#8211; Compute Engine pricing: https:\/\/cloud.google.com\/compute\/pricing (and VM instance pricing pages)\n&#8211; GPU pricing: https:\/\/cloud.google.com\/compute\/gpus-pricing\n&#8211; Cloud Storage pricing: https:\/\/cloud.google.com\/storage\/pricing\n&#8211; Pricing calculator: https:\/\/cloud.google.com\/products\/calculator<\/p>\n\n\n\n<blockquote>\n<p>Pricing varies by region, machine type, GPU type, disk type, and sustained usage\/commitments. Use the calculator for your exact region and configuration.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">Pricing dimensions<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p><strong>Compute Engine VM runtime<\/strong>\n   &#8211; Charged per second\/minute depending on VM type and billing model (verify current billing granularity in Compute Engine docs).\n   &#8211; Machine type (vCPU\/RAM) is a major driver.<\/p>\n<\/li>\n<li>\n<p><strong>GPU accelerators<\/strong>\n   &#8211; Charged per GPU attached, per time.\n   &#8211; Different GPU models have very different prices and availability.<\/p>\n<\/li>\n<li>\n<p><strong>Disk storage<\/strong>\n   &#8211; Boot disk (Persistent Disk) and any additional data disks.\n   &#8211; Disk type (balanced\/performance\/extreme\/hyperdisk depending on availability) affects cost and performance.<\/p>\n<\/li>\n<li>\n<p><strong>Network<\/strong>\n   &#8211; Ingress is typically free; egress to the internet or cross-region is often charged (verify current networking pricing).\n   &#8211; If you use Cloud NAT, there are charges for NAT usage and IPs.<\/p>\n<\/li>\n<li>\n<p><strong>Cloud Storage<\/strong>\n   &#8211; Storage GB-month\n   &#8211; Operations (PUT\/GET\/LIST) and egress depending on access patterns and location.<\/p>\n<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Free tier<\/h3>\n\n\n\n<p>Google Cloud offers a general free tier for some products, but <strong>GPU usage is not free<\/strong>, and many ML workloads will exceed free-tier limits quickly. Verify current free-tier offerings here:\nhttps:\/\/cloud.google.com\/free<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Cost drivers (what usually makes bills spike)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Leaving GPU VMs running idle overnight\/weekend.<\/li>\n<li>Large, fast disks provisioned but underutilized.<\/li>\n<li>Significant internet egress (downloading datasets repeatedly, or serving inference to internet clients).<\/li>\n<li>Training logs and artifacts accumulating in Cloud Storage indefinitely.<\/li>\n<li>Overprovisioned machine types \u201cjust in case.\u201d<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Hidden or indirect costs<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Snapshots\/backups<\/strong>: snapshot storage costs can accumulate.<\/li>\n<li><strong>Static external IPs<\/strong>: can be charged when reserved and unused (verify current policy).<\/li>\n<li><strong>Artifact\/container pulls<\/strong>: if pulling images across regions, egress can apply.<\/li>\n<li><strong>Support and compliance tooling<\/strong>: not a direct Deep Learning VM Images cost, but often required in production.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">How to optimize cost<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use <strong>Spot VMs<\/strong> (preemptible-style) for fault-tolerant training jobs when feasible (verify current Compute Engine Spot VM behavior).<\/li>\n<li>Use smaller machine types for dev; scale up only for training runs.<\/li>\n<li>Automate shutdown with:<\/li>\n<li>a fixed schedule, or<\/li>\n<li>a \u201cjob runner\u201d script that powers off the instance when training completes.<\/li>\n<li>Store datasets in the same region as compute to reduce egress and latency.<\/li>\n<li>Use lifecycle policies on Cloud Storage buckets to transition\/delete old artifacts.<\/li>\n<li>Use committed use discounts for always-on production inference (where applicable).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Example low-cost starter estimate (no fabricated numbers)<\/h3>\n\n\n\n<p>A low-cost starter setup for learning:\n&#8211; 1 small CPU VM (no GPU)\n&#8211; A small standard Persistent Disk boot disk\n&#8211; A Cloud Storage bucket for a few artifacts<\/p>\n\n\n\n<p>Because pricing varies by region and machine type, get an accurate estimate with the calculator:\nhttps:\/\/cloud.google.com\/products\/calculator<br\/>\nSearch for <strong>Compute Engine<\/strong> and <strong>Cloud Storage<\/strong>, choose your region, and enter expected hours.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Example production cost considerations<\/h3>\n\n\n\n<p>For production training\/inference:\n&#8211; GPU(s) dominate costs; confirm GPU utilization with monitoring.\n&#8211; Consider separate environments (dev\/test\/prod) and enforce budgets\/quotas per environment.\n&#8211; Use centralized artifact storage and retention policies.\n&#8211; Consider private networking and Cloud NAT costs for locked-down environments.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">10. Step-by-Step Hands-On Tutorial<\/h2>\n\n\n\n<p>This lab provisions a Compute Engine VM from <strong>Deep Learning VM Images<\/strong>, runs a small training job (CPU-friendly), stores a model artifact in Cloud Storage, and then cleans up resources.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Objective<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Create a Compute Engine instance using <strong>Deep Learning VM Images<\/strong><\/li>\n<li>Verify the ML environment on the VM<\/li>\n<li>Run a tiny TensorFlow training job (or install dependencies if needed)<\/li>\n<li>Upload the trained model artifact to Cloud Storage<\/li>\n<li>Clean up safely to avoid unexpected cost<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Lab Overview<\/h3>\n\n\n\n<p>You will:\n1. Prepare a project, APIs, and variables.\n2. Create a Cloud Storage bucket for artifacts.\n3. Discover available Deep Learning VM Images and select one.\n4. Create a service account with least privilege for the bucket.\n5. Create a VM from the selected Deep Learning VM Images image.\n6. SSH into the VM, run a small training script, and upload results.\n7. Validate outputs.\n8. Troubleshoot common issues.\n9. Clean up all created resources.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 1: Set project, region\/zone, and enable APIs<\/h3>\n\n\n\n<p>Pick a region\/zone near you. For this tutorial we\u2019ll use a zone variable; choose one that supports the machine type you want.<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud auth login\ngcloud config set project YOUR_PROJECT_ID\n\n# Choose a zone (example). Change as needed.\ngcloud config set compute\/zone us-central1-a\n\n# Enable required APIs\ngcloud services enable compute.googleapis.com storage.googleapis.com\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome:<\/strong> APIs are enabled, and <code>gcloud<\/code> points to your project and zone.<\/p>\n\n\n\n<p><strong>Verify:<\/strong><\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud services list --enabled --filter=\"name:compute.googleapis.com OR name:storage.googleapis.com\"\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 2: Create a Cloud Storage bucket for artifacts<\/h3>\n\n\n\n<p>Bucket names must be globally unique. Choose a region aligned to your compute region to reduce latency and potential egress.<\/p>\n\n\n\n<pre><code class=\"language-bash\">export BUCKET_NAME=\"dlvm-artifacts-$RANDOM-$RANDOM\"\nexport BUCKET_LOCATION=\"us-central1\"   # Adjust to your preferred region\n\ngcloud storage buckets create \"gs:\/\/$BUCKET_NAME\" --location=\"$BUCKET_LOCATION\"\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome:<\/strong> A new bucket is created.<\/p>\n\n\n\n<p><strong>Verify:<\/strong><\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud storage buckets describe \"gs:\/\/$BUCKET_NAME\"\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 3: Discover Deep Learning VM Images and select an image<\/h3>\n\n\n\n<p>Deep Learning VM Images are published as public images. The recommended way is to <strong>list the images<\/strong> and pick one that matches your framework and CPU\/GPU preference.<\/p>\n\n\n\n<p>Run:<\/p>\n\n\n\n<pre><code class=\"language-bash\"># List available images from the Google-managed image project used for Deep Learning VM Images.\n# This project name is commonly referenced in Google documentation; verify in official docs if it changes.\ngcloud compute images list \\\n  --project=deeplearning-platform-release \\\n  --no-standard-images \\\n  --format=\"table(name, family, status, diskSizeGb)\"\n<\/code><\/pre>\n\n\n\n<p>Now select an image:\n&#8211; For a low-cost lab, pick a <strong>CPU<\/strong> image if available.\n&#8211; For GPU work, pick a GPU-oriented image (you\u2019ll also need to attach a GPU and have quota).<\/p>\n\n\n\n<p>Set an environment variable with the exact image name you chose from the output:<\/p>\n\n\n\n<pre><code class=\"language-bash\">export DLVM_IMAGE_NAME=\"PASTE_AN_IMAGE_NAME_FROM_THE_LIST\"\nexport DLVM_IMAGE_PROJECT=\"deeplearning-platform-release\"\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome:<\/strong> You have a concrete image name to use when creating the VM.<\/p>\n\n\n\n<p><strong>Verify:<\/strong><\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud compute images describe \"$DLVM_IMAGE_NAME\" --project=\"$DLVM_IMAGE_PROJECT\"\n<\/code><\/pre>\n\n\n\n<p>If you cannot find images or the project name differs, <strong>verify in official docs<\/strong>:\nhttps:\/\/cloud.google.com\/deep-learning-vm\/docs<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 4: Create a least-privilege service account for the VM<\/h3>\n\n\n\n<p>This VM only needs to write artifacts to your bucket for this lab.<\/p>\n\n\n\n<pre><code class=\"language-bash\">export SA_NAME=\"dlvm-lab-sa\"\nexport SA_EMAIL=\"$SA_NAME@$(gcloud config get-value project).iam.gserviceaccount.com\"\n\ngcloud iam service-accounts create \"$SA_NAME\" \\\n  --display-name=\"Deep Learning VM Images lab service account\"\n<\/code><\/pre>\n\n\n\n<p>Grant bucket-scoped permissions (recommended over project-wide roles):<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud storage buckets add-iam-policy-binding \"gs:\/\/$BUCKET_NAME\" \\\n  --member=\"serviceAccount:$SA_EMAIL\" \\\n  --role=\"roles\/storage.objectAdmin\"\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome:<\/strong> Service account exists and can write objects to the lab bucket.<\/p>\n\n\n\n<p><strong>Verify:<\/strong><\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud iam service-accounts describe \"$SA_EMAIL\"\ngcloud storage buckets get-iam-policy \"gs:\/\/$BUCKET_NAME\" --format=\"json\" | head\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 5: Create a VM from Deep Learning VM Images<\/h3>\n\n\n\n<p>Use a small machine type to keep costs low. If your chosen image expects more CPU\/RAM, adjust.<\/p>\n\n\n\n<pre><code class=\"language-bash\">export VM_NAME=\"dlvm-lab-vm\"\n\ngcloud compute instances create \"$VM_NAME\" \\\n  --image=\"$DLVM_IMAGE_NAME\" \\\n  --image-project=\"$DLVM_IMAGE_PROJECT\" \\\n  --machine-type=\"e2-standard-2\" \\\n  --boot-disk-size=\"50GB\" \\\n  --service-account=\"$SA_EMAIL\" \\\n  --scopes=\"https:\/\/www.googleapis.com\/auth\/cloud-platform\" \\\n  --labels=\"purpose=dlvm-lab,env=dev\"\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome:<\/strong> A Compute Engine VM is created and running.<\/p>\n\n\n\n<p><strong>Verify:<\/strong><\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud compute instances describe \"$VM_NAME\" --format=\"get(status,machineType,disks[0].initializeParams.sourceImage)\"\n<\/code><\/pre>\n\n\n\n<blockquote>\n<p>Note on scopes: modern best practice is to rely on IAM permissions and keep scopes appropriately set. Many tutorials still use <code>cloud-platform<\/code> for simplicity. In tightly controlled environments, use narrower scopes and least-privileged IAM. Verify your organization\u2019s policy.<\/p>\n<\/blockquote>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 6: SSH into the VM and verify the environment<\/h3>\n\n\n\n<pre><code class=\"language-bash\">gcloud compute ssh \"$VM_NAME\"\n<\/code><\/pre>\n\n\n\n<p>On the VM, run:<\/p>\n\n\n\n<pre><code class=\"language-bash\">python3 --version || true\npython --version || true\n\n# Check disk space\ndf -h\n\n# Confirm you can access metadata identity (should succeed if service account is attached)\ncurl -s -H \"Metadata-Flavor: Google\" \\\n  \"http:\/\/metadata.google.internal\/computeMetadata\/v1\/instance\/service-accounts\/default\/email\"\necho\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome:<\/strong> You can SSH in, see Python, and see the service account email.<\/p>\n\n\n\n<p>Now, check if TensorFlow is already available:<\/p>\n\n\n\n<pre><code class=\"language-bash\">python3 -c \"import tensorflow as tf; print('TensorFlow:', tf.__version__)\"\n<\/code><\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li>If this prints a version: proceed to Step 7.<\/li>\n<li>If it fails with <code>ModuleNotFoundError: No module named 'tensorflow'<\/code>, you have two options:\n  1. Choose a different Deep Learning VM Images image that includes TensorFlow (repeat Step 3 and Step 5), or\n  2. Install TensorFlow into a virtual environment (shown next).<\/li>\n<\/ul>\n\n\n\n<p>To install TensorFlow (CPU) safely in a venv:<\/p>\n\n\n\n<pre><code class=\"language-bash\">python3 -m venv ~\/venv\nsource ~\/venv\/bin\/activate\npip install --upgrade pip\npip install tensorflow\npython -c \"import tensorflow as tf; print('TensorFlow:', tf.__version__)\"\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome:<\/strong> TensorFlow import works.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 7: Run a small training job and save a model artifact<\/h3>\n\n\n\n<p>Create a simple TensorFlow script:<\/p>\n\n\n\n<pre><code class=\"language-bash\">cat &gt; ~\/train_mnist.py &lt;&lt;'PY'\nimport os\nimport tensorflow as tf\n\nprint(\"TensorFlow version:\", tf.__version__)\n\n# Load MNIST (downloads data the first time)\n(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()\n\n# Normalize\nx_train = x_train \/ 255.0\nx_test = x_test \/ 255.0\n\nmodel = tf.keras.Sequential([\n    tf.keras.layers.Flatten(input_shape=(28, 28)),\n    tf.keras.layers.Dense(128, activation=\"relu\"),\n    tf.keras.layers.Dense(10)\n])\n\nmodel.compile(\n    optimizer=\"adam\",\n    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),\n    metrics=[\"accuracy\"],\n)\n\nhistory = model.fit(x_train, y_train, epochs=1, validation_split=0.1, batch_size=128)\ntest_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)\n\nprint(\"Test accuracy:\", test_acc)\n\nout_dir = os.path.expanduser(\"~\/model_artifact\")\nos.makedirs(out_dir, exist_ok=True)\n\n# Save in SavedModel format\nsave_path = os.path.join(out_dir, \"savedmodel\")\nmodel.save(save_path)\n\n# Write a small text summary\nwith open(os.path.join(out_dir, \"metrics.txt\"), \"w\") as f:\n    f.write(f\"test_accuracy={test_acc}\\n\")\n\nprint(\"Saved model to:\", save_path)\nprint(\"Wrote metrics to:\", os.path.join(out_dir, \"metrics.txt\"))\nPY\n\npython3 ~\/train_mnist.py\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome:<\/strong> Training runs for 1 epoch and outputs test accuracy. A directory <code>~\/model_artifact\/<\/code> is created with <code>savedmodel\/<\/code> and <code>metrics.txt<\/code>.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 8: Upload artifacts to Cloud Storage<\/h3>\n\n\n\n<p>Still on the VM:<\/p>\n\n\n\n<pre><code class=\"language-bash\"># gsutil is commonly available on Google-provided images; if not, install Google Cloud CLI or use gcloud storage.\ngsutil ls \"gs:\/\/$BUCKET_NAME\" || true\n<\/code><\/pre>\n\n\n\n<p>If <code>gsutil<\/code> is present, upload:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gsutil -m cp -r ~\/model_artifact \"gs:\/\/$BUCKET_NAME\/$VM_NAME\/\"\n<\/code><\/pre>\n\n\n\n<p>If <code>gsutil<\/code> is not installed, use <code>gcloud storage<\/code> (recommended newer interface):<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud storage cp -r ~\/model_artifact \"gs:\/\/$BUCKET_NAME\/$VM_NAME\/\"\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome:<\/strong> Your model and metrics file are in the bucket path <code>gs:\/\/BUCKET\/VM_NAME\/model_artifact\/...<\/code>.<\/p>\n\n\n\n<p>Exit the VM:<\/p>\n\n\n\n<pre><code class=\"language-bash\">exit\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Validation<\/h3>\n\n\n\n<p>From your local terminal:<\/p>\n\n\n\n<p>1) Confirm the VM exists and is running:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud compute instances list --filter=\"name=$VM_NAME\"\n<\/code><\/pre>\n\n\n\n<p>2) Confirm artifacts in Cloud Storage:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud storage ls \"gs:\/\/$BUCKET_NAME\/$VM_NAME\/model_artifact\/\"\ngcloud storage ls \"gs:\/\/$BUCKET_NAME\/$VM_NAME\/model_artifact\/savedmodel\/\"\ngcloud storage cat \"gs:\/\/$BUCKET_NAME\/$VM_NAME\/model_artifact\/metrics.txt\"\n<\/code><\/pre>\n\n\n\n<p>You should see a <code>test_accuracy=...<\/code> line.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Troubleshooting<\/h3>\n\n\n\n<p>Common issues and fixes:<\/p>\n\n\n\n<p>1) <strong><code>PERMISSION_DENIED<\/code> uploading to the bucket<\/strong>\n&#8211; Cause: Service account lacks bucket permissions, or VM is using a different identity than expected.\n&#8211; Fix:\n  &#8211; Confirm VM\u2019s service account:\n    <code>bash\n    gcloud compute instances describe \"$VM_NAME\" --format=\"get(serviceAccounts.email)\"<\/code>\n  &#8211; Confirm bucket IAM binding includes that email.\n  &#8211; Re-add the IAM policy binding (Step 4).<\/p>\n\n\n\n<p>2) <strong>No Deep Learning VM Images appear in <code>gcloud compute images list<\/code><\/strong>\n&#8211; Cause: The image project name could change, or org policy restricts public images.\n&#8211; Fix:\n  &#8211; Verify the current instructions in official docs: https:\/\/cloud.google.com\/deep-learning-vm\/docs\n  &#8211; If org policy blocks public images, request an exception or mirror the image to a private project (platform team pattern).<\/p>\n\n\n\n<p>3) <strong>Quota errors (CPU\/GPU\/external IP)<\/strong>\n&#8211; Cause: Project quota limits.\n&#8211; Fix: Reduce machine size, use a different region\/zone, or request quota increase.<\/p>\n\n\n\n<p>4) <strong>TensorFlow import fails<\/strong>\n&#8211; Cause: Chosen image doesn\u2019t include TensorFlow, or you selected a different framework image.\n&#8211; Fix: Install TensorFlow in a venv (Step 6) or pick a TensorFlow-focused image (Step 3).<\/p>\n\n\n\n<p>5) <strong>Unexpected cost risk<\/strong>\n&#8211; Fix: Set a reminder to delete the VM, or add a shutdown script. For production, enforce org policies and budgets.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Cleanup<\/h3>\n\n\n\n<p>To avoid charges, delete the VM and bucket.<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud compute instances delete \"$VM_NAME\" --quiet\n<\/code><\/pre>\n\n\n\n<p>Delete the bucket (this deletes all objects inside):<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud storage rm -r \"gs:\/\/$BUCKET_NAME\"\n<\/code><\/pre>\n\n\n\n<p>Optionally delete the service account:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud iam service-accounts delete \"$SA_EMAIL\" --quiet\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome:<\/strong> No running VM, no bucket, no service account created for this lab.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">11. Best Practices<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Architecture best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Separate dev\/test\/prod projects<\/strong> (or at least separate networks and IAM boundaries).<\/li>\n<li>Keep datasets in <strong>Cloud Storage<\/strong> and mount\/copy only what\u2019s needed to the VM.<\/li>\n<li>Use <strong>instance templates<\/strong> for reproducibility; avoid hand-built snowflake VMs.<\/li>\n<li>Consider building a <strong>custom image<\/strong> derived from Deep Learning VM Images for production (patches, agents, hardening, pinned dependencies).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">IAM\/security best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use <strong>OS Login<\/strong> and IAM groups for SSH access.<\/li>\n<li>Use <strong>least-privilege service accounts<\/strong> with bucket-level permissions instead of broad project roles.<\/li>\n<li>Avoid long-lived service account keys on disk; prefer workload identity via instance metadata (default service account with IAM).<\/li>\n<li>Limit who can attach external IPs and who can create GPU VMs (these are both risk and cost controls).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Cost best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use labels: <code>env<\/code>, <code>owner<\/code>, <code>cost-center<\/code>, <code>workload<\/code>, <code>expiration<\/code>.<\/li>\n<li>Automate shutdown for dev VMs and require justification for always-on GPU instances.<\/li>\n<li>Use <strong>budgets and alerts<\/strong> at project level.<\/li>\n<li>Use Spot VMs for retryable training to reduce cost (verify suitability and interruption handling).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Performance best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Place compute and storage in the <strong>same region<\/strong>.<\/li>\n<li>Choose disk types appropriate for IO patterns (sequential reads vs random reads, checkpoint writes, etc.).<\/li>\n<li>For GPU workloads, monitor utilization; if GPU is low, you\u2019re likely CPU\/data pipeline bound.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Reliability best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Store checkpoints and outputs in Cloud Storage to survive VM termination.<\/li>\n<li>Use startup scripts that are <strong>idempotent<\/strong> so you can recreate instances.<\/li>\n<li>For distributed training, validate network throughput and plan for failure\/restart semantics.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operations best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Standardize logging locations (local + Cloud Logging).<\/li>\n<li>Capture metadata about runs (git commit, dataset version, hyperparameters) and store with artifacts.<\/li>\n<li>Use a consistent directory structure for outputs and retention.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Governance\/tagging\/naming best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Naming convention example: <code>dlvm-&lt;team&gt;-&lt;env&gt;-&lt;purpose&gt;-&lt;id&gt;<\/code><\/li>\n<li>Mandatory labels: <code>owner<\/code>, <code>env<\/code>, <code>data-classification<\/code>, <code>cost-center<\/code>, <code>expiry-date<\/code><\/li>\n<li>Restrict public IP usage via org policy where possible.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">12. Security Considerations<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Identity and access model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Users<\/strong>: grant access via IAM + OS Login; avoid unmanaged SSH keys.<\/li>\n<li><strong>Workloads<\/strong>: assign a dedicated service account per workload class (training vs inference) with least privilege.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Encryption<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data at rest is encrypted by default in Google Cloud storage systems.<\/li>\n<li>For stricter requirements, consider <strong>Customer-Managed Encryption Keys (CMEK)<\/strong> for disks and buckets (verify current CMEK support for Compute Engine disks and Cloud Storage).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Network exposure<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Avoid exposing SSH or notebook ports to the internet.<\/li>\n<li>Prefer:<\/li>\n<li>private instances (no public IP)<\/li>\n<li>IAP TCP forwarding \/ bastion host<\/li>\n<li>VPN\/Interconnect for enterprise access<\/li>\n<li>Use firewall rules narrowly scoped by source ranges and tags\/service accounts.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secrets handling<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Do not store secrets in:<\/li>\n<li>instance metadata startup scripts<\/li>\n<li>Git repos on the VM<\/li>\n<li>plain text in home directories<\/li>\n<li>Use <strong>Secret Manager<\/strong> and retrieve secrets at runtime with IAM-controlled access.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Audit\/logging<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use <strong>Cloud Audit Logs<\/strong> for admin actions (VM creation, IAM changes).<\/li>\n<li>Ensure OS-level logs are retained if needed; route key application logs to Cloud Logging.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Compliance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data residency: keep data and compute in the correct region.<\/li>\n<li>Access controls: implement least privilege and strong identity controls (MFA, group-based access).<\/li>\n<li>Artifact governance: define retention and deletion policies for datasets, checkpoints, and logs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Common security mistakes<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Leaving a GPU VM with a public IP open to <code>0.0.0.0\/0<\/code> on SSH.<\/li>\n<li>Reusing the default Compute Engine service account with Editor-like permissions.<\/li>\n<li>Downloading datasets to local disk without lifecycle controls.<\/li>\n<li>Installing arbitrary packages as root without tracking changes.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secure deployment recommendations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Create private VMs and use Cloud NAT for outbound.<\/li>\n<li>Enforce OS Login + 2FA.<\/li>\n<li>Use a hardened baseline and patch cadence; consider building a custom image.<\/li>\n<li>Use organization policy constraints (where available) to restrict risky configurations.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">13. Limitations and Gotchas<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>It\u2019s still a VM<\/strong>: You manage lifecycle, patching, users, and disk growth.<\/li>\n<li><strong>GPU quotas and availability<\/strong>: Many teams are blocked by GPU quotas or zone capacity.<\/li>\n<li><strong>Framework\/driver compatibility<\/strong>: Even with curated images, verify your exact framework version, CUDA requirements, and GPU model support.<\/li>\n<li><strong>Public image governance<\/strong>: Some organizations block public images; you may need to mirror images into a private project.<\/li>\n<li><strong>Notebook exposure risk<\/strong>: If you run Jupyter, do not bind it to all interfaces with weak auth on a public IP.<\/li>\n<li><strong>Storage performance mismatches<\/strong>: Training performance can bottleneck on disk or data pipeline rather than GPU.<\/li>\n<li><strong>Cost surprise: idle GPU<\/strong>: The most common bill shock is \u201cGPU VM left running.\u201d<\/li>\n<li><strong>Cross-region data egress<\/strong>: Moving large datasets across regions can be expensive and slow.<\/li>\n<li><strong>Reproducibility<\/strong>: If you always use \u201clatest\u201d images, updates can change environments. Pin specific image versions for production.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">14. Comparison with Alternatives<\/h2>\n\n\n\n<p>Deep Learning VM Images is one option in a broader ML platform landscape.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Key alternatives<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Vertex AI Workbench<\/strong> (managed notebooks; verify current product scope): better for managed notebook lifecycle and governance.<\/li>\n<li><strong>Vertex AI Training \/ Custom Jobs<\/strong>: managed training execution; less VM ops burden.<\/li>\n<li><strong>Deep Learning Containers<\/strong>: container images for ML, often used with GKE\/Vertex AI; better for container-first workflows.<\/li>\n<li><strong>GKE (Kubernetes)<\/strong>: great for standardized container orchestration; more platform engineering overhead.<\/li>\n<li><strong>Other clouds\u2019 equivalents<\/strong>: AWS Deep Learning AMIs, Azure Data Science VM (compare carefully on governance and pricing).<\/li>\n<li><strong>Self-managed images<\/strong>: rolling your own base OS + install scripts; maximum control but highest setup\/maintenance cost.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Comparison table<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Option<\/th>\n<th>Best For<\/th>\n<th>Strengths<\/th>\n<th>Weaknesses<\/th>\n<th>When to Choose<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Deep Learning VM Images (Google Cloud)<\/strong><\/td>\n<td>VM-based ML dev\/training with quick start<\/td>\n<td>Curated ML-ready VM images; Compute Engine flexibility; good for custom deps<\/td>\n<td>You manage VM ops; risk of idle cost; version pinning needed<\/td>\n<td>You want fast setup and full VM control<\/td>\n<\/tr>\n<tr>\n<td><strong>Vertex AI Workbench<\/strong><\/td>\n<td>Managed notebooks and team governance<\/td>\n<td>Managed user experience; integrates with Vertex AI<\/td>\n<td>Less low-level control than raw VMs; may impose patterns<\/td>\n<td>You want managed notebook lifecycle and governance<\/td>\n<\/tr>\n<tr>\n<td><strong>Vertex AI Training (Custom Jobs)<\/strong><\/td>\n<td>Managed training runs<\/td>\n<td>Less infrastructure management; better job tracking<\/td>\n<td>Less OS-level control; needs job packaging<\/td>\n<td>You want managed execution and repeatable training jobs<\/td>\n<\/tr>\n<tr>\n<td><strong>Deep Learning Containers<\/strong><\/td>\n<td>Container-first ML runtimes<\/td>\n<td>Reproducible containers; works across services<\/td>\n<td>Requires container workflow; not a VM image<\/td>\n<td>You standardize on containers across environments<\/td>\n<\/tr>\n<tr>\n<td><strong>GKE + ML containers<\/strong><\/td>\n<td>Platform teams running many ML services\/jobs<\/td>\n<td>Standard orchestration; scaling; multi-tenant patterns<\/td>\n<td>Higher operational overhead; cluster management<\/td>\n<td>You need Kubernetes-based standardization<\/td>\n<\/tr>\n<tr>\n<td><strong>AWS Deep Learning AMIs<\/strong><\/td>\n<td>Similar VM-first approach on AWS<\/td>\n<td>Familiar to AWS users<\/td>\n<td>Different IAM\/networking\/pricing models<\/td>\n<td>You are standardized on AWS<\/td>\n<\/tr>\n<tr>\n<td><strong>Azure Data Science VM<\/strong><\/td>\n<td>Similar VM-first approach on Azure<\/td>\n<td>Azure ecosystem integration<\/td>\n<td>Different governance and service boundaries<\/td>\n<td>You are standardized on Azure<\/td>\n<\/tr>\n<tr>\n<td><strong>Self-managed custom images<\/strong><\/td>\n<td>Maximum customization<\/td>\n<td>Full control; internal compliance hardening<\/td>\n<td>Highest maintenance burden<\/td>\n<td>Strict compliance or highly custom stacks<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">15. Real-World Example<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Enterprise example: Regulated analytics team migrating GPU training to Google Cloud<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> An enterprise analytics team needs GPU training for computer vision but must meet strict security controls (private networking, audited access, restricted egress).<\/li>\n<li><strong>Proposed architecture:<\/strong><\/li>\n<li>Compute Engine VMs created from <strong>Deep Learning VM Images<\/strong> in a private subnet (no public IP)<\/li>\n<li>Cloud NAT for controlled outbound updates<\/li>\n<li>Cloud Storage bucket in-region for datasets and artifacts with bucket-level IAM and retention policies<\/li>\n<li>OS Login for access; Cloud Logging\/Monitoring for audit and operations<\/li>\n<li>Optional: custom hardened image derived from the base Deep Learning VM Images image for production consistency<\/li>\n<li><strong>Why this service was chosen:<\/strong><\/li>\n<li>VM-first model matches enterprise operational controls and change management.<\/li>\n<li>Faster setup than building GPU images from scratch.<\/li>\n<li>Flexibility for custom dependencies and internal security agents.<\/li>\n<li><strong>Expected outcomes:<\/strong><\/li>\n<li>Reduced time to provision compliant GPU environments<\/li>\n<li>Standardized training platform with repeatable builds<\/li>\n<li>Better auditability and reduced environment drift<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Startup\/small-team example: Fast experimentation without a platform team<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> A startup needs to iterate quickly on an NLP model without investing in Kubernetes or a managed training pipeline yet.<\/li>\n<li><strong>Proposed architecture:<\/strong><\/li>\n<li>Single VM from Deep Learning VM Images<\/li>\n<li>Cloud Storage for datasets and checkpoints<\/li>\n<li>Simple scripts for \u201cstart training \u2192 upload \u2192 shutdown\u201d<\/li>\n<li><strong>Why this service was chosen:<\/strong><\/li>\n<li>Minimal platform overhead; fast to start.<\/li>\n<li>Pay-as-you-go with the flexibility to scale up to GPU when needed.<\/li>\n<li><strong>Expected outcomes:<\/strong><\/li>\n<li>Faster iteration cycles<\/li>\n<li>Clear path to production hardening later (custom images, private networking, or migration to managed training)<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">16. FAQ<\/h2>\n\n\n\n<p>1) <strong>Is Deep Learning VM Images a managed ML service?<\/strong><br\/>\nNo. It provides curated VM images. You still manage the Compute Engine instance lifecycle, OS configuration, patching strategy, and access controls.<\/p>\n\n\n\n<p>2) <strong>Do Deep Learning VM Images include GPUs?<\/strong><br\/>\nThe images do not \u201cinclude\u201d GPUs; GPUs are attached to a VM as accelerators and billed separately. Some images are designed to work well with GPUs. Verify the image\u2019s intended use and current documentation.<\/p>\n\n\n\n<p>3) <strong>How do I find the correct Deep Learning VM Images image name?<\/strong><br\/>\nUse <code>gcloud compute images list --project=deeplearning-platform-release --no-standard-images<\/code> and choose an image that matches your needs. Verify the current image project and naming in official docs.<\/p>\n\n\n\n<p>4) <strong>Can I use these images with private VMs (no public IP)?<\/strong><br\/>\nYes. Use private IPs and Cloud NAT for outbound access if needed, plus IAP\/VPN for admin access.<\/p>\n\n\n\n<p>5) <strong>What\u2019s the safest way to give the VM access to Cloud Storage?<\/strong><br\/>\nAttach a dedicated service account to the VM and grant it bucket-level permissions (least privilege). Avoid storing service account keys on disk.<\/p>\n\n\n\n<p>6) <strong>Do I need to enable any APIs?<\/strong><br\/>\nAt minimum, Compute Engine API. Commonly Cloud Storage API as well for artifacts\/datasets.<\/p>\n\n\n\n<p>7) <strong>What\u2019s the best practice for reproducibility\u2014use \u201clatest\u201d images or pin versions?<\/strong><br\/>\nFor production, pin to a specific image version and control updates. Using \u201clatest\u201d is convenient for experimentation but can introduce changes unexpectedly.<\/p>\n\n\n\n<p>8) <strong>Can I create my own custom image from a Deep Learning VM Images instance?<\/strong><br\/>\nYes. A common production pattern is to start from the curated base, apply hardening and pinned dependencies, then create a custom image for consistent rollout.<\/p>\n\n\n\n<p>9) <strong>How do I avoid surprise costs?<\/strong><br\/>\nAutomate shutdown, use labels and budgets, and be especially careful with GPU VMs. Consider Spot VMs for interruptible workloads.<\/p>\n\n\n\n<p>10) <strong>Is it better to use Vertex AI instead?<\/strong><br\/>\nVertex AI is often better when you want managed training, managed pipelines, managed endpoints, and less VM operations. Deep Learning VM Images is better when you need full VM control.<\/p>\n\n\n\n<p>11) <strong>Can I run containers on a Deep Learning VM Images VM?<\/strong><br\/>\nYes, you can run Docker containers on a VM if Docker is installed (many ML images include developer tooling, but verify). Alternatively use Deep Learning Containers directly with a container platform.<\/p>\n\n\n\n<p>12) <strong>How do I securely run Jupyter on the VM?<\/strong><br\/>\nAvoid exposing it publicly. Use SSH tunneling or IAP TCP forwarding, bind to localhost, and enforce strong auth. Verify current best practices for notebooks in Google Cloud docs.<\/p>\n\n\n\n<p>13) <strong>What if my organization blocks public images?<\/strong><br\/>\nYou may need a platform-team process to import\/mirror approved images into a private project or build an internal base image pipeline.<\/p>\n\n\n\n<p>14) <strong>How do I choose a machine type and disk?<\/strong><br\/>\nStart small for dev, then benchmark. Training often needs sufficient RAM and fast disk for data pipelines. Use Monitoring to see bottlenecks.<\/p>\n\n\n\n<p>15) <strong>Do these images guarantee performance improvements?<\/strong><br\/>\nThey mainly reduce setup friction and improve consistency. Performance still depends on machine type, GPU, disk throughput, data pipeline, and model architecture.<\/p>\n\n\n\n<p>16) <strong>Can I use TPUs with Deep Learning VM Images?<\/strong><br\/>\nTPUs are provided through separate Google Cloud TPU\/Vertex AI mechanisms. If you need TPUs, verify the recommended approach in current Cloud TPU and Vertex AI documentation.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">17. Top Online Resources to Learn Deep Learning VM Images<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Resource Type<\/th>\n<th>Name<\/th>\n<th>Why It Is Useful<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Official documentation<\/td>\n<td>Deep Learning VM documentation: https:\/\/cloud.google.com\/deep-learning-vm<\/td>\n<td>Primary reference for images, creation steps, and supported configurations<\/td>\n<\/tr>\n<tr>\n<td>Official docs (Compute Engine)<\/td>\n<td>Compute Engine documentation: https:\/\/cloud.google.com\/compute\/docs<\/td>\n<td>Core VM, disk, networking, IAM, and ops fundamentals used by DL VM images<\/td>\n<\/tr>\n<tr>\n<td>Official pricing<\/td>\n<td>Compute Engine pricing: https:\/\/cloud.google.com\/compute\/pricing<\/td>\n<td>Understand VM, disk, and related compute charges<\/td>\n<\/tr>\n<tr>\n<td>Official pricing<\/td>\n<td>GPU pricing: https:\/\/cloud.google.com\/compute\/gpus-pricing<\/td>\n<td>GPU SKUs, regions, and cost drivers<\/td>\n<\/tr>\n<tr>\n<td>Official pricing<\/td>\n<td>Cloud Storage pricing: https:\/\/cloud.google.com\/storage\/pricing<\/td>\n<td>Storage cost model for datasets and artifacts<\/td>\n<\/tr>\n<tr>\n<td>Official tool<\/td>\n<td>Pricing Calculator: https:\/\/cloud.google.com\/products\/calculator<\/td>\n<td>Build region-accurate estimates without guessing numbers<\/td>\n<\/tr>\n<tr>\n<td>Official getting started<\/td>\n<td>Deep Learning VM getting started (see docs navigation): https:\/\/cloud.google.com\/deep-learning-vm\/docs<\/td>\n<td>Step-by-step instructions and current best practices (verify latest)<\/td>\n<\/tr>\n<tr>\n<td>Official security<\/td>\n<td>IAM documentation: https:\/\/cloud.google.com\/iam\/docs<\/td>\n<td>Least privilege and service account design patterns<\/td>\n<\/tr>\n<tr>\n<td>Official ops<\/td>\n<td>Cloud Logging: https:\/\/cloud.google.com\/logging\/docs<\/td>\n<td>Centralize training and system logs<\/td>\n<\/tr>\n<tr>\n<td>Official ops<\/td>\n<td>Cloud Monitoring: https:\/\/cloud.google.com\/monitoring\/docs<\/td>\n<td>GPU\/CPU\/disk utilization dashboards and alerting<\/td>\n<\/tr>\n<tr>\n<td>Official learning<\/td>\n<td>Google Cloud Skills Boost: https:\/\/www.cloudskillsboost.google\/<\/td>\n<td>Hands-on labs (search for Compute Engine, ML, and Deep Learning VM topics)<\/td>\n<\/tr>\n<tr>\n<td>Official YouTube<\/td>\n<td>Google Cloud Tech YouTube: https:\/\/www.youtube.com\/@googlecloudtech<\/td>\n<td>Architecture, best practices, and demos (search relevant topics)<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">18. Training and Certification Providers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Institute<\/th>\n<th>Suitable Audience<\/th>\n<th>Likely Learning Focus<\/th>\n<th>Mode<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>DevOps engineers, SREs, platform teams, cloud engineers<\/td>\n<td>DevOps\/cloud fundamentals, automation, operational practices around cloud workloads<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>ScmGalaxy.com<\/td>\n<td>Beginners to intermediate engineers<\/td>\n<td>DevOps, CI\/CD, SCM, and foundational cloud\/ops practices<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.scmgalaxy.com\/<\/td>\n<\/tr>\n<tr>\n<td>CLoudOpsNow.in<\/td>\n<td>Cloud ops and operations-focused teams<\/td>\n<td>Cloud operations practices, monitoring, governance, cost controls<\/td>\n<td>Check website<\/td>\n<td>https:\/\/cloudopsnow.in\/<\/td>\n<\/tr>\n<tr>\n<td>SreSchool.com<\/td>\n<td>SREs, reliability engineers, platform teams<\/td>\n<td>Reliability engineering, monitoring, incident response, operational maturity<\/td>\n<td>Check website<\/td>\n<td>https:\/\/sreschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>AiOpsSchool.com<\/td>\n<td>Ops teams adopting AIOps<\/td>\n<td>Observability, automation, operations analytics, AIOps concepts<\/td>\n<td>Check website<\/td>\n<td>https:\/\/aiopsschool.com\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">19. Top Trainers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Platform\/Site<\/th>\n<th>Likely Specialization<\/th>\n<th>Suitable Audience<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>RajeshKumar.xyz<\/td>\n<td>Cloud\/DevOps training and guidance (verify current offerings on site)<\/td>\n<td>Beginners to professionals seeking practical coaching<\/td>\n<td>https:\/\/rajeshkumar.xyz\/<\/td>\n<\/tr>\n<tr>\n<td>devopstrainer.in<\/td>\n<td>DevOps training resources (verify current offerings on site)<\/td>\n<td>DevOps engineers, sysadmins moving to cloud<\/td>\n<td>https:\/\/devopstrainer.in\/<\/td>\n<\/tr>\n<tr>\n<td>devopsfreelancer.com<\/td>\n<td>Freelance DevOps support\/training platform (verify current offerings on site)<\/td>\n<td>Teams needing short-term help or mentoring<\/td>\n<td>https:\/\/devopsfreelancer.com\/<\/td>\n<\/tr>\n<tr>\n<td>devopssupport.in<\/td>\n<td>DevOps support and enablement (verify current offerings on site)<\/td>\n<td>Ops\/DevOps teams needing troubleshooting and guidance<\/td>\n<td>https:\/\/devopssupport.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">20. Top Consulting Companies<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Company<\/th>\n<th>Likely Service Area<\/th>\n<th>Where They May Help<\/th>\n<th>Consulting Use Case Examples<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>cotocus.com<\/td>\n<td>Cloud and DevOps consulting (verify offerings on site)<\/td>\n<td>Cloud architecture, CI\/CD, infrastructure automation, operations enablement<\/td>\n<td>Standardizing VM provisioning, IAM guardrails, cost controls for ML VMs<\/td>\n<td>https:\/\/cotocus.com\/<\/td>\n<\/tr>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>Training + consulting (verify offerings on site)<\/td>\n<td>DevOps transformation, platform enablement, automation practices<\/td>\n<td>Building repeatable infra-as-code patterns for Compute Engine ML workloads<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>DEVOPSCONSULTING.IN<\/td>\n<td>DevOps consulting (verify offerings on site)<\/td>\n<td>DevOps processes, automation, reliability practices<\/td>\n<td>Implementing monitoring\/alerting and governance for VM-based ML environments<\/td>\n<td>https:\/\/devopsconsulting.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">21. Career and Learning Roadmap<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn before this service<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Google Cloud fundamentals: projects, billing, IAM<\/li>\n<li>Compute Engine basics: instances, images, disks, networks, firewall rules<\/li>\n<li>Linux basics: SSH, system services, package managers, permissions<\/li>\n<li>Python fundamentals: venv\/conda, pip, running scripts<\/li>\n<li>Storage fundamentals: Cloud Storage buckets and IAM<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn after this service<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>GPU operations: quotas, utilization monitoring, performance tuning<\/li>\n<li>Infrastructure as Code: Terraform for repeatable VM provisioning<\/li>\n<li>Security hardening: OS Login, least privilege IAM, private networking, Cloud NAT<\/li>\n<li>ML platform scaling:<\/li>\n<li>Vertex AI Training for managed jobs (verify)<\/li>\n<li>Vertex AI Workbench for managed notebooks (verify)<\/li>\n<li>Containerization and Deep Learning Containers<\/li>\n<li>GKE if you need orchestration at scale<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Job roles that use it<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Cloud Engineer \/ Infrastructure Engineer supporting ML teams<\/li>\n<li>ML Engineer operating training\/inference systems<\/li>\n<li>DevOps \/ SRE enabling GPU capacity, monitoring, and cost controls<\/li>\n<li>Data Scientist (especially in early-stage or research-heavy teams)<\/li>\n<li>Solutions Architect designing ML reference architectures<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Certification path (if available)<\/h3>\n\n\n\n<p>Google Cloud certifications that commonly align (verify current certifications and exam coverage):\n&#8211; Associate Cloud Engineer\n&#8211; Professional Cloud Architect\n&#8211; Professional Machine Learning Engineer<\/p>\n\n\n\n<p>Official certification overview: https:\/\/cloud.google.com\/learn\/certification<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Project ideas for practice<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Build a \u201ccreate VM \u2192 run training \u2192 upload \u2192 delete VM\u201d automation script.<\/li>\n<li>Create a custom hardened image derived from a Deep Learning VM Images base.<\/li>\n<li>Implement private-only DL VM instances with Cloud NAT and IAP access.<\/li>\n<li>Add monitoring dashboards for GPU\/CPU\/memory\/disk and alert on idle GPU.<\/li>\n<li>Implement artifact retention policies in Cloud Storage.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">22. Glossary<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Deep Learning VM Images<\/strong>: Google-maintained VM images intended for ML\/deep learning workloads on Compute Engine.<\/li>\n<li><strong>Compute Engine<\/strong>: Google Cloud\u2019s IaaS VM service.<\/li>\n<li><strong>Image<\/strong>: A boot disk template used to create VM instances.<\/li>\n<li><strong>Image family<\/strong>: A pointer to the latest non-deprecated image in a family (useful but can reduce reproducibility if you always track \u201clatest\u201d).<\/li>\n<li><strong>Persistent Disk<\/strong>: Network-attached block storage for Compute Engine.<\/li>\n<li><strong>GPU (Graphics Processing Unit)<\/strong>: Hardware accelerator commonly used for deep learning training and inference.<\/li>\n<li><strong>IAM (Identity and Access Management)<\/strong>: Controls who can do what in your Google Cloud environment.<\/li>\n<li><strong>Service account<\/strong>: Non-human identity used by workloads to access Google Cloud APIs.<\/li>\n<li><strong>OS Login<\/strong>: Google Cloud feature to manage Linux SSH access using IAM.<\/li>\n<li><strong>Cloud Storage<\/strong>: Google Cloud object storage for datasets and model artifacts.<\/li>\n<li><strong>Cloud NAT<\/strong>: Managed NAT for outbound internet access from private VMs without public IPs.<\/li>\n<li><strong>Cloud Logging \/ Cloud Monitoring<\/strong>: Observability services for logs, metrics, dashboards, and alerting.<\/li>\n<li><strong>Least privilege<\/strong>: Security principle of granting only the minimal permissions required.<\/li>\n<li><strong>Egress<\/strong>: Outbound network traffic, often billable when leaving a region or going to the internet.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">23. Summary<\/h2>\n\n\n\n<p>Deep Learning VM Images on Google Cloud provides curated VM images for Compute Engine that accelerate AI and ML work by reducing environment setup and improving consistency. It matters because deep learning environments are complex\u2014frameworks, drivers, and dependencies can easily drift\u2014and standardized images help teams move faster with fewer failures.<\/p>\n\n\n\n<p>In the Google Cloud ecosystem, Deep Learning VM Images fits best when you want VM-level control for training, experimentation, or inference, while still integrating cleanly with Cloud Storage, IAM, and Cloud Logging\/Monitoring.<\/p>\n\n\n\n<p>Cost and security are primarily governed by how you run Compute Engine:\n&#8211; Cost drivers: VM size, GPU type\/count, disk size\/type, and egress.\n&#8211; Security drivers: IAM\/OS Login, service account least privilege, and minimizing network exposure.<\/p>\n\n\n\n<p>Use Deep Learning VM Images when you want a practical ML-ready VM baseline and are prepared to manage VM operations. If you want fully managed training and notebook governance, evaluate Vertex AI options next (verify current best practices in official docs).<\/p>\n\n\n\n<p>Next step: read the official Deep Learning VM documentation and then productionize your lab by adding private networking, budgets\/alerts, and an image\/version pinning strategy: https:\/\/cloud.google.com\/deep-learning-vm<\/p>\n","protected":false},"excerpt":{"rendered":"<p>AI and ML<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[53,51],"tags":[],"class_list":["post-550","post","type-post","status-publish","format-standard","hentry","category-ai-and-ml","category-google-cloud"],"_links":{"self":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/550","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/comments?post=550"}],"version-history":[{"count":0,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/550\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/media?parent=550"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/categories?post=550"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/tags?post=550"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}