{"id":549,"date":"2026-04-14T11:36:20","date_gmt":"2026-04-14T11:36:20","guid":{"rendered":"https:\/\/www.devopsschool.com\/tutorials\/google-cloud-deep-learning-containers-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-ai-and-ml\/"},"modified":"2026-04-14T11:36:20","modified_gmt":"2026-04-14T11:36:20","slug":"google-cloud-deep-learning-containers-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-ai-and-ml","status":"publish","type":"post","link":"https:\/\/www.devopsschool.com\/tutorials\/google-cloud-deep-learning-containers-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-ai-and-ml\/","title":{"rendered":"Google Cloud Deep Learning Containers Tutorial: Architecture, Pricing, Use Cases, and Hands-On Guide for AI and ML"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Category<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">AI and ML<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">1. Introduction<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Deep Learning Containers is a set of Google-maintained Docker container images for machine learning and deep learning on Google Cloud. The images come pre-installed with popular frameworks (for example TensorFlow and PyTorch), common ML libraries, and (for GPU images) NVIDIA CUDA\/cuDNN dependencies, so you can start training or serving models without spending hours building and debugging your environment.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In simple terms: <strong>Deep Learning Containers gives you a ready-to-run ML environment packaged as a container image<\/strong>, so you can run the same environment on your laptop, Compute Engine, Google Kubernetes Engine (GKE), or Vertex AI\u2014reducing \u201cworks on my machine\u201d issues.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Technically, Deep Learning Containers are <strong>curated container images<\/strong> published by Google and designed to work well with Google Cloud infrastructure. You pull a specific image (CPU or GPU, framework version, and sometimes CUDA version) from Google\u2019s published registries, then run it on a compute platform of your choice. Deep Learning Containers is not a managed training service by itself; it is an <strong>opinionated, reproducible runtime layer<\/strong> you can use across multiple services.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The main problem it solves is <strong>environment standardization<\/strong>:\n&#8211; Reproducible and supportable ML runtime environments\n&#8211; Faster onboarding and fewer dependency conflicts\n&#8211; Easier migration between dev\/test and production\n&#8211; A consistent base image for your own custom ML containers<\/p>\n\n\n\n<blockquote>\n<p>Naming\/status note: \u201cDeep Learning Containers\u201d remains the official name on Google Cloud at the time of writing. However, Google has been transitioning container image hosting from older registries to Artifact Registry in many products. <strong>Verify the current recommended image registry and image URIs in official docs<\/strong> before you standardize on a specific URI pattern.<\/p>\n<\/blockquote>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">2. What is Deep Learning Containers?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Official purpose<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Deep Learning Containers provide <strong>optimized, tested container images<\/strong> for deep learning workloads on Google Cloud. They are built to help you run ML frameworks quickly and consistently on Google Cloud compute services.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Core capabilities<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Pre-built container images for common ML frameworks and versions<\/li>\n<li>CPU and GPU variants (GPU variants include NVIDIA user-space libraries; you still need GPU drivers on the host where applicable)<\/li>\n<li>A stable base image you can extend with your code and dependencies<\/li>\n<li>Compatibility with common Google Cloud runtimes (e.g., GKE, Compute Engine, Vertex AI custom training\u2014verify supported patterns in official docs)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Major components<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Container images<\/strong>: Framework-specific images (TensorFlow, PyTorch, and others as published).<\/li>\n<li><strong>Tags\/versions<\/strong>: Image tags encode framework version and runtime flavor (CPU\/GPU; sometimes CUDA).<\/li>\n<li><strong>Registries<\/strong>: Images are hosted in Google container registries (increasingly Artifact Registry). <strong>Verify the authoritative registry and image list<\/strong> in the official Deep Learning Containers documentation.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Service type<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Deep Learning Containers is best understood as a <strong>Google-managed image catalog<\/strong> (curated container artifacts), not a standalone managed service.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Scope (regional\/global\/project-scoped)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The images themselves are published by Google and pulled from Google-hosted registries.<\/li>\n<li>Your usage is <strong>project-scoped<\/strong> in the sense that:<\/li>\n<li>Pulling images may be controlled by your project\u2019s egress policies, VPC Service Controls, and IAM policies (where applicable).<\/li>\n<li>Costs are incurred in the project where compute runs and where artifacts\/logs are stored.<\/li>\n<li>Runtime resources are typically <strong>zonal or regional<\/strong> depending on the platform you run on (Compute Engine zones, GKE clusters in regions\/zones, Vertex AI training in a region).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">How it fits into the Google Cloud ecosystem<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Deep Learning Containers sits in the middle of Google Cloud\u2019s AI and ML stack:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Development<\/strong>: standardize environments for engineers and data scientists<\/li>\n<li><strong>Training<\/strong>: run training on Vertex AI custom jobs, GKE Jobs, or Compute Engine instances<\/li>\n<li><strong>Serving<\/strong>: deploy inference services on GKE (and sometimes on other runtimes if compatible with your networking and GPU needs)<\/li>\n<li><strong>MLOps<\/strong>: integrate with Cloud Logging\/Monitoring, Artifact Registry, Cloud Build, Cloud Storage, and Vertex AI<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">3. Why use Deep Learning Containers?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Business reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Faster time-to-value<\/strong>: reduce setup time for ML projects by starting from a known-good runtime.<\/li>\n<li><strong>Lower project risk<\/strong>: fewer failures due to dependency conflicts and \u201cmysterious\u201d CUDA mismatches.<\/li>\n<li><strong>Portability<\/strong>: a consistent runtime across teams and environments.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Technical reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Reproducibility<\/strong>: pin an image tag to reproduce experiments and deployments.<\/li>\n<li><strong>Framework alignment<\/strong>: choose an image matching the framework version you need.<\/li>\n<li><strong>GPU readiness<\/strong>: GPU images include user-space CUDA\/cuDNN libraries (host drivers still required).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operational reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Standard base images<\/strong>: platform teams can approve a small set of base images.<\/li>\n<li><strong>Simplified CI\/CD<\/strong>: build your own container images <em>FROM<\/em> Deep Learning Containers, then deploy through standard pipelines.<\/li>\n<li><strong>Observability<\/strong>: containers integrate cleanly with Cloud Logging and Cloud Monitoring when run on Google Cloud runtimes.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/compliance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Controlled supply chain<\/strong>: start from Google-published images, then extend with your own code and scanning policies.<\/li>\n<li><strong>Consistent patching workflow<\/strong>: you can periodically rebuild your derived images from updated base images.<\/li>\n<li><strong>Policy enforcement<\/strong>: standard containers enable consistent runtime hardening (non-root user patterns, minimal packages, private networking).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scalability\/performance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Optimized runtimes<\/strong>: curated images are generally tested for compatibility with Google Cloud environments.<\/li>\n<li><strong>Horizontal scaling<\/strong>: on GKE, you can scale training\/inference using Kubernetes primitives.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should choose Deep Learning Containers<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Choose Deep Learning Containers when you want:\n&#8211; A consistent runtime across local\/dev\/prod\n&#8211; A vetted base image for TensorFlow\/PyTorch workloads\n&#8211; To run containers on Compute Engine, GKE, or Vertex AI custom jobs\n&#8211; To reduce environment setup and debugging time<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should not choose it<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Consider alternatives when:\n&#8211; You need a <strong>fully managed<\/strong> environment without container operations (consider Vertex AI prebuilt training\/serving, Vertex AI Workbench, or managed pipelines).\n&#8211; You need extremely minimal images for fast cold starts (Deep Learning Containers can be large; startup\/pull times can be significant).\n&#8211; You require a framework\/runtime combination not provided\u2014then you may need to build your own base image.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">4. Where is Deep Learning Containers used?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Industries<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Healthcare and life sciences (imaging, genomics)<\/li>\n<li>Retail and e-commerce (recommendations, forecasting)<\/li>\n<li>Media and entertainment (content understanding)<\/li>\n<li>Manufacturing (quality inspection with computer vision)<\/li>\n<li>Financial services (fraud detection, risk models)<\/li>\n<li>Public sector and education (research workloads)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Team types<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>ML engineering teams standardizing training\/serving environments<\/li>\n<li>Platform\/DevOps teams providing \u201cgolden images\u201d and guardrails<\/li>\n<li>Research teams needing reproducible environments<\/li>\n<li>SRE\/operations teams supporting ML services at scale<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Workloads<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Model training (single-node, distributed\u2014framework-dependent)<\/li>\n<li>Batch inference (offline scoring jobs)<\/li>\n<li>Online inference (API services on Kubernetes)<\/li>\n<li>Experimentation and notebooks (when using container-based dev environments)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Architectures<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Vertex AI custom training jobs using custom containers<\/li>\n<li>GKE-based training orchestration (Jobs, Argo Workflows, Kubeflow\u2014verify current Kubeflow support patterns separately)<\/li>\n<li>Compute Engine instances running Docker for ad-hoc training<\/li>\n<li>Hybrid: on-prem development + cloud training (portable container runtime)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Real-world deployment contexts<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Development: local Docker + Cloud Build for reproducible builds<\/li>\n<li>Test: GKE namespaces \/ Vertex AI staging<\/li>\n<li>Production: hardened images in Artifact Registry, deployed with policy controls<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Production vs dev\/test usage<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Dev\/test<\/strong>: fast iteration, frequent rebuilds, smaller datasets, CPU-only jobs<\/li>\n<li><strong>Production<\/strong>: pinned image versions, vulnerability scanning, private networking, autoscaling, and strict IAM<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">5. Top Use Cases and Scenarios<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Below are realistic ways teams use Deep Learning Containers on Google Cloud.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1) Standardized TensorFlow training environment<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Engineers use different TensorFlow and dependency versions, producing inconsistent results.<\/li>\n<li><strong>Why Deep Learning Containers fits<\/strong>: A pinned container tag ensures the same runtime everywhere.<\/li>\n<li><strong>Example<\/strong>: Data science trains a Keras model locally; the same container runs in Vertex AI custom training.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2) GPU-enabled PyTorch training without CUDA dependency chaos<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: CUDA\/cuDNN mismatches cause runtime errors when moving to GPU machines.<\/li>\n<li><strong>Why it fits<\/strong>: GPU images bundle user-space GPU libraries aligned to a compatible CUDA stack.<\/li>\n<li><strong>Example<\/strong>: Team runs PyTorch training on GKE GPU nodes using the same base image across clusters.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3) Building a \u201cgolden base image\u201d for ML<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Security requires a controlled, scanned base image for all ML workloads.<\/li>\n<li><strong>Why it fits<\/strong>: Use Deep Learning Containers as an approved base and extend it with only what you need.<\/li>\n<li><strong>Example<\/strong>: Platform team publishes <code>company-ml-base:tf2-cpu<\/code> derived from Deep Learning Containers, scanned and signed.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4) Repeatable research experiments<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Reproducing experiments months later is hard due to changing dependencies.<\/li>\n<li><strong>Why it fits<\/strong>: Containers encode dependencies; tags encode versions.<\/li>\n<li><strong>Example<\/strong>: Research team stores the exact container digest in experiment metadata.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">5) Batch inference on scheduled infrastructure<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Nightly scoring needs a known environment and consistent outputs.<\/li>\n<li><strong>Why it fits<\/strong>: Run batch jobs on GKE or Compute Engine with a fixed container.<\/li>\n<li><strong>Example<\/strong>: A nightly job reads data from BigQuery export in Cloud Storage and writes predictions back.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6) Distributed training on Kubernetes (advanced)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Need to scale training across multiple nodes and GPUs.<\/li>\n<li><strong>Why it fits<\/strong>: Containers are the standard unit of deployment on Kubernetes.<\/li>\n<li><strong>Example<\/strong>: Team uses a Kubernetes operator (framework-dependent) and a Deep Learning Containers image to run multi-worker training.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">7) Rapid onboarding for new ML engineers<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: New hires spend days setting up GPU drivers and framework dependencies.<\/li>\n<li><strong>Why it fits<\/strong>: Provide a container-based workflow as the default.<\/li>\n<li><strong>Example<\/strong>: <code>docker run<\/code> or Kubernetes pod runs the same environment used in production.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">8) A\/B testing model server stacks (GKE)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Need to compare serving performance across framework versions.<\/li>\n<li><strong>Why it fits<\/strong>: Swap container tags and redeploy.<\/li>\n<li><strong>Example<\/strong>: Canary deployment in GKE using different container tags for TF Serving-based stacks (if used).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">9) CI pipeline for model training images<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Building ML images is slow and inconsistent across developer machines.<\/li>\n<li><strong>Why it fits<\/strong>: Use Cloud Build to build a derived image from Deep Learning Containers.<\/li>\n<li><strong>Example<\/strong>: On merge to main, Cloud Build builds and pushes <code>trainer:gitsha<\/code> images to Artifact Registry.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">10) Hybrid portability (dev on-prem, train in cloud)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: On-prem GPU resources are constrained; cloud burst training is needed.<\/li>\n<li><strong>Why it fits<\/strong>: Same container runs anywhere Docker\/Kubernetes runs.<\/li>\n<li><strong>Example<\/strong>: Developers validate training in local container; production training runs on Vertex AI.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">6. Core Features<\/h2>\n\n\n\n<blockquote>\n<p>Important: Deep Learning Containers is an image catalog. Many \u201cfeatures\u201d are realized through how you use the images with compute services.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">1) Curated framework images (TensorFlow\/PyTorch, etc.)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Provides prebuilt containers with ML frameworks and common dependencies.<\/li>\n<li><strong>Why it matters<\/strong>: Saves time and reduces dependency conflicts.<\/li>\n<li><strong>Practical benefit<\/strong>: Faster environment setup; fewer broken builds.<\/li>\n<li><strong>Limitations\/caveats<\/strong>: Framework and version availability changes over time\u2014<strong>verify the current image list and tags in official docs<\/strong>.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2) CPU and GPU image variants<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Offers CPU-only images and GPU-enabled images (with user-space NVIDIA libraries).<\/li>\n<li><strong>Why it matters<\/strong>: Enables the same workflow for both dev (CPU) and training (GPU).<\/li>\n<li><strong>Practical benefit<\/strong>: Developers can iterate cheaply on CPU, then scale to GPU.<\/li>\n<li><strong>Limitations\/caveats<\/strong>: GPU usage still requires compatible GPU drivers on the host (Compute Engine\/GKE nodes). CUDA versions must align\u2014<strong>verify compatibility<\/strong>.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3) Version pinning through tags\/digests<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Lets you select a specific version of an image.<\/li>\n<li><strong>Why it matters<\/strong>: Reproducibility and predictable behavior.<\/li>\n<li><strong>Practical benefit<\/strong>: Stable builds and repeatable experiments.<\/li>\n<li><strong>Limitations\/caveats<\/strong>: Tags can sometimes be moved in some ecosystems; using immutable image digests provides stronger guarantees. Verify how Google publishes these images.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4) Designed for Google Cloud runtimes<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Images are tested for typical Google Cloud usage patterns.<\/li>\n<li><strong>Why it matters<\/strong>: Fewer surprises when moving to Vertex AI or GKE.<\/li>\n<li><strong>Practical benefit<\/strong>: Less time troubleshooting OS\/package-level issues.<\/li>\n<li><strong>Limitations\/caveats<\/strong>: \u201cDesigned for\u201d is not \u201csupported everywhere.\u201d Always validate your workload on your chosen runtime.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">5) Extensibility (use as a base image)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: You can build your own containers on top of Deep Learning Containers.<\/li>\n<li><strong>Why it matters<\/strong>: Real workloads require custom code and dependencies.<\/li>\n<li><strong>Practical benefit<\/strong>: Standard base + your app layer = consistent and maintainable.<\/li>\n<li><strong>Limitations\/caveats<\/strong>: Large base images can increase build and pull times.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6) Container-first MLOps alignment<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Works well with CI\/CD, Artifact Registry, and policy-based governance.<\/li>\n<li><strong>Why it matters<\/strong>: Production ML is software delivery plus data.<\/li>\n<li><strong>Practical benefit<\/strong>: Promote the same container through dev\/stage\/prod.<\/li>\n<li><strong>Limitations\/caveats<\/strong>: You must implement your own release controls (scanning, signing, provenance).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">7) Compatibility with Kubernetes<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: As containers, images are naturally deployable to GKE.<\/li>\n<li><strong>Why it matters<\/strong>: Kubernetes is a standard platform for ML ops at scale.<\/li>\n<li><strong>Practical benefit<\/strong>: Unified scheduling, secrets, networking, and autoscaling.<\/li>\n<li><strong>Limitations\/caveats<\/strong>: Kubernetes adds operational complexity; not ideal for small teams unless you already run GKE.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">7. Architecture and How It Works<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">High-level service architecture<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Deep Learning Containers sits at the artifact layer:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Google publishes framework container images (Deep Learning Containers).<\/li>\n<li>You pull an image into your runtime environment (Compute Engine \/ GKE \/ Vertex AI).<\/li>\n<li>You run training or inference inside the container.<\/li>\n<li>You store data and artifacts in Cloud Storage \/ BigQuery \/ databases.<\/li>\n<li>You observe and govern workloads through Cloud Logging, Cloud Monitoring, IAM, and network controls.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Request\/data\/control flow<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Typical training flow:\n&#8211; <strong>Control plane<\/strong>: You submit a job (Vertex AI \/ Kubernetes \/ scripts).\n&#8211; <strong>Image pull<\/strong>: The runtime pulls the container image from Google\u2019s registry (and possibly your derived image from Artifact Registry).\n&#8211; <strong>Data access<\/strong>: Container reads training data from Cloud Storage\/BigQuery or other sources.\n&#8211; <strong>Compute<\/strong>: Training executes on CPU or GPU hardware.\n&#8211; <strong>Outputs<\/strong>: Model artifacts written to Cloud Storage; metrics\/logs to Cloud Logging.\n&#8211; <strong>Promotion<\/strong>: Optionally register model and deploy (Vertex AI model registry\/endpoints or GKE).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Integrations with related services<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Common pairings:\n&#8211; <strong>Artifact Registry<\/strong>: store your derived training images.\n&#8211; <strong>Cloud Build<\/strong>: build derived images from Deep Learning Containers.\n&#8211; <strong>Vertex AI<\/strong>: run custom training jobs with your container image.\n&#8211; <strong>Compute Engine<\/strong>: ad-hoc training\/inference by running the container on VMs.\n&#8211; <strong>GKE<\/strong>: production orchestration for training pipelines and inference services.\n&#8211; <strong>Cloud Storage<\/strong>: datasets, checkpoints, and final model artifacts.\n&#8211; <strong>Cloud Logging\/Monitoring<\/strong>: operational observability.\n&#8211; <strong>IAM<\/strong>: access to storage, Vertex AI, registries, and logs.\n&#8211; <strong>VPC \/ Private Google Access<\/strong>: keep traffic private where possible.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Dependency services<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Deep Learning Containers itself does not allocate compute. You typically depend on:\n&#8211; A runtime (Compute Engine \/ GKE \/ Vertex AI)\n&#8211; A registry endpoint (Google-published image registry + your Artifact Registry)\n&#8211; Data storage (Cloud Storage\/BigQuery)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/authentication model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Pulling images: may be public or authenticated depending on the registry and image access rules. <strong>Verify current access requirements<\/strong> for the official image registry.<\/li>\n<li>Access to your resources: handled via IAM service accounts attached to the VM\/node\/pod\/job.<\/li>\n<li>Best practice: use dedicated service accounts with least privilege.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Networking model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Image pulls and data access happen over Google\u2019s network.<\/li>\n<li>You can reduce public internet exposure by:<\/li>\n<li>Using private clusters (GKE)<\/li>\n<li>Using Private Google Access \/ Private Service Connect (where applicable)<\/li>\n<li>Restricting egress with Cloud NAT and firewall rules<\/li>\n<li>Using VPC Service Controls for supported services (verify coverage and constraints)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Monitoring\/logging\/governance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Ensure containers write logs to stdout\/stderr for collection.<\/li>\n<li>Export metrics (framework metrics, custom metrics) using supported agents on your runtime.<\/li>\n<li>Track:<\/li>\n<li>Training duration and GPU utilization<\/li>\n<li>Image pull times<\/li>\n<li>Data read throughput and egress<\/li>\n<li>Cost per experiment\/job<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Simple architecture diagram (Mermaid)<\/h3>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart LR\n  Dev[Developer \/ CI] --&gt;|Build derived image| CB[Cloud Build]\n  CB --&gt; AR[Artifact Registry]\n  AR --&gt;|Pull image| VA[Vertex AI Custom Job]\n  VA --&gt;|Read data| GCS[(Cloud Storage)]\n  VA --&gt;|Write model artifacts| GCS\n  VA --&gt; LOG[Cloud Logging]\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Production-style architecture diagram (Mermaid)<\/h3>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart TB\n  subgraph CICD[CI\/CD]\n    Git[Source Repo] --&gt; CB2[Cloud Build]\n    CB2 --&gt; Scan[Container Scanning \/ Policy]\n    Scan --&gt; AR2[Artifact Registry (Private Repo)]\n  end\n\n  subgraph Network[VPC]\n    NAT[Cloud NAT]\n    FW[Firewall Rules]\n  end\n\n  subgraph Train[Training Plane]\n    VA2[Vertex AI Custom Training]\n    SA[Service Account (Least Privilege)]\n    VA2 --- SA\n  end\n\n  subgraph Data[Data Plane]\n    GCS2[(Cloud Storage: datasets &amp; artifacts)]\n    BQ[(BigQuery - optional)]\n    KMS[Cloud KMS - optional]\n  end\n\n  subgraph Ops[Operations]\n    CL[Cloud Logging]\n    CM[Cloud Monitoring]\n    Audit[Cloud Audit Logs]\n  end\n\n  Git --&gt; CB2\n  AR2 --&gt;|Pull container| VA2\n  VA2 --&gt;|Read\/Write| GCS2\n  VA2 --&gt;|Query| BQ\n  GCS2 --- KMS\n  VA2 --&gt; CL\n  VA2 --&gt; CM\n  VA2 --&gt; Audit\n  VA2 -. egress .-&gt; NAT\n  NAT -. controlled .-&gt; FW\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">8. Prerequisites<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Account\/project requirements<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A Google Cloud project with billing enabled.<\/li>\n<li>Ability to enable APIs and create resources in that project.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Permissions \/ IAM roles<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Exact roles depend on the runtime you choose. For the lab in this tutorial (Cloud Build + Artifact Registry + Vertex AI + Cloud Storage), you typically need permissions equivalent to:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Artifact Registry:<\/li>\n<li><code>roles\/artifactregistry.admin<\/code> (or narrower: create repo + write)<\/li>\n<li>Cloud Build:<\/li>\n<li><code>roles\/cloudbuild.builds.editor<\/code> (or admin depending on org policy)<\/li>\n<li>Cloud Storage:<\/li>\n<li><code>roles\/storage.admin<\/code> (or narrower: bucket create + object admin)<\/li>\n<li>Vertex AI:<\/li>\n<li><code>roles\/aiplatform.user<\/code> (and possibly additional roles for job creation\/reading logs)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">If you use a dedicated service account for the training job, grant it:\n&#8211; <code>roles\/storage.objectAdmin<\/code> on the specific bucket\/prefix used for outputs\n&#8211; Other least-privilege roles as needed (e.g., read dataset buckets)<\/p>\n\n\n\n<blockquote>\n<p>In organizations with strict IAM, you may also need permissions for <code>serviceusage.services.enable<\/code> to enable APIs, and for creating service accounts.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">Billing requirements<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Billing must be enabled (compute and storage costs apply).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">CLI\/SDK\/tools needed<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/cloud.google.com\/sdk\/docs\/install\">Google Cloud SDK (<code>gcloud<\/code>)<\/a> (or Cloud Shell)<\/li>\n<li>Docker (local, or use Cloud Build so you don\u2019t need local Docker)<\/li>\n<li>Optional: <code>gsutil<\/code> (included with Cloud SDK)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Region availability<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Choose a Vertex AI region supported in your project. Vertex AI and GPU availability vary by region.<\/li>\n<li>Artifact Registry is regional; choose the same region as your training jobs when practical to reduce latency\/egress.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Quotas\/limits<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You may encounter quotas for:\n&#8211; Vertex AI custom training resources (CPU\/GPU)\n&#8211; Compute capacity in a region\n&#8211; Artifact Registry storage and operations\n&#8211; Cloud Build build minutes (depending on billing plan)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Always check:\n&#8211; <strong>IAM deny policies \/ org policies<\/strong> (e.g., restriction on external IPs, allowed regions)\n&#8211; <strong>GPU quota<\/strong> if you plan to use GPUs<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Prerequisite services (APIs)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For the lab, enable:\n&#8211; Vertex AI API (<code>aiplatform.googleapis.com<\/code>)\n&#8211; Artifact Registry API (<code>artifactregistry.googleapis.com<\/code>)\n&#8211; Cloud Build API (<code>cloudbuild.googleapis.com<\/code>)\n&#8211; Cloud Storage (generally available; no API enablement required in the same way)<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">9. Pricing \/ Cost<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Deep Learning Containers images themselves are not typically billed as a separate line item. <strong>You pay for the infrastructure and services you use to run them and store artifacts.<\/strong><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Pricing dimensions<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Costs depend on where you run the container:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p><strong>Compute<\/strong>\n   &#8211; Vertex AI custom training: billed based on machine type, accelerators (GPUs), and runtime duration.\n   &#8211; Compute Engine: billed based on VM type, attached GPUs, disks, and runtime.\n   &#8211; GKE: billed for cluster management (mode-dependent) plus underlying nodes\/pods and GPUs.<\/p>\n<\/li>\n<li>\n<p><strong>Storage<\/strong>\n   &#8211; Cloud Storage for datasets, checkpoints, and model artifacts.\n   &#8211; Artifact Registry storage for your derived images.<\/p>\n<\/li>\n<li>\n<p><strong>Networking<\/strong>\n   &#8211; Egress charges if data moves across regions or out to the internet.\n   &#8211; Container image pulls may incur network and performance overhead; charges depend on network path and location.<\/p>\n<\/li>\n<li>\n<p><strong>Build<\/strong>\n   &#8211; Cloud Build minutes and resources used to build images.<\/p>\n<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Free tier (if applicable)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Google Cloud has limited free tiers for some products, but <strong>ML training and large container pulls typically exceed free-tier allowances<\/strong>. Verify current free-tier details in official pricing pages.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Key cost drivers<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>GPU hours (largest driver for deep learning training)<\/li>\n<li>Training job duration (including idle\/wait time)<\/li>\n<li>Data size and repeated reads (especially cross-region)<\/li>\n<li>Artifact retention (checkpoints can be large)<\/li>\n<li>Image size and pull frequency (DLC images can be large)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Hidden or indirect costs<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Repeated container pulls<\/strong> in autoscaled Kubernetes environments can increase network usage and slow startup.<\/li>\n<li><strong>Cross-region storage access<\/strong>: training in one region reading data in another can create both latency and egress charges.<\/li>\n<li><strong>Logging volume<\/strong>: extremely verbose training logs can increase logging ingestion and storage costs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Network\/data transfer implications<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Prefer keeping:<\/li>\n<li>Artifact Registry repo,<\/li>\n<li>Cloud Storage bucket, and<\/li>\n<li>training runtime\n  in the <strong>same region<\/strong> when possible.<\/li>\n<li>Use caching strategies for datasets and containers where appropriate (GKE node image caching, pre-pulled images\u2014implementation-specific).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">How to optimize cost<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Start with CPU for development and small experiments.<\/li>\n<li>Use small machine types for smoke tests.<\/li>\n<li>Use preemptible\/Spot VMs where supported by your chosen runtime (availability and behavior differ\u2014verify in official docs).<\/li>\n<li>Put datasets and outputs in the same region as training.<\/li>\n<li>Apply lifecycle policies to Cloud Storage buckets for old checkpoints.<\/li>\n<li>Build smaller derived images:<\/li>\n<li>remove build tools from runtime stage<\/li>\n<li>avoid adding large unused packages<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Example low-cost starter estimate (conceptual)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A realistic \u201cstarter\u201d cost profile (no fabricated numbers):\n&#8211; 1 small CPU training job for 10\u201330 minutes on Vertex AI or Compute Engine\n&#8211; A few GB of Cloud Storage for dataset and outputs\n&#8211; One container image build in Cloud Build\n&#8211; Minimal egress if all in one region<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To compute your real estimate, use:\n&#8211; Google Cloud pricing pages for:\n  &#8211; Vertex AI training pricing: https:\/\/cloud.google.com\/vertex-ai\/pricing\n  &#8211; Compute Engine pricing: https:\/\/cloud.google.com\/compute\/pricing\n  &#8211; Artifact Registry pricing: https:\/\/cloud.google.com\/artifact-registry\/pricing\n  &#8211; Cloud Storage pricing: https:\/\/cloud.google.com\/storage\/pricing\n&#8211; Pricing Calculator: https:\/\/cloud.google.com\/products\/calculator<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Example production cost considerations<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In production, costs usually come from:\n&#8211; Continuous training\/retraining schedules\n&#8211; GPU training fleets\n&#8211; Multi-environment deployments (dev\/stage\/prod)\n&#8211; Large-scale artifact retention (models + datasets + features)\n&#8211; High availability inference clusters (if you use containers for serving)<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">10. Step-by-Step Hands-On Tutorial<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">This lab demonstrates a practical pattern used in real teams:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Start from a <strong>Deep Learning Containers<\/strong> base image (CPU to keep cost low)<\/li>\n<li>Add your training code<\/li>\n<li>Build and push the derived image to <strong>Artifact Registry<\/strong><\/li>\n<li>Run the container as a <strong>Vertex AI custom training job<\/strong><\/li>\n<li>Write model artifacts to <strong>Cloud Storage<\/strong><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Because Deep Learning Containers image URIs and tags change over time, this tutorial is written to be executable without guessing specific tags. You will <strong>select a current DLC base image URI from official docs<\/strong> and paste it into an environment variable.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Objective<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Build a small TensorFlow or PyTorch training container derived from <strong>Deep Learning Containers<\/strong>, run it on <strong>Vertex AI custom training<\/strong>, and store the output model artifact in <strong>Cloud Storage<\/strong>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Lab Overview<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You will:\n1. Set up project variables and enable APIs.\n2. Create an Artifact Registry repository and a Cloud Storage bucket.\n3. Choose a Deep Learning Containers base image (CPU).\n4. Write a tiny training script (fast and low-cost).\n5. Build a derived container with Cloud Build and push it to Artifact Registry.\n6. Run a Vertex AI custom training job using the derived image.\n7. Validate outputs and clean up.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 1: Set up project, region, and APIs<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">1) Open <strong>Cloud Shell<\/strong> (recommended) or a terminal with <code>gcloud<\/code> authenticated.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">2) Set environment variables:<\/p>\n\n\n\n<pre><code class=\"language-bash\">export PROJECT_ID=\"YOUR_PROJECT_ID\"\nexport REGION=\"us-central1\"   # choose a Vertex AI-supported region\nexport AR_REPO=\"dlc-lab\"\nexport IMAGE_NAME=\"dlc-train-demo\"\nexport BUCKET=\"gs:\/\/${PROJECT_ID}-dlc-lab-artifacts\"\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">3) Configure gcloud:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud config set project \"${PROJECT_ID}\"\ngcloud config set ai\/region \"${REGION}\"\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">4) Enable required APIs:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud services enable \\\n  aiplatform.googleapis.com \\\n  artifactregistry.googleapis.com \\\n  cloudbuild.googleapis.com\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome<\/strong>: APIs enable successfully.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification<\/strong>:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud services list --enabled --filter=\"name:(aiplatform.googleapis.com artifactregistry.googleapis.com cloudbuild.googleapis.com)\"\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 2: Create an Artifact Registry Docker repository and a Cloud Storage bucket<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">1) Create the Artifact Registry repository (Docker format):<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud artifacts repositories create \"${AR_REPO}\" \\\n  --repository-format=docker \\\n  --location=\"${REGION}\" \\\n  --description=\"Deep Learning Containers lab repo\"\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">2) Create the Cloud Storage bucket (regional bucket in the same region is usually a good default):<\/p>\n\n\n\n<pre><code class=\"language-bash\">gsutil mb -l \"${REGION}\" \"${BUCKET}\"\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome<\/strong>:\n&#8211; Artifact Registry repo exists.\n&#8211; Cloud Storage bucket exists.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification<\/strong>:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud artifacts repositories describe \"${AR_REPO}\" --location=\"${REGION}\"\ngsutil ls -b \"${BUCKET}\"\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 3: Select a Deep Learning Containers base image (CPU)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You must choose a current Deep Learning Containers image URI from the official docs.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Official Deep Learning Containers documentation (start here and locate the image list):<br\/>\n  https:\/\/cloud.google.com\/deep-learning-containers\/docs<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Set the base image URI:<\/p>\n\n\n\n<pre><code class=\"language-bash\">export DLC_BASE_IMAGE=\"PASTE_OFFICIAL_DLC_CPU_IMAGE_URI_HERE\"\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Examples vary by framework\/version and registry location, and change over time. <strong>Do not proceed until you paste a valid official image URI<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification<\/strong> (optional but recommended): try to inspect\/pull the base image locally. In Cloud Shell, Docker availability may vary; if available:<\/p>\n\n\n\n<pre><code class=\"language-bash\">docker pull \"${DLC_BASE_IMAGE}\"\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">If Docker is not available locally, you can rely on Cloud Build to pull it during the build step (next step). A failure there usually indicates a bad image URI or access restriction.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome<\/strong>: You have a valid DLC base image URI stored in <code>DLC_BASE_IMAGE<\/code>.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 4: Create a small training application (fast CPU demo)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Create a working directory:<\/p>\n\n\n\n<pre><code class=\"language-bash\">mkdir -p dlc-lab\/app\ncd dlc-lab\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Create <code>app\/train.py<\/code>:<\/p>\n\n\n\n<pre><code class=\"language-python\">import argparse\nimport os\nimport time\n\ndef main():\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"--out_dir\", required=True, help=\"Output directory (e.g., \/gcs\/BUCKET\/path or gs:\/\/... depending on your approach)\")\n    args = parser.parse_args()\n\n    # Keep the demo simple: write a small artifact and logs.\n    os.makedirs(args.out_dir, exist_ok=True)\n\n    # Simulate a tiny \"training\" workload\n    for i in range(3):\n        print(f\"Training step {i+1}\/3 ...\")\n        time.sleep(2)\n\n    artifact_path = os.path.join(args.out_dir, \"model.txt\")\n    with open(artifact_path, \"w\") as f:\n        f.write(\"demo-model-artifact\\n\")\n\n    print(f\"Wrote artifact to: {artifact_path}\")\n\nif __name__ == \"__main__\":\n    main()\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This script intentionally avoids large dependencies so it runs quickly. In a real workload you would import TensorFlow\/PyTorch and write a SavedModel\/torchscript artifact.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome<\/strong>: You have a Python script that writes an output file to a directory.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 5: Create a Dockerfile that extends Deep Learning Containers<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Create <code>Dockerfile<\/code> in <code>dlc-lab\/<\/code>:<\/p>\n\n\n\n<pre><code class=\"language-dockerfile\"># Base image is a Google Cloud Deep Learning Containers image (CPU)\nARG DLC_BASE_IMAGE\nFROM ${DLC_BASE_IMAGE}\n\nWORKDIR \/app\nCOPY app\/train.py \/app\/train.py\n\n# For many DLC images, Python is already present.\n# If your chosen image does not include python\/pip as expected, select a different DLC image\n# or add install steps that match the base image OS (verify in official docs).\n\nENTRYPOINT [\"python3\", \"\/app\/train.py\"]\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome<\/strong>: A Dockerfile that builds a runnable container based on Deep Learning Containers.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 6: Build and push the derived image with Cloud Build<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Construct the Artifact Registry image URI:<\/p>\n\n\n\n<pre><code class=\"language-bash\">export DERIVED_IMAGE_URI=\"${REGION}-docker.pkg.dev\/${PROJECT_ID}\/${AR_REPO}\/${IMAGE_NAME}:v1\"\necho \"${DERIVED_IMAGE_URI}\"\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Submit the build:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud builds submit . \\\n  --tag \"${DERIVED_IMAGE_URI}\" \\\n  --substitutions=_DLC_BASE_IMAGE=\"${DLC_BASE_IMAGE}\"\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Cloud Build will try to pull the base image and build your derived image.<\/p>\n\n\n\n<blockquote>\n<p>If the build fails because the Dockerfile ARG wasn\u2019t substituted: Cloud Build\u2019s <code>--substitutions<\/code> sets variables for build steps and some configurations. If you run into issues, you can pass the build arg explicitly with a <code>cloudbuild.yaml<\/code>. The simplest portable approach is to create a <code>cloudbuild.yaml<\/code> (shown below).<\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">If you get substitution issues, create <code>cloudbuild.yaml<\/code> in <code>dlc-lab\/<\/code>:<\/p>\n\n\n\n<pre><code class=\"language-bash\">cat &gt; cloudbuild.yaml &lt;&lt;'EOF'\nsteps:\n- name: 'gcr.io\/cloud-builders\/docker'\n  args:\n  - 'build'\n  - '--build-arg'\n  - 'DLC_BASE_IMAGE=${_DLC_BASE_IMAGE}'\n  - '-t'\n  - '${_DERIVED_IMAGE_URI}'\n  - '.'\nimages:\n- '${_DERIVED_IMAGE_URI}'\nsubstitutions:\n  _DLC_BASE_IMAGE: ''\n  _DERIVED_IMAGE_URI: ''\nEOF\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Then run:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud builds submit . \\\n  --config=cloudbuild.yaml \\\n  --substitutions=_DLC_BASE_IMAGE=\"${DLC_BASE_IMAGE}\",_DERIVED_IMAGE_URI=\"${DERIVED_IMAGE_URI}\"\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome<\/strong>: A new image tag <code>v1<\/code> exists in Artifact Registry.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification<\/strong>:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud artifacts docker images list \"${REGION}-docker.pkg.dev\/${PROJECT_ID}\/${AR_REPO}\" --include-tags\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 7: Run a Vertex AI custom training job using the derived image<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Vertex AI custom training will run your container on managed infrastructure. You will pass an output directory and write artifacts to a local path inside the container, then upload results to Cloud Storage.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There are multiple correct patterns for outputs. A simple pattern is:\n&#8211; Write outputs to the container filesystem (e.g., <code>\/tmp\/out<\/code>)\n&#8211; Copy outputs to Cloud Storage at the end of the job<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To keep this lab simple and avoid guessing runtime-specific mounted paths, we will implement the upload inside the container by using <code>gsutil<\/code>. Many DLC images include common tools, but <strong>do not assume<\/strong> <code>gsutil<\/code> is present. The most reliable approach is to add the Cloud SDK (or just <code>gsutil<\/code>) in your Dockerfile\u2014but that adds size.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Instead, we\u2019ll use a clean pattern:\n&#8211; Write outputs to <code>\/tmp\/out<\/code>\n&#8211; Let Vertex AI capture logs\n&#8211; Then manually download artifacts (for a real job, you\u2019d explicitly upload to GCS)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, for a realistic ML workflow, you usually want outputs in GCS. So below are two options:<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Option A (recommended for real projects): bake upload tooling into the image<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Add this to your Dockerfile (only if compatible with your base image OS; <strong>verify in official docs<\/strong>):<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Install <code>google-cloud-storage<\/code> Python client and upload via Python (lighter than full Cloud SDK), or<\/li>\n<li>Install <code>gsutil<\/code> via Cloud SDK.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">To keep this tutorial broadly compatible, we\u2019ll use <strong>Option B<\/strong> in the lab and mention Option A as best practice.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Option B (lab): write artifacts to <code>\/tmp\/out<\/code> and inspect logs<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Run the job and confirm it ran successfully; then iterate to add uploads later.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">1) Create a job spec file <code>job.json<\/code>:<\/p>\n\n\n\n<pre><code class=\"language-bash\">cat &gt; job.json &lt;&lt;EOF\n{\n  \"displayName\": \"dlc-train-demo\",\n  \"jobSpec\": {\n    \"workerPoolSpecs\": [\n      {\n        \"machineSpec\": {\n          \"machineType\": \"n1-standard-4\"\n        },\n        \"replicaCount\": 1,\n        \"containerSpec\": {\n          \"imageUri\": \"${DERIVED_IMAGE_URI}\",\n          \"args\": [\"--out_dir\", \"\/tmp\/out\"]\n        }\n      }\n    ]\n  }\n}\nEOF\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Notes:\n&#8211; <code>machineType<\/code> availability varies. If <code>n1-standard-4<\/code> is not available in your region or restricted by policy, choose a supported machine type. <strong>Verify in Vertex AI docs<\/strong>: https:\/\/cloud.google.com\/vertex-ai\/docs\/training\/create-custom-job\n&#8211; This job runs on CPU and should finish quickly.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">2) Submit the custom job:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud ai custom-jobs create --region=\"${REGION}\" --file=job.json\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome<\/strong>: Vertex AI starts the job and then completes it successfully.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification<\/strong>:\n&#8211; List jobs:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud ai custom-jobs list --region=\"${REGION}\" --format=\"table(name,displayName,state,createTime)\"\n<\/code><\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Describe the job (replace <code>JOB_ID<\/code> with the returned name or ID):<\/li>\n<\/ul>\n\n\n\n<pre><code class=\"language-bash\">export JOB_NAME=\"PASTE_JOB_NAME_HERE\"\ngcloud ai custom-jobs describe \"${JOB_NAME}\" --region=\"${REGION}\"\n<\/code><\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li>View logs in Cloud Logging:<\/li>\n<li>Go to <strong>Logging<\/strong> in the Google Cloud console.<\/li>\n<li>Filter by Vertex AI training resource labels (exact labels vary). Use the job name to search.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 8 (Optional but strongly recommended): Make the job write artifacts to Cloud Storage<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For real ML workflows, your job should write model artifacts to Cloud Storage. A simple, robust approach is to upload using the Python Cloud Storage client.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">1) Update <code>app\/train.py<\/code> to optionally upload to GCS:<\/p>\n\n\n\n<pre><code class=\"language-bash\">cat &gt; app\/train.py &lt;&lt;'EOF'\nimport argparse\nimport os\nimport time\nfrom urllib.parse import urlparse\n\ndef is_gcs_uri(uri: str) -&gt; bool:\n    return uri.startswith(\"gs:\/\/\")\n\ndef write_local(out_dir: str) -&gt; str:\n    os.makedirs(out_dir, exist_ok=True)\n    for i in range(3):\n        print(f\"Training step {i+1}\/3 ...\")\n        time.sleep(2)\n    artifact_path = os.path.join(out_dir, \"model.txt\")\n    with open(artifact_path, \"w\") as f:\n        f.write(\"demo-model-artifact\\n\")\n    print(f\"Wrote local artifact to: {artifact_path}\")\n    return artifact_path\n\ndef upload_to_gcs(local_path: str, gcs_uri: str):\n    # Import lazily so the base image doesn't need it unless you enable it.\n    from google.cloud import storage\n\n    parsed = urlparse(gcs_uri)\n    bucket_name = parsed.netloc\n    blob_path = parsed.path.lstrip(\"\/\")\n    client = storage.Client()\n    bucket = client.bucket(bucket_name)\n    blob = bucket.blob(blob_path)\n    blob.upload_from_filename(local_path)\n    print(f\"Uploaded {local_path} to {gcs_uri}\")\n\ndef main():\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"--out_dir\", required=True, help=\"Local output dir (e.g., \/tmp\/out)\")\n    parser.add_argument(\"--gcs_out\", required=False, help=\"GCS URI to upload model.txt (e.g., gs:\/\/bucket\/path\/model.txt)\")\n    args = parser.parse_args()\n\n    local_path = write_local(args.out_dir)\n\n    if args.gcs_out:\n        upload_to_gcs(local_path, args.gcs_out)\n\nif __name__ == \"__main__\":\n    main()\nEOF\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">2) Update the Dockerfile to install the Python client library:<\/p>\n\n\n\n<pre><code class=\"language-bash\">cat &gt; Dockerfile &lt;&lt;'EOF'\nARG DLC_BASE_IMAGE\nFROM ${DLC_BASE_IMAGE}\n\nWORKDIR \/app\nCOPY app\/train.py \/app\/train.py\n\n# Install only what we need for uploading to Cloud Storage.\n# This assumes pip is available. If not, choose a DLC image that includes it or adjust accordingly.\nRUN python3 -m pip install --no-cache-dir google-cloud-storage\n\nENTRYPOINT [\"python3\", \"\/app\/train.py\"]\nEOF\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">3) Rebuild and push as <code>v2<\/code>:<\/p>\n\n\n\n<pre><code class=\"language-bash\">export DERIVED_IMAGE_URI_V2=\"${REGION}-docker.pkg.dev\/${PROJECT_ID}\/${AR_REPO}\/${IMAGE_NAME}:v2\"\n\ngcloud builds submit . \\\n  --config=cloudbuild.yaml \\\n  --substitutions=_DLC_BASE_IMAGE=\"${DLC_BASE_IMAGE}\",_DERIVED_IMAGE_URI=\"${DERIVED_IMAGE_URI_V2}\"\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">4) Create a new job spec that uploads to GCS:<\/p>\n\n\n\n<pre><code class=\"language-bash\">export GCS_MODEL_URI=\"${BUCKET}\/outputs\/model.txt\"\n\ncat &gt; job-v2.json &lt;&lt;EOF\n{\n  \"displayName\": \"dlc-train-demo-v2\",\n  \"jobSpec\": {\n    \"workerPoolSpecs\": [\n      {\n        \"machineSpec\": {\n          \"machineType\": \"n1-standard-4\"\n        },\n        \"replicaCount\": 1,\n        \"containerSpec\": {\n          \"imageUri\": \"${DERIVED_IMAGE_URI_V2}\",\n          \"args\": [\"--out_dir\", \"\/tmp\/out\", \"--gcs_out\", \"${GCS_MODEL_URI}\"]\n        }\n      }\n    ]\n  }\n}\nEOF\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">5) Submit:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud ai custom-jobs create --region=\"${REGION}\" --file=job-v2.json\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome<\/strong>: Job completes and <code>model.txt<\/code> exists in your bucket.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification<\/strong>:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gsutil ls \"${BUCKET}\/outputs\/\"\ngsutil cat \"${BUCKET}\/outputs\/model.txt\"\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Validation<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You have successfully validated Deep Learning Containers usage if:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A derived container image exists in Artifact Registry:<\/li>\n<\/ul>\n\n\n\n<pre><code class=\"language-bash\">gcloud artifacts docker images list \"${REGION}-docker.pkg.dev\/${PROJECT_ID}\/${AR_REPO}\" --include-tags\n<\/code><\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Vertex AI custom training job completes successfully:<\/li>\n<\/ul>\n\n\n\n<pre><code class=\"language-bash\">gcloud ai custom-jobs list --region=\"${REGION}\" --format=\"table(displayName,state,createTime)\"\n<\/code><\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li>(If you did Step 8) the artifact exists in Cloud Storage:<\/li>\n<\/ul>\n\n\n\n<pre><code class=\"language-bash\">gsutil ls \"${BUCKET}\/outputs\/model.txt\"\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Troubleshooting<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">Build fails: \u201cbase image not found\u201d or \u201cdenied\u201d<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Cause: incorrect DLC base image URI or restricted egress \/ registry access.<\/li>\n<li>Fix:<\/li>\n<li>Re-check the official Deep Learning Containers image list: https:\/\/cloud.google.com\/deep-learning-containers\/docs<\/li>\n<li>Confirm your org policy allows pulling from the registry.<\/li>\n<li>If using private networking, ensure Private Google Access\/NAT allows registry access.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Build fails: <code>python3: not found<\/code> or <code>pip: not found<\/code><\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Cause: selected DLC image doesn\u2019t include the expected tools in PATH.<\/li>\n<li>Fix:<\/li>\n<li>Choose a DLC image intended for Python-based workflows (verify in official docs).<\/li>\n<li>Adjust your Dockerfile to match the base OS package manager and install Python\/pip (only if appropriate).<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Vertex AI job fails immediately<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Cause: machine type not available, quota exceeded, or permissions missing.<\/li>\n<li>Fix:<\/li>\n<li>Check job error in Vertex AI UI and Cloud Logging.<\/li>\n<li>Confirm quotas in the region.<\/li>\n<li>Try a smaller\/different machine type supported in your region.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Upload to GCS fails with \u201c403\u201d or auth errors<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Cause: training runtime service account lacks bucket permissions.<\/li>\n<li>Fix:<\/li>\n<li>Identify the service account used by Vertex AI training (verify in job details).<\/li>\n<li>Grant it <code>roles\/storage.objectAdmin<\/code> on the bucket (or a narrower role as needed).<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Cleanup<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To avoid ongoing costs, delete resources you created.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">1) Delete Vertex AI custom jobs (optional; jobs generally stop billing after completion, but you may want to remove artifacts):\n&#8211; You can keep job history, but delete any running jobs if needed.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">2) Delete the Artifact Registry repository (deletes images):<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud artifacts repositories delete \"${AR_REPO}\" --location=\"${REGION}\"\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">3) Delete the Cloud Storage bucket (this is irreversible):<\/p>\n\n\n\n<pre><code class=\"language-bash\">gsutil rm -r \"${BUCKET}\"\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">4) (Optional) Disable APIs if this was a one-time lab (usually not necessary).<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">11. Best Practices<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Architecture best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Separate base vs app layers<\/strong>: use Deep Learning Containers as a base, add only your code and minimal dependencies.<\/li>\n<li><strong>Pin versions<\/strong>: pin the base image tag (or digest) and document it with each model release.<\/li>\n<li><strong>Keep data close to compute<\/strong>: co-locate training jobs and Cloud Storage buckets in the same region.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">IAM\/security best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use <strong>dedicated service accounts<\/strong> per workload (training vs serving).<\/li>\n<li>Grant <strong>least privilege<\/strong>:<\/li>\n<li>Training job SA: read dataset bucket(s), write output bucket(s)<\/li>\n<li>Build pipeline SA: write to Artifact Registry<\/li>\n<li>Prefer <strong>private Artifact Registry repos<\/strong> for your derived images.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Cost best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Run <strong>smoke tests on CPU<\/strong> before GPU jobs.<\/li>\n<li>Right-size machine types; scale up only when metrics show CPU\/GPU saturation.<\/li>\n<li>Use <strong>lifecycle rules<\/strong> to expire old checkpoints and intermediate artifacts.<\/li>\n<li>Reduce image bloat to cut storage and pull times.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Performance best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use local SSD or optimized disks when training is I\/O bound (platform-dependent).<\/li>\n<li>Avoid reading the same data repeatedly from remote storage; cache when feasible.<\/li>\n<li>For Kubernetes, consider node-local caching patterns for large images\/datasets.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Reliability best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Make training resumable:<\/li>\n<li>checkpoint periodically to Cloud Storage<\/li>\n<li>store training state and metadata<\/li>\n<li>Use retries carefully: distinguish between transient infrastructure errors and deterministic code\/data errors.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operations best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Standardize logging:<\/li>\n<li>log to stdout\/stderr<\/li>\n<li>log key metrics (epoch time, loss, accuracy, GPU utilization if relevant)<\/li>\n<li>Add health checks for serving containers on GKE.<\/li>\n<li>Track container versions and model versions together (release metadata).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Governance\/tagging\/naming best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use consistent naming:<\/li>\n<li><code>ar-repo<\/code>: <code>ml-platform<\/code>, <code>training-images<\/code><\/li>\n<li>image tags: <code>modelname:gitsha<\/code>, <code>trainer:release-2026-04<\/code><\/li>\n<li>Use labels\/tags on jobs and storage paths:<\/li>\n<li><code>env=dev|stage|prod<\/code>, <code>team=\u2026<\/code>, <code>cost_center=\u2026<\/code><\/li>\n<li>Record:<\/li>\n<li>base DLC image version<\/li>\n<li>your derived image digest<\/li>\n<li>dataset version<\/li>\n<li>code commit SHA<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">12. Security Considerations<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Identity and access model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Workloads authenticate using:<\/li>\n<li>service accounts attached to Vertex AI jobs \/ GKE nodes \/ Compute Engine VMs<\/li>\n<li>Enforce least privilege:<\/li>\n<li>read-only access to input datasets<\/li>\n<li>write-only access to output locations<\/li>\n<li>separate roles for build vs runtime<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Encryption<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>In transit: Google Cloud services use TLS for API access.<\/li>\n<li>At rest:<\/li>\n<li>Cloud Storage and Artifact Registry encrypt data at rest by default.<\/li>\n<li>For stricter requirements, use <strong>Customer-Managed Encryption Keys (CMEK)<\/strong> where supported (verify CMEK support for each service you use).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Network exposure<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Prefer private networking patterns:<\/li>\n<li>GKE private clusters (if using GKE)<\/li>\n<li>No public IPs on training VMs (if using Compute Engine)<\/li>\n<li>Controlled egress via Cloud NAT<\/li>\n<li>Limit inbound access with firewall rules and Kubernetes NetworkPolicies.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secrets handling<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Do not bake credentials into images.<\/li>\n<li>Use Secret Manager (and workload identity patterns where applicable).<\/li>\n<li>For Kubernetes, use Secret Manager CSI driver or similar patterns (verify current recommended integration).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Audit\/logging<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Cloud Audit Logs record admin and data access events for many services.<\/li>\n<li>Ensure logs are retained per compliance requirements.<\/li>\n<li>Monitor image build\/push events (Cloud Build, Artifact Registry logs).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Compliance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Keep track of:<\/li>\n<li>where data is stored (region)<\/li>\n<li>where training occurs (region)<\/li>\n<li>access patterns and audit trails<\/li>\n<li>For regulated workloads, validate that each dependent service meets your compliance needs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Common security mistakes<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Running containers as root without necessity<\/li>\n<li>Broad IAM roles like project-wide Owner\/Editor for workloads<\/li>\n<li>Pulling arbitrary public images instead of curated bases<\/li>\n<li>Allowing unrestricted egress from training environments<\/li>\n<li>Storing sensitive data in logs or model artifacts<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secure deployment recommendations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Maintain an approved list of DLC base images (by digest, if possible).<\/li>\n<li>Scan and sign derived images (use your organization\u2019s supply chain tooling).<\/li>\n<li>Use private Artifact Registry and restrict who can push\/pull.<\/li>\n<li>Segment environments by project (dev\/stage\/prod).<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">13. Limitations and Gotchas<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Not a managed service<\/strong>: Deep Learning Containers doesn\u2019t schedule jobs or manage GPUs; you must run it on a compute platform.<\/li>\n<li><strong>Image size<\/strong>: DLC images can be large, leading to:<\/li>\n<li>slower cold starts<\/li>\n<li>longer Kubernetes rollout times<\/li>\n<li>higher network usage when scaling<\/li>\n<li><strong>GPU compatibility<\/strong>: GPU containers require:<\/li>\n<li>compatible host GPU drivers<\/li>\n<li>compatible CUDA stack alignment<\/li>\n<li>correct runtime configuration (e.g., NVIDIA container runtime on Kubernetes nodes)<\/li>\n<li><strong>Version availability changes<\/strong>: Framework versions and tags evolve. Always verify current images in official docs.<\/li>\n<li><strong>Registry transitions<\/strong>: Google Cloud has been moving from older registries to Artifact Registry across products. Verify the recommended image URIs and migration guidance.<\/li>\n<li><strong>Quotas<\/strong>: GPU quota and machine availability are common blockers.<\/li>\n<li><strong>Cross-region costs<\/strong>: Training in one region while reading data from another can create egress charges and slow training.<\/li>\n<li><strong>Operational complexity on GKE<\/strong>: Kubernetes-based ML is powerful but increases operational overhead (node pools, GPU drivers, autoscaling, security hardening).<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">14. Comparison with Alternatives<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Deep Learning Containers is one way to standardize ML runtimes. Here\u2019s how it compares.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Option<\/th>\n<th>Best For<\/th>\n<th>Strengths<\/th>\n<th>Weaknesses<\/th>\n<th>When to Choose<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Deep Learning Containers (Google Cloud)<\/strong><\/td>\n<td>Teams needing curated framework containers<\/td>\n<td>Fast startup for environment, curated compatibility, easy to extend<\/td>\n<td>Not a managed service; images can be large; GPU compatibility still requires host setup<\/td>\n<td>You want a standard base image for training\/serving on Google Cloud runtimes<\/td>\n<\/tr>\n<tr>\n<td><strong>Vertex AI prebuilt training containers<\/strong><\/td>\n<td>Quick managed training on Vertex AI<\/td>\n<td>Integrated with Vertex AI workflows; fewer custom steps<\/td>\n<td>Less flexibility than fully custom images; may not match exact dependency needs<\/td>\n<td>You primarily train on Vertex AI and prefer simplified configuration<\/td>\n<\/tr>\n<tr>\n<td><strong>Deep Learning VM Images<\/strong><\/td>\n<td>VM-based workflows, SSH-centric teams<\/td>\n<td>Turnkey VMs with frameworks\/drivers; easy interactive debugging<\/td>\n<td>Less portable than containers; VM config drift<\/td>\n<td>You want a ready VM environment and are not standardizing on containers yet<\/td>\n<\/tr>\n<tr>\n<td><strong>Vertex AI Workbench<\/strong><\/td>\n<td>Notebook-first development<\/td>\n<td>Managed notebooks, integrated auth and storage<\/td>\n<td>Not a deployment artifact by itself; still need productionization<\/td>\n<td>You want managed notebooks for experimentation and prototyping<\/td>\n<\/tr>\n<tr>\n<td><strong>GKE + custom Docker images (self-built)<\/strong><\/td>\n<td>Platform teams needing full control<\/td>\n<td>Minimal images possible; full control over OS\/packages<\/td>\n<td>More effort to maintain; more debugging<\/td>\n<td>You have custom requirements not met by DLC images<\/td>\n<\/tr>\n<tr>\n<td><strong>AWS Deep Learning Containers<\/strong><\/td>\n<td>Teams running on AWS<\/td>\n<td>Similar curated images<\/td>\n<td>Different ecosystem; migration friction<\/td>\n<td>You are primarily on AWS<\/td>\n<\/tr>\n<tr>\n<td><strong>Azure ML curated environments<\/strong><\/td>\n<td>Teams on Azure<\/td>\n<td>Integrated with Azure ML<\/td>\n<td>Different ecosystem<\/td>\n<td>You are primarily on Azure<\/td>\n<\/tr>\n<tr>\n<td><strong>On-prem\/self-managed containers<\/strong><\/td>\n<td>Air-gapped or strict control<\/td>\n<td>Full control<\/td>\n<td>Higher ops burden<\/td>\n<td>You must run outside cloud or under strict constraints<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">15. Real-World Example<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Enterprise example: regulated manufacturing quality inspection<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: A manufacturing enterprise trains computer vision models for defect detection. The team must ensure reproducible builds, auditability, and secure access to sensitive image data.<\/li>\n<li><strong>Proposed architecture<\/strong>:<\/li>\n<li>Deep Learning Containers as the approved base images (pinned)<\/li>\n<li>Derived images built by Cloud Build and stored in private Artifact Registry<\/li>\n<li>Vertex AI custom training jobs in a controlled region<\/li>\n<li>Cloud Storage buckets with CMEK for datasets and artifacts (where required)<\/li>\n<li>Cloud Logging\/Monitoring + Audit Logs for traceability<\/li>\n<li><strong>Why Deep Learning Containers was chosen<\/strong>:<\/li>\n<li>Standardized framework runtime across multiple teams<\/li>\n<li>Reduced environment drift and improved supportability<\/li>\n<li>Easier governance with a small approved base image list<\/li>\n<li><strong>Expected outcomes<\/strong>:<\/li>\n<li>Faster onboarding and fewer \u201cdependency incidents\u201d<\/li>\n<li>Traceable training runs (code + container + data references)<\/li>\n<li>Improved compliance posture via consistent supply chain controls<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Startup\/small-team example: recommendation model MVP<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: A startup needs to iterate quickly on a recommendation model, moving from laptops to cloud training without spending weeks on infrastructure.<\/li>\n<li><strong>Proposed architecture<\/strong>:<\/li>\n<li>Pick a TensorFlow\/PyTorch Deep Learning Containers CPU image for development and CI builds<\/li>\n<li>Use Cloud Build to create derived training image<\/li>\n<li>Run small Vertex AI custom jobs for experiments<\/li>\n<li>Store outputs in a single Cloud Storage bucket with lifecycle rules<\/li>\n<li><strong>Why Deep Learning Containers was chosen<\/strong>:<\/li>\n<li>Reduced setup time and simplified reproducibility<\/li>\n<li>Container-based workflow keeps future options open (GKE, Vertex AI, Compute Engine)<\/li>\n<li><strong>Expected outcomes<\/strong>:<\/li>\n<li>Faster iteration cycles with fewer environment issues<\/li>\n<li>A clear path to productionization (same image promoted forward)<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">16. FAQ<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">1) <strong>Is Deep Learning Containers a managed training service?<\/strong><br\/>\nNo. Deep Learning Containers provides container images. You still need to run them on a compute platform like Vertex AI, Compute Engine, or GKE.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">2) <strong>Do Deep Learning Containers images cost money?<\/strong><br\/>\nTypically, you\u2019re billed for the services used to run and store them (compute, storage, network, build). The images themselves aren\u2019t usually billed as a separate product line. Verify details in official pricing docs for the services you use.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">3) <strong>Can I use Deep Learning Containers with Vertex AI?<\/strong><br\/>\nYes, commonly via custom training (custom container jobs). Verify the latest recommended integration patterns in Vertex AI documentation.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">4) <strong>Can I use Deep Learning Containers with GKE?<\/strong><br\/>\nYes. They are container images, so they can run on GKE like other containers, assuming node configuration (especially GPU runtime) is correct.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">5) <strong>Do GPU images include NVIDIA drivers?<\/strong><br\/>\nContainers typically include user-space libraries, but GPU drivers must be installed on the host (VM or Kubernetes node). Verify the exact requirements for your chosen image and runtime.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">6) <strong>How do I pick the right image tag?<\/strong><br\/>\nPick based on framework (TensorFlow\/PyTorch), framework version, and CPU vs GPU needs. Use official docs to find the current image list and tag semantics.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">7) <strong>Should I pin by tag or digest?<\/strong><br\/>\nFor strong reproducibility, pin by immutable digest. Tags are easier for humans but can be updated in some ecosystems. Verify how Google publishes and updates tags.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">8) <strong>Are Deep Learning Containers suitable for production inference?<\/strong><br\/>\nThey can be, but you should consider image size, startup time, and attack surface. Many production teams build slimmer inference images derived from a known base.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">9) <strong>How do I keep images patched?<\/strong><br\/>\nRegularly rebuild derived images using updated base images, scan them, and promote them through environments. Keep change logs tied to image digests.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">10) <strong>Can I run Deep Learning Containers on Cloud Run?<\/strong><br\/>\nCloud Run is optimized for stateless HTTP containers and has constraints (including no GPU in many configurations; verify current GPU support status and region availability). DLC images may be large and not ideal for fast cold starts. Validate carefully.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">11) <strong>How do I reduce container pull time on Kubernetes?<\/strong><br\/>\nUse smaller derived images, consider image caching strategies, and keep nodes and registries in the same region. For large-scale setups, pre-pulling can help (implementation-specific).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">12) <strong>How do I store model artifacts from training?<\/strong><br\/>\nUse Cloud Storage. Write checkpoints and outputs to a GCS bucket. Ensure the training job service account has the right IAM permissions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">13) <strong>What\u2019s the difference between Deep Learning VM Images and Deep Learning Containers?<\/strong><br\/>\nVM Images are VM-based environments; Containers are portable Docker images. Containers are generally better for CI\/CD and portability; VMs may be simpler for interactive SSH workflows.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">14) <strong>Can I extend a DLC image with my own dependencies?<\/strong><br\/>\nYes\u2014this is a common pattern. Keep additions minimal and document them.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">15) <strong>What if I need a framework version not available in DLC images?<\/strong><br\/>\nYou may need to build your own base image, or use a different curated source. Consider Vertex AI prebuilt containers or self-managed images if needed.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">16) <strong>How do I ensure supply chain security for ML images?<\/strong><br\/>\nUse private registries, scanning, signing\/provenance, and least-privilege access controls. Promote images through environments with approvals.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">17. Top Online Resources to Learn Deep Learning Containers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Resource Type<\/th>\n<th>Name<\/th>\n<th>Why It Is Useful<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Official documentation<\/td>\n<td>Deep Learning Containers docs: https:\/\/cloud.google.com\/deep-learning-containers\/docs<\/td>\n<td>Authoritative overview, image list, supported workflows, and best practices<\/td>\n<\/tr>\n<tr>\n<td>Official pricing<\/td>\n<td>Vertex AI pricing: https:\/\/cloud.google.com\/vertex-ai\/pricing<\/td>\n<td>Training costs are typically where most spend happens when using DLC with Vertex AI<\/td>\n<\/tr>\n<tr>\n<td>Official pricing<\/td>\n<td>Artifact Registry pricing: https:\/\/cloud.google.com\/artifact-registry\/pricing<\/td>\n<td>Understand storage and operations costs for derived images<\/td>\n<\/tr>\n<tr>\n<td>Official pricing<\/td>\n<td>Cloud Build pricing: https:\/\/cloud.google.com\/build\/pricing<\/td>\n<td>Understand build minutes and build resource costs<\/td>\n<\/tr>\n<tr>\n<td>Official pricing<\/td>\n<td>Cloud Storage pricing: https:\/\/cloud.google.com\/storage\/pricing<\/td>\n<td>Key for datasets and model artifacts<\/td>\n<\/tr>\n<tr>\n<td>Pricing calculator<\/td>\n<td>Google Cloud Pricing Calculator: https:\/\/cloud.google.com\/products\/calculator<\/td>\n<td>Build region-accurate estimates without guessing<\/td>\n<\/tr>\n<tr>\n<td>Official training docs<\/td>\n<td>Vertex AI custom training overview: https:\/\/cloud.google.com\/vertex-ai\/docs\/training\/overview<\/td>\n<td>How to run custom jobs (common place to use DLC-derived images)<\/td>\n<\/tr>\n<tr>\n<td>Official tutorials<\/td>\n<td>Vertex AI training tutorials (index): https:\/\/cloud.google.com\/vertex-ai\/docs\/tutorials<\/td>\n<td>End-to-end examples you can adapt to DLC-based images<\/td>\n<\/tr>\n<tr>\n<td>Official architecture<\/td>\n<td>Google Cloud Architecture Center: https:\/\/cloud.google.com\/architecture<\/td>\n<td>Reference architectures for MLOps, security, networking, and operations<\/td>\n<\/tr>\n<tr>\n<td>Official videos<\/td>\n<td>Google Cloud Tech YouTube: https:\/\/www.youtube.com\/@googlecloudtech<\/td>\n<td>Product deep-dives and practical demos (search within channel for Vertex AI and containers)<\/td>\n<\/tr>\n<tr>\n<td>Samples<\/td>\n<td>GoogleCloudPlatform GitHub org: https:\/\/github.com\/GoogleCloudPlatform<\/td>\n<td>Trusted samples for Vertex AI, ML pipelines, and container workflows (verify relevance per repo)<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">18. Training and Certification Providers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Institute<\/th>\n<th>Suitable Audience<\/th>\n<th>Likely Learning Focus<\/th>\n<th>Mode<\/th>\n<th>Website<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>DevOps engineers, platform teams, cloud engineers<\/td>\n<td>DevOps, CI\/CD, containers, cloud ops foundations that support ML platforms<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>ScmGalaxy.com<\/td>\n<td>Beginners to intermediate engineers<\/td>\n<td>Software delivery, SCM, DevOps practices useful for container-based ML workflows<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.scmgalaxy.com\/<\/td>\n<\/tr>\n<tr>\n<td>CLoudOpsNow.in<\/td>\n<td>Cloud operations teams<\/td>\n<td>Cloud operations and practical administration patterns<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.cloudopsnow.in\/<\/td>\n<\/tr>\n<tr>\n<td>SreSchool.com<\/td>\n<td>SREs, reliability engineers<\/td>\n<td>Reliability engineering practices, monitoring, incident response for production services<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.sreschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>AiOpsSchool.com<\/td>\n<td>Ops + ML\/automation practitioners<\/td>\n<td>AIOps concepts, monitoring\/automation patterns<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.aiopsschool.com\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">19. Top Trainers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Platform\/Site<\/th>\n<th>Likely Specialization<\/th>\n<th>Suitable Audience<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>RajeshKumar.xyz<\/td>\n<td>DevOps\/cloud tooling and practical engineering guidance (verify current offerings)<\/td>\n<td>Engineers seeking hands-on coaching<\/td>\n<td>https:\/\/rajeshkumar.xyz\/<\/td>\n<\/tr>\n<tr>\n<td>devopstrainer.in<\/td>\n<td>DevOps training and coaching<\/td>\n<td>Beginners to intermediate DevOps practitioners<\/td>\n<td>https:\/\/www.devopstrainer.in\/<\/td>\n<\/tr>\n<tr>\n<td>devopsfreelancer.com<\/td>\n<td>Freelance DevOps services\/training platform (verify current scope)<\/td>\n<td>Teams needing practical help with CI\/CD and ops<\/td>\n<td>https:\/\/www.devopsfreelancer.com\/<\/td>\n<\/tr>\n<tr>\n<td>devopssupport.in<\/td>\n<td>DevOps support and training (verify current offerings)<\/td>\n<td>Teams needing operational assistance<\/td>\n<td>https:\/\/www.devopssupport.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">20. Top Consulting Companies<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Company<\/th>\n<th>Likely Service Area<\/th>\n<th>Where They May Help<\/th>\n<th>Consulting Use Case Examples<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>cotocus.com<\/td>\n<td>Cloud\/DevOps consulting (verify current offerings)<\/td>\n<td>Architecture, cloud migrations, container platforms<\/td>\n<td>Container registry strategy, CI\/CD for ML images, cost controls<\/td>\n<td>https:\/\/cotocus.com\/<\/td>\n<\/tr>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>DevOps and platform consulting\/training<\/td>\n<td>DevOps transformation, CI\/CD, platform engineering<\/td>\n<td>Building secure pipelines for DLC-derived images, operational readiness<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>DEVOPSCONSULTING.IN<\/td>\n<td>DevOps consulting (verify current offerings)<\/td>\n<td>DevOps implementation support<\/td>\n<td>Container build\/release workflows, monitoring and governance foundations<\/td>\n<td>https:\/\/www.devopsconsulting.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">21. Career and Learning Roadmap<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn before this service<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To use Deep Learning Containers effectively on Google Cloud, learn:\n&#8211; Docker fundamentals (images, tags\/digests, layers, Dockerfile best practices)\n&#8211; Basic Google Cloud:\n  &#8211; projects, billing, IAM, service accounts\n  &#8211; Cloud Storage\n&#8211; One runtime environment:\n  &#8211; Vertex AI custom training, <strong>or<\/strong>\n  &#8211; GKE basics, <strong>or<\/strong>\n  &#8211; Compute Engine basics\n&#8211; ML basics: training vs inference, datasets, metrics, model artifacts<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn after this service<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Vertex AI MLOps:<\/li>\n<li>pipelines, model registry, endpoints (as applicable)<\/li>\n<li>Artifact Registry security:<\/li>\n<li>vulnerability scanning, access controls, promotion workflows<\/li>\n<li>Observability for ML workloads:<\/li>\n<li>structured logging, metrics, tracing where appropriate<\/li>\n<li>Cost optimization for GPU workloads<\/li>\n<li>Data engineering foundations:<\/li>\n<li>BigQuery, Dataflow, feature stores (service choice depends on your stack)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Job roles that use it<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>ML Engineer<\/li>\n<li>Platform Engineer (ML Platform \/ Internal Developer Platform)<\/li>\n<li>DevOps Engineer supporting ML workloads<\/li>\n<li>SRE for ML serving systems<\/li>\n<li>Cloud Solutions Architect (AI\/ML)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Certification path (if available)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Deep Learning Containers itself is not a certification, but relevant Google Cloud certifications include:\n&#8211; Professional Cloud Developer\n&#8211; Professional Data Engineer\n&#8211; Professional Machine Learning Engineer<br\/>\nVerify current certification names and outlines: https:\/\/cloud.google.com\/learn\/certification<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Project ideas for practice<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Build a reproducible training image pipeline:<\/li>\n<li><code>FROM<\/code> DLC base image<\/li>\n<li>build with Cloud Build<\/li>\n<li>run on Vertex AI<\/li>\n<li>store artifacts in GCS with lifecycle rules<\/li>\n<li>Create a GPU training job (once you have quota):<\/li>\n<li>benchmark a small CNN on CPU vs GPU<\/li>\n<li>Create a minimal inference API on GKE using a DLC-derived runtime (then shrink it)<\/li>\n<li>Implement image governance:<\/li>\n<li>allowlist base images<\/li>\n<li>pin digests<\/li>\n<li>scan and block high-severity vulnerabilities (tooling varies)<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">22. Glossary<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Deep Learning Containers (DLC)<\/strong>: Google-maintained Docker images with ML frameworks and dependencies for running on Google Cloud.<\/li>\n<li><strong>Artifact Registry<\/strong>: Google Cloud service to store and manage container images and other artifacts.<\/li>\n<li><strong>Container image<\/strong>: A packaged filesystem and metadata used to create container instances.<\/li>\n<li><strong>Tag<\/strong>: A human-readable label pointing to an image version (e.g., <code>:v1<\/code>).<\/li>\n<li><strong>Digest<\/strong>: An immutable identifier for an image (stronger reproducibility than tags).<\/li>\n<li><strong>Vertex AI custom training<\/strong>: Running training jobs on managed infrastructure using your code and (optionally) custom containers.<\/li>\n<li><strong>GKE (Google Kubernetes Engine)<\/strong>: Managed Kubernetes service on Google Cloud.<\/li>\n<li><strong>CUDA\/cuDNN<\/strong>: NVIDIA GPU computing libraries commonly needed for deep learning on GPUs.<\/li>\n<li><strong>Service account<\/strong>: An identity used by workloads to access Google Cloud APIs.<\/li>\n<li><strong>Least privilege<\/strong>: Security principle of granting only the minimum permissions required.<\/li>\n<li><strong>Egress<\/strong>: Network traffic leaving a region, VPC, or cloud provider, often billable.<\/li>\n<li><strong>Lifecycle policy<\/strong>: Automated rule for deleting\/moving objects in Cloud Storage after a period.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">23. Summary<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Deep Learning Containers on Google Cloud is a curated catalog of container images for AI and ML workloads, designed to give you reproducible, framework-ready environments without rebuilding everything from scratch. It matters because it reduces dependency problems, improves portability across dev\/test\/prod, and supports a clean platform engineering approach where teams build their own derived images on top of approved bases.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Deep Learning Containers fits best as the <strong>runtime foundation<\/strong> underneath services like Vertex AI custom training, GKE, and Compute Engine. Cost-wise, the major drivers are the compute you run (especially GPUs), data storage, and network movement\u2014not the images themselves. Security-wise, treat DLC as a base: pin versions, control access through IAM and private registries, scan derived images, and keep data and networking locked down.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Use Deep Learning Containers when you want standardization and speed with containers; consider managed Vertex AI prebuilt paths or notebook services when you want less container operational work. A strong next step is to expand the lab into a real workflow: train a small model, checkpoint to Cloud Storage, and promote an immutable image digest through staging and production.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>AI and ML<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[53,51],"tags":[],"class_list":["post-549","post","type-post","status-publish","format-standard","hentry","category-ai-and-ml","category-google-cloud"],"_links":{"self":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/549","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/comments?post=549"}],"version-history":[{"count":0,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/549\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/media?parent=549"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/categories?post=549"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/tags?post=549"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}