Google Cloud Service Directory Tutorial: Architecture, Pricing, Use Cases, and Hands-On Guide for Distributed, hybrid, and multicloud

Category

Distributed, hybrid, and multicloud

1. Introduction

Service Directory is Google Cloud’s managed service registry for organizing, publishing, and discovering services across environments—Google Cloud, on‑prem, and multicloud—using a consistent API and IAM security model.

In simple terms, Service Directory is an “address book for services.” You register service endpoints (IP/port, or other connection details) and attach metadata. Clients then look up a service name and retrieve the endpoints and metadata they need to connect.

Technically, Service Directory provides a regional, project-scoped resource model (namespaces → services → endpoints) with metadata at each level. It exposes APIs for registration (create/update/delete) and lookup/resolve (discover endpoints) and is designed to integrate with service discovery patterns in distributed systems, including hybrid and multicloud topologies.

The main problem it solves is reliable service discovery and service metadata management when you have many microservices, multiple environments, and multiple runtime platforms—and you need a central registry that is governed by IAM, auditable, and consistent across teams.

2. What is Service Directory?

Official purpose

Service Directory is a fully managed service registry in Google Cloud that helps you discover services and their endpoints, and store service metadata in a structured way. It is commonly used as a foundational building block for service discovery in distributed, hybrid, and multicloud architectures.

Official documentation: https://cloud.google.com/service-directory/docs

Core capabilities

  • Service registration: Create and manage a hierarchy of namespaces, services, and endpoints.
  • Service discovery: Look up a service and retrieve its endpoints (optionally using filters and selection logic—verify supported filtering in the current docs).
  • Metadata management: Attach key/value metadata to namespaces, services, and endpoints to support routing decisions, environment selection, ownership, versioning, and policy enforcement.
  • IAM-governed access: Control who can register services and who can discover them.
  • Auditability: API activity is captured via Cloud Audit Logs (Admin Activity and Data Access logging behavior depends on configuration—verify in your org).

Major components (resource model)

Service Directory organizes data into a simple hierarchy:

  1. Namespace – A logical grouping (often “team”, “domain”, “environment”, or “platform boundary”). – Example: payments-prod, shared-platform, onprem-dc1.

  2. Service – Represents a discoverable service within a namespace. – Example: orders-api, users-grpc, inventory.

  3. Endpoint – A concrete endpoint for a service (commonly address + port), plus metadata. – Example: VM IP and port, an internal load balancer IP and port, or another reachable address in your network.

Important boundary: Service Directory stores endpoint information; it does not route traffic, perform health checks, or load balance by itself.

Service type

  • Managed control-plane registry (metadata + discovery API).
  • Clients/consumers connect directly to returned endpoints (data plane remains your responsibility).

Scope: regional, project-scoped resources

  • Service Directory resources are created in a location (typically a region) and are project-scoped.
  • You typically create: projects/PROJECT_ID/locations/REGION/namespaces/...
  • Design implication: if you operate across multiple regions, you’ll usually model replication or separate registries per region (see architecture section).

Exact location semantics and supported locations can evolve—verify current availability in the official docs.

How it fits into the Google Cloud ecosystem

Service Directory is frequently used alongside: – Compute Engine and GKE workloads that need a registry outside Kubernetes-native discovery. – Hybrid connectivity (Cloud VPN / Cloud Interconnect) where services span VPCs and on‑prem. – Service mesh / Envoy-based discovery patterns (often via other Google Cloud products that can consume service registries—verify current integration guidance in the docs for your specific mesh/Envoy setup). – Cloud IAM, Cloud Audit Logs, Cloud Monitoring/Logging for governance and operations.

3. Why use Service Directory?

Business reasons

  • Standardize service discovery across teams and environments, reducing “tribal knowledge” and hard-coded endpoints.
  • Accelerate onboarding: new services are discoverable by convention and metadata instead of spreadsheets or ad-hoc documentation.
  • Enable platform governance: consistent naming, ownership metadata, and access controls.

Technical reasons

  • Decouple clients from infrastructure: clients discover endpoints at runtime rather than embedding IPs/DNS names.
  • Support hybrid and multicloud: store endpoints that live in Google Cloud, on‑prem, or another cloud (as long as the network path exists).
  • Metadata-driven discovery: clients can select endpoints based on metadata (version, environment, zone, shard, compliance domain), within the supported API capabilities.

Operational reasons

  • Central control plane: one place to register and update endpoints during migrations, failovers, or scaling events.
  • Auditable changes: “who changed endpoints” can be tracked via audit logs.
  • Safer rollouts: publish new endpoints alongside old ones and shift consumers gradually (client-side logic required).

Security/compliance reasons

  • IAM-based controls: restrict who can register/modify services vs who can only discover.
  • Least privilege: separate roles for platform team (registration) and application team (lookup).
  • Audit logging: meet operational and compliance expectations for change tracking.

Scalability/performance reasons

  • Avoid central DIY registry pitfalls: building and operating Consul/Eureka etcd-like registries can be expensive and operationally risky.
  • Designed for distributed architectures: offers API-based lookup suited for modern service discovery workflows.

When teams should choose Service Directory

Choose it when you need one or more of the following: – A Google-managed registry with IAM and audit logs. – A service registry that works across runtimes (VMs, containers, on‑prem). – A structured way to attach and query service metadata. – A registry that can support hybrid and multicloud service discovery patterns.

When teams should not choose it

Avoid or reconsider Service Directory when: – You only need Kubernetes-native service discovery inside a single cluster (Kubernetes Services + CoreDNS is usually sufficient). – You need traffic routing, load balancing, or health checking from the registry itself (you’ll need Cloud Load Balancing, a service mesh, or your own discovery + routing logic). – You need a configuration store or secrets vault (use Secret Manager, Config Connector, or a dedicated config system). – You require global active-active registry semantics without region-aware design (Service Directory is location-based; multi-region design is on you).

4. Where is Service Directory used?

Industries

  • Financial services: strict environment separation, audit trails for endpoint changes, hybrid data centers.
  • Retail/e-commerce: microservices with frequent deployment and scaling.
  • Healthcare: controlled discovery across segmented networks; strong governance requirements.
  • Media/gaming: multi-region service deployments and latency-aware client selection.
  • Manufacturing/IoT: hybrid factories/on‑prem services combined with cloud analytics platforms.

Team types

  • Platform engineering teams building internal developer platforms (IDPs).
  • SRE/operations teams standardizing discovery and ownership metadata.
  • DevOps teams supporting multi-environment pipelines (dev/test/stage/prod).
  • Security teams enforcing IAM boundaries and auditing changes.

Workloads

  • Microservices on GKE and Compute Engine.
  • Hybrid services connected via Cloud VPN / Cloud Interconnect.
  • Multi-tenant internal APIs, shared platform services, and internal tools.

Architectures

  • Hub-and-spoke VPCs: central registry with controlled cross-VPC discovery.
  • Multi-region: per-region registries with replication pipelines.
  • Hybrid service catalog: on‑prem endpoints published to cloud consumers (and vice versa).

Real-world deployment contexts

  • Migrations: register both old (on‑prem) and new (cloud) endpoints during phased cutovers.
  • Shared services: publish internal platform services (auth, billing, logging collectors) used by many apps.
  • Partner ecosystems: controlled discovery for internal partner integration endpoints (within private networks).

Production vs dev/test usage

  • Dev/test: useful for validating naming standards, metadata conventions, and client lookup logic before production.
  • Production: most valuable when tightly integrated with CI/CD or automation that updates endpoints and metadata during deployments.

5. Top Use Cases and Scenarios

Below are realistic patterns where Service Directory is a good fit.

1) Hybrid service discovery (on‑prem to Google Cloud)

  • Problem: Cloud workloads need to call on‑prem services, but endpoints change and ownership is unclear.
  • Why Service Directory fits: Central registry with IAM; on‑prem endpoints can be registered and discovered by cloud clients.
  • Example: A GKE workload discovers the current on‑prem SAP proxy endpoint via Service Directory and connects over Cloud Interconnect.

2) Multi-environment endpoint management (dev/stage/prod)

  • Problem: Teams accidentally call prod from dev due to misconfigured endpoints.
  • Why it fits: Use namespaces per environment and strict IAM to reduce mistakes.
  • Example: payments-dev namespace is readable by dev apps; payments-prod is readable only by prod service accounts.

3) Service catalog for shared internal APIs

  • Problem: Teams don’t know which internal APIs exist, which versions are supported, or where to route.
  • Why it fits: Metadata (owner, SLA tier, version, contact) and standardized naming.
  • Example: A platform team publishes identity/auth service with endpoints for regional deployments and metadata for escalation.

4) Gradual migration from legacy endpoints

  • Problem: You must migrate clients from legacy VMs to new services without breaking everything.
  • Why it fits: Register both old and new endpoints; clients can select based on metadata (or use a staged rollout logic).
  • Example: Endpoints tagged legacy=true and version=v1 are phased out as clients switch to version=v2.

5) Blue/green backend discovery (client-side)

  • Problem: You want blue/green releases without relying on a load balancer for every internal call.
  • Why it fits: Two sets of endpoints registered with metadata color=blue/green; clients choose.
  • Example: Internal batch jobs resolve only color=green during canary, then switch to blue after validation.

6) Service mesh registry backing (integration-dependent)

  • Problem: Envoy-based service-to-service discovery needs a consistent registry across heterogeneous runtimes.
  • Why it fits: Service Directory can act as a registry used by control planes (integration specifics vary).
  • Example: A hybrid mesh uses Service Directory as one registry source for VM workloads (verify current recommended setup in your mesh docs).

7) Central registry for multi-cluster GKE workloads

  • Problem: Multiple clusters host services; clients need a stable place to find endpoints.
  • Why it fits: Externalized registry not tied to one cluster.
  • Example: A client in cluster A resolves a service that runs in cluster B via endpoints published by automation.

8) Operational ownership and routing metadata

  • Problem: Incidents are slowed by unclear ownership and missing service details.
  • Why it fits: Store on-call, repo link, runbook link, criticality, and region metadata.
  • Example: metadata: {ownerTeam=platform, oncall=pagerduty://..., runbook=https://...}.

9) Network-segmented discovery (shared VPC / multiple projects)

  • Problem: Different projects need to discover shared services, but you must restrict modification rights.
  • Why it fits: IAM controls plus project organization patterns; discovery can be granted without registration privileges.
  • Example: A shared services project hosts Service Directory; app projects get viewer/lookup access only.

10) Disaster recovery endpoint publishing

  • Problem: During failover, clients must discover DR endpoints quickly and safely.
  • Why it fits: Update endpoints or metadata to shift consumers; audit trail helps governance.
  • Example: Add DR endpoints with priority=1 during incident; clients prefer lower priority numbers (client logic).

11) Internal tooling and automation

  • Problem: Scripts and operators need an authoritative source of service endpoints.
  • Why it fits: API-driven registry; can integrate with CI/CD.
  • Example: A deployment pipeline registers a new VM MIG’s internal load balancer address after rollout.

12) Multicloud shared service discovery (with network connectivity)

  • Problem: Services run in multiple clouds; you want one registry for discovery.
  • Why it fits: Endpoints can represent any reachable IP/hostname; IAM governs access.
  • Example: A Google Cloud workload discovers an AWS-hosted internal service endpoint reachable via VPN and uses it for cross-cloud calls.

6. Core Features

1) Hierarchical resource organization (namespaces → services → endpoints)

  • What it does: Provides structured grouping for service discovery.
  • Why it matters: Prevents “flat list chaos” and enables clear ownership and boundaries.
  • Practical benefit: You can map namespaces to teams/environments and services to APIs, with endpoints representing backends.
  • Caveats: Naming conventions are your responsibility; poor naming leads to confusing discovery.

2) Endpoint registration (address + port + metadata)

  • What it does: Stores endpoint connection details and metadata for discovery.
  • Why it matters: Clients can connect to the correct backend without hardcoding.
  • Practical benefit: Supports VM IPs, internal load balancers, on‑prem IPs, and more.
  • Caveats: Service Directory does not validate endpoint reachability; you must ensure networking and health separately.

3) Metadata at multiple levels

  • What it does: Lets you attach key/value metadata to namespaces, services, and endpoints.
  • Why it matters: Enables ownership, routing decisions, and environment separation.
  • Practical benefit: Tag endpoints with region, zone, version, complianceDomain, etc.
  • Caveats: Metadata is not a secret store. Don’t store credentials or sensitive data.

4) Lookup and discovery APIs

  • What it does: Clients query a service name and retrieve endpoint data.
  • Why it matters: Enables runtime discovery and reduces manual configuration.
  • Practical benefit: A client can resolve endpoints at startup or periodically refresh.
  • Caveats: Clients must implement retry/backoff and caching as appropriate.

5) IAM-based access control

  • What it does: Controls who can create/update/delete vs who can view/resolve.
  • Why it matters: Prevents unauthorized endpoint registration and reduces supply-chain-style risks.
  • Practical benefit: Platform team can own registration; apps can have read-only discovery.
  • Caveats: Misconfigured IAM (overbroad roles) can let unintended parties redirect traffic by changing endpoints.

6) Audit logging via Cloud Audit Logs

  • What it does: Captures administrative actions and (depending on settings) data access events.
  • Why it matters: Supports governance, investigations, and compliance.
  • Practical benefit: You can trace “who changed endpoint X at time Y”.
  • Caveats: Data Access logs may be disabled by default in some orgs; verify your logging configuration.

7) Regional location model

  • What it does: Resources are created in a specific location.
  • Why it matters: Impacts latency, availability patterns, and multi-region design.
  • Practical benefit: You can align registry location with service region.
  • Caveats: Cross-region discovery strategies are on you (replicate, or design clients to query multiple locations).

8) Automation-friendly (CLI, REST, client libraries)

  • What it does: Provides APIs and tools to manage registrations.
  • Why it matters: Enables integration with CI/CD and infrastructure automation.
  • Practical benefit: Pipelines can register endpoints after deploy; cleanup can deregister on teardown.
  • Caveats: Ensure automation uses least-privilege service accounts and is protected from tampering.

7. Architecture and How It Works

High-level architecture

Service Directory is a managed registry control plane. Producers (deployment automation, platform tools, or operators) register services and endpoints. Consumers (applications, gateways, or proxies) query the registry to retrieve endpoints and metadata, then connect directly.

Key idea: Service Directory is not in the data path. It does not proxy your traffic; it helps clients find where to send traffic.

Control flow (registration)

  1. A deployment pipeline (or operator) creates/updates: – Namespace – Service – Endpoint(s)
  2. Metadata is attached to help discovery and governance.
  3. IAM governs who can perform each action.
  4. Changes are captured in audit logs.

Data flow (discovery)

  1. A client authenticates to Google Cloud (service account).
  2. Client calls Service Directory lookup/resolve API.
  3. Client receives service + endpoints + metadata.
  4. Client chooses an endpoint (e.g., random, round-robin, metadata-based selection).
  5. Client connects to that endpoint over the network path you’ve configured.

Integrations with related services (common patterns)

  • Cloud IAM: enforce least privilege for registration and discovery.
  • Cloud Audit Logs: record endpoint changes for governance.
  • Cloud Logging/Monitoring: observe API usage patterns and investigate failures (exact metrics vary; verify available metrics in Cloud Monitoring).
  • Compute Engine / GKE / on‑prem: service endpoints typically live here.
  • Hybrid networking: Cloud VPN / Cloud Interconnect to make endpoints reachable across environments.
  • Service meshes / Envoy-based solutions: may consume Service Directory as a registry source depending on product and configuration—verify the current recommended integration path in the docs for your mesh/control plane.

Dependency services

  • Service Directory API (servicedirectory.googleapis.com)
  • IAM for authorization
  • Cloud Resource Manager / Service Usage for enabling APIs and managing quotas
  • Network connectivity between consumers and endpoints (VPC, VPN, Interconnect)

Security/authentication model

  • Uses standard Google Cloud authentication:
  • User credentials (developer workflows)
  • Service account credentials (production workloads)
  • Authorization is enforced by IAM roles granted at org/folder/project/resource level.
  • Recommended: use dedicated service accounts for registrars and consumers.

Networking model

  • Service Directory itself is accessed via Google APIs (control plane).
  • Endpoints returned can be:
  • Private RFC1918 IPs in VPCs
  • On‑prem IPs reachable via VPN/Interconnect
  • Internal load balancer addresses
  • Consumers must have network reachability to endpoints; Service Directory does not create routes or firewall rules.

Monitoring/logging/governance considerations

  • Audit logs are essential for “endpoint tampering” detection.
  • Create alerts on:
  • Unusual spikes in endpoint updates
  • Unauthorized attempts (permission denied)
  • CI/CD service account anomalies
  • Consider building policy checks:
  • Enforce metadata keys (owner, env, data classification)
  • Validate endpoint address ranges (e.g., only allow private IP blocks)

Simple architecture diagram (Mermaid)

flowchart LR
  A[Deployment pipeline / Operator] -->|Register endpoints| SD[(Service Directory)]
  C[Client service] -->|Lookup/Resolve| SD
  C -->|Connect using returned address:port| E1[Endpoint 1]
  C -->|Connect using returned address:port| E2[Endpoint 2]

Production-style architecture diagram (Mermaid)

flowchart TB
  subgraph Org[Organization]
    subgraph Shared[Shared Services Project]
      SD[(Service Directory<br/>regional)]
      LOG[Cloud Logging / Audit Logs]
      IAM[Cloud IAM]
    end

    subgraph ProdVPC[Prod VPC / Shared VPC]
      subgraph RegionA[us-central1]
        SVC1[Service: orders-api]
        EP1[(Endpoint A1<br/>VM/MIG/ILB)]
        EP2[(Endpoint A2<br/>VM/MIG/ILB)]
      end

      subgraph RegionB[us-east1]
        EP3[(Endpoint B1<br/>DR/secondary)]
      end

      subgraph Clients[Client Workloads]
        GKE[GKE workloads]
        VM[Compute Engine clients]
      end
    end
  end

  IAM --> SD
  SD --> LOG

  SVC1 -.metadata/endpoints.-> SD
  EP1 -.registered.-> SD
  EP2 -.registered.-> SD
  EP3 -.registered.-> SD

  GKE -->|Lookup/Resolve via Google APIs| SD
  VM -->|Lookup/Resolve via Google APIs| SD

  GKE -->|Private traffic| EP1
  GKE -->|Private traffic| EP2
  GKE -->|Failover / selection logic| EP3

8. Prerequisites

Account/project requirements

  • A Google Cloud project with billing enabled.
  • Ability to enable APIs in the project.

Permissions / IAM roles

You will typically need: – Permission to enable APIs: roles/serviceusage.serviceUsageAdmin (or equivalent) – Service Directory administration for the lab: a role such as: – roles/servicedirectory.admin (recommended for learning in a sandbox) – Compute Engine admin permissions for VM creation: – roles/compute.admin (or limited set: instance admin + network admin)

Role names and least-privilege combinations can vary; verify in official IAM role docs for Service Directory: – https://cloud.google.com/service-directory/docs/access-control

Billing requirements

  • Service Directory usage may incur charges (see Pricing section).
  • Compute Engine VMs used in the tutorial can incur compute and disk charges.

CLI/SDK/tools needed

  • Cloud Shell (recommended) or local installation of:
  • Google Cloud CLI (gcloud)
  • Optional for the lab:
  • Python 3 on a client VM (we’ll install via apt)
  • pip to install the Service Directory client library

Region availability

  • Choose a region supported by Service Directory (commonly used examples include us-central1).
  • Verify current supported locations: https://cloud.google.com/service-directory/docs/locations

Quotas/limits

  • Service Directory quotas exist for resources and API usage (namespaces, services, endpoints, requests).
  • Compute Engine quotas apply for VM creation.
  • Verify quotas in:
  • Google Cloud Console → IAM & Admin → Quotas
  • Service Directory quotas documentation (verify current page in official docs)

Prerequisite services/APIs

Enable at minimum: – Service Directory API: servicedirectory.googleapis.com – Compute Engine API: compute.googleapis.com

9. Pricing / Cost

Service Directory is a managed Google Cloud service with usage-based pricing. Exact SKUs, rates, and free-tier details can change and may differ by location. Do not rely on blog posts or old numbers.

Official pricing page: – https://cloud.google.com/service-directory/pricing

Google Cloud Pricing Calculator: – https://cloud.google.com/products/calculator

Pricing dimensions (typical model to verify)

Service registries commonly charge based on a combination of: – Number of registered resources (e.g., endpoints stored) – Number of API operations (registrations, lookups/resolves) – Possibly “stored metadata” or other dimensions

Service Directory’s exact billing dimensions should be confirmed on the official pricing page. If you are planning production use, validate: – What counts as a billable lookup/resolve – Whether endpoint storage is billed per endpoint per hour/month – Any free tier or always-free usage thresholds (if offered)

Cost drivers

Direct cost drivers (verify in pricing docs): – High number of endpoints (especially ephemeral endpoints if frequently created/destroyed) – High lookup QPS (clients that resolve too frequently without caching) – Automation that updates endpoints very often

Indirect cost drivers – Compute/networking: The endpoints you register might live behind load balancers, VMs, or interconnect links that have their own costs. – Logging: Audit/Data Access logs can increase Logging ingestion/storage costs if enabled at high volume. – Cross-region traffic: If discovery results in cross-region calls, your application may incur inter-region network charges.

Network/data transfer implications

  • API calls to Service Directory are Google API calls; network egress from Google Cloud to Google APIs is typically not billed the same way as general internet egress, but billing and routing depend on environment (Cloud Shell vs VM vs on‑prem). Verify your specific scenario.
  • The real network cost often comes from service-to-service traffic between clients and the discovered endpoints:
  • Same-zone/region internal traffic patterns
  • Cross-region traffic
  • Cross-cloud or on‑prem via VPN/Interconnect

How to optimize cost

  • Cache discovery results on the client side with a reasonable TTL (your own caching policy).
  • Avoid resolving on every request. Resolve:
  • At startup
  • On a schedule
  • On failure with backoff
  • Keep endpoint churn low. Prefer registering stable endpoints (e.g., internal load balancer VIPs) when possible.
  • Use metadata wisely to reduce unnecessary endpoint sets returned to clients.

Example low-cost starter estimate (no fabricated numbers)

For a small lab: – A few namespaces/services/endpoints – Occasional lookups from a handful of clients – Low API volume

Cost should typically be small, but verify with: – The Service Directory pricing page (for storage + requests) – The Pricing Calculator (to model lookups and endpoint counts) – Compute Engine VM costs if you run the hands-on lab VMs

Example production cost considerations

In production, cost planning should include: – Number of services and endpoints across regions/environments – Expected lookup/resolve QPS per client and total across fleet – Logging/audit requirements (Data Access logs can be high volume) – Network topology (cross-region and hybrid traffic patterns) – Whether you can register load balancer VIPs instead of every pod/VM endpoint

10. Step-by-Step Hands-On Tutorial

This lab builds a small, real service discovery workflow: – Two backend VMs running NGINX (each returns a different response) – One client VM that queries Service Directory to discover endpoints – The client then curls the discovered endpoints over internal IPs

This demonstrates what Service Directory is (registry + metadata) and what it is not (it won’t load balance; the client chooses endpoints).

Objective

Create a Service Directory namespace and service, register two VM endpoints with metadata, and perform discovery from a client VM using the Service Directory API.

Lab Overview

You will: 1. Enable required APIs and set environment variables. 2. Create two backend VMs and one client VM in a region. 3. Create a Service Directory namespace and service. 4. Register endpoints using the backend VMs’ internal IPs and port 80. 5. Run a Python discovery script on the client VM to fetch endpoints and call them. 6. Clean up all resources.

Step 1: Set project, region, and enable APIs

Open Cloud Shell and run:

gcloud auth list
gcloud config list project

Set variables (edit values if needed):

export PROJECT_ID="$(gcloud config get-value project)"
export REGION="us-central1"
export ZONE="us-central1-a"

# Names for the lab
export SD_NAMESPACE="lab-namespace"
export SD_SERVICE="hello-service"

# VM names
export VM_BACKEND_1="sd-backend-1"
export VM_BACKEND_2="sd-backend-2"
export VM_CLIENT="sd-client-1"

Enable APIs:

gcloud services enable servicedirectory.googleapis.com compute.googleapis.com

Expected outcome – APIs enable successfully (may take 30–90 seconds).

Verification

gcloud services list --enabled --filter="name:(servicedirectory.googleapis.com compute.googleapis.com)"

Step 2: Create two backend VMs that serve distinct responses

We’ll create small Compute Engine VMs with a startup script that installs NGINX and sets a unique home page.

Backend 1:

gcloud compute instances create "$VM_BACKEND_1" \
  --zone "$ZONE" \
  --machine-type "e2-micro" \
  --image-family "debian-12" \
  --image-project "debian-cloud" \
  --metadata startup-script='#! /bin/bash
set -e
apt-get update
apt-get install -y nginx
echo "Hello from backend-1" > /var/www/html/index.html
systemctl enable nginx
systemctl restart nginx
'

Backend 2:

gcloud compute instances create "$VM_BACKEND_2" \
  --zone "$ZONE" \
  --machine-type "e2-micro" \
  --image-family "debian-12" \
  --image-project "debian-cloud" \
  --metadata startup-script='#! /bin/bash
set -e
apt-get update
apt-get install -y nginx
echo "Hello from backend-2" > /var/www/html/index.html
systemctl enable nginx
systemctl restart nginx
'

Expected outcome – Two VMs are created and start NGINX on port 80.

Verification Get internal IPs:

export BACKEND_1_IP="$(gcloud compute instances describe "$VM_BACKEND_1" --zone "$ZONE" --format='value(networkInterfaces[0].networkIP)')"
export BACKEND_2_IP="$(gcloud compute instances describe "$VM_BACKEND_2" --zone "$ZONE" --format='value(networkInterfaces[0].networkIP)')"

echo "$BACKEND_1_IP"
echo "$BACKEND_2_IP"

At this point you can’t directly curl internal IPs from Cloud Shell. We’ll do that from a client VM next.

Step 3: Create a client VM to perform discovery and connectivity tests

Create the client VM:

gcloud compute instances create "$VM_CLIENT" \
  --zone "$ZONE" \
  --machine-type "e2-micro" \
  --image-family "debian-12" \
  --image-project "debian-cloud"

SSH into the client VM:

gcloud compute ssh "$VM_CLIENT" --zone "$ZONE"

From inside the VM, verify you can reach both backends on internal IP (replace IPs if you didn’t export them in Cloud Shell; you can also re-run describe commands from Cloud Shell):

curl -s "http://BACKEND_1_INTERNAL_IP/"
curl -s "http://BACKEND_2_INTERNAL_IP/"

If you exported the IPs in Cloud Shell, paste them here:

curl -s "http://'"$BACKEND_1_IP"'/" && echo
curl -s "http://'"$BACKEND_2_IP"'/" && echo

Expected outcome – Output: – Hello from backend-1Hello from backend-2

Exit SSH for now:

exit

Step 4: Create a Service Directory namespace and service

In Cloud Shell, create the namespace:

gcloud service-directory namespaces create "$SD_NAMESPACE" \
  --location "$REGION"

Create the service:

gcloud service-directory services create "$SD_SERVICE" \
  --location "$REGION" \
  --namespace "$SD_NAMESPACE"

Optionally add metadata (useful in real environments):

gcloud service-directory services update "$SD_SERVICE" \
  --location "$REGION" \
  --namespace "$SD_NAMESPACE" \
  --update-metadata=owner=platform-team,env=lab,protocol=http

Expected outcome – A namespace and service exist in the chosen region.

Verification

gcloud service-directory namespaces describe "$SD_NAMESPACE" --location "$REGION"
gcloud service-directory services describe "$SD_SERVICE" --location "$REGION" --namespace "$SD_NAMESPACE"

Step 5: Register the two backend endpoints (internal IP + port 80)

Create endpoint entries. We’ll also attach endpoint metadata like version and zone.

Endpoint 1:

gcloud service-directory endpoints create "backend-1" \
  --location "$REGION" \
  --namespace "$SD_NAMESPACE" \
  --service "$SD_SERVICE" \
  --address "$BACKEND_1_IP" \
  --port "80" \
  --metadata version=v1,instance=backend-1,zone="$ZONE"

Endpoint 2:

gcloud service-directory endpoints create "backend-2" \
  --location "$REGION" \
  --namespace "$SD_NAMESPACE" \
  --service "$SD_SERVICE" \
  --address "$BACKEND_2_IP" \
  --port "80" \
  --metadata version=v1,instance=backend-2,zone="$ZONE"

Expected outcome – Two endpoints are registered under the service.

Verification

gcloud service-directory endpoints list \
  --location "$REGION" \
  --namespace "$SD_NAMESPACE" \
  --service "$SD_SERVICE"

Describe one endpoint:

gcloud service-directory endpoints describe "backend-1" \
  --location "$REGION" \
  --namespace "$SD_NAMESPACE" \
  --service "$SD_SERVICE"

Step 6: Discover endpoints from the client VM using the Service Directory API (Python)

Now we’ll run a discovery script from the client VM. This is closer to a real workload pattern: a runtime uses its service account to query the registry.

SSH into the client VM:

gcloud compute ssh "$VM_CLIENT" --zone "$ZONE"

Install Python tooling and the client library:

sudo apt-get update
sudo apt-get install -y python3-pip
pip3 install --user google-cloud-service-directory

Create a script discover.py:

cat > discover.py <<'PY'
import os
from google.cloud import servicedirectory_v1

PROJECT_ID = os.environ["PROJECT_ID"]
REGION = os.environ["REGION"]
NAMESPACE = os.environ["SD_NAMESPACE"]
SERVICE = os.environ["SD_SERVICE"]

service_name = f"projects/{PROJECT_ID}/locations/{REGION}/namespaces/{NAMESPACE}/services/{SERVICE}"

client = servicedirectory_v1.LookupServiceClient()
svc = client.lookup_service(request={"name": service_name})

print(f"Service: {svc.name}")
print(f"Metadata: {dict(svc.metadata)}")
print("Endpoints:")
for ep_name, ep in svc.endpoints.items():
    print(f"- {ep_name}: {ep.address}:{ep.port} metadata={dict(ep.metadata)}")
PY

Export environment variables on the VM (use the same values as Cloud Shell):

export PROJECT_ID="$(gcloud config get-value project)"
export REGION="us-central1"
export SD_NAMESPACE="lab-namespace"
export SD_SERVICE="hello-service"

Run the script:

python3 discover.py

Expected outcome – You see the service name and two endpoints with their internal IPs and ports.

Step 7: Use discovery results to call the endpoints

From the client VM, curl each backend:

curl -s "http://'"$BACKEND_1_IP"'/" && echo
curl -s "http://'"$BACKEND_2_IP"'/" && echo

If you want to copy/paste endpoints from the script output, do so. In a real app, you would parse the endpoint list and connect accordingly.

Expected outcome – You again receive: – Hello from backend-1Hello from backend-2

Validation

From Cloud Shell: – Confirm registry contents:

gcloud service-directory endpoints list \
  --location "$REGION" \
  --namespace "$SD_NAMESPACE" \
  --service "$SD_SERVICE"

From the client VM: – Confirm lookup returns endpoints and metadata:

python3 discover.py

Network validation: – Confirm internal connectivity:

curl -s "http://<endpoint-ip>/" 

Troubleshooting

Common issues and fixes:

  1. PERMISSION_DENIED when calling Service Directory – Cause: The VM’s service account (or your user) lacks lookup permissions. – Fix:

    • In a lab, grant a role like roles/servicedirectory.viewer (or least privilege needed) to the VM service account.
    • Verify required permissions in: https://cloud.google.com/service-directory/docs/access-control
  2. API not enabled or servicedirectory.googleapis.com has not been used – Fix: bash gcloud services enable servicedirectory.googleapis.com

  3. Python dependency errors – Fix: Ensure pip3 is installed and you used pip3 install --user .... – If your environment blocks user installs, use a virtualenv: bash python3 -m venv venv source venv/bin/activate pip install google-cloud-service-directory

  4. Client VM cannot reach backend internal IP – Cause: Network/firewall issue or NGINX not started yet. – Fix:

    • Wait 1–2 minutes after VM creation (startup script time).
    • SSH to backend and check: bash sudo systemctl status nginx --no-pager
    • Confirm you’re using the internal IP and both VMs are in the same VPC (default network in this lab).
  5. gcloud: Invalid choice: 'service-directory' – Cause: Older Google Cloud CLI. – Fix: Update gcloud: bash gcloud components update – If the command group differs in your environment, verify current CLI reference: https://cloud.google.com/sdk/gcloud/reference

Cleanup

To avoid ongoing costs, delete Service Directory resources and VMs.

Delete endpoints, service, namespace:

gcloud service-directory endpoints delete "backend-1" \
  --location "$REGION" --namespace "$SD_NAMESPACE" --service "$SD_SERVICE" --quiet

gcloud service-directory endpoints delete "backend-2" \
  --location "$REGION" --namespace "$SD_NAMESPACE" --service "$SD_SERVICE" --quiet

gcloud service-directory services delete "$SD_SERVICE" \
  --location "$REGION" --namespace "$SD_NAMESPACE" --quiet

gcloud service-directory namespaces delete "$SD_NAMESPACE" \
  --location "$REGION" --quiet

Delete VMs:

gcloud compute instances delete "$VM_CLIENT" "$VM_BACKEND_1" "$VM_BACKEND_2" \
  --zone "$ZONE" --quiet

Expected outcome – All lab resources are removed.

11. Best Practices

Architecture best practices

  • Prefer stable endpoints when possible: Register internal load balancer VIPs or gateway addresses rather than every ephemeral instance, unless you truly need per-instance discovery.
  • Design multi-region intentionally:
  • Use per-region namespaces/services, or
  • Replicate entries across regions with automation, or
  • Have clients query multiple locations (if that fits your latency/availability goals).
  • Separate environments cleanly: Use namespaces per environment (dev, stage, prod) and separate projects when appropriate.

IAM/security best practices

  • Split registrar vs consumer identities:
  • Registrar service account: create/update/delete endpoints.
  • Consumer service accounts: lookup/resolve only.
  • Use least privilege:
  • Avoid granting admin rights broadly.
  • Grant access at the narrowest resource scope you can (project vs namespace vs service—verify supported IAM granularity in current docs).
  • Protect the registrar pipeline:
  • CI/CD credentials should be stored securely.
  • Use workload identity where possible.
  • Implement guardrails:
  • Validate endpoint address ranges (e.g., only allow RFC1918).
  • Require metadata keys like owner, env, dataClassification.

Cost best practices

  • Cache lookup results in clients to reduce API calls.
  • Avoid high-frequency polling; use refresh intervals and exponential backoff on errors.
  • Minimize endpoint churn: frequent create/delete cycles can raise operational overhead and cost (verify pricing dimensions).

Performance best practices

  • Client-side selection: Implement efficient endpoint selection (round robin/random) and keep a small in-memory cache.
  • Use timeouts and retries on discovery calls. Treat registry calls as dependencies and plan for transient failures.
  • Avoid oversharing endpoints: if filters are supported for your use case, reduce the returned endpoint set to what the client needs.

Reliability best practices

  • Fail-safe behavior:
  • If lookup fails, use cached endpoints (within a safe TTL) rather than failing hard immediately.
  • Health awareness:
  • Service Directory doesn’t health check endpoints; integrate with health checks at your load balancer/mesh, or implement client-side failover.
  • Change management:
  • Use staged endpoint updates and observe client behavior.

Operations best practices

  • Logging and auditing:
  • Enable and retain audit logs appropriate to your compliance requirements.
  • Monitor who changes endpoints and when.
  • Naming conventions:
  • Make names predictable and searchable (team-env, domain-service, etc.).
  • Automation:
  • Keep registry updates in pipelines rather than manual steps.
  • Build a cleanup process to remove stale endpoints.

Governance/tagging/naming best practices

  • Use a standard metadata schema:
  • ownerTeam, env, serviceTier, repo, runbook, oncall, region, version
  • Document what each key means and enforce it in CI/CD.
  • Avoid using metadata as an uncontrolled dumping ground; define a schema and review process.

12. Security Considerations

Identity and access model

  • Service Directory uses Cloud IAM.
  • Treat “who can register or update endpoints” as a high-risk permission because it can redirect production traffic.

Recommendations: – Give update permissions only to trusted automation identities. – Give read-only/lookup permissions to application service accounts that need discovery. – Use separate projects or namespaces for prod vs non-prod and enforce IAM boundaries.

Encryption

  • Data in Google Cloud managed services is typically encrypted at rest and in transit; confirm the specific guarantees in the product security documentation for Service Directory (verify in official docs).
  • Clients connect to the Service Directory API over TLS (standard Google APIs).

Network exposure

  • The registry is accessed via Google APIs; consider:
  • Using private connectivity approaches for Google APIs if required by your environment (e.g., Private Google Access for VMs without external IP—verify applicability to your network design).
  • The discovered endpoints might be private or public; you must enforce network policy:
  • VPC firewall rules
  • Segmentation between environments
  • VPN/Interconnect routing controls

Secrets handling

  • Do not store secrets (API keys, passwords, certificates) in Service Directory metadata.
  • Store secrets in Secret Manager and reference them indirectly (e.g., by secret resource name if appropriate, but consider whether that still leaks sensitive info).

Audit/logging

  • Use Cloud Audit Logs to track endpoint changes and suspicious activity.
  • Consider exporting logs to SIEM and alerting on:
  • Endpoint address changes in prod namespaces
  • Large-scale deletions
  • Changes outside deployment windows

Compliance considerations

  • Audit trails help with compliance controls (change management, least privilege).
  • If you have residency requirements, confirm the location behavior and data handling in official docs.

Common security mistakes

  • Granting servicedirectory.admin to broad groups.
  • Allowing developers to modify prod endpoints directly.
  • Storing secrets in metadata.
  • Registering endpoints that are reachable from unintended networks (e.g., accidentally publishing a public IP).

Secure deployment recommendations

  • Use separate projects for prod registries with restricted IAM.
  • Require CI/CD approvals for endpoint changes to critical services.
  • Implement automated validation:
  • Endpoint must be within allowed CIDR ranges
  • Required metadata keys present
  • Namespace/service naming conventions enforced

13. Limitations and Gotchas

Service Directory is intentionally narrow in scope. Plan for these realities:

  • Not a load balancer: it returns endpoints; it does not distribute traffic.
  • No built-in health checking: it won’t remove unhealthy endpoints automatically unless you build automation to do so.
  • Regional resource model: multi-region discovery requires explicit design (replication, per-region registries, or multi-location queries).
  • Network reachability is your job: registry entries do not create routing/firewall rules.
  • Metadata is not a config/secrets store: keep metadata non-sensitive and small.
  • Quotas apply: resources (namespaces/services/endpoints) and request rates are quota-controlled. Verify current quotas in the console and docs.
  • Consistency expectations: treat registry updates as eventually consistent unless the docs guarantee otherwise—verify consistency behavior if you need strong guarantees.
  • Cross-project discovery: possible via IAM, but governance and ownership can become complex; define clear boundaries and naming.
  • Operational drift: stale endpoints can accumulate if you don’t automate deregistration on decommission.
  • Pricing surprises: high-frequency resolution without caching can drive up API usage charges (verify exact pricing dimensions).

14. Comparison with Alternatives

Service discovery overlaps with DNS, load balancing, service mesh, and self-managed registries. Here’s how to choose.

Common alternatives in Google Cloud

  • Cloud DNS (private zones): great for name-to-IP mapping; less suited for rich service metadata and structured service registry workflows.
  • GKE/Kubernetes service discovery (Service + CoreDNS): best inside a cluster; doesn’t naturally span hybrid/multicloud without additional patterns.
  • Service mesh registries/routing (product-dependent): typically handle routing and telemetry, but may still rely on or integrate with registries.
  • Cloud Load Balancing: excellent for traffic distribution and health checking, but not a general service registry.

Alternatives in other clouds

  • AWS Cloud Map: AWS’s managed service discovery and registry.
  • HashiCorp Consul (self-managed or managed depending on environment): popular cross-platform service registry with health checks (operational overhead).
  • Netflix Eureka / etcd-based registries: self-managed patterns with significant operational costs.

Comparison table

Option Best For Strengths Weaknesses When to Choose
Google Cloud Service Directory Registry + metadata + IAM-governed discovery across hybrid/multicloud Managed, IAM integration, structured resources, auditability Not in data path, no health checks, regional design required You want a managed service registry in Google Cloud for distributed, hybrid, and multicloud discovery
Cloud DNS (Private Zones) Simple name resolution in VPCs Simple, ubiquitous, works with legacy apps Limited metadata model; not a service registry; update workflows differ You only need DNS-based resolution and simple records
Kubernetes Services + CoreDNS Discovery inside a Kubernetes cluster Native, automatic, low friction Cluster-scoped; hybrid/multicloud needs extra tooling Your services and clients are in the same cluster and DNS is enough
Cloud Load Balancing L4/L7 routing, health checks, stable VIPs Health checks, traffic distribution, reliability Not a registry; doesn’t store service catalog metadata You need routing/load balancing; register LB VIP in Service Directory if desired
AWS Cloud Map AWS-native service registry/discovery AWS integration, managed Tied to AWS ecosystem Your workloads are primarily on AWS
HashiCorp Consul Cross-platform service discovery with health checks Rich features, service mesh integration, health checks Operational overhead, scaling and upgrades You need advanced discovery + health checking and accept ops burden

15. Real-World Example

Enterprise example: Hybrid banking platform with strict governance

Problem A bank runs customer and transaction services on-prem for regulatory and latency reasons, while analytics and new microservices run on Google Cloud. Teams struggle with endpoint sprawl, unclear ownership, and risky manual changes during migrations.

Proposed architecture – A dedicated “shared services” Google Cloud project hosts Service Directory in each primary region. – Namespaces reflect environment and domain: – core-prod, core-stage, analytics-prod – On‑prem services (reachable via Cloud Interconnect) are registered as endpoints with metadata: – ownerTeam, pciScope=true/false, region, drTier, runbook – Application workloads in GKE use service accounts with lookup-only permissions. – CI/CD pipelines (restricted service accounts) update endpoints during releases and failovers. – Audit logs exported to a central logging project and SIEM; alerts on endpoint changes in prod.

Why Service Directory was chosen – IAM-governed registry with auditability fits regulated change management. – Works across hybrid endpoints (on‑prem + cloud) without forcing everything into Kubernetes. – Metadata supports operational ownership and compliance tagging.

Expected outcomes – Reduced endpoint misconfiguration incidents. – Faster migrations and controlled cutovers. – Improved audit readiness due to centralized, logged endpoint changes.

Startup/small-team example: Multi-region SaaS with shared internal APIs

Problem A small SaaS team runs services across two regions for availability. They need a simple way for background workers and internal services to discover the correct API endpoints without hardcoding and without running a self-managed registry.

Proposed architecture – One Service Directory namespace per environment (prod, stage), per region. – Register internal load balancer VIPs as endpoints for each service. – Clients cache discovery results and refresh every few minutes. – Use metadata: – region, priority, version – Simple selection logic prefers local region endpoints; fails over to secondary.

Why Service Directory was chosen – Low operational overhead compared to self-managed Consul/Eureka. – Integrates cleanly with Google Cloud IAM and supports automation via gcloud/API.

Expected outcomes – Faster iteration with fewer config changes. – Controlled failover behavior without manually updating many clients. – Clear ownership metadata as the team grows.

16. FAQ

  1. Is Service Directory a load balancer?
    No. Service Directory provides discovery (returns endpoints). Load balancing and routing require a load balancer, service mesh, or client-side balancing logic.

  2. Does Service Directory health check endpoints?
    Not by itself. If you need health-based endpoint removal, build automation or rely on a load balancer/mesh that performs health checks.

  3. Is Service Directory global or regional?
    Service Directory resources are created in a specific location (commonly a region). Multi-region designs require explicit planning (replication or per-region registries). Verify current location behavior in the docs.

  4. Can I register on‑prem endpoints?
    Yes—if clients can reach those endpoints over VPN/Interconnect and IAM allows discovery.

  5. Can I register endpoints from another cloud (AWS/Azure)?
    You can register any reachable endpoint address/port. Practical success depends on network connectivity and governance.

  6. Should I store secrets in Service Directory metadata?
    No. Use Secret Manager for secrets. Metadata should be non-sensitive.

  7. How do clients authenticate to Service Directory?
    Using Google Cloud authentication (service accounts for workloads). Client libraries and ADC (Application Default Credentials) are typical.

  8. How do I restrict who can change endpoints?
    Use IAM: grant registration/update privileges only to CI/CD or platform operators; grant lookup privileges to consumers.

  9. Can multiple projects share one registry?
    Often yes by granting IAM access across projects, but governance becomes important. Many organizations host registries in a shared services project.

  10. How should I model namespaces?
    Common patterns: namespace per environment (prod, stage) and domain/team (payments-prod). Choose a model that matches ownership and access boundaries.

  11. Does Service Directory replace DNS?
    Not necessarily. DNS is still useful for many workloads. Service Directory is a richer registry for service discovery + metadata. Some architectures use both.

  12. How often should clients call lookup/resolve?
    Avoid per-request resolution. Cache results and refresh periodically or on failure. The right interval depends on how often endpoints change.

  13. What happens if Service Directory is temporarily unavailable?
    Treat it like any dependency: use cached endpoints, apply retries with backoff, and fail gracefully.

  14. Can I use Service Directory with GKE?
    Yes, especially when you need discovery outside cluster boundaries or want a centralized registry. For purely in-cluster discovery, Kubernetes Services may be enough.

  15. Is Service Directory suitable for internet-facing service discovery?
    It’s primarily used for internal discovery in distributed, hybrid, and multicloud setups. If you publish public endpoints, carefully control IAM and consider whether DNS or an API gateway is more appropriate.

  16. How do I prevent stale endpoints?
    Automate deregistration on instance termination and run periodic reconciliation (compare registry entries to actual backends).

  17. Can I attach arbitrary metadata keys?
    You can attach key/value metadata, but limits apply (size/count). Verify the current limits in official docs and standardize a schema.

17. Top Online Resources to Learn Service Directory

Resource Type Name Why It Is Useful
Official documentation Service Directory Docs — https://cloud.google.com/service-directory/docs Canonical overview, concepts, APIs, and operational guidance
Pricing Service Directory Pricing — https://cloud.google.com/service-directory/pricing Current billing model and SKU dimensions (verify before production)
API reference Service Directory API Reference — https://cloud.google.com/service-directory/docs/reference/rest REST resources/methods, request/response fields
Access control Service Directory Access Control — https://cloud.google.com/service-directory/docs/access-control IAM roles/permissions and secure patterns
Locations Service Directory Locations — https://cloud.google.com/service-directory/docs/locations Where the service is available and location behavior
CLI reference gcloud reference (search “service-directory”) — https://cloud.google.com/sdk/gcloud/reference Up-to-date CLI commands and flags for automation
Client libraries Google Cloud Client Libraries — https://cloud.google.com/apis/docs/client-libraries-explained How to use ADC and client libs consistently
Python library google-cloud-service-directory (package docs; verify latest) — https://cloud.google.com/python/docs/reference/servicedirectory/latest Practical Python API usage for lookup/registration (library surface may evolve)
Architecture guidance Google Cloud Architecture Center — https://cloud.google.com/architecture Broader distributed/hybrid patterns relevant to registries and discovery
Hands-on labs Google Cloud Skills Boost catalog (search “Service Directory”) — https://www.cloudskillsboost.google/catalog Guided labs if available for your subscription (catalog changes over time)
Videos Google Cloud Tech / YouTube (search “Service Directory”) — https://www.youtube.com/@googlecloudtech Talks and demos that help with conceptual understanding

18. Training and Certification Providers

  1. DevOpsSchool.com
    Suitable audience: DevOps engineers, SREs, platform teams, cloud engineers
    Likely learning focus: Google Cloud fundamentals, DevOps practices, automation, service discovery patterns
    Mode: check website
    Website: https://www.devopsschool.com/

  2. ScmGalaxy.com
    Suitable audience: Beginners to intermediate DevOps learners, engineers moving into cloud/DevOps
    Likely learning focus: SCM/CI-CD foundations, DevOps tooling, cloud basics
    Mode: check website
    Website: https://www.scmgalaxy.com/

  3. CLoudOpsNow.in
    Suitable audience: Cloud operations and DevOps practitioners
    Likely learning focus: Cloud operations, monitoring, automation, operational readiness
    Mode: check website
    Website: https://cloudopsnow.in/

  4. SreSchool.com
    Suitable audience: SREs, operations teams, reliability-focused engineers
    Likely learning focus: SRE practices, reliability engineering, incident response, monitoring
    Mode: check website
    Website: https://sreschool.com/

  5. AiOpsSchool.com
    Suitable audience: Ops teams exploring AIOps, monitoring/observability engineers
    Likely learning focus: AIOps concepts, automation, observability, operational analytics
    Mode: check website
    Website: https://aiopsschool.com/

19. Top Trainers

  1. RajeshKumar.xyz
    Likely specialization: DevOps/cloud training content and workshops (verify current offerings on site)
    Suitable audience: Beginners to working professionals
    Website: https://rajeshkumar.xyz/

  2. devopstrainer.in
    Likely specialization: DevOps training programs (tools, CI/CD, cloud)
    Suitable audience: DevOps engineers, students, career switchers
    Website: https://devopstrainer.in/

  3. devopsfreelancer.com
    Likely specialization: Freelance DevOps guidance/training and practical support (verify offerings)
    Suitable audience: Small teams and individuals needing targeted help
    Website: https://devopsfreelancer.com/

  4. devopssupport.in
    Likely specialization: DevOps support and training resources (verify current scope)
    Suitable audience: Teams needing operational support and skill-building
    Website: https://devopssupport.in/

20. Top Consulting Companies

  1. cotocus.com
    Likely service area: Cloud/DevOps consulting (verify current practice areas on website)
    Where they may help: Architecture reviews, platform modernization, automation pipelines
    Consulting use case examples:

    • Designing a service discovery strategy for hybrid workloads
    • Automating endpoint registration/deregistration in CI/CD
    • IAM and audit logging review for registries
    • Website: https://cotocus.com/
  2. DevOpsSchool.com
    Likely service area: DevOps consulting, implementation support, training-led delivery
    Where they may help: CI/CD, cloud migration support, SRE/DevOps practices adoption
    Consulting use case examples:

    • Implementing Google Cloud landing zones and shared services projects
    • Building automation for Service Directory registrations
    • Operational runbooks and incident response processes
    • Website: https://www.devopsschool.com/
  3. DEVOPSCONSULTING.IN
    Likely service area: DevOps and cloud consulting (verify current offerings)
    Where they may help: DevOps toolchains, cloud operations, reliability improvements
    Consulting use case examples:

    • Standardizing service discovery patterns across environments
    • Security hardening and least-privilege IAM for registries
    • Observability and audit logging integration
    • Website: https://devopsconsulting.in/

21. Career and Learning Roadmap

What to learn before Service Directory

  • Google Cloud fundamentals:
  • Projects, IAM, service accounts
  • VPC networking basics (subnets, firewall rules, internal vs external IPs)
  • Basics of distributed systems:
  • Service discovery concepts (client-side vs server-side)
  • Failure modes (partial failures, retries, backoff)
  • Basic automation:
  • gcloud CLI usage
  • Infrastructure-as-code fundamentals (Terraform concepts help, even if not required)

What to learn after Service Directory

  • Cloud Load Balancing patterns (internal/external) for traffic distribution and health checks
  • Service mesh fundamentals (Envoy/Istio concepts) if you need routing, mTLS, and telemetry
  • Hybrid connectivity: Cloud VPN, Cloud Interconnect, DNS design
  • Observability:
  • Cloud Logging, Cloud Monitoring
  • Audit log analysis and alerting
  • Policy and governance:
  • Organization policies
  • CI/CD controls and approvals

Job roles that use it

  • Cloud/Platform Engineer
  • DevOps Engineer
  • Site Reliability Engineer (SRE)
  • Solutions Architect
  • Security Engineer (for IAM/audit governance)
  • Backend Engineer working on microservices/platform integration

Certification path (Google Cloud)

Service Directory is not typically a standalone certification topic, but it supports skills tested in broader certifications: – Associate Cloud Engineer – Professional Cloud Architect – Professional Cloud DevOps Engineer

Verify current exam guides on Google Cloud’s certification site: – https://cloud.google.com/learn/certification

Project ideas for practice

  • Build a small microservices demo where clients discover services via Service Directory and apply metadata-based selection (e.g., prefer same-zone endpoints).
  • Create a CI/CD pipeline step that registers a new internal load balancer VIP after deployment and deregisters on rollback.
  • Implement an endpoint reconciliation job that removes stale entries by comparing registry endpoints with your actual backends (MIGs, GKE services, etc.).
  • Add security guardrails: validate that registered endpoints are only in approved CIDR ranges and contain required metadata.

22. Glossary

  • Service discovery: The process of finding the network location (and sometimes metadata) of a service at runtime.
  • Service registry: A database/system that stores service names and their endpoints for discovery.
  • Namespace (Service Directory): A grouping container for services, often mapped to an environment, domain, or team boundary.
  • Service (Service Directory): A named service within a namespace that clients can discover.
  • Endpoint (Service Directory): A concrete address/port (and metadata) representing where a service can be reached.
  • Metadata: Key/value attributes attached to namespaces/services/endpoints (e.g., owner, version, region).
  • IAM (Identity and Access Management): Google Cloud’s authorization system controlling who can do what.
  • Audit Logs: Logs that record administrative and data-access events for Google Cloud resources.
  • Hybrid cloud: Architecture spanning on‑prem and cloud environments.
  • Multicloud: Architecture spanning multiple cloud providers.
  • Client-side load balancing: Clients choose an endpoint from a discovered set (random/round-robin/weighted) rather than using a centralized load balancer.
  • Control plane: Management layer (registration/discovery APIs, policies). Not the same as traffic/data plane.
  • Data plane: The actual application traffic between clients and service endpoints.

23. Summary

Service Directory is Google Cloud’s managed service registry for Distributed, hybrid, and multicloud architectures. It provides a structured model (namespaces, services, endpoints) and an API for registering endpoints and discovering them at runtime, with strong integration into IAM and Cloud Audit Logs.

It matters because it helps teams standardize service discovery, reduce hard-coded configuration, and improve governance—especially when workloads span GKE, VMs, on‑prem, and multiple regions. Cost is usage-based (verify exact SKUs on the official pricing page), and the biggest operational cost drivers are typically endpoint churn and excessive discovery calls without caching. Security hinges on strict IAM for who can modify endpoints and on audit log monitoring.

Use Service Directory when you need a Google-managed registry with metadata and governance. Pair it with load balancers, service mesh, and good client-side caching for production-grade reliability.

Next step: review the official docs and implement a production-ready pattern that includes least-privilege IAM, automated registration/deregistration, caching, and clear multi-region design decisions.