Azure Foundry Models Tutorial: Architecture, Pricing, Use Cases, and Hands-On Guide for AI + Machine Learning

Category

AI + Machine Learning

1. Introduction

What this service is

Foundry Models is the Azure experience for discovering, selecting, and using foundation models (and related model assets) inside Azure AI Foundry. It focuses on the “model layer” of building AI applications—helping you evaluate which model to use, deploy it (where applicable), and call it from your apps securely and at scale.

Simple explanation (one paragraph)

If you want to add generative AI (chat, summarization, extraction, embeddings, classification) to an application, Foundry Models helps you choose a model and use it in Azure without stitching together a lot of separate product pages and guesswork.

Technical explanation (one paragraph)

Technically, Foundry Models sits in the Azure AI Foundry portal and documentation ecosystem and works with underlying Azure resources (commonly Azure OpenAI and/or other Azure AI model hosting/inference offerings, depending on what models are available to you). It provides a catalog-style experience, model details (model cards/metadata), and deployment/inference entry points. Exact deployment mechanics (serverless vs provisioned, key vs Microsoft Entra ID auth, networking options, quotas) depend on the specific model family and the Azure service hosting the model—so you must confirm the supported path for the model you choose in the official docs.

What problem it solves

Teams often struggle with: – Picking the right model (capabilities, cost, latency, safety constraints) – Understanding how to deploy and call it properly in Azure – Governing access, monitoring usage, and controlling costs

Foundry Models centralizes model discovery and provides an Azure-aligned workflow to move from “model selection” to “working endpoint” with governance in mind.

Naming note (verify in official docs): Microsoft has been evolving the Azure AI Studio/Foundry naming. If you see Azure AI Studio in documentation URLs, it may still be the canonical docs path even when the portal branding shows Azure AI Foundry. The service term Foundry Models is used throughout this tutorial exactly as requested.


2. What is Foundry Models?

Official purpose

Foundry Models is intended to help you find and use models in Azure AI Foundry. This typically includes: – Browsing a catalog of available models (foundation models and related variants) – Understanding model capabilities and constraints (context limits, modalities, licensing/terms, intended use) – Deploying or connecting to a model endpoint (when supported) – Getting code snippets or guidance to integrate the model into your application

Because Microsoft’s model offerings and hosting options evolve, treat Foundry Models as the Azure AI Foundry model consumption layer, and confirm the exact model hosting backend for your chosen model in official docs.

Core capabilities

Common capabilities you can expect around Foundry Models include: – Model discovery: search/filter by task (chat, embeddings, vision), provider, capability, etc. – Model details: model card-like information, supported input/output types, limitations – Model usage: deployment or “use this model” guidance to call an inference API – Governance hooks: integration with Azure identity, logging, and Azure policy patterns via underlying services

Major components (conceptual)

Foundry Models is best understood as an experience composed of: 1. Azure AI Foundry portal UI (model browsing and configuration) 2. Model catalog metadata (model cards, version info, constraints) 3. Underlying inference/hosting service (often Azure OpenAI for OpenAI models; other Azure AI model inference paths may apply) 4. Security boundary and access (resource-based IAM, keys, Microsoft Entra ID) 5. Observability (usage metrics/logs depending on hosting service)

Service type

Foundry Models is primarily a managed cloud experience inside the Azure AI Foundry ecosystem. Billing and runtime behavior depend on the underlying Azure service used to host/serve the model.

Scope: regional/global/zonal and subscription/project alignment

  • Model availability is region-dependent. Many foundation models are only available in certain Azure regions, and availability can change.
  • Access is subscription- and resource-scoped. You typically consume models through an Azure resource (for example, an Azure OpenAI resource) tied to a subscription and region.
  • Project scoping (portal experience) may apply. Azure AI Foundry often organizes work into projects/hubs/workspaces (terminology can evolve). Confirm current project structure in docs.

How it fits into the Azure ecosystem

Foundry Models fits into a typical Azure AI + Machine Learning stack like this: – App layer: Web apps, APIs, Azure Functions, container apps – AI orchestration: Prompt workflows, evaluations, retrieval augmentation patterns – Model layer (Foundry Models): selection, deployment, endpoint usage – Data layer: Azure AI Search, Storage, Cosmos DB, SQL – Security and ops: Key Vault, Private Link, Monitor, Policy, Defender for Cloud


3. Why use Foundry Models?

Business reasons

  • Faster time-to-value: reduce the “which model do we pick and how do we use it?” cycle time.
  • Risk reduction: model metadata and governance patterns help avoid accidental misuse.
  • Cost transparency: model choice impacts token usage, latency, and cost; Foundry Models helps compare at the decision point.

Technical reasons

  • Centralized model discovery aligned to Azure deployment paths.
  • Repeatable integration patterns for apps (endpoint-based inference with standard auth).
  • Model lifecycle clarity: versions, capabilities, and constraints can be tracked more systematically than ad-hoc model selection.

Operational reasons

  • Easier standardization: platform teams can recommend a small set of approved models.
  • Monitoring alignment: usage can be tracked through underlying Azure resources.
  • Scalability: select deployment/consumption patterns that match production requirements (through supported hosting services).

Security/compliance reasons

  • Azure IAM integration: managed identity / Microsoft Entra ID is often available depending on the underlying model service.
  • Network controls: options like Private Link may exist depending on the model hosting service.
  • Auditability: resource-level logs and metrics support compliance operations.

Scalability/performance reasons

  • Model choice optimization: pick smaller/faster models where appropriate.
  • Regional deployment: choose regions closer to users (where model availability allows).
  • Quota-aware scaling: plan around token-per-minute / request-per-minute style quotas (varies by service/model).

When teams should choose it

Choose Foundry Models when: – You want a curated Azure workflow for selecting and consuming models. – You need enterprise controls (IAM, logging, private networking where supported). – You want to standardize model selection for multiple teams.

When teams should not choose it

Consider alternatives when: – You must run models fully offline/on-prem with no cloud calls. – You require a specific open-source model variant not offered in your Azure regions. – You need deep training pipelines beyond what your selected model/hosting service supports (you may need Azure Machine Learning for full customization). – You are locked into another cloud’s model ecosystem (e.g., AWS Bedrock or Vertex AI) for organizational reasons.


4. Where is Foundry Models used?

Industries

Foundry Models patterns appear across: – Financial services (document understanding, support copilots, compliance summarization) – Healthcare (clinical note summarization with strong governance requirements) – Retail/e-commerce (product Q&A, support automation, personalization) – Manufacturing (maintenance logs summarization, knowledge search) – Software/SaaS (in-app assistants, developer copilots, ticket triage) – Public sector (case summarization, citizen support—subject to compliance constraints)

Team types

  • Application development teams integrating AI into products
  • Platform/Cloud Center of Excellence teams standardizing AI usage
  • Security and compliance teams reviewing AI service controls
  • DevOps/SRE teams operating production endpoints
  • Data/ML teams doing evaluation and model selection

Workloads

  • Chatbots and copilots
  • Summarization and extraction pipelines
  • Embeddings generation and semantic search (RAG)
  • Classification, routing, moderation-assisted flows
  • Multimodal use cases (text + image) where supported by chosen models

Architectures

  • API-driven microservices calling model endpoints
  • Event-driven pipelines (Functions) that summarize or extract from documents
  • RAG: AI Search + model for grounded responses
  • Multi-region active/active or active/passive designs (model availability permitting)

Real-world deployment contexts

  • Production: private networking, key rotation, monitoring/alerting, quota management, fallback models
  • Dev/Test: smaller models, strict spend caps, sampling-based evaluation, sandbox subscriptions

5. Top Use Cases and Scenarios

Below are realistic scenarios where Foundry Models helps you get from “idea” to “working model endpoint” in Azure.

1) Customer support chat assistant

  • Problem: Support agents waste time searching knowledge bases and writing repetitive answers.
  • Why Foundry Models fits: Helps select a chat-capable model and integrate it with an application workflow.
  • Example: A support portal calls a model endpoint to draft replies; agents approve/edit before sending.

2) RAG-based internal knowledge search

  • Problem: Employees can’t find policies and technical docs quickly.
  • Why it fits: Lets you choose a model for grounded Q&A and pair it with Azure AI Search.
  • Example: Index SharePoint exports into Azure AI Search; model generates answers with citations.

3) Meeting notes and action item extraction

  • Problem: Teams lose follow-ups buried in transcripts.
  • Why it fits: Choose a cost-effective model for summarization and structured extraction.
  • Example: After a Teams meeting, a function triggers and produces summary + JSON action items.

4) Invoice/contract clause summarization (human-in-the-loop)

  • Problem: Legal and finance teams need fast review but must keep humans in control.
  • Why it fits: Select a model suitable for summarization with governance controls.
  • Example: A review app shows extracted clauses with confidence notes; legal staff approves.

5) Ticket triage and routing

  • Problem: ITSM tickets pile up; wrong routing increases resolution time.
  • Why it fits: Pick a lightweight classification model for routing to correct team.
  • Example: New tickets are categorized (billing/bug/outage) and assigned automatically.

6) PII/PHI-aware redaction helper

  • Problem: Documents contain sensitive fields that must be redacted before sharing.
  • Why it fits: Combine model extraction with policy-driven controls and audit logs.
  • Example: A pipeline identifies likely PII segments; a reviewer confirms redactions.

7) Code review summarization for PRs

  • Problem: Large pull requests are hard to review consistently.
  • Why it fits: Use a model that can summarize diffs and highlight risk areas.
  • Example: CI posts a PR comment with summary and suggested tests (reviewer validates).

8) Product catalog enrichment

  • Problem: Product descriptions are inconsistent and missing attributes.
  • Why it fits: Model can generate standardized descriptions and structured attributes.
  • Example: A batch job reads raw supplier data and outputs normalized descriptions.

9) Multi-lingual translation with terminology constraints

  • Problem: Support content must be translated while preserving brand terminology.
  • Why it fits: Choose a model that handles translation well; enforce terminology via prompts/glossaries.
  • Example: Knowledge articles are translated; QA checks run on sampled outputs.

10) Security log summarization and incident assistant

  • Problem: Analysts need quick understanding of alerts and timelines.
  • Why it fits: Use summarization/extraction on event bundles; integrate with SOC workflows.
  • Example: Sentinel alert triggers a function that summarizes related events and suggests next steps.

11) Marketing content drafting with guardrails

  • Problem: Content teams need drafts quickly but must comply with policy.
  • Why it fits: Model selection + content filtering patterns reduce risk.
  • Example: Drafts are generated with style constraints; compliance review is mandatory.

12) Data-to-text executive reporting

  • Problem: Executives want plain-language narratives from KPI dashboards.
  • Why it fits: Use a model for concise narrative generation from structured data.
  • Example: Nightly job produces a narrative for sales KPIs with “what changed” highlights.

6. Core Features

Because Foundry Models is a model-focused experience that depends on the underlying model hosting service, treat these features as common, model-dependent capabilities. Always confirm what your chosen model supports.

1) Model catalog (discovery and comparison)

  • What it does: Lets you browse and filter available models, typically by task and capability.
  • Why it matters: Model choice drives quality, latency, and cost.
  • Practical benefit: Faster shortlisting; fewer failed POCs due to mismatched capabilities.
  • Limitations/caveats: Catalog content and available models vary by region and subscription eligibility.

2) Model metadata and “model card” style details

  • What it does: Provides information such as intended use, limitations, and usage notes.
  • Why it matters: Reduces the chance of deploying an unsuitable model to production.
  • Practical benefit: Better governance and safer adoption.
  • Limitations/caveats: Depth of metadata varies by provider/model.

3) Deployment guidance / endpoint creation (model-dependent)

  • What it does: Provides a path to deploy or connect to a model endpoint.
  • Why it matters: Deployment mechanics differ across model families.
  • Practical benefit: Reduces trial-and-error when moving from selection to integration.
  • Limitations/caveats: Some models may be “use via API” rather than “deploy your own,” depending on Azure’s offering.

4) Inference integration patterns (REST/SDK snippets)

  • What it does: Shows how to call the model from code and what parameters to use (temperature, max tokens, etc.).
  • Why it matters: Teams need reproducible integration patterns.
  • Practical benefit: Faster integration into apps and CI environments.
  • Limitations/caveats: SDK support varies; verify the recommended SDK for the specific endpoint type.

5) Authentication options (keys and/or Microsoft Entra ID)

  • What it does: Enables secure access to model endpoints.
  • Why it matters: Production systems should avoid hard-coded secrets and support least privilege.
  • Practical benefit: Use managed identity where supported; rotate keys otherwise.
  • Limitations/caveats: Not all endpoint types support Entra ID equally—confirm per service/model.

6) Quotas and capacity awareness (service-dependent)

  • What it does: Helps you understand limits like requests per minute or tokens per minute.
  • Why it matters: Prevents production incidents caused by throttling.
  • Practical benefit: Better capacity planning and load testing.
  • Limitations/caveats: Quotas vary by region, model, and your approval level; exact numbers are not universal.

7) Content safety and policy alignment (via underlying services)

  • What it does: Supports safe use patterns (e.g., moderation filters, policy checks).
  • Why it matters: Reduces harmful output and compliance risk.
  • Practical benefit: Safer applications with defined guardrails.
  • Limitations/caveats: Filtering capabilities depend on the underlying service and configuration. Verify how content filtering is applied for your model.

8) Observability hooks (metrics/logs via Azure Monitor patterns)

  • What it does: Enables operational monitoring via the hosting resource’s metrics/logs.
  • Why it matters: You need to understand usage, errors, latency, and throttling.
  • Practical benefit: Better SRE operations and cost control.
  • Limitations/caveats: Logs can include sensitive prompts/outputs if enabled—review carefully.

9) Environment organization (projects/workspaces) (portal-dependent)

  • What it does: Organizes model work across teams/environments.
  • Why it matters: Separates dev/test/prod and clarifies ownership.
  • Practical benefit: Cleaner access controls and cost allocation.
  • Limitations/caveats: Exact “hub/project/workspace” structure changes over time—verify current portal terminology.

7. Architecture and How It Works

High-level architecture

Foundry Models acts like a front door for model selection and usage inside Azure AI Foundry. The actual inference traffic typically goes to an underlying Azure resource endpoint.

At a high level: 1. You select a model in Foundry Models. 2. You create a deployment or choose a consumption method (depends on model). 3. Your application calls the resulting endpoint. 4. Identity, networking, and logging are enforced by Azure at the resource boundary.

Request / data / control flow

  • Control plane: Model discovery, deployment configuration, and IAM are control-plane actions performed in Azure portals/APIs.
  • Data plane: Prompts and responses flow over HTTPS to the model endpoint.

Integrations with related services

Common integrations around Foundry Models-based apps include: – Azure OpenAI (frequently the hosting layer for OpenAI models) – Azure AI Search for RAG (index + retrieval + grounding) – Azure Storage for documents, transcripts, and prompt artifacts – Azure Key Vault for API keys/secrets and certificate management – Azure Monitor / Log Analytics for metrics, logs, and alerting – Private Link to keep traffic on private networks (where supported) – Azure Functions / App Service / Container Apps / AKS to host AI application backends

Dependency services

Foundry Models itself is not typically billed as a standalone runtime; your main dependencies (and costs) come from: – The model hosting/inference service (for example, Azure OpenAI) – Your app hosting – Data and networking services

Security/authentication model

Common patterns: – API keys (simpler; use Key Vault and rotation) – Microsoft Entra ID (recommended for production) where supported: – Managed identity from your compute – Role-based access control at the resource level

Verify the supported authentication method for your model endpoint in official docs.

Networking model

Typical options (depending on underlying service): – Public endpoint with HTTPS + firewall rules (IP allowlists) – Private Endpoint (Private Link) to keep endpoint traffic off the public internet – VNet integration for your app service + private DNS

Monitoring/logging/governance considerations

  • Track request counts, latency, errors, throttling via Azure Monitor metrics (varies by service).
  • Decide carefully whether to log prompts/responses (privacy/compliance).
  • Use resource tags, budgets, and policy to enforce environment separation and cost controls.

Simple architecture diagram (Mermaid)

flowchart LR
  U[User / Client App] --> A[App Backend<br/>Function/App Service]
  A -->|HTTPS inference| M[Model Endpoint<br/>(via Foundry Models selection)]
  A --> KV[Azure Key Vault]
  A --> MON[Azure Monitor]

Production-style architecture diagram (Mermaid)

flowchart TB
  subgraph Internet
    C[Web/Mobile Clients]
  end

  subgraph Azure["Azure Subscription"]
    FD[Front Door / App Gateway (optional)]
    AP[App Backend (AKS/Container Apps/App Service)]
    MI[Managed Identity]
    KV[Key Vault]
    AI[Azure AI Search (RAG)]
    ST[Storage Account (docs, logs, artifacts)]
    MON[Azure Monitor + Log Analytics]
    PE1[Private Endpoint (Model)]
    PE2[Private Endpoint (Search/Storage)]
    DNS[Private DNS Zones]
    MODEL[Model Endpoint (Hosting service behind Foundry Models)]
  end

  C --> FD --> AP
  AP --- MI
  AP -->|Retrieve secrets (if needed)| KV
  AP -->|Retrieve docs/chunks| AI
  AI --> ST
  AP -->|Private HTTPS| PE1 --> MODEL
  AP -->|Private HTTPS| PE2 --> AI
  AP -->|Metrics/Logs| MON
  DNS --- PE1
  DNS --- PE2

8. Prerequisites

Account/subscription/tenant requirements

  • An active Azure subscription.
  • Ability to create resource groups and AI resources in your target region.
  • If your organization uses policies, ensure AI resources are allowed.

Permissions / IAM roles

Minimum recommended: – Subscription/resource group: Contributor (for lab work) – For production: separate roles for deployment vs runtime access. – For Azure OpenAI (if used): ensure you have the required permissions to create and manage deployments.

Billing requirements

  • A billing-enabled subscription.
  • Some models/services require approval or eligibility. If the portal indicates access is restricted, follow the official access request process.

CLI/SDK/tools needed

For the hands-on lab in this tutorial: – Azure CLI: https://learn.microsoft.com/cli/azure/install-azure-cli – Python 3.9+ (3.10/3.11 recommended) – Optional: jq for JSON formatting

Region availability

  • Model availability is region-specific.
  • Choose a region where your target model is available. Verify in:
  • Azure AI Foundry portal model catalog
  • Azure OpenAI region/model availability docs (if using Azure OpenAI)

Quotas/limits

Expect some combination of: – Requests per minute (RPM) – Tokens per minute (TPM) – Max context window / max output tokens – Deployment limits per resource/region

Quotas vary significantly—verify in official docs and in the Azure portal quota views.

Prerequisite services

Depending on your architecture: – Azure OpenAI resource (common for OpenAI models) – Key Vault (recommended) – Azure Monitor / Log Analytics (recommended) – Azure AI Search (for RAG)


9. Pricing / Cost

Current pricing model (accurate, non-fabricated)

Foundry Models is best treated as a selection and usage experience; most of your cost comes from the underlying model hosting/inference service and the supporting Azure resources.

Common pricing dimensions you may encounter: – Token-based pricing (typical for chat and embeddings models): you pay per input/output tokens. – Provisioned throughput / capacity (some offerings): you pay for reserved capacity to achieve predictable latency/throughput. – Hosting/compute pricing (if you deploy models on managed compute): hourly compute + storage + networking.

Because models and hosting options evolve, do not assume one universal pricing scheme. Always review pricing for the exact model/provider/hosting method you choose.

Free tier

  • The Azure AI Foundry portal experience is generally not the cost driver.
  • Free tiers (if any) are typically tied to specific services (not guaranteed for model inference).
  • If you see promotional credits or trial offers, treat them as time-limited and verify eligibility.

Cost drivers

Primary drivers: – Total tokens processed (input + output) – Average prompt size (system prompt + retrieved context + user message) – Response length (max tokens) – Request rate (RPM) – Model choice (larger models often cost more and respond slower) – Region (prices can vary)

Secondary/indirect drivers: – Azure AI Search (index size, replicas/partitions) – Storage (documents, embeddings, logs) – Networking (Private Link, outbound data transfer in some architectures) – App hosting (Functions, App Service, containers) – Monitoring (Log Analytics ingestion and retention)

Network/data transfer implications

  • Inbound data to Azure services is typically free; outbound may be billed depending on path and service.
  • Private Link can add cost for private endpoints and data processing (service-dependent). Verify Private Link pricing when used.

How to optimize cost

  • Use the smallest model that meets quality requirements.
  • Reduce tokens:
  • Trim system prompts
  • Use retrieval with strict top-k and chunk sizing
  • Summarize long contexts
  • Set max_tokens and stop sequences.
  • Cache responses where safe.
  • Use budgets and alerts.
  • Separate dev/test/prod resources and enforce spend limits.

Example low-cost starter estimate (no fabricated prices)

A low-cost dev/test pattern typically looks like: – A small Azure OpenAI deployment (or other model endpoint) used intermittently – Minimal app hosting (local dev or small App Service) – Limited logging retention

Your monthly cost will mainly be driven by how many prompts you send and how long the responses are. Use the Azure Pricing Calculator to estimate using expected token volumes: – Azure Pricing Calculator: https://azure.microsoft.com/pricing/calculator/

Example production cost considerations (no fabricated prices)

Production adds: – Higher request volume (token spend scales linearly) – More expensive models for quality – Redundancy (multiple regions/resources) – Private networking and enterprise monitoring – RAG components (Search, Storage)

Official pricing pages to use

  • Azure OpenAI pricing (if using Azure OpenAI as the hosting layer):
    https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/
  • Azure Pricing Calculator:
    https://azure.microsoft.com/pricing/calculator/
  • For other models offered through Azure AI Foundry catalog: verify in official docs and model-specific pricing links shown in the portal.

10. Step-by-Step Hands-On Tutorial

This lab uses a practical and widely supported path: deploy a model that is accessible through Foundry Models and call it via an Azure endpoint. The most common execution path today is via Azure OpenAI deployments managed in Azure and discoverable/usable through Foundry Models experiences.

If your Foundry Models catalog offers other endpoint types, the same app-side pattern applies (HTTPS call + auth), but the endpoint URL and API shape may differ—verify in official docs for your selected model.

Objective

Create a minimal, production-shaped setup where you: 1. Create an Azure OpenAI resource (model hosting) 2. Deploy a chat model 3. Call the model from Python 4. Add basic safety/cost controls 5. Clean up all resources

Lab Overview

  • Time: ~30–60 minutes
  • Cost: Low if you keep requests small (token-based); avoid long prompts and large outputs.
  • Outcome: You can send a chat prompt and receive a response from your deployed model endpoint.

Step 1: Choose a region and confirm model availability

  1. Decide on an Azure region.
  2. In Azure AI Foundry (portal UI) or Azure docs, confirm: – The model you want is available in that region. – Your subscription is allowed to use it.

Expected outcome: You know the region and model family you will deploy.

Verification: – In Azure AI Foundry portal (commonly https://ai.azure.com/), confirm the model shows as available for your region/subscription.
If you cannot access the portal UI due to org policy, use Azure documentation for Azure OpenAI model availability:
https://learn.microsoft.com/azure/ai-services/openai/concepts/models


Step 2: Install tools and sign in to Azure

Install Azure CLI and sign in:

az version
az login
az account show

Set your subscription (replace with your subscription ID):

az account set --subscription "<SUBSCRIPTION_ID>"

Expected outcome: Azure CLI is authenticated to the correct subscription.

Verification:

az account show --query "{name:name, id:id, tenantId:tenantId}" -o table

Step 3: Create a resource group

Choose names and region:

RG="rg-foundry-models-lab"
LOC="eastus"   # example; choose a region that supports your target model
az group create -n "$RG" -l "$LOC"

Expected outcome: A resource group exists.

Verification:

az group show -n "$RG" -o table

Step 4: Create an Azure OpenAI resource (model hosting backend)

Note: Azure OpenAI resource creation and naming constraints apply. Some subscriptions require approval to create and use Azure OpenAI. If creation fails due to policy/approval, follow the error guidance and request access per official docs.

Create the resource (choose a globally unique name):

AOAI_NAME="aoai$RANDOM$RANDOM"
az cognitiveservices account create \
  -n "$AOAI_NAME" \
  -g "$RG" \
  -l "$LOC" \
  --kind OpenAI \
  --sku S0 \
  --yes

Expected outcome: An Azure OpenAI resource is created.

Verification:

az cognitiveservices account show -n "$AOAI_NAME" -g "$RG" --query "{name:name, endpoint:properties.endpoint}" -o json

Save the endpoint:

ENDPOINT=$(az cognitiveservices account show -n "$AOAI_NAME" -g "$RG" --query "properties.endpoint" -o tsv)
echo "$ENDPOINT"

Step 5: Create a model deployment

You must pick a model name/version that is available to you. Azure OpenAI uses a deployment name that your app calls.

  1. Decide: – DEPLOYMENT_NAME (your label) – MODEL_NAME (the model identifier) – MODEL_VERSION (if required by the command)

Because model identifiers change over time, get the correct model name/version from official docs or the Azure portal: – Models concept doc: https://learn.microsoft.com/azure/ai-services/openai/concepts/models

Example pattern (replace placeholders with valid values):

DEPLOYMENT_NAME="chat"
MODEL_NAME="<MODEL_NAME_FROM_DOCS_OR_PORTAL>"
MODEL_VERSION="<MODEL_VERSION_IF_REQUIRED>"

Create the deployment (command structure can vary by API/CLI version). Try this form first:

az cognitiveservices account deployment create \
  -g "$RG" \
  -n "$AOAI_NAME" \
  --deployment-name "$DEPLOYMENT_NAME" \
  --model-name "$MODEL_NAME" \
  --model-version "$MODEL_VERSION" \
  --model-format OpenAI \
  --sku-name "Standard"

If your CLI reports that parameters differ, consult the up-to-date CLI reference: – Azure CLI az cognitiveservices account deployment reference: https://learn.microsoft.com/cli/azure/cognitiveservices/account/deployment

Expected outcome: A deployment exists and is in a succeeded state.

Verification:

az cognitiveservices account deployment show \
  -g "$RG" -n "$AOAI_NAME" --deployment-name "$DEPLOYMENT_NAME" -o json

Step 6: Get an API key (for lab simplicity)

For a quick lab, use an API key. For production, prefer Microsoft Entra ID where supported and appropriate.

KEY=$(az cognitiveservices account keys list -g "$RG" -n "$AOAI_NAME" --query "key1" -o tsv)
echo "Got key length: ${#KEY}"

Expected outcome: You have an API key to call the endpoint.

Verification: Key length is non-zero.


Step 7: Call the model using Python (Chat Completions)

Create a virtual environment and install dependencies.

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install requests

Create chat.py:

import os
import requests

endpoint = os.environ["AZURE_OPENAI_ENDPOINT"].rstrip("/")
key = os.environ["AZURE_OPENAI_API_KEY"]
deployment = os.environ["AZURE_OPENAI_DEPLOYMENT"]

# API versions change; verify the newest stable API version in official docs.
# Azure OpenAI REST reference:
# https://learn.microsoft.com/azure/ai-services/openai/reference
api_version = os.environ.get("AZURE_OPENAI_API_VERSION", "2024-02-15-preview")

url = f"{endpoint}/openai/deployments/{deployment}/chat/completions?api-version={api_version}"

headers = {
    "Content-Type": "application/json",
    "api-key": key
}

payload = {
    "messages": [
        {"role": "system", "content": "You are a concise assistant. Respond in 3 bullet points maximum."},
        {"role": "user", "content": "Explain what Foundry Models is in Azure and when to use it."}
    ],
    "temperature": 0.2,
    "max_tokens": 150
}

resp = requests.post(url, headers=headers, json=payload, timeout=60)
print("Status:", resp.status_code)
print(resp.text)
resp.raise_for_status()

Set environment variables and run:

export AZURE_OPENAI_ENDPOINT="$ENDPOINT"
export AZURE_OPENAI_API_KEY="$KEY"
export AZURE_OPENAI_DEPLOYMENT="$DEPLOYMENT_NAME"
python chat.py

Expected outcome: You receive a JSON response containing an assistant message.

Verification: – HTTP status is 200 – Response includes choices[0].message.content


Step 8: Add basic guardrails (cost + reliability)

Apply these controls in your app: – Hard limit output size (max_tokens) – Lower temperature for deterministic outputs – Timeouts and retries for transient failures – Backoff on 429 throttling

Example retry wrapper (minimal):

import time
import random
import requests

def post_with_retry(url, headers, payload, max_attempts=5):
    for attempt in range(1, max_attempts + 1):
        r = requests.post(url, headers=headers, json=payload, timeout=60)
        if r.status_code in (429, 500, 502, 503, 504):
            sleep_s = min(30, (2 ** attempt) + random.random())
            time.sleep(sleep_s)
            continue
        r.raise_for_status()
        return r
    r.raise_for_status()

Expected outcome: Your client behaves better under throttling or transient issues.


Validation

Run a few controlled prompts and confirm: 1. Responses return successfully. 2. Token usage stays within expectations (check response fields if provided by the API). 3. You can reproduce results by using the same prompt and low temperature.

Recommended checks: – Send 5–10 test requests. – Watch for 429 responses (quota throttling). – Keep prompts short to keep cost low.


Troubleshooting

Common issues and fixes:

1) Resource creation fails (policy/approval) – Symptom: AuthorizationFailed, RequestDisallowedByPolicy, or “not eligible”. – Fix: Request Azure OpenAI access (if required) and/or ask your Azure admin to allow resource creation.

2) Deployment create fails (model not found / invalid version) – Symptom: CLI error about model name/version. – Fix: Use the exact model identifier shown in the Azure portal for your region; verify in docs:
https://learn.microsoft.com/azure/ai-services/openai/concepts/models

3) HTTP 401 / 403 when calling the endpoint – Symptom: Unauthorized. – Fix: – Confirm you used the correct endpoint URL from the resource. – Confirm API key is correct. – Confirm you’re calling the correct deployment name.

4) HTTP 404 – Symptom: Not found. – Fix: – Verify deployment name matches exactly. – Verify API version path is correct; confirm the latest REST reference:
https://learn.microsoft.com/azure/ai-services/openai/reference

5) HTTP 429 throttling – Symptom: Too many requests. – Fix: – Reduce request rate and tokens – Implement retries with backoff – Request quota increase (process varies by org/service)


Cleanup

To avoid ongoing charges, delete the resource group:

az group delete -n "$RG" --yes --no-wait

Expected outcome: All lab resources are removed.

Verification:

az group exists -n "$RG"

When it returns false, cleanup is complete.


11. Best Practices

Architecture best practices

  • Prefer RAG for enterprise knowledge scenarios instead of stuffing large documents into prompts.
  • Separate model selection from application orchestration so you can swap models with minimal code changes.
  • Design for fallbacks: a smaller/cheaper model for degraded mode, or a rules-based response for critical workflows.

IAM/security best practices

  • Prefer Microsoft Entra ID / managed identity when supported by the underlying model endpoint.
  • If using API keys:
  • Store in Key Vault
  • Rotate keys on a schedule
  • Never commit keys into repos
  • Use least privilege at the resource group/resource scope.

Cost best practices

  • Measure and control token usage:
  • Cap max_tokens
  • Keep system prompts short
  • Summarize conversation history instead of replaying it
  • Use Azure Budgets and alerts per environment/team.
  • Tag resources (env, owner, costCenter, app) for chargeback/showback.

Performance best practices

  • Choose the smallest model that meets requirements.
  • Reduce prompt size and avoid unnecessary context.
  • Cache stable outputs and embeddings.
  • Use asynchronous patterns for batch workloads.

Reliability best practices

  • Handle 429 and transient 5xx errors with backoff and jitter.
  • Use circuit breakers for downstream dependency failures.
  • Consider multi-region only if your chosen model is available in multiple regions and your compliance posture allows it.

Operations best practices

  • Monitor:
  • Request count, errors, latency
  • Throttling events
  • Token usage (where reported)
  • Centralize logs and secure them (PII concerns).
  • Define SLOs for latency and availability; load test against quotas.

Governance/tagging/naming best practices

  • Naming example:
  • rg-<app>-<env>-<region>
  • aoai-<app>-<env>-<region>
  • kv-<app>-<env>-<region>
  • Tag consistently:
  • env=dev|test|prod
  • owner=email
  • costCenter=...
  • dataSensitivity=low|medium|high

12. Security Considerations

Identity and access model

  • Resource-level access is governed by Azure RBAC.
  • For data-plane calls:
  • Some endpoints support Microsoft Entra ID (recommended where available).
  • Otherwise use API keys with strict handling.

Encryption

  • Data in transit: HTTPS/TLS.
  • Data at rest: depends on the underlying services (Azure services generally encrypt at rest by default).
  • If logging prompts/outputs, treat them as sensitive data.

Network exposure

  • Prefer Private Link where supported for the model endpoint and dependent services (Search, Storage).
  • Restrict public network access:
  • IP allowlists/firewall rules where supported
  • WAF in front of your app APIs
  • Use private DNS zones correctly for private endpoints.

Secrets handling

  • Use Key Vault for:
  • API keys
  • Certificates
  • Connection strings
  • Use managed identity to access Key Vault.
  • Rotate secrets; monitor access.

Audit/logging

  • Enable Azure Activity Logs for control-plane auditing.
  • Use diagnostic settings for the underlying model resource (if available).
  • Carefully evaluate whether to log:
  • Full prompts/responses (high sensitivity)
  • Metadata only (safer): timestamps, token counts, request IDs, latency

Compliance considerations

  • Data residency: pick regions that satisfy requirements.
  • Data retention: configure logs with retention policies.
  • Model terms: some models have usage restrictions—review model/provider terms in the portal and docs.

Common security mistakes

  • Hardcoding API keys in code or CI logs
  • Overly permissive RBAC (Owner everywhere)
  • Logging sensitive prompts and outputs without access controls
  • Exposing model endpoints directly to the public internet without an application layer
  • Not planning for prompt injection (especially in RAG systems)

Secure deployment recommendations

  • Put the model behind an application API that:
  • Authenticates users
  • Applies authorization and rate limiting
  • Enforces prompt policies and redaction
  • Implements monitoring and abuse detection
  • Use Private Link for internal enterprise apps where possible.
  • Add content filtering and human review for high-risk outputs.

13. Limitations and Gotchas

These are common for model services integrated through Foundry Models; exact behavior depends on the underlying model hosting service and model family.

Known limitations

  • Region limitations: Not all models are available in all regions.
  • Quota throttling: You may hit RPM/TPM limits unexpectedly during load tests.
  • Approval requirements: Some subscriptions require access approval for certain models/services.
  • Model behavior variability: Even with low temperature, outputs can change across model versions.

Quotas

  • Requests per minute, tokens per minute, and concurrency limits are common.
  • Quota increases may require a request process and justification.

Regional constraints

  • Multi-region design can be blocked by model availability.
  • Compliance requirements may restrict you to specific regions where a model may not exist.

Pricing surprises

  • Token costs scale linearly; large prompts (RAG context) can be expensive.
  • Logging and monitoring can become significant at high volume.
  • Private networking can add cost (service-dependent).

Compatibility issues

  • API versions change; older API versions can be retired.
  • SDKs may lag behind API features; verify recommended SDK in official docs.

Operational gotchas

  • Rotating keys can break clients if not coordinated.
  • Over-logging prompts/responses can create sensitive data stores.
  • Content filtering/policy settings can block responses in unexpected ways—test with representative prompts.

Migration challenges

  • Swapping models can change output style and tool/function-calling behaviors.
  • Prompt tuning is often required when upgrading models.
  • Regression testing is essential for production assistants.

Vendor-specific nuances

  • Azure services typically separate control-plane and data-plane permissions.
  • Some features are only available in certain portal experiences—confirm current UI flow in Azure AI Foundry documentation.

14. Comparison with Alternatives

Foundry Models is best compared as a model selection + consumption experience in Azure. Alternatives vary depending on whether you want a curated model catalog, a model training platform, or a cross-cloud abstraction.

Comparison table

Option Best For Strengths Weaknesses When to Choose
Foundry Models (Azure) Teams selecting and consuming foundation models in Azure AI Foundry Centralized model discovery; Azure-aligned governance; integrates with Azure AI ecosystem Exact capabilities depend on underlying model hosting service; region/model availability constraints You want an Azure-native path from model selection to integration
Azure OpenAI (Azure) Using OpenAI models with Azure governance Mature enterprise patterns; strong docs; token-based pricing; private networking options (verify per region) Limited to models offered; quotas/availability can vary You know you want OpenAI model families in Azure
Azure Machine Learning (Azure) Training, fine-tuning, and MLOps for custom ML Full lifecycle ML (data, training, registry, deployment); flexible compute More complexity; higher ops burden than pure API consumption You need custom training, managed endpoints, registries, and MLOps pipelines
Azure AI Services (non-OpenAI) Prebuilt AI APIs (vision, language, speech) Task-specific APIs; predictable behavior; enterprise controls Less flexible than foundation models for open-ended generation You need deterministic APIs for specific tasks
AWS Bedrock (AWS) Managed foundation models in AWS Model variety; managed access patterns Different cloud; migration effort; governance differs Your platform is AWS-first
Google Vertex AI (Google Cloud) Model garden + MLOps in GCP Integrated MLOps + model offerings Different cloud; different IAM/networking Your platform is GCP-first
Self-hosted open-source models (AKS/VMs) Maximum control, offline needs Full control over weights/runtime; can be cost-effective at scale Significant ops, scaling, security patching; GPU cost and capacity You require offline/air-gapped or custom runtime control

15. Real-World Example

Enterprise example: Internal policy assistant with RAG

  • Problem: A global enterprise has thousands of HR/IT/security policies. Employees ask repetitive questions and get inconsistent answers.
  • Proposed architecture:
  • Documents in SharePoint exported to Azure Storage
  • Indexed into Azure AI Search
  • Application backend in Azure Container Apps
  • Model selected via Foundry Models, hosted via the appropriate Azure model endpoint (commonly Azure OpenAI for chat)
  • Private networking using Private Link (model endpoint + search)
  • Secrets in Key Vault
  • Monitoring via Azure Monitor / Log Analytics
  • Why Foundry Models was chosen:
  • Central model discovery and governance alignment for multiple teams
  • Easier standardization of approved models across departments
  • Expected outcomes:
  • Faster employee support
  • Reduced helpdesk volume
  • Better compliance via cited sources and controlled rollout

Startup/small-team example: SaaS ticket triage automation

  • Problem: A small SaaS company’s support queue grows quickly; manual triage is inconsistent.
  • Proposed architecture:
  • Webhook receives new tickets (Azure Functions)
  • Calls a model endpoint chosen via Foundry Models to classify and route tickets
  • Writes results to a database (Azure SQL or Cosmos DB)
  • Simple dashboard for overrides and auditing
  • Why Foundry Models was chosen:
  • Fast model selection and integration without building ML infrastructure
  • Token-based costs scale with usage and can be capped with prompt limits
  • Expected outcomes:
  • Faster first response time
  • More consistent routing
  • Low operational overhead

16. FAQ

1) Is Foundry Models a standalone Azure service with its own bill?
Foundry Models is primarily an Azure AI Foundry experience for selecting/using models. Costs generally come from the underlying model hosting/inference service (for example, Azure OpenAI) and supporting resources.

2) Do I always need Azure OpenAI to use Foundry Models?
Not necessarily. Foundry Models may surface different model providers and consumption paths. However, Azure OpenAI is a common hosting layer for many generative AI workloads in Azure. Confirm the hosting method per model in the portal/docs.

3) Can I deploy any model I see in the catalog into my subscription?
Not always. Availability depends on region, subscription eligibility, quotas, and the specific model’s offering (some may be “use via API,” some may require special approval). Verify in the Azure AI Foundry portal and official docs.

4) How do I keep prompts and outputs private?
Use least privilege, restrict network exposure (Private Link where supported), limit logging of prompts/outputs, and apply encryption and access controls to any stored conversation data.

5) What authentication should I use for production?
Prefer Microsoft Entra ID / managed identity where supported by the endpoint. If you must use API keys, store them in Key Vault and rotate them.

6) How do I estimate cost before going live?
Estimate by tokens: average prompt tokens + average response tokens × expected request volume. Use the Azure Pricing Calculator and the model/service pricing page.

7) Why do I get 429 errors?
You’re hitting quota/rate limits (requests/tokens per minute). Implement backoff retries, reduce prompt size, and request quota increases if needed.

8) Can I use Foundry Models for embeddings and semantic search?
Yes, if your chosen model endpoint supports embeddings. For full RAG, combine with Azure AI Search and store embeddings appropriately.

9) Does Foundry Models support fine-tuning?
Fine-tuning is model- and service-dependent. Some models/services support fine-tuning; others do not. Verify in official docs for the specific model family.

10) How do I version and test prompts?
Treat prompts like code: store in Git, add evaluation tests, and maintain a prompt changelog. Use sampling-based regression tests on model upgrades.

11) What’s the difference between model selection and deployment?
Selection is choosing a model that fits your workload. Deployment is creating/using an endpoint with quotas and authentication that your app calls.

12) Can I restrict model usage to a specific VNet?
Often yes using Private Link (service-dependent). If Private Link isn’t supported for your endpoint type, enforce access via an application layer in your VNet.

13) Should I send entire documents in prompts?
Usually no. Use RAG: retrieve small relevant chunks and include only those. This reduces cost and improves answer grounding.

14) How do I prevent prompt injection in RAG?
Apply input sanitization, isolate instructions from retrieved content, use strict tool/function calling policies, and add refusal policies and auditing.

15) What should I monitor in production?
Latency, error rate, throttling rate (429), token usage, and user feedback. Also monitor cost budgets and unusual usage spikes.

16) How do I handle model updates?
Pin versions where possible, test new versions in staging, run regression evaluations, and roll out gradually.

17) Can I use Foundry Models for regulated workloads?
Potentially, but you must validate compliance requirements (region, logging, data handling, vendor terms). Engage security/compliance early.


17. Top Online Resources to Learn Foundry Models

Resource Type Name Why It Is Useful
Official portal Azure AI Foundry portal (commonly: https://ai.azure.com/) Entry point for model catalog and Foundry Models experience (UI may evolve)
Official documentation Azure AI Foundry / Azure AI Studio docs (verify current path) https://learn.microsoft.com/azure/ai-studio/ Core docs for the Foundry/Studio experience and model usage workflows
Official documentation Azure OpenAI documentation https://learn.microsoft.com/azure/ai-services/openai/ If your Foundry Models usage is backed by Azure OpenAI, these are the canonical API and deployment docs
Official reference Azure OpenAI REST API reference https://learn.microsoft.com/azure/ai-services/openai/reference Required for correct endpoints, API versions, and payload shapes
Official concept doc Azure OpenAI models https://learn.microsoft.com/azure/ai-services/openai/concepts/models Helps confirm model availability, versions, and capabilities
Official pricing Azure OpenAI pricing https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/ Pricing dimensions and links to calculator
Official calculator Azure Pricing Calculator https://azure.microsoft.com/pricing/calculator/ Estimate token usage and supporting resources
Official CLI reference Azure CLI cognitive services deployment commands https://learn.microsoft.com/cli/azure/cognitiveservices/account/deployment Ensures your deployment commands match current CLI
Architecture guidance Azure Architecture Center https://learn.microsoft.com/azure/architecture/ Reference architectures (RAG, security, networking) to productionize model endpoints
Samples Azure Samples on GitHub https://github.com/Azure-Samples Many Azure AI samples live here; search for Azure OpenAI + RAG patterns (verify repo relevance)

18. Training and Certification Providers

Institute Suitable Audience Likely Learning Focus Mode Website URL
DevOpsSchool.com DevOps engineers, cloud engineers, architects Azure operations, DevOps, and cloud-native practices applied to AI projects check website https://www.devopsschool.com/
ScmGalaxy.com Beginners to intermediate IT professionals DevOps, CI/CD, cloud fundamentals that support AI deployments check website https://www.scmgalaxy.com/
CLoudOpsNow.in Cloud ops and platform teams Cloud operations, governance, monitoring, cost controls check website https://www.cloudopsnow.in/
SreSchool.com SREs, reliability engineers, platform teams Reliability engineering practices for production systems (including AI endpoints) check website https://www.sreschool.com/
AiOpsSchool.com Ops teams adopting AI for IT operations AIOps concepts, monitoring automation, operational analytics check website https://www.aiopsschool.com/

19. Top Trainers

Platform/Site Likely Specialization Suitable Audience Website URL
RajeshKumar.xyz DevOps/cloud training content (verify current offerings) Engineers seeking practical cloud/DevOps learning https://rajeshkumar.xyz/
devopstrainer.in DevOps training platform (verify course catalog) Beginners to experienced DevOps practitioners https://www.devopstrainer.in/
devopsfreelancer.com Freelance DevOps guidance/services (verify current scope) Teams needing hands-on DevOps help or mentorship https://www.devopsfreelancer.com/
devopssupport.in DevOps support and training resources (verify offerings) Ops teams and learners needing support-oriented training https://www.devopssupport.in/

20. Top Consulting Companies

Company Likely Service Area Where They May Help Consulting Use Case Examples Website URL
cotocus.com Cloud/DevOps consulting (verify service catalog) Platform engineering, DevOps automation, cloud operations CI/CD for AI apps, infrastructure as code, operational readiness for model endpoints https://cotocus.com/
DevOpsSchool.com DevOps and cloud consulting/training DevOps transformation and cloud best practices Secure deployment pipelines, monitoring strategy, cost governance https://www.devopsschool.com/
DEVOPSCONSULTING.IN DevOps consulting (verify portfolio) DevOps process, automation, cloud adoption Azure landing zones, standardized deployment patterns, SRE practices for AI workloads https://devopsconsulting.in/

21. Career and Learning Roadmap

What to learn before Foundry Models

  • Azure fundamentals: subscriptions, RBAC, resource groups, regions
  • Networking basics: VNets, Private Link, DNS
  • Security basics: Key Vault, managed identity, logging/auditing
  • API basics: REST, auth headers, rate limiting
  • AI basics: tokens, context windows, embeddings, RAG concepts

What to learn after Foundry Models

  • RAG at scale:
  • Azure AI Search indexing strategies
  • Chunking, embedding models, evaluation
  • Production operations:
  • SLOs, alerting, incident response
  • Cost governance and chargeback
  • Model evaluation and safety:
  • Prompt injection defenses
  • Red-teaming and regression testing
  • MLOps and customization:
  • Azure Machine Learning pipelines (when you need custom training/fine-tuning)

Job roles that use it

  • Cloud Solutions Architect (AI workloads)
  • DevOps / Platform Engineer for AI-enabled applications
  • SRE supporting AI APIs
  • Application Developer integrating generative AI
  • Security Engineer reviewing AI service deployments
  • ML Engineer (model selection + evaluation + integration)

Certification path (if available)

There is not necessarily a single certification named “Foundry Models.” Practical paths usually combine: – Azure fundamentals certifications – Azure AI-related certifications (verify current Microsoft certification names/paths in official Microsoft Learn)

Start here: – Microsoft Learn: https://learn.microsoft.com/training/

Project ideas for practice

  1. RAG assistant for internal docs with citations and access control
  2. Ticket triage and routing microservice with evaluation harness
  3. Document summarization pipeline with human approval workflow
  4. Cost dashboard tracking token usage + budgets + alerts
  5. Private networking proof-of-concept with Private Link + private DNS

22. Glossary

  • Foundry Models: Azure AI Foundry experience for discovering and using models; relies on underlying Azure model hosting services.
  • Azure AI Foundry: Azure portal experience for building AI apps (branding may evolve; verify official naming in docs).
  • Azure OpenAI: Azure service providing OpenAI model deployments with Azure governance and networking options.
  • Control plane: Management operations (create resources, deployments, RBAC).
  • Data plane: Runtime inference calls (prompts/responses).
  • Tokens: Units of text used for billing and limits in many LLM APIs.
  • Context window: Maximum tokens a model can consider (prompt + response).
  • RAG (Retrieval-Augmented Generation): Pattern combining retrieval (search) with generation to ground answers in your data.
  • Embeddings: Vector representations of text used for semantic search and similarity.
  • Quota: Service-imposed throughput limits (RPM/TPM).
  • Private Link: Azure private networking feature to access services via private endpoints.
  • Managed identity: Azure identity for services to access other resources without storing secrets.
  • Key Vault: Azure service for secret, key, and certificate management.
  • 429 throttling: HTTP status code indicating too many requests due to rate limits.
  • Temperature: Parameter influencing randomness in model outputs.
  • Max tokens: Parameter to cap response length (cost and safety control).

23. Summary

Foundry Models (Azure) is the model selection and usage experience inside Azure AI Foundry for teams building AI + Machine Learning solutions with foundation models. It helps you move from model discovery to practical integration while staying aligned with Azure governance patterns.

Key takeaways: – Foundry Models itself is not usually the main cost center—model inference and supporting services drive spend. – Security depends heavily on the underlying endpoint: use least privilege, prefer managed identity, and restrict networking with Private Link where supported. – Production readiness requires planning for quotas, throttling, monitoring, and prompt/data governance.

When to use it: you want an Azure-native path to select and operationalize foundation models with enterprise controls.
Next learning step: build a small RAG service (Azure AI Search + model endpoint) and add evaluation, monitoring, and cost budgets before scaling to production.