Category
AI and ML
1. Introduction
Vertex AI Datasets is the dataset management capability inside Vertex AI on Google Cloud. It lets you register, organize, and reuse training/evaluation data for ML workflows (AutoML and custom training) in a consistent way—without everyone on the team manually tracking “which bucket/path/table did we train on?”
In simple terms: Vertex AI Datasets is a catalog of ML-ready datasets (tabular, image, text, video) that points to your source data (typically Cloud Storage or BigQuery) and can hold labeling/annotation metadata. It provides a standard entry point for downstream ML tasks like training, evaluation, and labeling jobs.
Technically: a Vertex AI Dataset is a regional Vertex AI resource (projects/*/locations/*/datasets/*) that stores dataset metadata (display name, schema, data item references, labels/annotations) and links to underlying data sources. You can import data from Cloud Storage and/or BigQuery (depending on dataset type), and then use the dataset as the input to Vertex AI training pipelines (AutoML or custom) and data labeling workflows.
The problem it solves: ML teams often struggle with dataset sprawl—many versions of CSVs, folders, and tables with unclear lineage. Vertex AI Datasets provides a structured dataset object that makes it easier to: – collaborate across data/ML/ops teams, – standardize training inputs, – apply consistent access controls, – reduce mistakes (training on the wrong snapshot/path), – operationalize dataset-driven MLOps workflows.
Naming note (verify if your org uses legacy terms): Vertex AI is the successor to “AI Platform.” Dataset management is now part of Vertex AI and is commonly referred to as Vertex AI Datasets in docs and console.
2. What is Vertex AI Datasets?
Official purpose
Vertex AI Datasets is the Vertex AI data management layer for creating and managing dataset resources used in ML workflows. It is designed to help teams prepare and manage data for training, evaluation, and labeling inside the Vertex AI ecosystem.
Core capabilities
At a practical level, Vertex AI Datasets enables you to: – Create datasets for supported data types (commonly tabular, image, text, and video). – Import data from supported sources (commonly BigQuery for tabular; Cloud Storage for media/text). – Manage labels/annotations (often via Vertex AI Data Labeling integration). – Reuse datasets across experiments, training jobs, and pipelines. – Control access using Google Cloud IAM.
Major components (conceptual)
While the exact objects vary by dataset type, common concepts include: – Dataset resource: the top-level container in Vertex AI (regional). – Data items: references to individual records (rows, files, documents, frames/clips). – Annotations/labels: metadata created by labeling jobs or imported labels. – Schema: metadata schema describing the dataset type and expected fields.
Service type
- Managed service within Vertex AI (control plane managed by Google).
- You interact with it via:
- Google Cloud Console (Vertex AI → Datasets)
- Vertex AI API (
aiplatform.googleapis.com) gcloudCLI (gcloud ai datasets ...)- Vertex AI SDKs (Python commonly)
Scope: regional and project-scoped
- Project-scoped: datasets live inside a Google Cloud project.
- Regional: datasets are created in a specific Vertex AI location (for example,
us-central1,europe-west4, etc.).
Resource name format resembles: projects/PROJECT_ID/locations/LOCATION/datasets/DATASET_ID
Important: Even if your dataset data lives in Cloud Storage or BigQuery, the Vertex AI dataset resource is regional. For data residency, performance, and compliance, align: – Vertex AI dataset location – underlying storage locations (BigQuery dataset location; Cloud Storage bucket location)
How it fits into the Google Cloud ecosystem
Vertex AI Datasets is typically used alongside: – Cloud Storage: raw files and media assets – BigQuery: tabular datasets and analytics – Vertex AI Training / AutoML: training jobs that consume datasets – Vertex AI Pipelines: orchestrating repeatable ML workflows – Vertex AI Data Labeling: human labeling/annotation operations – IAM + Cloud Audit Logs: access control and auditing – Dataplex / Data Catalog (governance): governing the underlying storage and metadata (Vertex AI Datasets is not a full governance suite by itself)
Official docs starting point:
– https://cloud.google.com/vertex-ai/docs/datasets/introduction
3. Why use Vertex AI Datasets?
Business reasons
- Faster time to production: dataset resources become reusable building blocks for training workflows.
- Reduced risk: fewer “trained on the wrong file/table” incidents because datasets are tracked and referenced consistently.
- Better collaboration: a shared dataset registry is easier than passing around paths and ad-hoc spreadsheets.
Technical reasons
- Standardized ML inputs: downstream Vertex AI services can consume dataset IDs rather than fragile storage paths.
- Support for multiple data modalities: separate dataset types for tabular, image, text, and video (verify supported types for your region and workflow in official docs).
- Labeling integration: labeling workflows can attach annotations to the dataset resource.
Operational reasons
- Repeatability: stable dataset resources fit better into CI/CD and MLOps patterns.
- Central visibility: teams can discover datasets via console/API and inspect schema/metadata.
- Lifecycle management: you can delete datasets, rotate permissions, and standardize naming conventions.
Security/compliance reasons
- IAM-based access control: control who can view/manage datasets and who can access underlying data sources.
- Auditability: dataset actions are logged via Cloud Audit Logs (subject to your org’s logging configuration).
- Data residency alignment: choose dataset locations aligned to regulatory needs and storage locations.
Scalability/performance reasons
- Decouples control plane from data plane: the dataset resource is metadata, while the heavy data stays in BigQuery/Cloud Storage.
- Works with large sources: BigQuery tables and Cloud Storage buckets scale independently.
When teams should choose Vertex AI Datasets
Choose Vertex AI Datasets when: – You are standardizing ML workflows on Vertex AI. – Multiple people/teams share training data and need consistent references and permissions. – You want to integrate labeling, AutoML, training pipelines, and model registry around consistent dataset assets. – You need a managed dataset registry without building your own dataset metadata service.
When teams should not choose Vertex AI Datasets
You might skip Vertex AI Datasets if: – You are not using Vertex AI for training or MLOps (a dataset registry may not add value). – Your workflow is fully external (for example, training entirely on-prem) and you only use Google Cloud for storage. – You require advanced dataset versioning/branching semantics (Git-like) and governance features—consider complementary tools (DVC, lakeFS, Dataplex) and integrate as needed. – Your primary need is enterprise data governance and cataloging; Vertex AI Datasets is not a replacement for a data governance platform.
4. Where is Vertex AI Datasets used?
Industries
- Retail/e-commerce: product categorization, demand forecasting, personalization datasets
- Financial services: fraud and risk tabular datasets, document/text classification
- Healthcare/life sciences: imaging datasets (subject to compliance controls), NLP datasets
- Manufacturing: quality inspection image/video datasets
- Media/advertising: content classification and moderation datasets
- Transportation/logistics: ETA prediction, route optimization tabular data
Team types
- Data science teams building models and experiments
- ML engineering teams operationalizing training pipelines
- Platform teams standardizing Vertex AI usage
- Security and governance teams enforcing IAM and audit controls
- Data engineering teams managing upstream BigQuery/Storage sources
Workloads
- Supervised learning with labels/annotations
- Computer vision: classification, object detection (verify exact supported annotation formats per dataset type)
- NLP: classification, entity extraction (verify supported dataset types and formats)
- Tabular classification/regression
- Video classification/object tracking (verify supported capabilities)
Architectures
- Cloud-native MLOps (Vertex AI Pipelines + datasets + training + registry)
- BigQuery-centric ML where data stays in BigQuery and Vertex AI consumes it
- Data lake on Cloud Storage feeding labeled training datasets
Real-world deployment contexts
- Production: curated datasets feeding repeatable training pipelines, controlled by IAM and CI/CD
- Dev/test: smaller sandbox datasets used for experimentation, model prototyping, and pipeline validation
5. Top Use Cases and Scenarios
Below are realistic scenarios where Vertex AI Datasets fits well.
1) Tabular churn prediction dataset registry
- Problem: Analysts create many versions of churn tables in BigQuery; ML engineers lose track of which table was used for training.
- Why Vertex AI Datasets fits: A tabular dataset resource can reference the canonical BigQuery table and become the stable input to training pipelines.
- Example: Create
customer_churn_tabulardataset inus-central1referencingbq://project.ds.churn_features_v3.
2) Image classification for product categories
- Problem: Product images stored in Cloud Storage are not consistently labeled; training data is scattered across folders.
- Why it fits: Vertex AI Datasets organizes images as data items with labels/annotations, and integrates with labeling.
- Example: A retail team imports images from
gs://.../products/and assigns category labels for AutoML training.
3) Defect detection via object detection labels
- Problem: Manufacturing needs bounding boxes for defects across many assembly-line photos.
- Why it fits: Image datasets can hold object detection annotations (verify the supported import/annotation formats for your workflow).
- Example: Labelers annotate defects; training pipeline consumes the dataset for detection model training.
4) Document/text classification for support ticket routing
- Problem: Support tickets in text form require labeling by category/priority; labels need to be reused for retraining.
- Why it fits: Text datasets help centralize labeled text samples and feed supervised training.
- Example: Import ticket text from Cloud Storage, label intents, and reuse the dataset for monthly retraining.
5) Sentiment analysis dataset across regions
- Problem: Regional teams store training text in different buckets; compliance requires data locality.
- Why it fits: Datasets are regional resources; you can create region-specific datasets aligned to storage.
- Example:
sentiment-euineurope-west4referencing EU storage; separate datasetsentiment-usinus-central1.
6) Video dataset for content moderation
- Problem: Moderation needs labeled video clips and consistent training splits.
- Why it fits: Video datasets can organize video data items and annotations (verify supported formats and labeling tasks).
- Example: Import clips from Cloud Storage, label unsafe content categories, train classifier.
7) Central dataset catalog for an MLOps platform team
- Problem: Each squad builds its own dataset conventions; onboarding is slow.
- Why it fits: Platform team defines standards: naming, IAM groups, and dataset locations.
- Example: A “dataset registry” per domain:
fraud_*,search_*,vision_*.
8) Reproducible training input for Vertex AI Pipelines
- Problem: Pipelines reference raw paths; refactors break training jobs.
- Why it fits: Pipelines can reference dataset IDs, reducing fragile path dependencies.
- Example: Pipeline step fetches dataset resource and triggers training with the dataset as input.
9) Controlled external labeling with auditability
- Problem: Need to let a labeling vendor annotate data without broad bucket access.
- Why it fits: With careful IAM and storage permissions, you can limit access and audit operations (design carefully; verify best practices in official docs).
- Example: Vendor gets minimal permissions; dataset annotation changes are auditable.
10) Multi-model training from a shared “golden dataset”
- Problem: Multiple models (baseline, advanced, interpretable) should train on the same curated dataset.
- Why it fits: A single dataset resource becomes the canonical input; different training jobs reuse it.
- Example: Train baseline logistic regression and more complex models from the same dataset asset.
6. Core Features
Feature availability and exact dataset type support can change by region and over time. Verify in official docs if you rely on a specific dataset type, annotation format, or import path.
1) Dataset resources for multiple data modalities
- What it does: Lets you create datasets for different ML modalities (commonly tabular, image, text, video).
- Why it matters: ML workflows differ by modality; schema and import formats vary.
- Practical benefit: Teams can standardize dataset creation per modality and use consistent tooling.
- Caveats: Not all dataset types and labeling tasks are available in all regions. Verify supported locations and dataset types in Vertex AI docs.
2) Import from Cloud Storage and/or BigQuery (depending on dataset type)
- What it does: Creates dataset data items by importing references from GCS URIs or BigQuery tables.
- Why it matters: Keeps your data in scalable systems (GCS/BQ) while enabling ML workflows in Vertex AI.
- Practical benefit: Avoids ad-hoc local file management; supports larger datasets.
- Caveats: Location mismatches (Vertex AI region vs bucket/BQ dataset location) can cause friction or performance issues. Align locations where possible.
3) Labeling and annotation integration
- What it does: Supports attaching labels/annotations to dataset items (often via Vertex AI Data Labeling workflows).
- Why it matters: Supervised learning depends on high-quality labels.
- Practical benefit: Central place to store labeling output tied to data items.
- Caveats: Labeling incurs cost and requires careful IAM design. Some labeling workflows have task-specific formats and constraints.
4) Dataset metadata and organization
- What it does: Provides display names, resource labels/tags (where supported), schemas, and dataset-level metadata.
- Why it matters: Discoverability and governance.
- Practical benefit: Standard naming conventions and labels help manage many datasets across teams.
- Caveats: Vertex AI Datasets is not a full enterprise data catalog; rely on Dataplex/Data Catalog for broader governance.
5) API/SDK/CLI management
- What it does: Create/list/describe/delete datasets programmatically.
- Why it matters: Enables automation and MLOps.
- Practical benefit: Integrate dataset creation into CI/CD or environment bootstrapping.
- Caveats: Quotas and permissions apply; ensure least privilege.
6) Integration with Vertex AI training workflows
- What it does: Many Vertex AI training flows (including AutoML for supported modalities) can consume a dataset resource.
- Why it matters: Reduces glue code and makes training inputs consistent.
- Practical benefit: Easier reproducibility when training jobs reference a dataset ID.
- Caveats: Some custom training workflows may still read directly from GCS/BQ; dataset resources are helpful but not always required.
7) Regional resource control
- What it does: Dataset resources are created in a chosen Vertex AI region.
- Why it matters: Data residency, latency, and compliance.
- Practical benefit: Align datasets to regulated regions and keep workflows consistent.
- Caveats: Moving a dataset between regions is not typically a “move” operation; you often recreate/import in the target region.
7. Architecture and How It Works
High-level architecture
Vertex AI Datasets separates dataset metadata management (Vertex AI control plane) from data storage (Cloud Storage/BigQuery). The dataset resource: – stores schema and dataset metadata, – stores references to the underlying data items (file URIs, table references), – stores labeling/annotation metadata (depending on dataset type and workflow), – is used by downstream Vertex AI services for training and labeling.
Request/data/control flow (typical)
- You create a dataset in a Vertex AI region.
- You run an import (via console, API, SDK, or CLI).
- Vertex AI records dataset items and metadata, referencing your data in GCS or BigQuery.
- You optionally run labeling jobs and attach annotations to dataset items.
- Training jobs consume the dataset resource (or underlying sources), producing models and artifacts.
Integrations with related services
Common integrations include: – Cloud Storage: file-based sources for image/video/text. – BigQuery: tabular sources and feature tables. – Vertex AI Training / AutoML: training consumes dataset resources. – Vertex AI Pipelines: orchestrates recurring dataset import + training. – Cloud Logging / Cloud Monitoring: operational observability for API calls and jobs. – IAM / Cloud Audit Logs: access control and auditing. – Dataplex / Data Catalog: governance of underlying data stores (complementary).
Dependency services
aiplatform.googleapis.com(Vertex AI API)- BigQuery API (if using BigQuery sources)
- Cloud Storage API (if using GCS sources)
- IAM and Service Usage for API enablement
- Cloud Logging/Audit Logs (for monitoring/auditing)
Security/authentication model
- Uses Google Cloud IAM for dataset resource access.
- Uses service accounts for programmatic access (SDK/CLI).
- Underlying data access is enforced by the data plane service:
- BigQuery IAM for tables
- Cloud Storage IAM for buckets/objects
A common pitfall is granting Vertex AI dataset permissions without granting access to the referenced BigQuery table or GCS objects (or vice versa). You need both.
Networking model
- Vertex AI is a managed Google Cloud service accessed via Google APIs.
- Most usage is over public Google API endpoints, secured by IAM and TLS.
- Enterprises often restrict access using:
- Private Google Access (for VMs in VPC accessing Google APIs without external IPs)
- VPC Service Controls (service perimeter around Vertex AI, BigQuery, Storage)
Verify the latest Vertex AI + VPC SC guidance in official docs.
Monitoring/logging/governance considerations
- Cloud Audit Logs: dataset create/delete/import operations are typically auditable.
- Cloud Logging: job logs (for import/labeling) can appear depending on the operation.
- Resource labels: use consistent labels for ownership, environment, cost center.
- Data governance: govern underlying BigQuery/Storage with Dataplex, IAM conditions, bucket policies, retention, and DLP as required.
Simple architecture diagram (Mermaid)
flowchart LR
U[User / CI Pipeline] -->|Console / API / SDK| VAI[Vertex AI Datasets (regional)]
VAI -->|References| GCS[Cloud Storage bucket]
VAI -->|References| BQ[BigQuery table]
VAI -->|Dataset ID| TR[Vertex AI Training / AutoML]
TR --> M[Model artifacts]
Production-style architecture diagram (Mermaid)
flowchart TB
subgraph Org[Organization]
IAM[IAM + Groups]
AL[Cloud Audit Logs]
VPCSC[VPC Service Controls (optional)]
end
subgraph Data[Data Layer]
GCSRAW[(Cloud Storage - raw/curated)]
BQDW[(BigQuery - feature tables)]
DLP[DLP/Policy checks (optional)]
DPX[Dataplex/Data Catalog (governance)]
end
subgraph ML[Vertex AI (Regional)]
DS[Vertex AI Datasets]
LAB[Vertex AI Data Labeling (optional)]
PIPE[Vertex AI Pipelines]
TRAIN[Vertex AI Training / AutoML]
REG[Model Registry (Vertex AI)]
end
subgraph Ops[Operations]
LOG[Cloud Logging]
MON[Cloud Monitoring]
CI[CI/CD System]
end
IAM --> DS
IAM --> GCSRAW
IAM --> BQDW
DS -->|imports references| GCSRAW
DS -->|imports references| BQDW
DS --> LAB
DS --> PIPE
PIPE --> TRAIN
TRAIN --> REG
DS --> LOG
PIPE --> LOG
TRAIN --> LOG
LOG --> MON
DS --> AL
CI -->|API-driven automation| DS
CI --> PIPE
DPX --- GCSRAW
DPX --- BQDW
DLP --- GCSRAW
DLP --- BQDW
VPCSC --- DS
VPCSC --- GCSRAW
VPCSC --- BQDW
8. Prerequisites
Account/project requirements
- A Google Cloud project with billing enabled.
- Ability to enable APIs in the project.
Permissions / IAM roles
At minimum (principle of least privilege; adjust for your org):
– Vertex AI:
– roles/aiplatform.user for basic usage, or
– roles/aiplatform.admin for full control (use sparingly)
– BigQuery (if using BigQuery sources):
– roles/bigquery.dataViewer on source tables
– roles/bigquery.jobUser may be needed for some operations
– Cloud Storage (if using GCS sources):
– roles/storage.objectViewer (read)
– roles/storage.objectAdmin (if uploading/managing objects in the lab)
– Project setup:
– roles/serviceusage.serviceUsageAdmin to enable APIs (or project owner)
Verify role requirements in official docs (they evolve):
https://cloud.google.com/vertex-ai/docs/general/access-control
Billing requirements
- Dataset metadata operations are typically low cost, but you will pay for:
- BigQuery storage/query if used
- Cloud Storage storage/operations if used
- Labeling jobs if used
- Any training jobs if launched
CLI/SDK/tools
- Google Cloud SDK (
gcloud)
Install: https://cloud.google.com/sdk/docs/install - Optional:
- bq CLI (ships with Cloud SDK)
- Python 3.9+ and
google-cloud-aiplatformSDK (if automating)
Region availability
- Choose a Vertex AI region supported by your organization.
- Align with data location:
- BigQuery dataset location (US/EU or specific region)
- Cloud Storage bucket location (region/multi-region)
Quotas/limits
Vertex AI enforces quotas (API request rates, resource counts, etc.). Check:
https://cloud.google.com/vertex-ai/quotas
Prerequisite services to enable
In most cases:
– Vertex AI API: aiplatform.googleapis.com
– Cloud Storage: storage.googleapis.com
– BigQuery: bigquery.googleapis.com (if using BigQuery sources)
9. Pricing / Cost
Vertex AI Datasets cost is best understood as (a) dataset management metadata + (b) underlying storage and jobs.
Official pricing sources
- Vertex AI pricing: https://cloud.google.com/vertex-ai/pricing
- Google Cloud Pricing Calculator: https://cloud.google.com/products/calculator
- Cloud Storage pricing: https://cloud.google.com/storage/pricing
- BigQuery pricing: https://cloud.google.com/bigquery/pricing
Pricing dimensions (what you actually pay for)
You typically pay for: 1. Data storage – Cloud Storage: GB stored/month, operations (PUT/GET/LIST), retrieval (depending on class), replication, and potential egress. – BigQuery: table storage; queries (on-demand TB processed) or capacity-based reservations. 2. Data processing jobs – Dataset imports may trigger data processing/validation steps (behavior depends on dataset type). Any compute-like operations are usually priced under Vertex AI or the underlying service. Verify in official docs whether a specific import path triggers billable processing. 3. Labeling – Human labeling is billed (task type, volume, workforce). 4. Training – AutoML/custom training is billed by compute, duration, and configuration. 5. Networking – Data egress if data crosses regions or leaves Google Cloud.
Free tier
Vertex AI has some free usage tiers for certain products, but do not assume a free tier applies to dataset operations. Verify current free tier details on the Vertex AI pricing page.
Cost drivers (common “gotchas”)
- BigQuery query costs when you repeatedly transform/export data for training.
- Copying data into multiple buckets/regions for convenience.
- Labeling costs scaling with number of items and complexity.
- Training costs triggered accidentally from the console (AutoML training can run for hours).
- Storage class choices: using Standard vs Nearline/Coldline; retrieval fees can surprise you if you repeatedly read cold data.
Hidden or indirect costs
- Logging and monitoring ingestion (usually modest, but can grow with verbose logs).
- Inter-region data transfer if your training region differs from data region.
- CI/CD runner costs if you automate frequent dataset imports.
How to optimize cost
- Keep data and Vertex AI region aligned to reduce egress and improve performance.
- Use BigQuery views/materialized views carefully—understand query cost implications.
- Avoid duplicating full datasets for every experiment; use curated “golden” datasets and track versions via tables/snapshots.
- Use lifecycle rules on Cloud Storage buckets for raw/intermediate data.
- For labeling, start with small pilot batches to estimate cost/quality.
Example low-cost starter estimate (no fabricated numbers)
A minimal lab can be kept low cost by: – creating a small BigQuery table (KB/MB scale), – creating a Vertex AI tabular dataset referencing that table, – avoiding training and labeling jobs.
Costs will primarily be small BigQuery storage and minimal operations. Exact cost depends on region and pricing model—use the Pricing Calculator for your region and expected usage.
Example production cost considerations
In production, the biggest drivers are usually: – large-scale data storage (TBs) in BigQuery/Cloud Storage, – recurring labeling campaigns, – recurring training runs (AutoML or custom), – orchestration and compute for data prep pipelines (Dataflow/Dataproc/BigQuery).
A good practice is to separate: – raw data (cheap, long retention), – curated training dataset (stable tables/partitions), – experiment subsets (temporary, aggressively TTL’d).
10. Step-by-Step Hands-On Tutorial
This lab focuses on creating a real Vertex AI Datasets tabular dataset from a BigQuery table with minimal cost. You will: – create a small CSV locally, – load it into BigQuery, – create a Vertex AI Dataset that references that BigQuery table, – verify it exists via console and CLI, – clean up everything.
Objective
Create and manage a Vertex AI Datasets tabular dataset in Google Cloud and understand the required permissions, location alignment, verification, and cleanup steps.
Lab Overview
You will set up: – A Cloud Storage bucket (for staging the CSV) – A BigQuery dataset + table (loaded from the CSV) – A Vertex AI dataset (tabular) importing from the BigQuery table
You will validate by:
– viewing the dataset in Vertex AI console
– listing/describing the dataset using gcloud
You will clean up by: – deleting the Vertex AI dataset – deleting the BigQuery dataset (table) – deleting the Cloud Storage bucket
Step 1: Set environment variables and enable APIs
Expected outcome: Your project is set, APIs are enabled, and you have a chosen region.
1) Authenticate and set your project:
gcloud auth login
gcloud config set project YOUR_PROJECT_ID
2) Choose a Vertex AI region. This example uses us-central1:
export REGION=us-central1
gcloud config set ai/region $REGION
3) Enable required APIs:
gcloud services enable aiplatform.googleapis.com
gcloud services enable bigquery.googleapis.com
gcloud services enable storage.googleapis.com
Verify
gcloud services list --enabled --filter="name:aiplatform.googleapis.com OR name:bigquery.googleapis.com OR name:storage.googleapis.com"
Step 2: Create a Cloud Storage bucket for staging
Expected outcome: A bucket exists to store a small CSV file.
Choose a globally unique bucket name:
export BUCKET="YOUR_PROJECT_ID-vertex-datasets-lab"
Create the bucket (regional to match your Vertex AI region where possible):
gcloud storage buckets create gs://$BUCKET --location=$REGION
Verify
gcloud storage buckets describe gs://$BUCKET
Step 3: Create a small CSV dataset locally and upload it
Expected outcome: You have a CSV in Cloud Storage.
Create a file named customer_churn_sample.csv:
cat > customer_churn_sample.csv << 'EOF'
customer_id,tenure_months,monthly_charges,has_internet,contract_type,churned
C001,1,29.85,true,month-to-month,true
C002,34,56.95,true,one-year,false
C003,2,53.85,true,month-to-month,true
C004,45,42.30,false,two-year,false
C005,8,70.70,true,month-to-month,true
C006,22,89.10,true,one-year,false
C007,60,25.00,false,two-year,false
C008,12,99.65,true,month-to-month,true
EOF
Upload it:
gcloud storage cp customer_churn_sample.csv gs://$BUCKET/
Verify
gcloud storage ls gs://$BUCKET/
Step 4: Create a BigQuery dataset and load the CSV into a table
Expected outcome: BigQuery dataset + table exists and contains rows.
1) Create a BigQuery dataset (use US multi-region for simplicity if you picked a US Vertex AI region).
If you are using an EU Vertex AI region, use EU instead.
export BQ_LOCATION=US
export BQ_DATASET=vertex_datasets_lab
bq --location=$BQ_LOCATION mk -d \
--description "Vertex AI Datasets lab dataset" \
$BQ_DATASET
2) Load the CSV from Cloud Storage into a table:
bq --location=$BQ_LOCATION load \
--source_format=CSV \
--skip_leading_rows=1 \
--autodetect \
${BQ_DATASET}.customer_churn_sample \
gs://$BUCKET/customer_churn_sample.csv
3) Query to confirm rows:
bq --location=$BQ_LOCATION query --use_legacy_sql=false \
"SELECT contract_type, COUNT(*) AS n, SUM(CAST(churned AS INT64)) AS churned
FROM \`${BQ_DATASET}.customer_churn_sample\`
GROUP BY contract_type
ORDER BY n DESC;"
Notes on locations
– BigQuery datasets are created in locations like US, EU, or a specific region.
– Vertex AI datasets are created in a Vertex AI region (like us-central1).
– Location compatibility can matter for some workflows. If you hit location-related errors later, align BigQuery dataset region with your Vertex AI region as closely as possible (or follow Google’s recommended compatible location combinations in official docs).
Step 5: Create a Vertex AI Datasets tabular dataset (Console)
Using the console avoids having to specify schema URIs and import schema URIs.
Expected outcome: A Vertex AI Dataset exists in your chosen region.
1) Open the Vertex AI Datasets page:
https://console.cloud.google.com/vertex-ai/datasets
2) Select the same project and confirm the region (top bar or dataset creation flow).
3) Click Create dataset.
4) Configure:
– Dataset name: customer_churn_tabular_lab
– Data type: Tabular
– Select a data source: BigQuery
– Choose the table:
– Dataset: vertex_datasets_lab
– Table: customer_churn_sample
5) Create/import.
Verify in console – You should see the dataset appear in the datasets list. – Open it and confirm you see schema/columns and the data source reference.
Step 6: Verify with gcloud CLI
Expected outcome: You can list and describe the dataset resource.
List datasets in the region:
gcloud ai datasets list --region=$REGION
Describe the dataset (replace DATASET_ID with the ID from the list output):
export DATASET_ID="PASTE_DATASET_ID_HERE"
gcloud ai datasets describe $DATASET_ID --region=$REGION
You should see fields like: – name (resource name) – displayName – createTime – metadataSchemaUri (internal schema reference)
Validation
You have successfully completed the lab if:
– BigQuery table vertex_datasets_lab.customer_churn_sample exists and returns rows.
– Vertex AI dataset customer_churn_tabular_lab exists in the Vertex AI console.
– gcloud ai datasets list shows your dataset.
– gcloud ai datasets describe returns dataset details without permission errors.
Troubleshooting
Common issues and fixes:
1) Permission denied creating dataset
– Cause: Missing Vertex AI role.
– Fix: Grant roles/aiplatform.user (or admin) to your user/service account.
2) Permission denied reading BigQuery table
– Cause: You can create the Vertex AI dataset but can’t access the BigQuery table.
– Fix: Grant roles/bigquery.dataViewer on the dataset/table.
3) Location mismatch errors
– Cause: BigQuery dataset in EU, Vertex AI region in US (or vice versa), or incompatible combination.
– Fix: Recreate the BigQuery dataset in a compatible location, or choose a Vertex AI region aligned with your data.
4) API not enabled
– Cause: aiplatform.googleapis.com not enabled.
– Fix: Enable it with gcloud services enable aiplatform.googleapis.com.
5) gcloud ai datasets command not found
– Cause: Old Cloud SDK components.
– Fix: Update Cloud SDK:
bash
gcloud components update
Cleanup
To avoid ongoing costs, delete created resources.
1) Delete the Vertex AI dataset:
gcloud ai datasets delete $DATASET_ID --region=$REGION --quiet
2) Delete BigQuery dataset (deletes tables inside):
bq --location=$BQ_LOCATION rm -r -f $BQ_DATASET
3) Delete the Cloud Storage bucket (must be empty first):
gcloud storage rm -r gs://$BUCKET
4) Optional: remove local file:
rm -f customer_churn_sample.csv
11. Best Practices
Architecture best practices
- Align locations: Keep Vertex AI dataset region aligned with BigQuery dataset location and Cloud Storage bucket location to reduce latency and avoid cross-region constraints.
- Separate raw vs curated: Store raw data in a raw zone, curate a stable training dataset, and reference the curated dataset from Vertex AI Datasets.
- Design for reproducibility:
- Use immutable BigQuery tables (or snapshots) for training inputs.
- Use partitioned tables and explicit partitions when appropriate.
- Use naming conventions like
features_vYYYYMMDDorfeatures_v3.
IAM/security best practices
- Use least privilege:
- dataset viewers should not automatically be bucket admins
- separate “dataset metadata admin” from “data plane access” where possible
- Prefer group-based access (Google Groups / Cloud Identity).
- Use service accounts for automation (CI/CD) with narrow roles.
Cost best practices
- Avoid duplicating large datasets for experiments; use subsets or views carefully.
- For BigQuery:
- Minimize repeated full scans (use partitioning and clustering).
- Consider materialized views for recurring features if it reduces processing.
- For Cloud Storage:
- Set lifecycle rules for intermediate artifacts.
- Choose storage class based on access patterns.
Performance best practices
- Keep data close to compute (region alignment).
- Avoid cross-region reads during training.
- For tabular sources, optimize BigQuery table layout (partitioning/clustering) when query-based prep is used.
Reliability best practices
- Treat dataset creation/import as code where possible (SDK/CLI).
- Use CI validation steps:
- check table schema compatibility
- check row counts and null rates
- confirm IAM access
Operations best practices
- Use labels on dataset resources for:
env=dev|prodowner=team-xcost-center=...- Monitor:
- failed import/labeling jobs
- permission-related errors in logs
- Document dataset contracts:
- schema expectations
- label definitions
- update cadence
- known caveats
Governance/tagging/naming best practices
- Naming pattern example:
domain_modality_purpose_env
e.g.,support_text_intent_prod- Tag underlying BigQuery tables and GCS buckets with consistent labels.
- For sensitive data, formalize:
- retention policy
- access approval workflow
- de-identification controls
12. Security Considerations
Identity and access model
- Vertex AI Datasets access is controlled by IAM on Vertex AI resources.
- Underlying data access is controlled separately:
- BigQuery IAM for datasets/tables
- Cloud Storage IAM for buckets/objects
Secure design principle: grant access to the dataset resource only to users who also have the appropriate access to the data source—and vice versa.
Encryption
- Google Cloud encrypts data at rest and in transit by default across managed services.
- If you require customer-managed encryption keys (CMEK), verify:
- whether CMEK applies to Vertex AI dataset metadata and/or to related jobs,
- and how it applies to your BigQuery tables and Cloud Storage buckets.
CMEK support varies by product and region—verify in official docs.
Network exposure
- Access is via Google APIs; secure it with:
- IAM
- organization policy constraints
- VPC Service Controls (common for sensitive ML environments)
- If running from GCE/GKE without external IPs, use Private Google Access to reach Google APIs.
Secrets handling
- Don’t embed credentials in notebooks/scripts.
- Use:
- Workload Identity (GKE) or service accounts (GCE/Cloud Run)
- Secret Manager for API keys/secrets (when needed)
Audit/logging
- Enable and retain Cloud Audit Logs according to your compliance needs.
- Ensure dataset create/import/delete actions are logged and reviewable.
Compliance considerations
- Data residency: choose Vertex AI region and data locations that match regulatory requirements.
- PII/PHI: apply de-identification, DLP scanning, and strict IAM on underlying data stores.
- Vendor labeling: if you use external labelers, ensure contractual and technical controls.
Common security mistakes
- Giving
roles/storage.adminbroadly just to “fix access.” - Putting sensitive training data in public buckets or overly permissive IAM.
- Mixing dev/prod data in the same bucket without clear separation and controls.
- Not aligning VPC Service Controls perimeters across Vertex AI, BigQuery, and Storage.
Secure deployment recommendations
- Use separate projects for dev/test/prod.
- Apply org policies (e.g., restrict external IPs, restrict service account key creation).
- Use VPC Service Controls for sensitive environments.
- Use structured approvals for dataset promotion to production.
13. Limitations and Gotchas
Always validate current limits and supported formats in official docs. Limits and capabilities evolve.
Common limitations/gotchas include:
- Region and location constraints
- Vertex AI dataset resources are regional.
- BigQuery and Cloud Storage sources have locations; mismatches can cause issues.
- Dataset is not a data warehouse
- Vertex AI Datasets is not meant to replace BigQuery or a data lake.
- Not a full governance/catalog solution
- Use Dataplex/Data Catalog for broader governance and discovery.
- Underlying access still required
- Having permission to a dataset resource doesn’t automatically grant permission to the BigQuery table or GCS objects.
- Quota constraints
- API rate limits and resource quotas can affect automation at scale. Check quotas.
- Import format requirements
- Image/text/video dataset imports often require specific manifest/CSV formats depending on the task. Verify the current required formats.
- Pricing surprises
- Labeling and training can become the dominant cost quickly.
- BigQuery repeated scans during feature creation can be expensive.
- Migration challenges
- If you migrate from another MLOps platform, you may need to re-map dataset identifiers and re-import metadata.
14. Comparison with Alternatives
Vertex AI Datasets is part of the Vertex AI ecosystem; alternatives depend on whether you need ML dataset metadata management, labeling integration, or general data governance.
Comparison table
| Option | Best For | Strengths | Weaknesses | When to Choose |
|---|---|---|---|---|
| Vertex AI Datasets (Google Cloud) | Teams standardizing ML workflows on Vertex AI | Native integration with Vertex AI training/AutoML and labeling; regional resource control; IAM integration | Not a full data governance tool; relies on underlying stores; modality-specific import formats | You use Vertex AI for training/MLOps and want a dataset registry tied to ML workflows |
| BigQuery (tables/views) + conventions | Tabular-only ML with strong SQL governance | Great analytics, governance controls, performance, lineage tooling | No ML-native dataset object for multi-modality; labeling not native | Your ML is tabular and you already manage “training tables” well in BigQuery |
| Cloud Storage + folder conventions | File-based datasets and simple pipelines | Simple, cheap, flexible | Easy to lose track of versions/labels; governance is manual | Small teams or early-stage projects, or as the underlying storage layer |
| Dataplex / Data Catalog (Google Cloud) | Enterprise governance and discovery | Governance, cataloging, policies, lineage (for supported sources) | Not a replacement for ML dataset objects and labeling workflows | You need enterprise-wide governance plus ML workflows—use alongside Vertex AI Datasets |
| Vertex AI Feature Store (if used) | Serving/monitoring ML features | Feature reuse and online/offline serving patterns | Not a general dataset registry; different scope | You need feature management for training/serving consistency (complementary, not a substitute) |
| AWS SageMaker (Data Wrangler / Ground Truth / Feature Store) | AWS-native ML platform | Tight AWS integration and labeling (Ground Truth) | Different cloud ecosystem; migration overhead | Your stack is on AWS and you want native dataset/labeling tooling there |
| Azure Machine Learning Data assets | Azure-native ML platform | Data asset registry integrated with AML | Different ecosystem; migration overhead | Your stack is on Azure ML |
| DVC / lakeFS (self-managed) | Git-like dataset versioning and branching | Strong dataset versioning semantics; toolchain flexibility | Operational overhead; integration work | You need advanced dataset versioning and are willing to run/operate tooling |
15. Real-World Example
Enterprise example: regulated customer-risk modeling
- Problem: A bank trains multiple risk models with strict audit requirements. Data lives in BigQuery with tight controls. Teams need consistent dataset references and repeatable retraining.
- Proposed architecture
- BigQuery hosts curated feature tables (partitioned by snapshot date).
- Vertex AI Datasets registers a tabular dataset per model family, referencing the curated table or snapshot tables.
- Vertex AI Pipelines orchestrates monthly snapshot creation → dataset update/import → training → evaluation → registry.
- IAM groups enforce who can view datasets and who can access underlying BigQuery tables.
- Cloud Audit Logs retained to support audits.
- Why Vertex AI Datasets was chosen
- Provides a consistent, Vertex-AI-native dataset object for pipelines and training.
- Simplifies reproducibility and reduces “wrong input table” errors.
- Expected outcomes
- More repeatable retraining.
- Cleaner audit story (dataset IDs + table snapshot references).
- Faster onboarding for new ML engineers.
Startup/small-team example: ecommerce image categorization
- Problem: A startup needs to classify product images into categories. Images are in Cloud Storage; labels are evolving.
- Proposed architecture
- Cloud Storage bucket holds product images.
- Vertex AI Datasets stores an image dataset with label metadata.
- Vertex AI Data Labeling (optional) used in small batches to improve labels.
- AutoML training triggered when label quality reaches threshold.
- Why Vertex AI Datasets was chosen
- Minimal operational overhead compared to building a custom dataset registry.
- Tight path from dataset → labeling → training.
- Expected outcomes
- Faster iteration on label taxonomy.
- Repeatable training input.
- Reduced manual data management.
16. FAQ
1) Is Vertex AI Datasets the same as a BigQuery dataset?
No. A BigQuery dataset is a container for BigQuery tables. Vertex AI Datasets is an ML dataset resource in Vertex AI that references data in BigQuery and/or Cloud Storage (depending on type) and stores ML-specific metadata.
2) Does Vertex AI Datasets copy my data into Vertex AI?
Usually, it stores metadata and references to underlying data (GCS URIs or BigQuery tables). Exact behavior can vary by dataset type and workflow—verify in official docs for your modality and import method.
3) Is Vertex AI Datasets required to train models on Vertex AI?
Not always. Many custom training workflows can read directly from GCS/BigQuery. Vertex AI Datasets is most helpful for standardized workflows, reuse, and labeling/AutoML integration.
4) What dataset types are supported (tabular/image/text/video)?
Vertex AI commonly supports tabular, image, text, and video datasets, but exact supported tasks, formats, and regions can change. Verify in: https://cloud.google.com/vertex-ai/docs/datasets/introduction
5) Are Vertex AI datasets global or regional?
They are regional resources in a specified Vertex AI location.
6) Can I move a dataset to another region?
Typically you recreate the dataset in the target region and re-import from the source data. Verify whether any migration tooling exists for your dataset type.
7) How do permissions work?
You need IAM permissions for:
– the Vertex AI dataset resource (Vertex AI roles),
– and the underlying data (BigQuery roles and/or Cloud Storage roles).
8) Can multiple projects share the same Vertex AI dataset?
Vertex AI datasets are project-scoped. Cross-project sharing is usually done by sharing the underlying data (BQ/GCS) and recreating dataset resources in each project, or by centralizing ML in one project. Design depends on org policies.
9) How do I version datasets?
Vertex AI Datasets is primarily a dataset resource/metadata layer. For versioning, teams often use:
– BigQuery snapshot tables or partitioned snapshots,
– GCS object versioning and manifests,
– and MLOps metadata in pipelines.
Verify if any native dataset version features exist for your dataset type in current docs.
10) What’s the difference between Vertex AI Datasets and Vertex AI Feature Store?
Datasets manage training/evaluation data assets; Feature Store (where used) focuses on feature reuse and online/offline feature serving patterns. They solve different problems and are often complementary.
11) Can I use VPC Service Controls with Vertex AI Datasets?
Many enterprises use VPC SC with Vertex AI, BigQuery, and Cloud Storage. Verify the latest supported configurations in official VPC SC docs and Vertex AI docs.
12) What’s the cheapest way to try Vertex AI Datasets?
Create a small tabular dataset referencing a small BigQuery table and avoid training/labeling jobs until you’re ready.
13) Does using Vertex AI Datasets improve model accuracy?
Not directly. It improves manageability, consistency, and operational reliability, which can indirectly improve outcomes by reducing data mistakes and supporting better iteration.
14) How do I automate dataset creation?
Use the Vertex AI API, gcloud ai datasets commands, or the Vertex AI Python SDK. Validate quotas and IAM.
15) What should I monitor in production?
Monitor:
– import/labeling job failures,
– permission errors,
– underlying data pipeline health (BigQuery jobs, Dataflow pipelines),
– cost anomalies (BigQuery scans, labeling spend, training runs).
17. Top Online Resources to Learn Vertex AI Datasets
| Resource Type | Name | Why It Is Useful |
|---|---|---|
| Official documentation | Vertex AI Datasets introduction — https://cloud.google.com/vertex-ai/docs/datasets/introduction | Canonical overview of dataset concepts, types, and workflows |
| Official documentation | Vertex AI Access control (IAM) — https://cloud.google.com/vertex-ai/docs/general/access-control | Role guidance and permission model for Vertex AI resources |
| Official CLI reference | gcloud ai datasets reference — https://cloud.google.com/sdk/gcloud/reference/ai/datasets |
Command syntax for listing/creating/describing/deleting datasets |
| Official pricing | Vertex AI pricing — https://cloud.google.com/vertex-ai/pricing | Current pricing model for Vertex AI services |
| Official pricing tool | Google Cloud Pricing Calculator — https://cloud.google.com/products/calculator | Region-specific estimates without guessing |
| Official docs | Vertex AI Quotas — https://cloud.google.com/vertex-ai/quotas | Quota limits and how to request increases |
| Official docs | Vertex AI Data Labeling overview — https://cloud.google.com/vertex-ai/docs/data-labeling/overview | How labeling integrates with datasets and what to expect operationally |
| Official BigQuery pricing | BigQuery pricing — https://cloud.google.com/bigquery/pricing | Key cost drivers if you use BigQuery as a dataset source |
| Official Cloud Storage pricing | Cloud Storage pricing — https://cloud.google.com/storage/pricing | Key cost drivers for file-based datasets |
| Official architecture guidance | MLOps on Google Cloud (Architecture Center) — https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning | Reference architecture for dataset→pipeline→training operationalization |
| Official SDK docs | Vertex AI Python SDK reference — https://cloud.google.com/python/docs/reference/aiplatform/latest | Programmatic dataset operations and end-to-end ML automation |
| Official samples (GitHub) | GoogleCloudPlatform vertex-ai samples — https://github.com/GoogleCloudPlatform/vertex-ai-samples | Practical notebooks and code patterns (verify dataset examples relevant to your modality) |
| Official videos | Google Cloud Tech (YouTube) — https://www.youtube.com/@googlecloudtech | Product walkthroughs; search within channel for Vertex AI datasets/labeling |
18. Training and Certification Providers
| Institute | Suitable Audience | Likely Learning Focus | Mode | Website URL |
|---|---|---|---|---|
| DevOpsSchool.com | DevOps/Platform engineers, cloud engineers, SREs | MLOps/DevOps practices, automation, Google Cloud operations basics | check website | https://www.devopsschool.com/ |
| ScmGalaxy.com | Developers, build/release engineers, platform teams | SCM/CI/CD concepts, automation practices that support MLOps workflows | check website | https://www.scmgalaxy.com/ |
| CLoudOpsNow.in | Cloud operations teams, sysadmins | Cloud operations fundamentals, operational readiness | check website | https://www.cloudopsnow.in/ |
| SreSchool.com | SREs, reliability engineers, platform teams | Reliability engineering practices applicable to ML platforms | check website | https://www.sreschool.com/ |
| AiOpsSchool.com | Ops teams adopting AIOps, ML ops | Monitoring/automation practices; AIOps concepts that can complement ML operations | check website | https://www.aiopsschool.com/ |
19. Top Trainers
| Platform/Site | Likely Specialization | Suitable Audience | Website URL |
|---|---|---|---|
| RajeshKumar.xyz | Cloud/DevOps training content (verify specific Vertex AI coverage) | Beginners to intermediate engineers | https://rajeshkumar.xyz/ |
| devopstrainer.in | DevOps training and workshops | DevOps engineers, platform teams | https://www.devopstrainer.in/ |
| devopsfreelancer.com | Freelance DevOps/automation help (as a platform) | Teams needing short-term expertise | https://www.devopsfreelancer.com/ |
| devopssupport.in | DevOps support and training resources | Ops teams and engineers | https://www.devopssupport.in/ |
20. Top Consulting Companies
| Company Name | Likely Service Area | Where They May Help | Consulting Use Case Examples | Website URL |
|---|---|---|---|---|
| cotocus.com | Cloud/DevOps consulting (verify current offerings) | Cloud adoption, automation, platform engineering | Designing CI/CD for ML pipelines, IAM hardening, cost reviews | https://cotocus.com/ |
| DevOpsSchool.com | DevOps and cloud consulting/training | Team enablement, DevOps transformation | Building operational runbooks, setting up observability, improving deployment practices | https://www.devopsschool.com/ |
| DEVOPSCONSULTING.IN | DevOps consulting (verify current offerings) | CI/CD, infrastructure automation, reliability practices | Automation pipelines, infrastructure-as-code standardization, production readiness reviews | https://www.devopsconsulting.in/ |
21. Career and Learning Roadmap
What to learn before Vertex AI Datasets
- Google Cloud fundamentals:
- projects, IAM, service accounts, billing
- Cloud Storage basics (buckets, IAM, lifecycle)
- BigQuery basics (datasets, tables, locations, pricing)
- ML fundamentals:
- supervised learning concepts
- train/validation/test splits
- feature engineering basics
- Basic MLOps concepts:
- reproducibility
- data lineage
- automation and CI/CD
What to learn after Vertex AI Datasets
- Vertex AI training options:
- AutoML (where applicable)
- Custom training jobs
- Vertex AI Pipelines for orchestration
- Model Registry and model deployment patterns
- Monitoring and drift detection patterns (Vertex AI Model Monitoring where applicable)
- Data governance on Google Cloud (Dataplex, IAM Conditions, DLP)
Job roles that use it
- ML Engineer / Senior ML Engineer
- Cloud Engineer supporting AI platforms
- Data Engineer collaborating with ML teams
- Platform Engineer / MLOps Engineer
- SRE supporting ML systems
- Security Engineer reviewing AI/ML data access patterns
Certification path (Google Cloud)
Google Cloud certifications change over time. Commonly relevant tracks include: – Professional Machine Learning Engineer – Professional Cloud Architect – Associate Cloud Engineer
Verify current certification names and requirements here:
https://cloud.google.com/learn/certification
Project ideas for practice
- Create a “golden dataset” pattern:
- raw → curated BigQuery table → Vertex AI dataset → pipeline training
- Build a dataset importer script:
- validates schema and row counts
- creates/updates dataset resources
- Implement least-privilege IAM:
- separate dataset viewers from data viewers
- audit with Cloud Logging queries
- Cost governance exercise:
- estimate BigQuery scan cost for feature creation
- optimize table partitioning and pipeline schedules
22. Glossary
- Vertex AI Datasets: Vertex AI service capability to create/manage dataset resources used for ML workflows.
- Dataset resource: A regional Vertex AI object that stores metadata and references to underlying data.
- BigQuery dataset (BQ dataset): A container of BigQuery tables (not the same as Vertex AI dataset).
- Cloud Storage bucket: Storage container for objects (files) used by ML workflows.
- Data item: An individual unit in a dataset (row/file/document/clip) represented in dataset metadata.
- Annotation/label: Supervised learning metadata attached to data items (class label, bounding box, etc.).
- IAM (Identity and Access Management): Google Cloud access control system based on roles and permissions.
- Service account: Non-human identity used by applications/automation to call Google APIs.
- Region/location: Geographic placement for resources; Vertex AI datasets are regional.
- VPC Service Controls: A Google Cloud security feature to reduce data exfiltration risk by defining service perimeters.
- MLOps: Operational practices for deploying and maintaining ML systems (automation, monitoring, governance).
23. Summary
Vertex AI Datasets in Google Cloud (AI and ML category) is a managed way to create regional dataset resources that reference your ML data in BigQuery and Cloud Storage, and optionally store labeling/annotation metadata. It matters because it standardizes dataset handling across teams, improves reproducibility, and integrates cleanly with Vertex AI training and MLOps workflows.
From a cost perspective, dataset metadata is usually not the main driver; the real costs typically come from storage (BQ/GCS), labeling, and training, plus any data processing and cross-region transfer. From a security perspective, success depends on designing IAM for both the dataset resource and the underlying data, aligning regions/locations, and enabling auditability.
Use Vertex AI Datasets when you want a consistent dataset registry tightly integrated with Vertex AI workflows. Next step: connect your dataset to a controlled training workflow (Vertex AI training and/or Vertex AI Pipelines) and apply production IAM, logging, and cost controls.