Category
Analytics and AI
1. Introduction
Oracle Cloud Data Labeling is a managed service in Oracle Cloud Infrastructure (OCI) that helps teams create labeled datasets for supervised machine learning (ML). In simple terms, it provides tools and workflows to tag raw data—such as images or text—with the “correct answers” (labels) that ML models need for training and evaluation.
From a technical perspective, Data Labeling provides dataset management, label set definition, and a labeling workflow/UI and APIs that connect strongly with OCI foundational services like Object Storage (for raw and exported labeled data), IAM (for access control), and Audit (for governance and traceability). It is designed to support repeatable labeling operations that can be integrated into MLOps pipelines.
The problem it solves is practical and common: most ML projects fail or stall because training data is messy, unlabeled, inconsistently labeled, or difficult to manage across teams. Data Labeling provides structure, access control, and export mechanisms so that labeled datasets can reliably feed downstream training in services like OCI Data Science (or other training platforms).
Naming/status note: The service is commonly referred to as OCI Data Labeling or Data Labeling service in Oracle documentation. Verify the exact current console placement and terminology in your region/tenancy because OCI console navigation can change over time.
2. What is Data Labeling?
Official purpose (OCI-aligned):
Data Labeling helps you create, manage, and export labeled datasets so you can train and evaluate machine learning models.
Core capabilities
- Create and manage datasets for labeling
- Define label sets (the categories/tags you want labelers to apply)
- Assign and perform labeling work (often via labeling jobs/workflows—verify exact current UI terms in official docs)
- Track labeling progress and dataset state
- Export labeled data/annotations for ML training
Major components (conceptual model)
While exact resource names can vary by UI/API version, the service typically revolves around:
- Dataset: A collection of records (examples) to label.
- Record: An individual item (for example, an image file or text file stored in Object Storage).
- Label set / labels: The controlled vocabulary of labels (e.g.,
positive,negative). - Annotations: The labeling output attached to each record.
- Work assignment / labeling job (if exposed in your tenancy): A workflow that assigns work to one or more labelers and tracks completion.
If any of these terms differ in your tenancy, treat them as conceptual equivalents and verify in official docs.
Service type
- Managed cloud service for human-in-the-loop dataset labeling and export.
- Accessed through the OCI Console, REST APIs, and typically the OCI CLI/SDKs (verify current CLI command group availability in your installed CLI version).
Scope: regional and compartment-scoped
In OCI, services are generally: – Region-specific for resource creation and operations (datasets and related resources typically exist in a region). – Compartment-scoped for access control and organization.
Data itself generally resides in OCI Object Storage buckets in a specific region, and the Data Labeling service references that data and writes exports back to Object Storage.
How it fits into the Oracle Cloud ecosystem
Data Labeling sits in the Analytics and AI category and commonly supports: – OCI Data Science model training pipelines – OCI AI Services projects that require custom training data (where applicable) – Enterprise governance via IAM, Audit, Tagging, and Compartments – Storage and lifecycle via Object Storage (and potentially Archive Storage for long-term retention)
3. Why use Data Labeling?
Business reasons
- Faster time-to-model: standardized workflows reduce delays caused by ad hoc labeling tools.
- Better model outcomes: consistent labeling improves training signal and reduces rework.
- Cross-team collaboration: shared datasets and controlled access reduce duplication and confusion.
- Traceability: labeling artifacts can be governed like other cloud assets.
Technical reasons
- Tight integration with OCI IAM and Object Storage means your data stays inside your OCI environment.
- Exported labels can feed training workflows in OCI Data Science or external training systems.
- Programmatic control through APIs (and often CLI/SDK) supports automation.
Operational reasons
- Centralized management of datasets and progress tracking (as exposed by the service).
- Uses OCI standard constructs—compartments, policies, tags, Audit logs—which most platform teams already operate.
Security/compliance reasons
- Access controlled by least privilege using IAM policies.
- Data typically remains in Object Storage, enabling encryption, retention policies, and access logs.
- API calls are captured by OCI Audit.
Scalability/performance reasons
- Object Storage scales for large datasets without you managing capacity.
- Multiple labelers can work in parallel (subject to your workflow design and tenancy setup).
When teams should choose Data Labeling
Choose Oracle Cloud Data Labeling when: – You already store data in OCI and want labeling to remain in the same cloud boundary. – You need strong tenancy governance (IAM, Audit, tagging, compartments). – You want a managed approach instead of running and patching your own labeling platform. – You need an auditable, repeatable process to produce training datasets.
When teams should not choose it
Avoid (or reconsider) Data Labeling when: – You need specialized annotation types not supported by the service in your region (verify supported data/annotation types). – You require a built-in external workforce/managed labeling workforce. OCI Data Labeling is commonly used with your own labelers (employees/contractors) rather than providing a marketplace workforce—verify if your Oracle offering includes any workforce options. – You are already heavily invested in a different labeling ecosystem (e.g., Label Studio/CVAT) with established pipelines and integrations.
4. Where is Data Labeling used?
Industries
- Healthcare: imaging classification, clinical text categorization (with appropriate compliance controls)
- Manufacturing: defect detection datasets for computer vision
- Retail/e-commerce: product categorization and moderation datasets
- Financial services: document classification, text categorization, fraud-related training sets
- Telecom: ticket classification, NER for network incident descriptions
- Media: content tagging, policy compliance datasets
Team types
- Data science and ML engineering teams
- Platform/Cloud engineering teams (setting up secure workflows)
- Data governance teams (access controls, audit requirements)
- Product teams building AI features
- Annotation teams/operations teams performing labeling
Workloads
- Supervised learning training dataset creation
- Human-in-the-loop dataset cleanup and normalization
- Ongoing labeling for model retraining and drift response
Architectures
- “Object Storage → Data Labeling → Export → Data Science training”
- Hybrid pipelines where training happens outside OCI but labeling and storage stay in OCI
- Secure multi-compartment separation: raw data in one compartment, labeled exports in another
Production vs dev/test usage
- Dev/test: small datasets, quick iteration on label definitions, sampling strategies, QA rules.
- Production: controlled label taxonomy, review workflows, audit requirements, and reproducible exports integrated into CI/CD or MLOps processes.
5. Top Use Cases and Scenarios
Below are realistic scenarios where Oracle Cloud Data Labeling is a good fit. Each use case assumes your raw data is stored in OCI Object Storage and you need consistent labels for ML training.
1) Customer support ticket classification
- Problem: Thousands of tickets need category labels to train an auto-routing classifier.
- Why this service fits: Managed dataset organization + controlled label set; labelers can tag text records.
- Example scenario: Support ops label 20,000 historical tickets into categories like
billing,outage,account,feature_request.
2) Product review sentiment labeling
- Problem: Train a sentiment model but reviews are unlabeled.
- Why this service fits: Simple label sets like
positive/neutral/negative; easy export. - Example scenario: E-commerce team labels 5,000 reviews and exports annotations for training.
3) Image classification for quality inspection
- Problem: Determine whether a product photo indicates
passorfail. - Why this service fits: Human labeling workflow over image records.
- Example scenario: Manufacturing QA labels 10,000 assembly-line images as
okordefect.
4) Object detection dataset preparation (if supported)
- Problem: Need bounding boxes for objects in images.
- Why this service fits: Some labeling services support bounding box annotation; verify OCI Data Labeling annotation types in official docs for your region.
- Example scenario: Logistics team labels pallets/forklifts in warehouse images for a safety model.
5) Content moderation categorization
- Problem: Train a classifier to detect policy-violating content.
- Why this service fits: Strong governance (IAM/Audit), consistent label taxonomy.
- Example scenario: Trust & Safety labels text snippets into
spam,hate,safe,adult.
6) Document classification (if document datasets are supported)
- Problem: Label inbound PDFs into document types.
- Why this service fits: Central dataset management and export; verify document support and annotation features.
- Example scenario: Finance team labels invoices vs receipts vs statements.
7) Named entity recognition (NER) training data (if supported)
- Problem: Extract entities like
customer_name,account_idfrom text. - Why this service fits: If NER annotation is supported; otherwise you may need a specialized tool—verify.
- Example scenario: Telecom team labels service notes for entity extraction.
8) Retraining dataset for model drift response
- Problem: Model accuracy drops due to new data patterns.
- Why this service fits: Add new records, label them, export incremental dataset for retraining.
- Example scenario: Monthly labeling batches feed retraining pipeline.
9) Human QA pass on weakly labeled data
- Problem: Labels produced by heuristics are noisy.
- Why this service fits: Use human labelers to correct and validate a subset; export gold dataset.
- Example scenario: Start with rule-based tagging, then correct 10% sample.
10) Multilingual text categorization
- Problem: Need labeled data across multiple languages.
- Why this service fits: Central management and separate datasets per language; labelers assigned by language skill.
- Example scenario: Create datasets
tickets-en,tickets-es,tickets-frwith shared label taxonomy.
11) Model evaluation holdout dataset labeling
- Problem: Need a trusted test set to measure model performance.
- Why this service fits: Controlled process; tighter access to avoid leakage.
- Example scenario: Security team labels a test dataset only accessible to evaluators.
12) Data governance and auditability for regulated labeling
- Problem: Must prove who labeled what and when.
- Why this service fits: OCI Audit captures API actions; IAM enforces access.
- Example scenario: Healthcare org labels imaging metadata with strict compartment access and audit retention.
6. Core Features
Important: OCI capabilities can vary by region and over time. For exact supported data types and annotation modes, verify in the official Data Labeling documentation for your tenancy/region.
Feature 1: Dataset management (create, organize, lifecycle)
- What it does: Lets you create and manage datasets within compartments.
- Why it matters: Provides structure for labeling projects; reduces ad hoc sprawl.
- Practical benefit: Consistent dataset naming, tagging, and lifecycle controls.
- Caveats: Dataset operations are typically regional; plan for data locality.
Feature 2: Object Storage integration (source and export)
- What it does: Uses OCI Object Storage as the durable store for input records and exported labeled output.
- Why it matters: Object Storage is scalable and supports encryption and lifecycle policies.
- Practical benefit: Easy to integrate exports into ML training pipelines.
- Caveats: Ensure buckets and policies are in the correct region/compartment; watch for egress if exporting across regions.
Feature 3: Label sets and controlled taxonomy
- What it does: Define allowed labels/categories for a dataset or project.
- Why it matters: Prevents label drift and inconsistent categories.
- Practical benefit: Higher-quality training data and cleaner evaluation metrics.
- Caveats: Changing label sets mid-stream can complicate versioning; plan label governance.
Feature 4: Human labeling workflow (UI-based labeling)
- What it does: Provides a console experience for labelers to open records and apply labels/annotations.
- Why it matters: Human-in-the-loop labeling remains essential for many datasets.
- Practical benefit: Reduces need for third-party labeling tools when your workflow fits OCI capabilities.
- Caveats: Complex annotation types may not be supported; validate before committing.
Feature 5: Collaboration through OCI IAM users and groups
- What it does: Enables multiple labelers and reviewers via OCI identity and policy.
- Why it matters: Enterprise governance and least privilege are easier when integrated with OCI IAM.
- Practical benefit: You can separate duties (admins vs labelers vs export operators).
- Caveats: Requires thoughtful policy design to avoid over-broad access to buckets.
Feature 6: Export labeled datasets for training
- What it does: Exports labels/annotations to Object Storage.
- Why it matters: Training systems generally consume files, not labeling UI state.
- Practical benefit: Repeatable training runs from exported artifacts (store them immutably if needed).
- Caveats: Export format and schema must match your training pipeline; verify supported export formats.
Feature 7: API-driven operations (automation-ready)
- What it does: Supports programmatic dataset operations through OCI APIs (and typically SDKs/CLI).
- Why it matters: Enables integration into MLOps pipelines and CI/CD.
- Practical benefit: Automate dataset creation, record import, export, and reporting.
- Caveats: API feature coverage can differ from UI; confirm in API reference.
Feature 8: Governance via compartments, tags, and Audit
- What it does: Uses OCI resource organization and logging primitives.
- Why it matters: Regulated teams need traceability and access boundaries.
- Practical benefit: Standard OCI governance model; easy to align with landing zone patterns.
- Caveats: Audit captures API calls but may not capture every user action detail within a labeling UI—verify what is logged.
7. Architecture and How It Works
High-level service architecture
At a high level: 1. You store raw data (images/text/docs) in OCI Object Storage. 2. You create a Data Labeling dataset that references those objects. 3. Labelers authenticate with OCI IAM and label records in the Console UI. 4. The service stores label state/metadata and can export labeled results back to Object Storage. 5. You feed exported annotations into training (e.g., OCI Data Science jobs/notebooks) and deploy models.
Request/data/control flow
- Control plane: Dataset creation, label set creation, user permissions, export operations.
- Data plane: Object Storage objects are the raw inputs and exported outputs; Data Labeling references them.
- Identity flow: Users authenticate via OCI IAM; access determined by IAM policies.
- Audit flow: OCI Audit records API operations (create/update/delete/export, etc.).
Integrations with related OCI services
Common integrations include: – OCI Object Storage: input records + export target – OCI IAM: users/groups/policies for labelers and admins – OCI Audit: governance and activity trails – OCI Events / Notifications: optionally trigger automation when exports complete (verify available event types) – OCI Data Science: training pipelines consume exported data – OCI Vault / KMS: encryption key management for Object Storage (and potentially other integrated components)
Dependency services (typical)
- Object Storage bucket(s)
- IAM policies
- Network access to OCI Console endpoints (for labelers)
Security/authentication model
- OCI IAM user authentication (console) and OCI API request signing (SDK/CLI)
- Authorization via IAM policies scoped to compartments and resource families
- Object Storage access controlled via IAM policies and bucket policies (as configured)
Networking model
- Labelers typically use the public OCI Console over HTTPS.
- Data stays in Object Storage; network egress charges can apply if you download/export outside the region or cloud boundary.
- For enterprise environments, consider:
- OCI Cloud Guard and security zones (where applicable)
- Private access patterns for Object Storage (e.g., via Service Gateway in VCN) for compute-based pipelines—labeling UI itself is console-based.
Monitoring/logging/governance
- Audit is the baseline for “who did what” at the API/resource level.
- Operational monitoring is often indirect: track export objects created, dataset status, and downstream training success.
- Use tagging (
cost-center,project,data-classification) for cost allocation and governance.
Simple architecture diagram
flowchart LR
U[Labelers<br/>OCI Users] -->|Console (HTTPS)| DL[OCI Data Labeling]
DL -->|Reads raw records| OS[(OCI Object Storage<br/>Raw Data Bucket)]
DL -->|Exports annotations| OS2[(OCI Object Storage<br/>Labeled Export Bucket)]
OS2 --> DS[OCI Data Science<br/>Training/Notebooks]
IAM[(OCI IAM)] --> DL
AUD[(OCI Audit)] --> DL
Production-style architecture diagram
flowchart TB
subgraph Tenancy[OCI Tenancy]
subgraph CompartmentA[Compartment: ml-raw]
OSRAW[(Object Storage Bucket<br/>raw-data)]
end
subgraph CompartmentB[Compartment: ml-labeling]
DL[Data Labeling Datasets<br/>Label Sets / Jobs]
TAGS[Tagging & Cost Tracking]
end
subgraph CompartmentC[Compartment: ml-train]
OSLBL[(Object Storage Bucket<br/>labeled-exports)]
DS[OCI Data Science<br/>Projects/Jobs]
ART[(Model Artifacts<br/>Object Storage)]
end
IAM[(OCI IAM<br/>Groups/Policies)]
AUD[(OCI Audit)]
EVT[OCI Events]
NOTIF[OCI Notifications]
end
OSRAW --> DL
DL --> OSLBL
OSLBL --> DS
DS --> ART
IAM --> DL
IAM --> OSRAW
IAM --> OSLBL
AUD --> DL
AUD --> OSRAW
EVT --> NOTIF
DL -.optional events.-> EVT
8. Prerequisites
Tenancy/account requirements
- An active Oracle Cloud (OCI) tenancy with permissions to use Analytics and AI services.
- Access to a region where Data Labeling is available. Availability varies—verify in official docs and in your Console service list.
Permissions / IAM roles
You typically need: – Permission to create and manage Data Labeling resources in a compartment. – Permission to read input objects from Object Storage. – Permission to write export objects to Object Storage.
OCI IAM is policy-based; exact policy statements depend on your compartment structure. The resource family name for Data Labeling in IAM policies must match the official documentation—verify in official docs.
Example policy patterns to validate and adapt (do not paste blindly without verification): – Manage Data Labeling resources in a compartment – Read objects from a specific bucket/prefix – Write objects to an export bucket
Billing requirements
- You need billing enabled for the tenancy (even if the service itself is no-charge, dependent services like storage or egress are billable).
- You need a budget owner/cost center tag strategy for tracking.
CLI/SDK/tools needed
For this tutorial:
– OCI Console access (required for interactive labeling UI)
– Optional: OCI CLI for creating buckets and uploading sample records
Install/verify OCI CLI: https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/cliinstall.htm
Region availability
- Data Labeling might not be in every region. Check:
- OCI Services availability in your region (Console service list)
- Official docs for region support (service documentation)
Quotas/limits
- OCI enforces tenancy and compartment quotas for many services.
- Object Storage has practical limits (object count, request rates) and service limits (see official docs).
- Data Labeling may have dataset/job limits—verify in official docs.
Prerequisite services
- OCI Object Storage for input and export buckets
- OCI IAM for groups/policies
- Recommended: OCI Audit (enabled by default in OCI)
9. Pricing / Cost
Current pricing model (what to verify)
Oracle Cloud pricing changes over time and can be region- or contract-dependent. For Data Labeling, the direct service charge may be: – No additional charge (in some OCI services, labeling tools are provided without separate metering), or – Metered by usage dimensions (less common for basic labeling tools), or – Included as part of broader AI/ML offerings.
Because you must not rely on guesses for billing, do this first: – Check the official OCI pricing pages and the Data Labeling docs “Pricing” section (if present). – Use the official OCI cost estimator.
Official pricing entry points: – OCI Pricing overview: https://www.oracle.com/cloud/pricing/ – OCI Cost Estimator: https://www.oracle.com/cloud/costestimator.html – OCI price list (for SKU-level detail): https://www.oracle.com/cloud/price-list/
Pricing dimensions to consider (most common in practice)
Even if Data Labeling itself is low-cost/no-charge, you still pay for dependencies:
- Object Storage – Storage capacity (GB-month) – Requests (PUT/GET/LIST) – Retrieval (if using Archive tier)
- Network egress – Downloading exported datasets outside OCI/region can incur egress charges.
- Compute for training – OCI Data Science job runs/notebooks and GPU usage (separate from labeling).
- People cost – Human labeling time is often the biggest cost driver. This is not an OCI bill, but it’s real.
Free tier (if applicable)
OCI has an Always Free tier for certain services. Whether Data Labeling is included or free in your tenancy depends on Oracle’s current program—verify: – https://www.oracle.com/cloud/free/
Cost drivers (direct and indirect)
- Dataset size (number of objects/records)
- Export frequency (daily/weekly exports increase storage + request counts)
- Duplicate datasets and poor lifecycle policies (extra storage)
- Cross-region movement of exports
- Labeling team size and throughput (people cost, operations overhead)
- QA/review cycles (re-labeling increases time and complexity)
Hidden or indirect costs
- Data duplication: exporting multiple versions of annotations without lifecycle rules.
- Large objects: high-resolution images drive storage and slower human workflows.
- Egress surprise: downloading labeled exports to on-prem or other cloud.
- Operational overhead: IAM policy mistakes causing delays.
How to optimize cost
- Keep raw data in the most cost-effective tier (but don’t use Archive if you need frequent access).
- Use lifecycle rules for old exports (e.g., move to Archive after 30–90 days).
- Export only what you need (e.g., incremental exports, if supported and appropriate).
- Reduce object size where it doesn’t harm training signal (resize images, compress).
- Use sampling strategies to label fewer, higher-value examples first.
Example low-cost starter estimate (no fabricated numbers)
A small pilot typically includes: – 10–100 MB of text files or compressed images in Object Storage – 1–2 exports – 1–3 labelers for a few hours Costs will usually be dominated by human time, while OCI costs are mainly Object Storage requests and storage. Use the OCI Cost Estimator to model your region and expected storage and requests.
Example production cost considerations
In production, typical cost planning includes: – Raw data: 1–10+ TB Object Storage – Exported versions: multiple TB over time – High request rates due to frequent dataset refresh – Substantial people costs for labeling and QA – Downstream training GPU costs (often much larger than storage)
10. Step-by-Step Hands-On Tutorial
This lab is designed to be safe, low-cost, and beginner-friendly, while still reflecting a real workflow: store raw data in Object Storage, create a dataset in Oracle Cloud Data Labeling, label records, export labeled data, and clean up.
Objective
Create a small text classification dataset in Oracle Cloud Data Labeling, label a handful of records, export the labeled dataset to Object Storage, and verify the export.
Lab Overview
You will:
1. Create two Object Storage buckets: one for raw records and one for exports.
2. Upload a few small .txt files that represent records to be labeled.
3. Create a Data Labeling dataset and a label set (e.g., positive, negative).
4. Label records in the OCI Console labeling UI.
5. Export the labeled dataset to an export bucket.
6. Validate the exported objects exist in Object Storage.
7. Clean up all resources.
Note on UI navigation: The Console location for Data Labeling can change. If you don’t see it under “Analytics and AI,” use the Console search bar for Data Labeling.
Step 1: Create a compartment (recommended)
Why: Keeps lab resources isolated for cleanup and least privilege.
Console
1. Open OCI Console.
2. Go to Identity & Security → Compartments.
3. Click Create Compartment.
4. Name: dl-lab
5. Click Create.
Expected outcome: A new compartment dl-lab exists.
Step 2: Create Object Storage buckets
You will create:
– dl-lab-raw (raw text records)
– dl-lab-export (exported labeled data)
Option A: Console
1. Go to Storage → Object Storage → Buckets
2. Select compartment: dl-lab
3. Click Create Bucket
4. Bucket name: dl-lab-raw
Default storage tier is fine for a lab.
5. Create another bucket: dl-lab-export
Option B: OCI CLI (optional)
Prereqs:
– OCI CLI configured (oci setup config)
– You know your namespace
Get namespace:
oci os ns get
Create buckets (replace <COMPARTMENT_OCID>):
oci os bucket create \
--compartment-id <COMPARTMENT_OCID> \
--name dl-lab-raw
oci os bucket create \
--compartment-id <COMPARTMENT_OCID> \
--name dl-lab-export
Expected outcome: Two buckets exist in your selected region.
Step 3: Upload sample text records to the raw bucket
Create a local folder and add a few .txt files.
On your machine
mkdir -p dl-lab-records
cat > dl-lab-records/001.txt << 'EOF'
I love how fast the delivery was. Great experience.
EOF
cat > dl-lab-records/002.txt << 'EOF'
The item arrived broken and support was unhelpful.
EOF
cat > dl-lab-records/003.txt << 'EOF'
It is okay. Not bad, not great—just average.
EOF
Upload to Object Storage.
Option A: Console
1. Open Storage → Object Storage → Buckets → dl-lab-raw
2. Click Upload
3. Upload 001.txt, 002.txt, 003.txt
Option B: OCI CLI
Replace <NAMESPACE>:
oci os object put --namespace-name <NAMESPACE> \
--bucket-name dl-lab-raw \
--name records/001.txt --file dl-lab-records/001.txt
oci os object put --namespace-name <NAMESPACE> \
--bucket-name dl-lab-raw \
--name records/002.txt --file dl-lab-records/002.txt
oci os object put --namespace-name <NAMESPACE> \
--bucket-name dl-lab-raw \
--name records/003.txt --file dl-lab-records/003.txt
Expected outcome: You can see three objects in the bucket under prefix records/.
Step 4: Create IAM access for labelers (minimum required)
If you are doing this lab as an administrator in your tenancy, you might already have access. For real teams, create a group (e.g., DataLabelers) and grant the minimum necessary permissions.
Because IAM policy syntax and resource families must be exact, use this step as a checklist and verify in official docs:
– Data labelers need to:
– use/manage Data Labeling resources (dataset operations and labeling)
– read input objects from dl-lab-raw
– write export objects to dl-lab-export
Expected outcome: Your user (or group) can create datasets and read/write the relevant buckets.
Verification step (practical):
– In the Console, confirm you can:
– list objects in dl-lab-raw
– create a Data Labeling dataset
If either fails, troubleshoot IAM before proceeding.
Step 5: Create a Data Labeling dataset (text)
Console
1. Navigate to Data Labeling in the OCI Console (use search if needed).
2. Select compartment: dl-lab
3. Click Create dataset
4. Name: sentiment-lab
5. Choose dataset type: Text (or the closest equivalent shown)
6. Create the dataset.
Expected outcome: Dataset sentiment-lab exists and is empty (no records yet) or ready for record import.
Step 6: Add/import records from Object Storage
Console
1. Open dataset sentiment-lab
2. Find the option to Add records / Import data (exact wording varies)
3. Select:
– Bucket: dl-lab-raw
– Prefix: records/
4. Start the import.
Expected outcome: Dataset shows 3 records available for labeling.
Verification:
– The dataset record list displays 001.txt, 002.txt, 003.txt (or equivalent record identifiers).
Step 7: Create a label set for sentiment
Console
1. In dataset settings (or label configuration), create a label set with labels:
– positive
– negative
– neutral
2. Save the label set.
Expected outcome: Labelers can choose only these labels, improving consistency.
Step 8: Label the records in the labeling UI
Console
1. Open the dataset and choose Start labeling / Label (exact wording varies)
2. For each record:
– 001.txt → positive
– 002.txt → negative
– 003.txt → neutral
3. Save/submit labels.
Expected outcome: Each record shows as labeled, and dataset progress indicates 3/3 labeled (or similar).
Step 9: Export the labeled dataset to Object Storage
Console
1. In the dataset, find Export (or “Export annotations”)
2. Choose target:
– Bucket: dl-lab-export
– Prefix: exports/sentiment-lab/ (recommended)
3. Choose export format (if prompted).
If multiple formats exist, choose the one best aligned with your training toolchain. If unsure, choose the default and inspect the output.
4. Start export.
Expected outcome: Export completes successfully and objects appear in dl-lab-export under exports/sentiment-lab/.
Validation
Validate from Object Storage.
Console
1. Go to Storage → Object Storage → Buckets → dl-lab-export
2. Open exports/sentiment-lab/
3. Confirm one or more export files exist (e.g., manifest/annotation files).
Optional CLI validation
oci os object list --namespace-name <NAMESPACE> \
--bucket-name dl-lab-export \
--prefix exports/sentiment-lab/
You should see exported objects listed.
Troubleshooting
Issue: “Not authorized” when creating dataset or importing records
Likely cause: Missing IAM policy for Data Labeling and/or Object Storage access.
Fix:
– Confirm your user/group has permission for Data Labeling resources in the correct compartment.
– Confirm read access to dl-lab-raw objects and write access to dl-lab-export.
Issue: Records import fails or shows zero records
Likely causes: – Wrong bucket/prefix – Objects are not in the region expected – Unsupported file types for the chosen dataset type Fix: – Confirm objects exist under the prefix. – Try importing without a prefix to validate visibility. – Verify supported input file formats in official docs.
Issue: Export completes but you can’t find output files
Likely causes:
– Exported to a different bucket/prefix
– You lack permission to list objects in export bucket
Fix:
– Re-check export configuration (bucket and prefix).
– Verify IAM permissions on dl-lab-export.
Issue: Labeling UI is slow or errors in browser
Likely causes: Browser extensions, network restrictions, or session timeouts.
Fix:
– Try an incognito/private window.
– Use a supported browser per OCI Console requirements.
– Ensure corporate proxy rules allow OCI Console domains.
Cleanup
To avoid ongoing cost and clutter:
-
Delete the Data Labeling dataset – Go to Data Labeling → dataset
sentiment-lab→ Delete -
Delete exported objects – Empty
dl-lab-exportbucket (delete objects) -
Delete raw objects – Empty
dl-lab-rawbucket (delete objects) -
Delete buckets – Delete
dl-lab-export– Deletedl-lab-raw -
Delete compartment (optional) – If you created
dl-lab, delete it after confirming it contains no resources.
Expected outcome: No lab resources remain.
11. Best Practices
Architecture best practices
- Keep data close: store raw data and exports in the same region as Data Labeling to reduce latency and avoid cross-region transfer.
- Separate raw vs labeled buckets: use distinct buckets or prefixes and separate permissions.
- Version your exports: export to a versioned prefix (
exports/<dataset>/<YYYY-MM-DD>/) so training runs are reproducible.
IAM/security best practices
- Use least privilege:
- Labelers often need read access to raw records and write access only to exports (not delete).
- Admins manage datasets, label sets, and export configuration.
- Use groups (
DataLabelers,DataLabelAdmins) rather than individual user policies. - Use compartment boundaries:
- Keep raw data in a compartment with stricter access.
- Keep labeling projects in a separate compartment.
- Require MFA for labeler accounts where possible.
Cost best practices
- Use Object Storage lifecycle policies for old exports.
- Avoid repeatedly exporting full datasets if incremental export strategies work for your pipeline (verify what export options exist).
- Keep objects compressed and reasonably sized.
Performance best practices
- Organize objects with sensible prefixes (
records/,exports/) for manageable listing and operations. - Avoid extremely large single objects that are slow for labelers to open.
Reliability best practices
- Treat export artifacts as immutable training inputs: don’t overwrite; write new versions.
- Keep a backup of label taxonomy and labeling guidelines outside the tool (e.g., a controlled doc) to prevent drift.
Operations best practices
- Standardize naming:
- Dataset names:
<project>-<datatype>-<purpose> - Buckets:
<env>-<team>-<purpose> - Use tags:
project,environment,owner,cost-center,data-classification- Establish a review process (human QA) for label quality:
- Sampling-based review
- Inter-annotator agreement checks (if your workflow supports multiple labelers)
Governance best practices
- Define data classification and retention rules:
- Are you labeling PII? If yes, enforce access controls and minimize exposure.
- Keep audit logs retention aligned with compliance policies.
12. Security Considerations
Identity and access model
- OCI IAM governs access using:
- Users, groups, policies
- Compartments as authorization boundaries
- Use separate groups for:
- Labelers: can label, view records, export if needed
- Admins: can create datasets, manage label sets, manage exports
Encryption
- Object Storage supports encryption at rest (Oracle-managed keys by default).
- For stricter control, consider customer-managed keys with OCI Vault/KMS (verify supported configurations and organizational requirements).
Network exposure
- Console-based labeling uses HTTPS to OCI endpoints.
- For data processing pipelines that run in VCN (e.g., training jobs), use a Service Gateway for private access to Object Storage where appropriate.
Secrets handling
- Prefer OCI-native auth (IAM principals, instance principals, resource principals) over embedding API keys in scripts.
- If using API keys for OCI CLI/SDK, store and rotate them securely.
Audit/logging
- OCI Audit records API calls for supported services.
- Use Audit to track dataset creation, deletion, and export actions.
- If you need additional operational observability, log and monitor:
- Export object creation in Object Storage
- Downstream training pipeline results
Compliance considerations
- If labeling data includes PII/PHI or regulated content:
- Minimize access to raw data (need-to-know)
- Consider redaction or anonymization before labeling
- Establish retention and deletion policies
- Document your labeling SOPs (standard operating procedures)
Common security mistakes
- Granting labelers broad
manage object-familypermissions across the tenancy. - Storing raw sensitive data and exports in the same bucket with permissive policies.
- Exporting labeled datasets to public buckets or generating pre-authenticated requests without controls.
Secure deployment recommendations
- Use dedicated compartments for raw, labeling, and training.
- Use strict bucket policies; limit to specific buckets and prefixes.
- Apply consistent tagging and ownership.
- Periodically review IAM policies and group membership.
13. Limitations and Gotchas
These are common limitations/pitfalls seen in managed labeling workflows. For service-specific hard limits, verify in official Data Labeling docs.
Known limitations (verify specifics)
- Region availability may be limited compared to core OCI services.
- Supported data types and annotation types may not cover all needs (e.g., advanced polygon segmentation, 3D point clouds).
- Export formats may require transformation before training.
Quotas and service limits
- Dataset count, record count, or concurrency limits may apply.
- Object Storage request rate limits can be hit during bulk operations.
Regional constraints
- Buckets are regional. Keep raw and export buckets in the same region as the dataset for simpler operations.
Pricing surprises
- Even if Data Labeling has minimal direct cost, you can be charged for:
- Object Storage capacity and requests
- Network egress for downloads
- Downstream compute training (often the major cloud cost)
Compatibility issues
- Your ML training framework expects a particular annotation schema; you may need a conversion step.
- File naming conventions and character sets can cause import issues.
Operational gotchas
- Changing label definitions mid-project creates dataset versioning challenges.
- Mixed labeling standards across labelers reduces model accuracy; invest in guidelines and QA.
- Browser-based labeling can be impacted by session timeouts; plan work accordingly.
Migration challenges
- Migrating from Label Studio/CVAT/doccano requires mapping label schemas and export formats.
- Ensure consistent class names, IDs, and annotation coordinate conventions.
Vendor-specific nuances
- OCI resource organization via compartments is powerful but can confuse new teams; document your compartment strategy early.
14. Comparison with Alternatives
Data labeling exists across clouds and in open-source tools. The best choice depends on annotation complexity, governance needs, workforce model, and integration targets.
Comparison table
| Option | Best For | Strengths | Weaknesses | When to Choose |
|---|---|---|---|---|
| Oracle Cloud Data Labeling | Teams already on OCI needing governed labeling workflows | Tight OCI IAM/compartment integration, Object Storage-native, good for governed environments | Feature set and annotation types may be narrower than specialized tools; region availability can vary | You want labeling inside OCI with standard governance and easy export to OCI Data Science |
| OCI Data Science (adjacent) | End-to-end OCI ML lifecycle | Training and MLOps features; integrates with Object Storage | Not a labeling tool by itself | Use alongside Data Labeling for training and deployment |
| AWS SageMaker Ground Truth | AWS-native labeling with managed workforce options | Mature ecosystem, workforce options, strong integrations | AWS lock-in; pricing and workforce features vary | You’re AWS-first and want built-in workforce and tight SageMaker integration |
| Google Cloud Data Labeling Service / Vertex AI labeling | GCP-native ML pipelines | Vertex AI integration, managed workflows | GCP lock-in; feature availability varies | You’re on Vertex AI and want cloud-native labeling |
| Azure Machine Learning Data Labeling | Azure ML users | Integrated with Azure ML pipelines | Azure lock-in; workflow complexity can vary | You’re Azure-first and training in Azure ML |
| Label Studio (open-source / self-managed) | Custom workflows, extensibility | Highly flexible, plugin ecosystem | You manage hosting, scaling, security, upgrades | You need custom annotation types and can operate the platform |
| CVAT (open-source) | Computer vision annotation (boxes, polygons, etc.) | Strong CV annotation capabilities | Self-managed ops burden | You need advanced vision annotation beyond managed service capabilities |
| doccano (open-source) | NLP labeling (classification/NER) | Good for text workflows | Self-managed; limited enterprise governance out-of-box | You need NLP-focused annotation with customization |
15. Real-World Example
Enterprise example: Regulated customer communications classification
- Problem: A financial services enterprise needs labeled text data from customer communications (emails/tickets) to train a classifier for routing and compliance flagging. Data includes sensitive content and requires strict audit trails.
- Proposed architecture:
- Raw communications stored in OCI Object Storage in a restricted compartment (
fin-raw). - Data Labeling datasets in
fin-labelingcompartment. - Exported labeled datasets written to
fin-exportscompartment/bucket with stricter write controls and immutable versioning. - Training in OCI Data Science using exported datasets; model artifacts stored in Object Storage.
- Governance with IAM groups (
Labelers,Reviewers,MLAdmins), tagging, and Audit retention. - Why Data Labeling was chosen:
- OCI IAM/compartment governance, centralized workflows, and auditability.
- Keeps data inside Oracle Cloud boundaries.
- Expected outcomes:
- Faster dataset creation with controlled label taxonomy.
- Improved model performance due to consistent labels.
- Audit-ready processes demonstrating controlled access and traceable exports.
Startup/small-team example: Quick sentiment model MVP
- Problem: A startup wants to launch a sentiment feature in 2 weeks. They have 2,000 reviews but no labels.
- Proposed architecture:
- Reviews stored as
.txtobjects in a single Object Storage bucket. - One Data Labeling dataset with three labels (
positive/neutral/negative). - Weekly export to Object Storage; training done in a small OCI Data Science notebook/job.
- Why Data Labeling was chosen:
- Minimal infrastructure; no need to host an annotation platform.
- Simple workflow with fast iteration.
- Expected outcomes:
- MVP dataset labeled quickly.
- Repeatable export and retraining loop.
- Clear path to add QA and multi-labeler review later.
16. FAQ
-
Is Oracle Cloud Data Labeling a separate product or part of OCI?
It is an OCI service commonly referred to as Data Labeling (or Data Labeling service). It integrates with OCI services like Object Storage and IAM. -
Do I need OCI Object Storage to use Data Labeling?
In most practical workflows, yes—raw records and exported labels are commonly stored in Object Storage. -
Does Data Labeling provide human labelers (a workforce)?
Typically, you use your own OCI users (employees/contractors). If you need a managed workforce, verify current Oracle offerings and your contract terms. -
What data types can I label?
Commonly images and text, and sometimes documents depending on the service version and region. Verify supported data types in official docs. -
Can I automate dataset creation and export?
Usually yes via OCI APIs (and often CLI/SDK). Confirm API coverage in the Data Labeling API reference. -
How do I control who can label data?
Use OCI IAM groups and policies scoped to the dataset compartment and Object Storage buckets. -
How do I prevent label taxonomy drift?
Lock down who can edit label sets, document labeling guidelines, and version your exports. -
Where are labels stored?
Labels/annotations are managed by the service and can be exported to Object Storage for training and archiving. -
What export formats are supported?
Export formats vary by dataset type and service version. Verify supported formats and validate with a small pilot export. -
Can I use the labeled output with OCI Data Science?
Yes—export to Object Storage and consume the exported artifacts in OCI Data Science jobs/notebooks. -
How do I track labeling progress?
The Console typically shows dataset/job progress and labeled counts. For automation, use APIs to query status (verify exact endpoints). -
Is Data Labeling suitable for large-scale annotation (millions of records)?
Potentially, but operational planning is needed: quotas, throughput, object organization, and QA processes. Validate limits in official docs. -
How do I handle sensitive data like PII?
Restrict access, consider redaction/anonymization, enforce encryption and audit retention, and follow compliance requirements. -
Can multiple labelers label the same record for agreement checks?
Some labeling systems support multiple annotations per record; verify whether OCI Data Labeling supports this natively in your tenancy. -
What’s the fastest way to start?
Store a small set of records in Object Storage, create a dataset, define labels, label a few records, export, and confirm your training pipeline can read the export.
17. Top Online Resources to Learn Data Labeling
| Resource Type | Name | Why It Is Useful |
|---|---|---|
| Official documentation | OCI Data Labeling Documentation: https://docs.oracle.com/en-us/iaas/data-labeling/ | Primary source for supported data types, workflows, APIs, and limits |
| Official API reference | OCI APIs (start here, then navigate to Data Labeling): https://docs.oracle.com/en-us/iaas/api/ | Authoritative API operations and schemas for automation |
| Official CLI install | OCI CLI Installation: https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/cliinstall.htm | Helps you automate Object Storage + OCI operations |
| Official Object Storage docs | Object Storage Overview: https://docs.oracle.com/en-us/iaas/Content/Object/Concepts/objectstorageoverview.htm | Required to manage input/output data and lifecycle policies |
| Official IAM docs | IAM Overview: https://docs.oracle.com/en-us/iaas/Content/Identity/Concepts/overview.htm | Required for secure access design for labelers/admins |
| Official Audit docs | Audit Overview: https://docs.oracle.com/en-us/iaas/Content/Audit/Concepts/auditoverview.htm | Understand audit trails and governance |
| Official pricing | OCI Pricing: https://www.oracle.com/cloud/pricing/ | Understand cost model and billing dimensions |
| Official cost estimator | OCI Cost Estimator: https://www.oracle.com/cloud/costestimator.html | Create region-specific estimates without guessing |
| Architecture center | OCI Architecture Center: https://docs.oracle.com/solutions/ | Reference architectures that help design production ML platforms |
| Training (official) | Oracle Cloud training portal: https://education.oracle.com/ | Look for OCI Data Science / AI learning paths that reference labeling workflows |
| Community learning | Oracle Cloud Infrastructure blog: https://blogs.oracle.com/cloud-infrastructure/ | Practical posts and updates (verify against docs) |
| Samples (check availability) | Oracle GitHub (search OCI AI/Data Science samples): https://github.com/oracle | May include reference code; validate compatibility and recency |
18. Training and Certification Providers
| Institute | Suitable Audience | Likely Learning Focus | Mode | Website URL |
|---|---|---|---|---|
| DevOpsSchool.com | DevOps engineers, cloud engineers, platform teams | OCI fundamentals, MLOps/DevOps practices, automation basics | Check website | https://www.devopsschool.com/ |
| ScmGalaxy.com | Students, engineers moving into DevOps/Cloud | CI/CD, SCM, cloud fundamentals | Check website | https://www.scmgalaxy.com/ |
| CLoudOpsNow.in | Cloud operations and engineering teams | Cloud ops, monitoring, governance | Check website | https://www.cloudopsnow.in/ |
| SreSchool.com | SREs, reliability engineers, platform teams | SRE practices, production ops, reliability patterns | Check website | https://www.sreschool.com/ |
| AiOpsSchool.com | Ops + data teams | AIOps concepts, monitoring automation | Check website | https://www.aiopsschool.com/ |
19. Top Trainers
| Platform/Site | Likely Specialization | Suitable Audience | Website URL |
|---|---|---|---|
| RajeshKumar.xyz | DevOps/cloud training content (verify specific offerings) | Beginners to intermediate engineers | https://rajeshkumar.xyz/ |
| devopstrainer.in | DevOps training and mentoring (verify OCI coverage) | Engineers seeking structured training | https://www.devopstrainer.in/ |
| devopsfreelancer.com | Freelance DevOps support/training (verify scope) | Teams needing hands-on guidance | https://www.devopsfreelancer.com/ |
| devopssupport.in | DevOps support services and learning resources (verify scope) | Ops teams and practitioners | https://www.devopssupport.in/ |
20. Top Consulting Companies
| Company | Likely Service Area | Where They May Help | Consulting Use Case Examples | Website URL |
|---|---|---|---|---|
| cotocus.com | Cloud/DevOps consulting (verify OCI specialization) | Architecture, automation, operationalization | Designing secure OCI compartments; setting up Object Storage governance; pipeline automation | https://cotocus.com/ |
| DevOpsSchool.com | DevOps & cloud consulting/training | Platform enablement, CI/CD, cloud ops | Building MLOps-ready landing zones; IAM best practices; operational playbooks | https://www.devopsschool.com/ |
| DEVOPSCONSULTING.IN | DevOps consulting services | Delivery acceleration, DevOps toolchains | Implementing CI/CD; infrastructure automation; operational readiness reviews | https://www.devopsconsulting.in/ |
21. Career and Learning Roadmap
What to learn before this service
- OCI basics: regions, compartments, VCN concepts
- OCI IAM: users, groups, policies, dynamic groups
- OCI Object Storage: buckets, prefixes, lifecycle policies
- ML fundamentals: supervised learning, train/validation/test splits
- Data governance basics: labeling guidelines, QA processes
What to learn after this service
- OCI Data Science: projects, notebooks, jobs, model deployment
- MLOps practices: versioning datasets, reproducible training runs, pipeline automation
- Monitoring ML systems: data drift detection concepts, evaluation pipelines
- Security deep dives: Vault/KMS, security zones, Cloud Guard (where applicable)
Job roles that use it
- Data Scientist
- ML Engineer
- MLOps Engineer
- Cloud/Platform Engineer supporting AI platforms
- Data/AI Program Manager (for governance and throughput planning)
- Security Engineer (reviewing access and compliance controls)
Certification path (if available)
Oracle certifications change over time. Start with: – OCI foundations certifications – OCI Data Science/AI learning paths (if offered) Check the official Oracle training portal: – https://education.oracle.com/
Project ideas for practice
- Build a sentiment classifier using labeled text exported from Data Labeling and trained in OCI Data Science.
- Create an end-to-end pipeline: upload new raw records daily → label weekly → export → retrain monthly.
- Implement dataset governance: compartment design + IAM least privilege + tagging + lifecycle rules.
- Create a conversion script that transforms exported labels into the exact format required by your ML framework.
22. Glossary
- Annotation: The label information applied to a record (e.g., class label for text, bounding box for image).
- Compartment (OCI): A logical container for organizing and isolating OCI resources for access control and billing.
- Dataset: A managed collection of records to be labeled.
- Export: The process of writing labeled annotations to a file/object format in Object Storage for training use.
- IAM Policy (OCI): A statement defining who can do what on which resources in OCI.
- Label set: The defined list of allowed labels/categories used for consistent tagging.
- Object Storage: OCI service for storing unstructured data as objects in buckets.
- Record: A single data item to label (e.g., one text file or image object).
- Supervised learning: ML training method where the model learns from labeled examples.
- Tenancy (OCI): Your OCI account boundary containing compartments, IAM, and resources.
23. Summary
Oracle Cloud Data Labeling (Analytics and AI) is a managed OCI service for creating and exporting labeled datasets used in supervised machine learning. It fits naturally into OCI architectures by integrating with Object Storage for data, IAM for access control, and Audit for governance.
Cost planning should focus on the real drivers: Object Storage usage, export/versioning strategy, network egress if data leaves the region, downstream training compute, and—most importantly—human labeling time. Security should be built around least-privilege IAM, compartment separation, encryption, and auditability.
Use Data Labeling when you want a governed, OCI-native workflow for labeling data that will feed training pipelines such as OCI Data Science. The best next step is to run a small pilot (like the lab above), inspect export formats, and then formalize labeling guidelines, QA checks, and dataset versioning for production.