{"id":993,"date":"2026-05-18T03:48:42","date_gmt":"2026-05-18T03:48:42","guid":{"rendered":"https:\/\/www.devopsschool.com\/tutorials\/?p=993"},"modified":"2026-05-18T03:48:43","modified_gmt":"2026-05-18T03:48:43","slug":"datadog-application-error-tracking-in-eks-using-datadog-dogstatsd-apm-logs-and-error-tracking","status":"publish","type":"post","link":"https:\/\/www.devopsschool.com\/tutorials\/datadog-application-error-tracking-in-eks-using-datadog-dogstatsd-apm-logs-and-error-tracking\/","title":{"rendered":"Datadog: Application Error Tracking in EKS using Datadog, DogStatsD, APM, Logs, and Error Tracking"},"content":{"rendered":"\n<h1 class=\"wp-block-heading\">Master Guide: Application Error Tracking in EKS using Datadog, DogStatsD, APM, Logs, and Error Tracking<\/h1>\n\n\n\n<p>First, tiny naming correction: it is <strong>DogStatsD<\/strong>, not DogStashD. DogStatsD is Datadog\u2019s StatsD-compatible custom metrics service. It is excellent for counting application errors, but it is <strong>not a full Sentry replacement by itself<\/strong>. For Sentry-like error debugging, you should combine:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>DogStatsD metrics\n+ Application logs with stack traces\n+ Datadog APM traces\n+ Datadog Error Tracking\n+ Kubernetes \/ EKS metadata\n+ Unified service tagging\n<\/code><\/pre>\n\n\n\n<p>This is the best implementation pattern for your app running inside <strong>containers\/pods on EKS<\/strong>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">1. What we are trying to build<\/h1>\n\n\n\n<p>The target outcome is:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Application error happens\n        \u2193\nDatadog captures error count, log, trace, stack trace\n        \u2193\nDatadog links it to service, env, version\n        \u2193\nDatadog adds Kubernetes context\n        \u2193\nYou can identify the exact pod\/container\/deployment\/node\n<\/code><\/pre>\n\n\n\n<p>Final relationship should look like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Error Issue\n  \u251c\u2500\u2500 service: checkout-api\n  \u251c\u2500\u2500 env: prod\n  \u251c\u2500\u2500 version: 1.8.4\n  \u251c\u2500\u2500 error.type: PaymentTimeoutException\n  \u251c\u2500\u2500 endpoint: \/api\/checkout\n  \u251c\u2500\u2500 kube_namespace: prod\n  \u251c\u2500\u2500 kube_deployment: checkout-api\n  \u251c\u2500\u2500 pod_name: checkout-api-7c9d8f98c9-xz2lp\n  \u251c\u2500\u2500 container_name: checkout-api\n  \u251c\u2500\u2500 node: ip-10-0-12-25\n  \u2514\u2500\u2500 trace_id \/ log correlation\n<\/code><\/pre>\n\n\n\n<p>Datadog\u2019s unified service tagging is built around the standard <code>env<\/code>, <code>service<\/code>, and <code>version<\/code> tags, which are used to correlate metrics, traces, logs, containers, and deployment versions. (<a href=\"https:\/\/docs.datadoghq.com\/getting_started\/tagging\/unified_service_tagging\/?tab=kubernetes\">Datadog Monitoring<\/a>)<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">2. High-level architecture<\/h1>\n\n\n\n<pre class=\"wp-block-code\"><code>flowchart TD\n    U&#91;User \/ Client Request] --&gt; ING&#91;Ingress \/ ALB \/ API Gateway]\n    ING --&gt; SVC&#91;Kubernetes Service]\n    SVC --&gt; POD&#91;Application Pod in EKS]\n\n    POD --&gt; APP&#91;Application Container]\n\n    APP --&gt;|DogStatsD custom error metrics| DSD&#91;Datadog Agent DogStatsD]\n    APP --&gt;|APM traces and exceptions| APM&#91;Datadog Agent APM Receiver]\n    APP --&gt;|stdout\/stderr structured logs| LOGS&#91;Kubernetes Node Log Files]\n\n    LOGS --&gt; AGENT&#91;Datadog Agent DaemonSet]\n    DSD --&gt; AGENT\n    APM --&gt; AGENT\n\n    KUBE&#91;Kubernetes API \/ Kubelet Metadata] --&gt; AGENT\n    CLUSTER&#91;Datadog Cluster Agent] --&gt; AGENT\n\n    AGENT --&gt; DD&#91;Datadog Platform]\n\n    DD --&gt; METRICS&#91;Metrics Explorer \/ Dashboards]\n    DD --&gt; LOGEXP&#91;Logs Explorer]\n    DD --&gt; TRACE&#91;APM Traces \/ Service Map]\n    DD --&gt; ERR&#91;Error Tracking]\n    DD --&gt; MON&#91;Monitors \/ Alerts]\n\n    ERR --&gt; RCA&#91;Root Cause Analysis]\n    TRACE --&gt; RCA\n    LOGEXP --&gt; RCA\n    METRICS --&gt; RCA\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">3. Sentry to Datadog mapping<\/h1>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Sentry capability<\/th><th>Datadog equivalent<\/th><\/tr><\/thead><tbody><tr><td>Error issue grouping<\/td><td>Datadog Error Tracking<\/td><\/tr><tr><td>Stack trace<\/td><td>APM error span or structured error log<\/td><\/tr><tr><td>Release\/version tracking<\/td><td><code>version<\/code> tag<\/td><\/tr><tr><td>Environment<\/td><td><code>env<\/code> tag<\/td><\/tr><tr><td>Project\/service<\/td><td><code>service<\/code> tag<\/td><\/tr><tr><td>Error count<\/td><td>DogStatsD custom metric<\/td><\/tr><tr><td>Request trace<\/td><td>Datadog APM<\/td><\/tr><tr><td>Breadcrumb-style context<\/td><td>Logs, trace spans, custom tags<\/td><\/tr><tr><td>Alert on new error<\/td><td>Error Tracking monitor<\/td><\/tr><tr><td>Alert on error volume<\/td><td>Metric monitor or APM monitor<\/td><\/tr><tr><td>Find pod\/container<\/td><td>Kubernetes tags from Datadog Agent<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>DogStatsD is useful for <strong>custom error counters<\/strong>, but Error Tracking, logs, and APM are what give you the Sentry-like debugging experience.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">4. Recommended implementation model<\/h1>\n\n\n\n<p>Use four data streams together:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>flowchart LR\n    A&#91;Application Error] --&gt; B&#91;DogStatsD Metric]\n    A --&gt; C&#91;Structured Error Log]\n    A --&gt; D&#91;APM Trace \/ Span Error]\n    A --&gt; E&#91;Kubernetes Metadata]\n\n    B --&gt; F&#91;Dashboards and Metric Alerts]\n    C --&gt; G&#91;Log Search and Error Tracking]\n    D --&gt; H&#91;Trace Debugging and Service Map]\n    E --&gt; I&#91;Pod \/ Container \/ Deployment \/ Node Relationship]\n\n    F --&gt; J&#91;Datadog Incident View]\n    G --&gt; J\n    H --&gt; J\n    I --&gt; J\n<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Data type<\/th><th>Purpose<\/th><th>Example<\/th><\/tr><\/thead><tbody><tr><td>DogStatsD metric<\/td><td>Count and alert<\/td><td><code>app.error.count<\/code><\/td><\/tr><tr><td>Error log<\/td><td>Stack trace and message<\/td><td>JSON log with <code>error.stack<\/code><\/td><\/tr><tr><td>APM trace<\/td><td>Request path and dependency failure<\/td><td><code>\/checkout<\/code> \u2192 <code>payment-service<\/code> timeout<\/td><\/tr><tr><td>Kubernetes metadata<\/td><td>Pod\/container relationship<\/td><td><code>pod_name<\/code>, <code>kube_deployment<\/code>, <code>kube_namespace<\/code><\/td><\/tr><tr><td>Error Tracking issue<\/td><td>Group similar errors<\/td><td><code>PaymentTimeoutException<\/code> grouped as one issue<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Datadog Error Tracking groups errors into issues and can alert on new, regressed, or high-impact errors. (<a href=\"https:\/\/docs.datadoghq.com\/monitors\/types\/error_tracking\/?tab=newissue\">Datadog Monitoring<\/a>)<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">5. Install Datadog Agent in EKS<\/h1>\n\n\n\n<p>Datadog supports installation through <strong>Datadog Operator<\/strong>, <strong>Helm<\/strong>, or manual DaemonSet. Datadog currently recommends the Operator for Kubernetes because it reduces misconfiguration risk, but Helm is also a very common production approach. (<a href=\"https:\/\/docs.datadoghq.com\/containers\/kubernetes\/installation\/?tab=datadogoperator\">Datadog Monitoring<\/a>)<\/p>\n\n\n\n<p>For EKS, the standard model is:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Datadog Agent = DaemonSet\nDatadog Cluster Agent = Deployment\nApplication Pod sends logs\/traces\/metrics to local node Agent\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">5.1 Create namespace and secret<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>kubectl create namespace datadog\n\nkubectl -n datadog create secret generic datadog-secret \\\n  --from-literal api-key=\"$DD_API_KEY\"\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">5.2 Example <code>datadog-values.yaml<\/code><\/h2>\n\n\n\n<p>This is a practical production-style baseline for EKS application error tracking:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>targetSystem: linux\n\ndatadog:\n  apiKeyExistingSecret: datadog-secret\n\n  # Example: datadoghq.com, datadoghq.eu, us3.datadoghq.com, us5.datadoghq.com, ap1.datadoghq.com\n  site: datadoghq.com\n\n  clusterName: eks-prod-apne1-01\n\n  kubeStateMetricsCore:\n    enabled: true\n\n  collectEvents: true\n\n  logs:\n    enabled: true\n    containerCollectAll: true\n\n  apm:\n    socketEnabled: true\n    portEnabled: false\n\n  dogstatsd:\n    originDetection: true\n    useSocketVolume: true\n    socketPath: \/var\/run\/datadog\/dsd.socket\n    tagCardinality: orchestrator\n\n  tags:\n    - cloud:aws\n    - platform:eks\n    - owner:devops\n\nclusterAgent:\n  enabled: true\n  admissionController:\n    enabled: true\n    mutateUnlabelled: false\n\nagents:\n  containers:\n    agent:\n      resources:\n        requests:\n          cpu: 200m\n          memory: 256Mi\n        limits:\n          memory: 512Mi\n<\/code><\/pre>\n\n\n\n<p>Important notes:<\/p>\n\n\n\n<p><code>logs.enabled<\/code> and <code>containerCollectAll<\/code> allow the Agent to collect container logs. Datadog\u2019s Kubernetes log collection docs show enabling <code>features.logCollection.enabled<\/code> and <code>containerCollectAll<\/code> with the Operator; the Helm values above express the same intent for Helm-based installs. (<a href=\"https:\/\/docs.datadoghq.com\/containers\/kubernetes\/log\/?tab=datadogoperator\">Datadog Monitoring<\/a>)<\/p>\n\n\n\n<p><code>dogstatsd.originDetection<\/code> helps the Agent identify which container\/pod emitted DogStatsD metrics. Datadog documents that DogStatsD origin detection can tag metrics with the same pod tags as Autodiscovery metrics, but the Agent-side origin detection is not enabled by default unless configured. (<a href=\"https:\/\/docs.datadoghq.com\/extend\/dogstatsd\/?tab=hostagent\">Datadog Monitoring<\/a>)<\/p>\n\n\n\n<p><code>apm.socketEnabled<\/code> and <code>dogstatsd.useSocketVolume<\/code> use Unix Domain Socket communication. For Kubernetes APM, Datadog supports UDS, host IP, or Kubernetes service communication, and recommends UDS for trace submission. (<a href=\"https:\/\/docs.datadoghq.com\/tracing\/guide\/setting_up_apm_with_kubernetes_service.md?utm_source=chatgpt.com\">Datadog Monitoring<\/a>)<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">5.3 Install or upgrade Agent<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>helm upgrade --install datadog-agent datadog\/datadog \\\n  -n datadog \\\n  -f datadog-values.yaml\n<\/code><\/pre>\n\n\n\n<p>Verify:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>kubectl -n datadog get pods\nkubectl -n datadog get ds\nkubectl -n datadog get deploy\n<\/code><\/pre>\n\n\n\n<p>Expected resources:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>datadog-agent DaemonSet\ndatadog-cluster-agent Deployment\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">6. Add unified service tags to your application<\/h1>\n\n\n\n<p>This is the most important part for relationship-building.<\/p>\n\n\n\n<p>Every application Deployment should have:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>env\nservice\nversion\n<\/code><\/pre>\n\n\n\n<p>Example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>apiVersion: apps\/v1\nkind: Deployment\nmetadata:\n  name: checkout-api\n  namespace: prod\n  labels:\n    tags.datadoghq.com\/env: \"prod\"\n    tags.datadoghq.com\/service: \"checkout-api\"\n    tags.datadoghq.com\/version: \"1.8.4\"\nspec:\n  replicas: 3\n  selector:\n    matchLabels:\n      app: checkout-api\n  template:\n    metadata:\n      labels:\n        app: checkout-api\n        tags.datadoghq.com\/env: \"prod\"\n        tags.datadoghq.com\/service: \"checkout-api\"\n        tags.datadoghq.com\/version: \"1.8.4\"\n      annotations:\n        admission.datadoghq.com\/enabled: \"true\"\n        ad.datadoghq.com\/checkout-api.logs: '&#91;{\"source\":\"java\",\"service\":\"checkout-api\"}]'\n    spec:\n      containers:\n        - name: checkout-api\n          image: myrepo\/checkout-api:1.8.4\n          env:\n            - name: DD_ENV\n              valueFrom:\n                fieldRef:\n                  fieldPath: metadata.labels&#91;'tags.datadoghq.com\/env']\n\n            - name: DD_SERVICE\n              valueFrom:\n                fieldRef:\n                  fieldPath: metadata.labels&#91;'tags.datadoghq.com\/service']\n\n            - name: DD_VERSION\n              valueFrom:\n                fieldRef:\n                  fieldPath: metadata.labels&#91;'tags.datadoghq.com\/version']\n\n            - name: DD_ENTITY_ID\n              valueFrom:\n                fieldRef:\n                  fieldPath: metadata.uid\n\n            - name: DD_TRACE_AGENT_URL\n              value: \"unix:\/\/\/var\/run\/datadog\/apm.socket\"\n\n            - name: DOGSTATSD_SOCKET\n              value: \"\/var\/run\/datadog\/dsd.socket\"\n\n          volumeMounts:\n            - name: datadog-socket\n              mountPath: \/var\/run\/datadog\n              readOnly: true\n\n      volumes:\n        - name: datadog-socket\n          hostPath:\n            path: \/var\/run\/datadog\n<\/code><\/pre>\n\n\n\n<p>Datadog\u2019s Kubernetes unified service tagging documentation recommends applying <code>tags.datadoghq.com\/env<\/code>, <code>tags.datadoghq.com\/service<\/code>, and <code>tags.datadoghq.com\/version<\/code> labels at the Deployment and pod template levels, and exposing them to the container as <code>DD_ENV<\/code>, <code>DD_SERVICE<\/code>, and <code>DD_VERSION<\/code>. (<a href=\"https:\/\/docs.datadoghq.com\/getting_started\/tagging\/unified_service_tagging\/?tab=kubernetes\">Datadog Monitoring<\/a>)<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">7. Understand the exact error-tracking flow<\/h1>\n\n\n\n<pre class=\"wp-block-code\"><code>sequenceDiagram\n    participant User\n    participant App as App Container\n    participant DogStatsD as DogStatsD Socket\n    participant Logs as Container Logs\n    participant APM as APM Tracer\n    participant Agent as Datadog Agent\n    participant DD as Datadog\n    participant ET as Error Tracking\n\n    User-&gt;&gt;App: API request\n    App-&gt;&gt;App: Exception occurs\n\n    App-&gt;&gt;DogStatsD: increment app.error.count\n    DogStatsD-&gt;&gt;Agent: custom metric with tags\n\n    App-&gt;&gt;Logs: write structured ERROR log with stack trace\n    Logs-&gt;&gt;Agent: collect stdout\/stderr logs\n\n    App-&gt;&gt;APM: mark span as error\n    APM-&gt;&gt;Agent: send trace\/span data\n\n    Agent-&gt;&gt;Agent: attach Kubernetes metadata\n    Agent-&gt;&gt;DD: send metrics, logs, traces\n\n    DD-&gt;&gt;ET: group similar errors into issues\n    ET-&gt;&gt;DD: issue with service\/env\/version\/pod\/container context\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">8. Implement DogStatsD error metrics<\/h1>\n\n\n\n<p>DogStatsD should be used for <strong>counting and alerting<\/strong>, not for full stack traces.<\/p>\n\n\n\n<p>Recommended metric:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>app.error.count\n<\/code><\/pre>\n\n\n\n<p>Recommended tags:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>env\nservice\nversion\nerror_type\noperation\nendpoint\nhttp_status\nhandled\n<\/code><\/pre>\n\n\n\n<p>Avoid high-cardinality tags:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>request_id\nuser_id\nsession_id\norder_id\nfull_url\nfull_error_message\nstack_trace\npod_name unless intentionally needed\n<\/code><\/pre>\n\n\n\n<p>Bad metric design:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>app.error.count{user_id:12345,request_id:abc,error_message:payment failed for order 998877}\n<\/code><\/pre>\n\n\n\n<p>Good metric design:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>app.error.count{\n  env:prod,\n  service:checkout-api,\n  version:1.8.4,\n  error_type:PaymentTimeoutException,\n  operation:checkout,\n  endpoint:\/api\/checkout,\n  http_status:500,\n  handled:false\n}\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">8.1 Generic application pattern<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>try:\n    process_request()\nexcept Exception as err:\n    dogstatsd.increment(\n        \"app.error.count\",\n        tags=&#91;\n            \"error_type:\" + err.class_name,\n            \"operation:checkout\",\n            \"endpoint:\/api\/checkout\",\n            \"http_status:500\",\n            \"handled:false\"\n        ]\n    )\n\n    logger.error(\n        \"Checkout failed\",\n        error=err,\n        stack_trace=true,\n        fields={\n            \"error.kind\": err.class_name,\n            \"error.message\": err.message,\n            \"error.stack\": err.stack,\n            \"operation\": \"checkout\",\n            \"endpoint\": \"\/api\/checkout\",\n            \"http.status_code\": 500\n        }\n    )\n\n    raise\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">8.2 Python example<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>import os\nimport traceback\nimport logging\nfrom datadog import DogStatsd\n\nlogger = logging.getLogger(__name__)\n\nstatsd = DogStatsd(\n    socket_path=os.getenv(\"DOGSTATSD_SOCKET\", \"\/var\/run\/datadog\/dsd.socket\")\n)\n\ndef checkout(request):\n    try:\n        # business logic here\n        process_payment(request)\n\n    except Exception as exc:\n        error_type = exc.__class__.__name__\n        stack = traceback.format_exc()\n\n        statsd.increment(\n            \"app.error.count\",\n            tags=&#91;\n                f\"error_type:{error_type}\",\n                \"operation:checkout\",\n                \"endpoint:\/api\/checkout\",\n                \"http_status:500\",\n                \"handled:false\",\n            ],\n        )\n\n        logger.error(\n            \"Checkout failed\",\n            extra={\n                \"status\": \"error\",\n                \"error.kind\": error_type,\n                \"error.message\": str(exc),\n                \"error.stack\": stack,\n                \"operation\": \"checkout\",\n                \"endpoint\": \"\/api\/checkout\",\n                \"http.status_code\": 500,\n            },\n        )\n\n        raise\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">8.3 Node.js example<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>const StatsD = require(\"hot-shots\");\nconst logger = require(\".\/logger\");\n\nconst dogstatsd = new StatsD({\n  path: process.env.DOGSTATSD_SOCKET || \"\/var\/run\/datadog\/dsd.socket\"\n});\n\nasync function checkout(req, res) {\n  try {\n    await processPayment(req.body);\n    res.status(200).send({ status: \"ok\" });\n  } catch (err) {\n    dogstatsd.increment(\"app.error.count\", 1, &#91;\n      `error_type:${err.name}`,\n      \"operation:checkout\",\n      \"endpoint:\/api\/checkout\",\n      \"http_status:500\",\n      \"handled:false\"\n    ]);\n\n    logger.error({\n      status: \"error\",\n      message: \"Checkout failed\",\n      \"error.kind\": err.name,\n      \"error.message\": err.message,\n      \"error.stack\": err.stack,\n      operation: \"checkout\",\n      endpoint: \"\/api\/checkout\",\n      \"http.status_code\": 500\n    });\n\n    throw err;\n  }\n}\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">9. Implement structured logs for Error Tracking<\/h1>\n\n\n\n<p>This is where you get the Sentry-like stack trace.<\/p>\n\n\n\n<p>For Datadog Error Tracking from backend logs, the log should include:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>status = ERROR \/ CRITICAL \/ ALERT \/ EMERGENCY\nservice\nerror.kind or error.stack\n<\/code><\/pre>\n\n\n\n<p>Datadog documents that backend error logs need either <code>error.kind<\/code> or a valid <code>error.stack<\/code>, a service attribute, and an error-level status. For better grouping, include <code>error.message<\/code> and <code>error.stack<\/code>. (<a href=\"https:\/\/docs.datadoghq.com\/logs\/error_tracking\/backend\/?tab=serilog\">Datadog Monitoring<\/a>)<\/p>\n\n\n\n<p>Recommended JSON log:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>{\n  \"timestamp\": \"2026-05-18T10:00:00.000Z\",\n  \"status\": \"error\",\n  \"service\": \"checkout-api\",\n  \"env\": \"prod\",\n  \"version\": \"1.8.4\",\n  \"message\": \"Checkout failed\",\n  \"error.kind\": \"PaymentTimeoutException\",\n  \"error.message\": \"Payment provider timed out\",\n  \"error.stack\": \"PaymentTimeoutException: Payment provider timed out\\n    at CheckoutService.pay...\",\n  \"operation\": \"checkout\",\n  \"endpoint\": \"\/api\/checkout\",\n  \"http.status_code\": 500\n}\n<\/code><\/pre>\n\n\n\n<p>Recommended log rule:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Application logs should go to stdout\/stderr.\nDatadog Agent should collect container logs from the node.\nLogs should be JSON if possible.\nEach error log should contain service, env, version, error.kind, error.message, error.stack.\n<\/code><\/pre>\n\n\n\n<p>For Kubernetes, Datadog recommends Agent-based log collection and can collect logs from Kubernetes log files. File-based collection is preferred over Docker socket-based collection for performance and reliability in containerized environments. (<a href=\"https:\/\/docs.datadoghq.com\/containers\/troubleshooting\/log-collection\/?tab=datadogoperator&amp;utm_source=chatgpt.com\">Datadog Monitoring<\/a>)<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">10. Implement APM for request-level debugging<\/h1>\n\n\n\n<p>APM is what lets you answer:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Which API failed?\nWhich downstream service failed?\nWas it database, cache, third-party API, timeout, or code exception?\nWhich trace\/log belongs to this error?\n<\/code><\/pre>\n\n\n\n<p>Flow:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>flowchart TD\n    REQ&#91;Incoming Request \/api\/checkout] --&gt; SPAN1&#91;checkout-api span]\n    SPAN1 --&gt; SPAN2&#91;payment-service HTTP call]\n    SPAN1 --&gt; SPAN3&#91;database query]\n    SPAN2 --&gt; ERR&#91;Timeout Exception]\n    ERR --&gt; TRACE&#91;Trace marked as error]\n    TRACE --&gt; ET&#91;Error Tracking Issue]\n    TRACE --&gt; LOG&#91;Connected Logs]\n    TRACE --&gt; POD&#91;Pod and Container Metadata]\n<\/code><\/pre>\n\n\n\n<p>Recommended APM environment variables:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>env:\n  - name: DD_ENV\n    value: \"prod\"\n\n  - name: DD_SERVICE\n    value: \"checkout-api\"\n\n  - name: DD_VERSION\n    value: \"1.8.4\"\n\n  - name: DD_TRACE_AGENT_URL\n    value: \"unix:\/\/\/var\/run\/datadog\/apm.socket\"\n\n  - name: DD_LOGS_INJECTION\n    value: \"true\"\n\n  - name: DD_RUNTIME_METRICS_ENABLED\n    value: \"true\"\n<\/code><\/pre>\n\n\n\n<p>Datadog APM on Kubernetes supports UDS, host IP, or Kubernetes service routing for traces. In containerized environments, sending traces to <code>localhost<\/code> is usually wrong because the Agent is in another container\/pod; for Kubernetes, use UDS, node host IP, Admission Controller injection, or a Kubernetes service pattern. (<a href=\"https:\/\/docs.datadoghq.com\/tracing\/troubleshooting\/connection_errors\/\">Datadog Monitoring<\/a>)<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">11. How Error Tracking groups errors<\/h1>\n\n\n\n<p>Datadog Error Tracking groups similar errors into issues based on properties such as:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>service\nerror.type \/ error.kind\nerror.message\nerror.stack\ntop meaningful stack frame\n<\/code><\/pre>\n\n\n\n<p>So two errors may become separate issues if they happen in different services or have different error types\/stack-frame locations. (<a href=\"https:\/\/docs.datadoghq.com\/tracing\/error_tracking\/error_grouping\/\">Datadog Monitoring<\/a>)<\/p>\n\n\n\n<p>Example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>checkout-api + PaymentTimeoutException + CheckoutService.pay()\n= One Error Tracking issue\n<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>payment-service + PaymentTimeoutException + PaymentClient.call()\n= Different Error Tracking issue\n<\/code><\/pre>\n\n\n\n<p>This is why <code>service<\/code>, <code>error.kind<\/code>, and <code>error.stack<\/code> matter so much.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">12. Recommended tag strategy<\/h1>\n\n\n\n<h2 class=\"wp-block-heading\">Mandatory tags<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Tag<\/th><th>Example<\/th><th>Purpose<\/th><\/tr><\/thead><tbody><tr><td><code>env<\/code><\/td><td><code>prod<\/code><\/td><td>Separate prod\/stage\/dev<\/td><\/tr><tr><td><code>service<\/code><\/td><td><code>checkout-api<\/code><\/td><td>Service-level ownership<\/td><\/tr><tr><td><code>version<\/code><\/td><td><code>1.8.4<\/code><\/td><td>Release\/deployment tracking<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Strongly recommended tags<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Tag<\/th><th>Example<\/th><th>Purpose<\/th><\/tr><\/thead><tbody><tr><td><code>team<\/code><\/td><td><code>payments<\/code><\/td><td>Ownership<\/td><\/tr><tr><td><code>product<\/code><\/td><td><code>motoshare<\/code><\/td><td>Product\/application grouping<\/td><\/tr><tr><td><code>component<\/code><\/td><td><code>api<\/code><\/td><td>API\/worker\/consumer grouping<\/td><\/tr><tr><td><code>operation<\/code><\/td><td><code>checkout<\/code><\/td><td>Business flow<\/td><\/tr><tr><td><code>endpoint<\/code><\/td><td><code>\/api\/checkout<\/code><\/td><td>API route<\/td><\/tr><tr><td><code>error_type<\/code><\/td><td><code>PaymentTimeoutException<\/code><\/td><td>Error classification<\/td><\/tr><tr><td><code>handled<\/code><\/td><td><code>true\/false<\/code><\/td><td>Handled vs unhandled error<\/td><\/tr><tr><td><code>cloud<\/code><\/td><td><code>aws<\/code><\/td><td>Cloud provider<\/td><\/tr><tr><td><code>platform<\/code><\/td><td><code>eks<\/code><\/td><td>Runtime platform<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Kubernetes tags Datadog can add<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>kube_cluster_name\nkube_namespace\nkube_deployment\nkube_replica_set\npod_name\ncontainer_name\nimage_name\nimage_tag\nnode\navailability_zone\n<\/code><\/pre>\n\n\n\n<p>For DogStatsD metrics, be careful with tag cardinality. Datadog notes that for UDP DogStatsD, <code>pod_name<\/code> is not added by default to avoid creating too many custom metrics, and tag cardinality can be controlled globally or per metric. (<a href=\"https:\/\/docs.datadoghq.com\/extend\/dogstatsd\/?tab=hostagent\">Datadog Monitoring<\/a>)<\/p>\n\n\n\n<p>My recommendation:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Use service\/version-level DogStatsD metrics for alerting.\nUse logs\/APM\/Error Tracking for exact pod\/container investigation.\nUse pod-level metric tagging only when you really need it.\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">13. Complete application telemetry flow<\/h1>\n\n\n\n<pre class=\"wp-block-code\"><code>flowchart TD\n    A&#91;Exception in Application] --&gt; B{Telemetry Type}\n\n    B --&gt; C&#91;DogStatsD Counter]\n    C --&gt; C1&#91;app.error.count]\n    C1 --&gt; C2&#91;Alert: Error spike by service\/version]\n\n    B --&gt; D&#91;Structured Error Log]\n    D --&gt; D1&#91;error.kind]\n    D --&gt; D2&#91;error.message]\n    D --&gt; D3&#91;error.stack]\n    D3 --&gt; D4&#91;Error Tracking Issue]\n\n    B --&gt; E&#91;APM Trace]\n    E --&gt; E1&#91;Trace marked error]\n    E1 --&gt; E2&#91;Request path]\n    E2 --&gt; E3&#91;Downstream dependency failure]\n\n    B --&gt; F&#91;Kubernetes Metadata]\n    F --&gt; F1&#91;pod_name]\n    F --&gt; F2&#91;container_name]\n    F --&gt; F3&#91;kube_deployment]\n    F --&gt; F4&#91;node]\n\n    C2 --&gt; G&#91;Datadog Incident \/ Monitor]\n    D4 --&gt; G\n    E3 --&gt; G\n    F4 --&gt; G\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">14. Build dashboards<\/h1>\n\n\n\n<h2 class=\"wp-block-heading\">14.1 Error count by service<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>sum:app.error.count{env:prod} by {service}.as_count()\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">14.2 Error count by version<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>sum:app.error.count{env:prod,service:checkout-api} by {version}.as_count()\n<\/code><\/pre>\n\n\n\n<p>Use this to answer:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Did the new release increase errors?\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">14.3 Error count by operation<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>sum:app.error.count{env:prod,service:checkout-api} by {operation}.as_count()\n<\/code><\/pre>\n\n\n\n<p>Use this to answer:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Which business flow is failing?\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">14.4 Error count by error type<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>sum:app.error.count{env:prod,service:checkout-api} by {error_type}.as_count()\n<\/code><\/pre>\n\n\n\n<p>Use this to answer:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Which exception is most common?\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">14.5 Error count by Kubernetes deployment<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>sum:app.error.count{env:prod} by {kube_namespace,kube_deployment}.as_count()\n<\/code><\/pre>\n\n\n\n<p>Use this to answer:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Which deployment is producing the errors?\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">14.6 Pod-level view<\/h2>\n\n\n\n<p>Only use this if your DogStatsD metric cardinality\/tagging supports it:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>sum:app.error.count{env:prod,service:checkout-api} by {pod_name}.as_count()\n<\/code><\/pre>\n\n\n\n<p>For exact pod-level investigation, I would rely more on logs\/APM\/Error Tracking because pod-level metrics can create high cardinality and cost\/noise.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">15. Build monitors and alerts<\/h1>\n\n\n\n<h2 class=\"wp-block-heading\">15.1 Metric monitor: service error spike<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>sum(last_5m):sum:app.error.count{env:prod,service:checkout-api}.as_count() &gt; 50\n<\/code><\/pre>\n\n\n\n<p>Alert message:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>High application error count detected.\n\nService: {{service.name}}\nEnvironment: {{env.name}}\nVersion: {{version.name}}\n\nCheck:\n- Error Tracking issue\n- APM trace\n- Logs for error.stack\n- Kubernetes pod\/container details\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">15.2 Metric monitor: new version error spike<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>sum(last_10m):sum:app.error.count{env:prod,service:checkout-api} by {version}.as_count() &gt; 100\n<\/code><\/pre>\n\n\n\n<p>Use this after deployments.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">15.3 Error Tracking monitor: new issue<\/h2>\n\n\n\n<p>Use this for Sentry-like behavior:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Alert when a new backend issue appears for service:checkout-api env:prod\n<\/code><\/pre>\n\n\n\n<p>Datadog Error Tracking monitors support alerting on new issues, regressions, and high-impact errors. (<a href=\"https:\/\/docs.datadoghq.com\/monitors\/types\/error_tracking\/?tab=newissue\">Datadog Monitoring<\/a>)<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">15.4 APM monitor: error rate<\/h2>\n\n\n\n<p>Example logic:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Error rate for checkout-api &gt; 5% during last 5 minutes\n<\/code><\/pre>\n\n\n\n<p>Use this for service reliability monitoring.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">16. Recommended alerting strategy<\/h1>\n\n\n\n<p>Do not create only one giant alert.<\/p>\n\n\n\n<p>Use layered alerting:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>flowchart TD\n    A&#91;Application Errors] --&gt; B&#91;Metric Alert]\n    A --&gt; C&#91;Error Tracking New Issue Alert]\n    A --&gt; D&#91;APM Error Rate Alert]\n    A --&gt; E&#91;Kubernetes Pod Restart Alert]\n\n    B --&gt; F&#91;High volume problem]\n    C --&gt; G&#91;New code issue]\n    D --&gt; H&#91;Request failure problem]\n    E --&gt; I&#91;Runtime\/container problem]\n\n    F --&gt; J&#91;Incident]\n    G --&gt; J\n    H --&gt; J\n    I --&gt; J\n<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Alert type<\/th><th>Detects<\/th><th>Best for<\/th><\/tr><\/thead><tbody><tr><td>DogStatsD metric alert<\/td><td>Error volume spike<\/td><td>Fast service-level alert<\/td><\/tr><tr><td>Error Tracking alert<\/td><td>New\/regressed grouped error<\/td><td>Sentry-like issue detection<\/td><\/tr><tr><td>APM error rate alert<\/td><td>Request failure percentage<\/td><td>API\/SLO reliability<\/td><\/tr><tr><td>Log alert<\/td><td>Specific log pattern<\/td><td>Known failure modes<\/td><\/tr><tr><td>Kubernetes alert<\/td><td>CrashLoopBackOff\/restarts<\/td><td>Pod\/container health<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">17. Best practice: use DogStatsD for counters, not stack traces<\/h1>\n\n\n\n<p>DogStatsD should answer:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>How many errors happened?\nWhich service\/version\/operation is failing?\nDid errors increase after deployment?\n<\/code><\/pre>\n\n\n\n<p>DogStatsD should not answer:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>What is the stack trace?\nWhich line of code failed?\nWhat was the exception body?\nWhat user\/request caused this?\n<\/code><\/pre>\n\n\n\n<p>Those belong in:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>APM\nLogs\nError Tracking\nTrace\/log correlation\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">18. Best practice: standardize error classification<\/h1>\n\n\n\n<p>Create a small taxonomy across all services.<\/p>\n\n\n\n<p>Example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>validation_error\ndependency_timeout\ndatabase_error\nauthentication_error\nauthorization_error\nbusiness_rule_error\nunexpected_exception\n<\/code><\/pre>\n\n\n\n<p>Then tag DogStatsD metrics like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>error_category:dependency_timeout\nerror_type:PaymentTimeoutException\noperation:checkout\n<\/code><\/pre>\n\n\n\n<p>This gives clean dashboards:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Errors by category\nErrors by operation\nErrors by service\nErrors by version\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">19. Best practice: release\/version tracking<\/h1>\n\n\n\n<p>Every deployment should set a unique <code>version<\/code>.<\/p>\n\n\n\n<p>Good:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>version: 1.8.4\nversion: git-sha-a8f91cd\nversion: 2026.05.18.1\n<\/code><\/pre>\n\n\n\n<p>Bad:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>version: latest\nversion: prod\nversion: main\n<\/code><\/pre>\n\n\n\n<p>Datadog expects <code>version<\/code> to change with each application deployment so deployment impact can be identified cleanly. (<a href=\"https:\/\/docs.datadoghq.com\/getting_started\/tagging\/unified_service_tagging\/?tab=kubernetes\">Datadog Monitoring<\/a>)<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">20. Best practice: log format<\/h1>\n\n\n\n<p>Use JSON logs.<\/p>\n\n\n\n<p>Recommended fields:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>{\n  \"status\": \"error\",\n  \"service\": \"checkout-api\",\n  \"env\": \"prod\",\n  \"version\": \"1.8.4\",\n  \"message\": \"Checkout failed\",\n  \"error.kind\": \"PaymentTimeoutException\",\n  \"error.message\": \"Payment provider timed out\",\n  \"error.stack\": \"...\",\n  \"operation\": \"checkout\",\n  \"endpoint\": \"\/api\/checkout\",\n  \"http.method\": \"POST\",\n  \"http.status_code\": 500,\n  \"customer_impact\": true\n}\n<\/code><\/pre>\n\n\n\n<p>Avoid logging sensitive data:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>password\ntoken\ncredit card\npersonal identity data\nfull request payloads\nauthorization headers\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">21. Best practice: deployment annotation for logs<\/h1>\n\n\n\n<p>For each application container, add Datadog log annotation:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>annotations:\n  ad.datadoghq.com\/checkout-api.logs: &gt;\n    &#91;{\n      \"source\": \"java\",\n      \"service\": \"checkout-api\",\n      \"tags\": &#91;\"team:payments\",\"component:api\"]\n    }]\n<\/code><\/pre>\n\n\n\n<p>Use the right <code>source<\/code> value:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>App language\/runtime<\/th><th><code>source<\/code><\/th><\/tr><\/thead><tbody><tr><td>Java<\/td><td><code>java<\/code><\/td><\/tr><tr><td>Node.js<\/td><td><code>nodejs<\/code><\/td><\/tr><tr><td>Python<\/td><td><code>python<\/code><\/td><\/tr><tr><td>Go<\/td><td><code>go<\/code><\/td><\/tr><tr><td>.NET<\/td><td><code>csharp<\/code> or configured .NET source<\/td><\/tr><tr><td>Ruby<\/td><td><code>ruby<\/code><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>The <code>source<\/code> tag matters because Datadog\u2019s Error Tracking for logs uses language-specific handling, and Datadog recommends ensuring the source tag is properly configured. (<a href=\"https:\/\/docs.datadoghq.com\/logs\/error_tracking\/backend\/?tab=serilog\">Datadog Monitoring<\/a>)<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">22. Pod\/container relationship design<\/h1>\n\n\n\n<p>The relationship is built from three places:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>flowchart TD\n    A&#91;Application Deployment Labels] --&gt; D&#91;env\/service\/version]\n    B&#91;Datadog Agent Kubernetes Metadata] --&gt; E&#91;pod\/container\/deployment\/node]\n    C&#91;Application Logs\/APM\/DogStatsD] --&gt; F&#91;error\/trace\/metric]\n\n    D --&gt; G&#91;Unified Datadog View]\n    E --&gt; G\n    F --&gt; G\n\n    G --&gt; H&#91;Which service failed?]\n    G --&gt; I&#91;Which version failed?]\n    G --&gt; J&#91;Which pod\/container failed?]\n    G --&gt; K&#91;Which node hosted it?]\n<\/code><\/pre>\n\n\n\n<p>To make this work:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>1. Datadog Agent must run in the cluster.\n2. App pods must have unified service tags.\n3. Logs\/APM\/DogStatsD must use the same service\/env\/version.\n4. Error logs must include error.kind\/error.stack.\n5. APM tracer should inject trace\/log correlation where supported.\n6. DogStatsD origin detection should be enabled.\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">23. EKS-specific implementation notes<\/h1>\n\n\n\n<h2 class=\"wp-block-heading\">Standard EKS with EC2 worker nodes<\/h2>\n\n\n\n<p>Recommended:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Datadog Agent as DaemonSet\nUse UDS for APM\nUse UDS for DogStatsD\nCollect container logs from nodes\nUse Cluster Agent\nUse Admission Controller where possible\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">EKS Fargate<\/h2>\n\n\n\n<p>Be careful. EKS Fargate does not behave like normal EC2 worker nodes because you do not manage the underlying node the same way. Datadog\u2019s DogStatsD origin detection docs specifically mention <code>shareProcessNamespace:true<\/code> to assist the Agent for origin detection on EKS Fargate. (<a href=\"https:\/\/docs.datadoghq.com\/extend\/dogstatsd\/?tab=hostagent\">Datadog Monitoring<\/a>)<\/p>\n\n\n\n<p>If you are using Fargate, validate the Datadog deployment pattern separately.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">24. End-to-end sample implementation<\/h1>\n\n\n\n<h2 class=\"wp-block-heading\">24.1 Datadog Agent values<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>targetSystem: linux\n\ndatadog:\n  apiKeyExistingSecret: datadog-secret\n  site: datadoghq.com\n  clusterName: eks-prod-apne1-01\n\n  logs:\n    enabled: true\n    containerCollectAll: true\n\n  apm:\n    socketEnabled: true\n    portEnabled: false\n\n  dogstatsd:\n    originDetection: true\n    useSocketVolume: true\n    socketPath: \/var\/run\/datadog\/dsd.socket\n    tagCardinality: orchestrator\n\n  kubeStateMetricsCore:\n    enabled: true\n\n  collectEvents: true\n\nclusterAgent:\n  enabled: true\n  admissionController:\n    enabled: true\n    mutateUnlabelled: false\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">24.2 App deployment<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>apiVersion: apps\/v1\nkind: Deployment\nmetadata:\n  name: checkout-api\n  namespace: prod\n  labels:\n    tags.datadoghq.com\/env: \"prod\"\n    tags.datadoghq.com\/service: \"checkout-api\"\n    tags.datadoghq.com\/version: \"1.8.4\"\nspec:\n  replicas: 3\n  selector:\n    matchLabels:\n      app: checkout-api\n  template:\n    metadata:\n      labels:\n        app: checkout-api\n        tags.datadoghq.com\/env: \"prod\"\n        tags.datadoghq.com\/service: \"checkout-api\"\n        tags.datadoghq.com\/version: \"1.8.4\"\n      annotations:\n        admission.datadoghq.com\/enabled: \"true\"\n        ad.datadoghq.com\/checkout-api.logs: &gt;\n          &#91;{\n            \"source\": \"java\",\n            \"service\": \"checkout-api\",\n            \"tags\": &#91;\"team:payments\",\"component:api\"]\n          }]\n    spec:\n      containers:\n        - name: checkout-api\n          image: myrepo\/checkout-api:1.8.4\n          env:\n            - name: DD_ENV\n              valueFrom:\n                fieldRef:\n                  fieldPath: metadata.labels&#91;'tags.datadoghq.com\/env']\n\n            - name: DD_SERVICE\n              valueFrom:\n                fieldRef:\n                  fieldPath: metadata.labels&#91;'tags.datadoghq.com\/service']\n\n            - name: DD_VERSION\n              valueFrom:\n                fieldRef:\n                  fieldPath: metadata.labels&#91;'tags.datadoghq.com\/version']\n\n            - name: DD_ENTITY_ID\n              valueFrom:\n                fieldRef:\n                  fieldPath: metadata.uid\n\n            - name: DD_TRACE_AGENT_URL\n              value: \"unix:\/\/\/var\/run\/datadog\/apm.socket\"\n\n            - name: DD_LOGS_INJECTION\n              value: \"true\"\n\n            - name: DD_RUNTIME_METRICS_ENABLED\n              value: \"true\"\n\n            - name: DOGSTATSD_SOCKET\n              value: \"\/var\/run\/datadog\/dsd.socket\"\n\n          volumeMounts:\n            - name: datadog-socket\n              mountPath: \/var\/run\/datadog\n              readOnly: true\n\n      volumes:\n        - name: datadog-socket\n          hostPath:\n            path: \/var\/run\/datadog\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">25. Validation checklist<\/h1>\n\n\n\n<h2 class=\"wp-block-heading\">25.1 Validate Datadog Agent<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>kubectl -n datadog get pods\nkubectl -n datadog get ds\nkubectl -n datadog get deploy\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">25.2 Check Agent status<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>kubectl -n datadog exec -it &lt;datadog-agent-pod-name&gt; -c agent -- agent status\n<\/code><\/pre>\n\n\n\n<p>Look for:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>APM Agent: Running\nDogStatsD: Running\nLogs Agent: Running\n<\/code><\/pre>\n\n\n\n<p>Datadog\u2019s APM troubleshooting guide says the Agent status output should show the APM Agent as running; otherwise traces cannot be submitted properly. (<a href=\"https:\/\/docs.datadoghq.com\/tracing\/troubleshooting\/connection_errors\/\">Datadog Monitoring<\/a>)<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">25.3 Validate app tags<\/h2>\n\n\n\n<p>Check pod labels:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>kubectl -n prod get pod &lt;pod-name&gt; --show-labels\n<\/code><\/pre>\n\n\n\n<p>Expected:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>tags.datadoghq.com\/env=prod\ntags.datadoghq.com\/service=checkout-api\ntags.datadoghq.com\/version=1.8.4\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">25.4 Validate logs<\/h2>\n\n\n\n<p>Generate a test exception, then search logs by:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>service:checkout-api env:prod status:error\n<\/code><\/pre>\n\n\n\n<p>Expected fields:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>error.kind\nerror.message\nerror.stack\nkube_namespace\npod_name\ncontainer_name\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">25.5 Validate DogStatsD metric<\/h2>\n\n\n\n<p>Search metric:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>app.error.count\n<\/code><\/pre>\n\n\n\n<p>Group by:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>service\nversion\nerror_type\noperation\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">25.6 Validate APM<\/h2>\n\n\n\n<p>Search service:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>service:checkout-api env:prod\n<\/code><\/pre>\n\n\n\n<p>Expected:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Traces visible\nError traces visible\nService map visible\nTrace\/log correlation working\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">25.7 Validate Error Tracking<\/h2>\n\n\n\n<p>Search backend issues for:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>service:checkout-api env:prod\n<\/code><\/pre>\n\n\n\n<p>Expected:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Grouped error issue\nStack trace visible\nOccurrences visible\nRelated logs\/traces visible\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">26. Common problems and fixes<\/h1>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Problem<\/th><th>Likely cause<\/th><th>Fix<\/th><\/tr><\/thead><tbody><tr><td>Error metric appears but no pod\/container<\/td><td>DogStatsD origin detection\/cardinality issue<\/td><td>Enable origin detection; use UDS; review tag cardinality<\/td><\/tr><tr><td>Error Tracking issue not created<\/td><td>Logs missing <code>error.kind<\/code> or <code>error.stack<\/code><\/td><td>Add structured error fields<\/td><\/tr><tr><td>Logs visible but service name wrong<\/td><td>Missing log annotation or unified tags<\/td><td>Add <code>service<\/code> in log config and <code>DD_SERVICE<\/code><\/td><\/tr><tr><td>APM traces missing<\/td><td>App cannot reach Agent<\/td><td>Use UDS or correct <code>DD_AGENT_HOST<\/code>; check Agent status<\/td><\/tr><tr><td>Trace\/log correlation missing<\/td><td>Log injection not enabled<\/td><td>Enable tracer log injection<\/td><\/tr><tr><td>Too many custom metrics<\/td><td>High-cardinality metric tags<\/td><td>Remove <code>request_id<\/code>, <code>user_id<\/code>, <code>pod_name<\/code> from metrics<\/td><\/tr><tr><td>New release not visible<\/td><td>Static or missing <code>version<\/code><\/td><td>Set unique <code>DD_VERSION<\/code> per deployment<\/td><\/tr><tr><td>Pod error not visible in metric<\/td><td>Pod tag not included for cardinality reasons<\/td><td>Use logs\/APM for pod-level RCA or adjust cardinality carefully<\/td><\/tr><tr><td>Logs not collected<\/td><td>Agent log collection disabled<\/td><td>Enable container log collection<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">27. Best implementation pattern for your migration<\/h1>\n\n\n\n<p>Do not migrate like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Sentry \u2192 DogStatsD only\n<\/code><\/pre>\n\n\n\n<p>That will give weak debugging.<\/p>\n\n\n\n<p>Migrate like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Sentry\n  \u2192 Datadog Error Tracking\n  \u2192 Datadog APM\n  \u2192 Datadog Logs\n  \u2192 DogStatsD custom error metrics\n  \u2192 Kubernetes metadata correlation\n<\/code><\/pre>\n\n\n\n<p>Recommended production pattern:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>flowchart TD\n    A&#91;Sentry Replacement Requirement] --&gt; B&#91;Error Tracking]\n    A --&gt; C&#91;APM]\n    A --&gt; D&#91;Logs]\n    A --&gt; E&#91;DogStatsD Metrics]\n\n    B --&gt; F&#91;Grouped Issues]\n    C --&gt; G&#91;Trace and Dependency RCA]\n    D --&gt; H&#91;Stack Trace and Context]\n    E --&gt; I&#91;Fast Error Count Alerts]\n\n    F --&gt; J&#91;Service \/ Env \/ Version]\n    G --&gt; J\n    H --&gt; J\n    I --&gt; J\n\n    J --&gt; K&#91;Kubernetes Pod \/ Container \/ Deployment \/ Node]\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">28. Final recommended standard<\/h1>\n\n\n\n<p>For every service running in EKS, implement this standard:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>1. Add Datadog unified service labels:\n   - tags.datadoghq.com\/env\n   - tags.datadoghq.com\/service\n   - tags.datadoghq.com\/version\n\n2. Add application env vars:\n   - DD_ENV\n   - DD_SERVICE\n   - DD_VERSION\n   - DD_TRACE_AGENT_URL\n   - DD_LOGS_INJECTION\n   - DD_ENTITY_ID\n\n3. Enable Datadog Agent features:\n   - logs\n   - APM\n   - DogStatsD\n   - DogStatsD origin detection\n   - Kubernetes metadata\n   - Cluster Agent\n\n4. Application must emit:\n   - DogStatsD metric: app.error.count\n   - Structured error log with error.kind\/error.message\/error.stack\n   - APM trace\/span errors\n\n5. Dashboards should show:\n   - errors by service\n   - errors by version\n   - errors by operation\n   - errors by error_type\n   - errors by namespace\/deployment\n   - related pods\/containers through logs\/APM\n\n6. Alerts should include:\n   - new Error Tracking issue\n   - high error count\n   - high APM error rate\n   - pod restart\/crashloop alerts\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">29. Final conclusion<\/h1>\n\n\n\n<p>The best Datadog design for application error tracking in EKS is:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>DogStatsD for custom error counters\nLogs for stack traces\nAPM for request\/dependency tracing\nError Tracking for Sentry-like issue grouping\nUnified service tagging for service\/env\/version relationship\nKubernetes metadata for pod\/container\/node relationship\n<\/code><\/pre>\n\n\n\n<p>In short:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>DogStatsD tells you how many errors happened.\nLogs tell you what exception happened.\nAPM tells you where in the request path it failed.\nError Tracking groups the issue.\nKubernetes metadata tells you which pod\/container\/deployment\/node caused it.\n<\/code><\/pre>\n\n\n\n<p>That combination gives you a clean, production-grade replacement for Sentry while also giving stronger EKS infrastructure correlation than Sentry alone.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Master Guide: Application Error Tracking in EKS using Datadog, DogStatsD, APM, Logs, and Error Tracking First, tiny naming correction: it is DogStatsD, not DogStashD. DogStatsD is Datadog\u2019s&#8230; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-993","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/993","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/comments?post=993"}],"version-history":[{"count":1,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/993\/revisions"}],"predecessor-version":[{"id":994,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/993\/revisions\/994"}],"wp:attachment":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/media?parent=993"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/categories?post=993"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/tags?post=993"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}