Adding metrics to Datadog

commcare-hq supports both Datadog and Prometheus (FOSS Datadog)

I will discuss Datadog.

Types

Counter

A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. For example, you can use a counter to represent the number of requests served, tasks completed, or errors.

Do not use a counter to expose a value that can decrease. For example, do not use a counter for the number of currently running processes; instead use a gauge.

Gauge

A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.

Gauges are typically used for measured values like temperatures or current memory usage, but also "counts" that can go up and down, like the number of concurrent requests.

Histogram

A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets.

https://en.wikipedia.org/wiki/Histogram#Examples

Works slightly differently in Datadog & Prometheus.

Code

corehq/util/metrics/__init__.py

https://github.com/dimagi/commcare-hq/blob/master/corehq/util/metrics/__init__.py

When naming your metrics, the name must start with "commcare.".

Counter

`metrics_counter()`

commcare.repeaters.check.locked_out

commcare-hq:

metrics_counter("commcare.repeaters.check.locked_out", tags={'partition': partition})

Datadog

check_repeaters task canceled: locked out

commcare.repeaters.check.attempt_forward

commcare-hq:

metrics_counter("commcare.repeaters.check.attempt_forward")

Datadog

Retries from check repeaters task

Gauge

`metrics_gauge()`

commcare-hq:

metrics_gauge(
    'commcare.pillowtop.error_queue',
    row['num_errors'],
    tags={
        'pillow_name': row['pillow'],
        'host': 'celery',
        'group': 'celery'
    },
    multiprocess_mode=MPM_MAX  # (Prometheus only)
)

`metrics_gauge_task()`

metrics_gauge_task() wraps metrics_gauge() and defines a task.

commcare-hq:

metrics_gauge_task(
    'commcare.repeaters.overdue',
    RepeatRecord.objects.count_overdue,
    run_every=crontab(),  # every minute
    multiprocess_mode=MPM_MAX  # (Prometheus only)
)

Datadog

Number of overdue repeat records

Histogram

`metrics_histogram()`

commcare-hq:

metrics_histogram(
    'commcare.repeaters.repeat_record_processing.timing',
    processing_time * 1000,
    buckets=(100, 500, 1000, 5000),
    bucket_tag='duration',
    bucket_unit='ms',
    tags={
        'domain': repeat_record.domain,
        'action': action,
    },
)

Datadog: Tags allow you to filter your data

Click the "Edit" tab on each of those graphs. Compare their entries under "Graph your data". Notice that in the "from" field, data is filter by specific values of the "actions" tag.

Datadog: Tags allow users to filter the graph

Click the "Edit" tab on each of those graphs. Compare their entries under "Graph your data". Notice that the metric and the "from" field are the same. One graph has "sum by" set to the "duration" tag, and the other graph has "sum by" set to the "domain" tag. This difference allows users to filter the same graph by different tags.

`metrics_histogram_timer()`

metrics_histogram_timer() is a context manager for timing a block of code, and organizing the timings into buckets.

commcare-hq:

with metrics_histogram_timer(
    "commcare.repeaters.check.processing",
    timing_buckets=_check_repeaters_buckets,
):

Datadog

Adding a new widget

In your editor or IDE, copy the name of the metric from the code.
In Datadog, scroll to the group for the new widget.
Click "+ Add Widgets".
Drag "Timeseries" into the group.
The graphs initially shows the "system.cpu.user" metric. Click in the field, paste the name of the metric over "system.cpu.user", and select the metric.
Scroll to the bottom of the modal, and give your widget a name.
Click "Save".

kaapstorm/2024-10-24_datadog.md

Adding metrics to Datadog

Types

Counter

Gauge

Histogram

Code

Counter

metrics_counter()

commcare.repeaters.check.locked_out

Datadog

commcare.repeaters.check.attempt_forward

Datadog

Gauge

metrics_gauge()

metrics_gauge_task()

Datadog

Histogram

metrics_histogram()

Datadog: Tags allow you to filter your data

Datadog: Tags allow users to filter the graph

metrics_histogram_timer()

Datadog

Adding a new widget

`metrics_counter()`

`metrics_gauge()`

`metrics_gauge_task()`

`metrics_histogram()`

`metrics_histogram_timer()`