commcare-hq supports both Datadog and Prometheus (FOSS Datadog)
I will discuss Datadog.
From Prometheus documentation:
A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. For example, you can use a counter to represent the number of requests served, tasks completed, or errors.
Do not use a counter to expose a value that can decrease. For example, do not use a counter for the number of currently running processes; instead use a gauge.
A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.
Gauges are typically used for measured values like temperatures or current memory usage, but also "counts" that can go up and down, like the number of concurrent requests.
A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets.
https://en.wikipedia.org/wiki/Histogram#Examples
Works slightly differently in Datadog & Prometheus.
corehq/util/metrics/__init__.py
https://github.com/dimagi/commcare-hq/blob/master/corehq/util/metrics/__init__.py
When naming your metrics, the name must start with "commcare.".
metrics_counter("commcare.repeaters.check.locked_out", tags={'partition': partition})
metrics_counter("commcare.repeaters.check.attempt_forward")
metrics_gauge(
'commcare.pillowtop.error_queue',
row['num_errors'],
tags={
'pillow_name': row['pillow'],
'host': 'celery',
'group': 'celery'
},
multiprocess_mode=MPM_MAX # (Prometheus only)
)
metrics_gauge_task()
wraps metrics_gauge()
and defines a task.
metrics_gauge_task(
'commcare.repeaters.overdue',
RepeatRecord.objects.count_overdue,
run_every=crontab(), # every minute
multiprocess_mode=MPM_MAX # (Prometheus only)
)
metrics_histogram(
'commcare.repeaters.repeat_record_processing.timing',
processing_time * 1000,
buckets=(100, 500, 1000, 5000),
bucket_tag='duration',
bucket_unit='ms',
tags={
'domain': repeat_record.domain,
'action': action,
},
)
- Time spent processing postponed repeat records
- Time spent processing attempted repeat records (excludes waiting for endpoint)
Click the "Edit" tab on each of those graphs. Compare their entries under "Graph your data". Notice that in the "from" field, data is filter by specific values of the "actions" tag.
Click the "Edit" tab on each of those graphs. Compare their entries under "Graph your data". Notice that the metric and the "from" field are the same. One graph has "sum by" set to the "duration" tag, and the other graph has "sum by" set to the "domain" tag. This difference allows users to filter the same graph by different tags.
metrics_histogram_timer()
is a context manager for timing a block of
code, and organizing the timings into buckets.
with metrics_histogram_timer(
"commcare.repeaters.check.processing",
timing_buckets=_check_repeaters_buckets,
):
-
In your editor or IDE, copy the name of the metric from the code.
-
In Datadog, scroll to the group for the new widget.
-
Click "+ Add Widgets".
-
Drag "Timeseries" into the group.
-
The graphs initially shows the "system.cpu.user" metric. Click in the field, paste the name of the metric over "system.cpu.user", and select the metric.
-
Scroll to the bottom of the modal, and give your widget a name.
-
Click "Save".