Skip to content

Instantly share code, notes, and snippets.

@xrl
Last active February 1, 2025 04:38
Show Gist options
  • Save xrl/43aba28f20fd0efbdf2ea3e58e5c5ded to your computer and use it in GitHub Desktop.
Save xrl/43aba28f20fd0efbdf2ea3e58e5c5ded to your computer and use it in GitHub Desktop.
Quickwit+Vector for Kubernetes (EKS-flavored) Observability

Quickwit filled with Kube Logs (With Vector+ArgoCD+Grafana)

An experienced operators guide to streaming Kubernetes workload logs into Quickwit.

  1. Overview

  2. How does it work?

  3. Quickwit Costs

  4. Argo Configurations

  5. AWS Configuration

  6. Using Quickwit

  7. Log Management

  8. Grafana Integration

How does it work?

Every Kubernetes pod has its STDOUT and STDIN streams written to the kube node filesystem. Ever wondered how kubectl logs ... works? Well, the container logs are always written to disk and automatically trimmed before they can fill up the kube node's filesystem. The node's kubelet streams them to you on-demand -- going from filesystem to kubelet to kube api and finally over to your kubectl! Read more about Kubernetes standard log management.

We want to create an alternative path for these logs -- they will still be available for use with commands such as kubectl logs ... but we want to tail the files and ship their contents off to Quickwit as soon as possible. Getting to the point where everything works can be an odyssey, so this is my attempt at writing the book pamphlet.


Quickwit Costs

I have been accumulating data with Quickwit+Vector for about two days. The visibility has already helped me identify orphaned workloads and misconfigured logging in important cronjobs. And it has done so with a minimum Kubernetes footprint and negligible S3 usage.

1.9GB of compressed data in S3:

image

and the kube footprint is minuscule (and also not configured for high-availability just yet!)

% kubectl -n vector top pods
NAME                 CPU(cores)   MEMORY(bytes)
vector-agent-2fw7s   2m           25Mi
vector-agent-448zq   1m           26Mi
vector-agent-45m98   2m           27Mi
vector-agent-4b4r4   1m           21Mi
vector-agent-67lkg   1m           18Mi
[[[ SNIP ]]]

and

% kubectl -n quickwit top pods
NAME                                              CPU(cores)   MEMORY(bytes)
quickwit-logs-control-plane-7447dfb4d9-xgb2m      2m           11Mi
quickwit-logs-indexer-0                           25m          166Mi
quickwit-logs-janitor-9cb844987-whg7g             2m           19Mi
quickwit-logs-metastore-54d7c68f59-gnfjx          2m           15Mi
quickwit-logs-searcher-0                          2m           249Mi

Around these parts, we use ArgoCD to manage our apps. We don't manage helm repos locally and we don't run helm release updates by hand. We rely on gitops and active state sync to keep our many apps in sync. Kubernetes is complex and ArgoCD adds just a smidge more complexity in order to give unified visibility.

Best part of ArgoCD, it's really easy to share definitive Application specifications!

Argo Project Definitions

You have to create a tenancy with some guard rails. We want to limit which helm charts are installed to which namespaces (and what objects the charts can and cannot install in the cluster). Here's the relevant AppProject's edited for clarity:

apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: kube-system
  namespace: argo
spec:
  destinations:
    - namespace: 'kube-system'
      server: '*'
    - namespace: 'prometheus' # creates a ton of metric-gathering pods
      server: '*'
    - namespace: 'vector' # creates a ton of vector agent pods
      server: '*'
  sourceRepos:
    - 'https://prometheus-community.github.io/helm-charts'
    - 'https://helm.vector.dev'
  clusterResourceWhitelist: # required to install the Prometheus CRDs
    - group: "*"
      kind: "*"

apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: quickwit
  namespace: argo
spec:
  destinations:
    - namespace: "quickwit"
      server: "*"
  sourceRepos:
    - "https://helm.quickwit.io"
    - "https://github.com/xrl/quickwit-helm-charts.git"

these AppProject definitions are naturally managed by Git but that whole state-sync-webhooktastic architecture is out of scope.

Argo Application Definitions

Vector runs in agent mode:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: vector-agent
  namespace: argo
spec:
  project: kube-system
  syncPolicy:
    automated:
      prune: true
  source:
    repoURL: https://helm.vector.dev
    targetRevision: 0.37.0
    chart: vector
    helm:
      releaseName: vector-agent
      values: |
        fullnameOverride: "vector-agent"
        role: Agent
        customConfig:
          data_dir: /vector-data-dir
          api:
            enabled: true
            address: 0.0.0.0:8686
          sources:
            kubernetes_logs:
              type: kubernetes_logs
          transforms:
            filtered_logs:
              type: remap
              inputs: ["kubernetes_logs"]
              source: |
                .message = string!(.message)
                if contains(.message, "GET /ready HTTP/1.1") {
                  abort # we don't care RX health messages
                }
            kube_logs_to_otel:
              type: remap
              inputs: ["filtered_logs"]
              source: |
                .timestamp_nanos = to_unix_timestamp!(.timestamp, unit: "nanoseconds")
                .severity_text = "INFO"
                .body = {
                  "message": .message,
                  "stream": .stream
                }
                .attributes = .kubernetes

                del(.file)
                del(.timestamp)
                del(.source_type)
                del(.stream)
                del(.kubernetes)
                del(.message)
          sinks:
            quickwit_logs:
              type: http
              method: post
              inputs: ["kube_logs_to_otel"]
              encoding:
                codec: "json"
              framing:
                method: "newline_delimited"
              uri: "http://quickwit-logs-indexer.quickwit.svc.cluster.local:7280/api/v1/otel-logs-v0_7/ingest"
        # livenessProbe -- Override default liveness probe settings, if customConfig is used requires customConfig.api.enabled true
        ## Requires Vector's API to be enabled
        livenessProbe:
          httpGet:
            path: /health
            port: api

        # readinessProbe -- Override default readiness probe settings, if customConfig is used requires customConfig.api.enabled true
        ## Requires Vector's API to be enabled
        readinessProbe:
          httpGet:
            path: /health
            port: api
  destination:
    server: https://kubernetes.default.svc
    namespace: vector

good to notice in this config:

  • kubernetes logs are filtered to remove nuisance messages
  • remap transform is used to painfully convert to OTel-compatible log format
    • use kubectl -n vector exec -it $vector_pod -- vector tap kubernetes_logs to see what a kube log look like (I passed this through printf '$json' | jq to format it):
{
  "file": "/var/log/pods/argo_argo-cd-repo-server-869d695dc8-fgmqc_eab867f0-389b-4b6b-9b7f-69c7c3474c45/repo-server/2.log",
  "kubernetes": {
    "container_id": "containerd://d3e051e44f0fe97790d4615998b4747abe3a1f3adae0d7a9395934d526386615",
    "container_image": "quay.io/argoproj/argocd:v2.11.7",
    "container_image_id": "quay.io/argoproj/argocd@sha256:47e3e00dc501680e77b2496c67ed2e6bff8de1c71e55b56b37b9b11fc34f2ed4",
    "container_name": "repo-server",
    "namespace_labels": {
      "kubernetes.io/metadata.name": "argo"
    },
    "node_labels": {
      "arch": "amd64",
      "beta.kubernetes.io/arch": "amd64",
      "beta.kubernetes.io/instance-type": "r5.xlarge",
      "beta.kubernetes.io/os": "linux",
      "eks.amazonaws.com/capacityType": "ON_DEMAND",
      "eks.amazonaws.com/nodegroup": "ondemand-1b-2024083115310830840000000d",
      "eks.amazonaws.com/nodegroup-image": "ami-039bdded3573af90a",
      "failure-domain.beta.kubernetes.io/region": "eu-central-1",
      "failure-domain.beta.kubernetes.io/zone": "eu-central-1b",
      "k8s.io/cloud-provider-aws": "3a3320977962e39cf45d0123eecd5f54",
      "kubernetes.io/arch": "amd64",
      "kubernetes.io/hostname": "ip-172-30-50-168.eu-central-1.compute.internal",
      "kubernetes.io/os": "linux",
      "lifecycle": "ondemand",
      "node.kubernetes.io/instance-type": "r5.xlarge",
      "nodegroup": "ondemand-eu-central-1b",
      "topology.ebs.csi.aws.com/zone": "eu-central-1b",
      "topology.k8s.aws/zone-id": "euc1-az3",
      "topology.kubernetes.io/region": "eu-central-1",
      "topology.kubernetes.io/zone": "eu-central-1b"
    },
    "pod_annotations": {
      "checksum/cm": "860c7d2900972fc99c6d7059e06a25d9646dcbf74da82484611321c8cce79377",
      "checksum/cmd-params": "4c016fc0004793cf74267de6a9da23ad69fb79f0f9cd503ffae016297898f41d"
    },
    "pod_ip": "172.30.34.204",
    "pod_ips": [
      "172.30.34.204"
    ],
    "pod_labels": {
      "app.kubernetes.io/component": "repo-server",
      "app.kubernetes.io/instance": "argo-cd",
      "app.kubernetes.io/managed-by": "Helm",
      "app.kubernetes.io/name": "argocd-repo-server",
      "app.kubernetes.io/part-of": "argocd",
      "app.kubernetes.io/version": "v2.11.7",
      "helm.sh/chart": "argo-cd-7.3.11",
      "pod-template-hash": "869d695dc8"
    },
    "pod_name": "argo-cd-repo-server-869d695dc8-fgmqc",
    "pod_namespace": "argo",
    "pod_node_name": "ip-172-30-50-168.eu-central-1.compute.internal",
    "pod_owner": "ReplicaSet/argo-cd-repo-server-869d695dc8",
    "pod_uid": "eab867f0-389b-4b6b-9b7f-69c7c3474c45"
  },
  "message": "time=\"2024-10-31T02:28:03Z\" level=info msg=\"finished unary call with code OK\" grpc.code=OK grpc.method=Check grpc.service=grpc.health.v1.Health grpc.start_time=\"2024-10-31T02:28:03Z\" grpc.time_ms=0.019 span.kind=server system=grpc",
  "source_type": "kubernetes_logs",
  "stream": "stderr",
  "timestamp": "2024-10-31T02:28:03.185047076Z"
}
  • use kubectl -n vector exec -it $vector_pod -- vector top to see how many messages are moving through the agent
  • quickwit is hit with the generic http output plugin, values taken from the quickwit vector docs

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: quickwit-logs
  namespace: argo
spec:
  project: quickwit
  syncPolicy:
    automated:
      prune: true
  source:
    repoURL: "https://github.com/xrl/quickwit-helm-charts.git"
    path: charts/quickwit
    targetRevision: per-service-env-from
    helm:
      releaseName: quickwit-logs
      values: |
        fullnameOverride: quickwit-logs
        config:
          default_index_root_uri: s3://quickwit-logs
          storage:
            s3:
              region: eu-central-1
        metastore:
          extraEnv:
           -  name: QW_METASTORE_URI
              valueFrom:
                secretKeyRef:
                  name: quickwitlogs-secret
                  key: POSTGRES_URL
        searcher:
          replicaCount: 1
        serviceAccount:
          create: true
          annotations:
            eks.amazonaws.com/role-arn: "arn:aws:iam::1234567890:role/quickwit-logs"
  destination:
    server: https://kubernetes.default.svc
    namespace: quickwit

good to notice in this config:

  • uses a fork of the helm chart until this PR can be addressed
    • I have a RDS postgres instance I want to connect to. I will never put postgres credentials in my helm values 🫡
  • I only need one searcher for now
  • uses the EKS service account mechanism to inject a AWS session token in to the pod
    • I will never roll/manage AWS service account credentials again 🫡

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: kube-prometheus-stack
  namespace: argo
spec:
  project: kube-system
  syncPolicy:
    automated:
      prune: true
    syncOptions:
      - ServerSideApply=true
  source:
    repoURL: https://prometheus-community.github.io/helm-charts
    targetRevision: 65.3.2
    chart: kube-prometheus-stack
    helm:
      releaseName: prometheus
      values: |
        fullnameOverride: "prometheus"

        grafana:
          env:
            GF_INSTALL_PLUGINS: "quickwit-quickwit-datasource"
          persistence:
            enabled: true
          additionalDataSources:
            - name: Quickwit Logs
              type: quickwit-quickwit-datasource
              url: http://quickwit-logs-searcher.quickwit.svc.cluster.local:7280/api/v1
              jsonData:
                index: otel-logs-v0_7
                logMessageField: body
                logLevelField: severity_text
          grafana.ini:
            auth:
              disable_login_form: true
              disable_signout_menu: true
            auth.anonymous:
              enabled: true
              org_name: Main Org.
              org_role: Editor
              # database:
              # type: postgres
              # url: "${POSTGRES_URL}"

        prometheusOperator:
          kubeletService:
            enabled: false

        prometheus:
          prometheusSpec:
            resources:
              requests:
                memory: "28Gi"
                cpu: "2000m"
            ## Prometheus StorageSpec for persistent data
            ## ref: https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/user-guides/storage.md
            ##
            storageSpec:
              volumeClaimTemplate:
                spec:
                  storageClassName: gp2
                  accessModes: ["ReadWriteOnce"]
                  resources:
                    requests:
                      storage: 200Gi
        kube-state-metrics:
          podSecurityPolicy:
            enabled: false
  destination:
    server: https://kubernetes.default.svc
    namespace: prometheus

good to notice in this config:

  • this configures a prometheus instance AND a daemonset which forwards metrics from every kube node
  • the GF_INSTALL_PLUGINS ENV var lets us install the quickwit plugin on every container boot (that was new to me!)
  • we configure the data source right in the helm chart values (also new to me, I usually did clickops for that)
  • the persistence configuration kind of stinks, I have problems with the rollout strategy creating a deadlock over the PVC. deleting replicasets works
    • the ultimate goal should be to use postgres to store my grafana dashboard data, no PVCs
  • I disable the grafana auth machinery because I have a OIDC gateway in front of the service
    • out of scope from this documentation
    • you can go straight to the grafana service with kubectl -n prometheus port-forward svc/prometheus-grafana 8080:80, then open http://localhost:8080 in your browser. no login required.

AWS Role IAM Terraform

resource "aws_iam_role" "quickwit-logs" {
  name = "quickwit-logs"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRoleWithWebIdentity"
        Effect = "Allow"
        Principal = {
          Federated = format("arn:aws:iam::%s:oidc-provider/%s", var.aws_account_id, var.oidc_provider_id)
        }
        Condition = {
          StringLike = {
            "${var.oidc_provider_id}:sub" : "system:serviceaccount:quickwit:quickwit-logs",
            "${var.oidc_provider_id}:aud" : "sts.amazonaws.com"
          }
        }
      }
    ]
  })

  inline_policy {
    name = "s3-access"
    policy = jsonencode({
      "Version" : "2012-10-17",
      "Statement" : [
        {
          "Effect" : "Allow",
          "Action" : [
            "s3:ListBucket"
          ],
          "Resource" : [
            "arn:aws:s3:::quickwit-logs"
          ]
        },
        {
          "Effect" : "Allow",
          "Action" : [
            "s3:GetObject",
            "s3:PutObject",
            "s3:DeleteObject",
            "s3:ListMultipartUploadParts",
            "s3:AbortMultipartUpload"
          ],
          "Resource" : [
            "arn:aws:s3:::quickwit-logs/*"
          ]
        }
      ]
    })
  }

  managed_policy_arns = []
}

good to notice in this terraform:

  • OIDC trust relationship with the Kubernetes cluster (see the link to the IAM docs above)
    • limited access to kubernetes service accounts in the quickwit namespace
  • grants a variety of S3 permissions to the quickwit-logs bucket
    • the S3 permissions were found in quickwit's AWS S3 storage docs
    • remember you want to follow the principle of least privilege, grant only what is strictly necessary

the exact mechanism of running this terraform is out of scope -- but take comfort in knowing that it was applied through a gitops workflow.

Accessing Quickwit Directly

Port-forward in to the quickwit searcher service: kubectl -n quickwit port-forward svc/quickwit-logs-searcher 7280:7280 and then open your browser to http://localhost:7280 and you'll see this:

Quickwit UI Homepage
The Quickwit UI homepage showing the search interface

and if you go to look at the automatic otel logs index, you'll see this:

image

things to notice:

  • compression is a great thing. we pay for ~250MB of S3 storage for almost 5GB of JSON (there's a lot of junk in there I plan to strip out with vector's VRL)
    • at the time of writing this incarnation of quickwit had been running for less than 8 hours. this is a production kube cluster with ~150 nodes handling an enterprise workload.
    • the constant object writes, reads, write-backs might add up. I'll keep an eye on things best I can.
  • the number of splits changes often as quickwit service opens, merges, and garbage collects
    • a split is a single file with all the contents of a tantivy segment in one compressed, seakable blob. check out the contents in S3 like this:
% aws s3 ls s3://quickwit-logs/otel-logs-v0_7/
2024-10-30 16:24:59    7336474 01JBFHMNRV7XQJAF3B4PHBJ2CY.split
2024-10-30 16:26:38    9817784 01JBFHQNS1J8KGC50MJ3BP6S9V.split
2024-10-30 16:27:33   29905109 01JBFHSATR3TXDDDD65R3K4RHN.split
2024-10-30 17:25:12      63586 01JBFN2RTR4Y10873NBMP0A3NQ.split
2024-10-30 17:25:17      80415 01JBFN2XQ4PB5PJJBD2G8SKMCQ.split
2024-10-30 19:22:47  149153365 01JBFVT4BJ6DYAB86SK8V6B439.split
[[[ SNIP ]]]
  • the quickwit project has not tackled the unenviable task of authentication at the quickwit level. don't expose quickwit to the open internet.
    • grafana is the full-fledged dashboard builder so it's probably best to leave the auth to them. when I get grafana OIDC working I'll update this document.

read more:

Querying the Kubernetes Logs from the Quickwit UI

When visiting http://localhost:7280, I can query the Kubernetes logs and just make sure I understand what a document looks like.

The Query Editor panel of the Quickwit UI looks like this when our index has data flowing:

image

and one log's JSON looks like:

{
  "attributes": {
    "container_id": "containerd://9f6e5be434e97b7e37628b5f7a2423c4ec293939fbf58b22a66446ebff54ba87",
    "container_image": "registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7",
    "container_image_id": "registry.k8s.io/ingress-nginx/controller@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7",
    "container_name": "controller",
    "namespace_labels": {
      "kubernetes.io/metadata.name": "ingress-nginx"
    },
    "node_labels": {
      "arch": "amd64",
      "beta.kubernetes.io/arch": "amd64",
      "beta.kubernetes.io/instance-type": "r5.xlarge",
      "beta.kubernetes.io/os": "linux",
      "eks.amazonaws.com/capacityType": "ON_DEMAND",
      "eks.amazonaws.com/nodegroup": "ondemand-1a-20240831153108298900000007",
      "eks.amazonaws.com/nodegroup-image": "ami-039bdded3573af90a",
      "failure-domain.beta.kubernetes.io/region": "eu-central-1",
      "failure-domain.beta.kubernetes.io/zone": "eu-central-1a",
      "k8s.io/cloud-provider-aws": "3a3320977962e39cf45d0123eecd5f54",
      "kubernetes.io/arch": "amd64",
      "kubernetes.io/hostname": "ip-172-30-22-198.eu-central-1.compute.internal",
      "kubernetes.io/os": "linux",
      "lifecycle": "ondemand",
      "node.kubernetes.io/instance-type": "r5.xlarge",
      "nodegroup": "ondemand-eu-central-1a",
      "topology.ebs.csi.aws.com/zone": "eu-central-1a",
      "topology.k8s.aws/zone-id": "euc1-az2",
      "topology.kubernetes.io/region": "eu-central-1",
      "topology.kubernetes.io/zone": "eu-central-1a"
    },
    "pod_annotations": {
      "kubectl.kubernetes.io/restartedAt": "2023-12-06T01:04:59Z"
    },
    "pod_ip": "172.30.11.35",
    "pod_ips": [
      "172.30.11.35"
    ],
    "pod_labels": {
      "app.kubernetes.io/component": "controller",
      "app.kubernetes.io/instance": "ingress-nginx",
      "app.kubernetes.io/managed-by": "Helm",
      "app.kubernetes.io/name": "ingress-nginx",
      "app.kubernetes.io/part-of": "ingress-nginx",
      "app.kubernetes.io/version": "1.11.3",
      "helm.sh/chart": "ingress-nginx-4.11.3",
      "pod-template-hash": "6bc959cb88"
    },
    "pod_name": "ingress-nginx-controller-6bc959cb88-fp97t",
    "pod_namespace": "ingress-nginx",
    "pod_node_name": "ip-172-30-22-198.eu-central-1.compute.internal",
    "pod_owner": "ReplicaSet/ingress-nginx-controller-6bc959cb88",
    "pod_uid": "3d301f4e-b13a-45b3-8853-99b836e464a1"
  },
  "body": {
    "message": "172.30.9.182 - - [31/Oct/2024:02:46:19 +0000] \"GET /inventory?id=8934812a-40c7-4df9-8b79-32a02f358282 HTTP/1.1\" 200 11645 \"-\" \"Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)\" 392 0.159 [production-rx-production-80] [] 172.30.55.162:80 11624 0.160 200 2d56343d741dbdbbc2a2d0dfbbcbe7f8",
    "stream": "stdout"
  },
  "severity_text": "INFO",
  "timestamp_nanos": 1730342779504241000
}

Things to note:

  • the object structure is dictated by my VRL from the vector-agent Application definition above, but copied here:
kube_logs_to_otel:
  type: remap
  inputs: ["filtered_logs"]
  source: |
    .timestamp_nanos = to_unix_timestamp!(.timestamp, unit: "nanoseconds")
    .severity_text = "INFO"
    .body = {
      "message": .message,
      "stream": .stream
    }
    .attributes = .kubernetes

    del(.file)
    del(.timestamp)
    del(.source_type)
    del(.stream)
    del(.kubernetes)
    del(.message)
  • This is the first time I have tried to match the otel-logs schema

My must-haves to perform field-based search:

  • a field must be an exactly value: attributes.container_id:"containerd://9f6e5be434e97b7e37628b5f7a2423c4ec293939fbf58b22a66446ebff54ba87"
  • a field must be one of a list of values
  • a field must not be a value -attributes.container_id:"containerd://9f6e5be434e97b7e37628b5f7a2423c4ec293939fbf58b22a66446ebff54ba87" (note the minus)
  • a field should be present (no specific value in mind)
  • a field should be not be present

The hits just keep coming!

image

Managing Log Retention

How do we tweak the otel logs retention? 🤔

Grafana

You don't want to use the Quickwit UI for day-to-day observability/incident response. It's handy, for sure, but you want to build dashboards that lay out all the data at once.

This tutorial focuses on Grafana 12.

Creating a dashboard

Remember that we're building on top of the popular kube-prometheus-stack helm chart. This chart injects a whole suite of prometheus-centric dashboards to the grafana, creating what is essential a stdlib of monitoring. And we're totally able to add more dashboards.

  1. Make a Logs folder where we can store our quickwit-backed dashboards:

Creating Logs folder in Grafana
Creating a new folder to organize Quickwit log dashboards

  1. Select "New Dashboard" from the dropdown menu:

New Dashboard option
Selecting the New Dashboard option from Grafana menu

  1. And we'll create a new dashboard to lay out some smart widgets:

Dashboard creation screen
Empty dashboard ready for new panels

  1. Add your first visualization panel:

Add visualization panel
Adding a new visualization panel to the dashboard

  1. Let's just start by saving the empty dashboard:

Save dashboard dialog
Saving the dashboard with initial configuration

  1. Confirm the save operation:

Dashboard save confirmation
Confirming dashboard save operation

Summarizing Logs with Aggregations

Aggregations, or bucketing, is used for generating summary statistics of a dataset. The dataset is split in to multiple buckets and we can ask Quickwit to generate summary statistics on each bucket. The usual suspects for summarizing a bucket: count, average, min, max, sum, percentiles, etc. The docs will tell you that aggregations are only performed on fast fields -- stats are calculated from the columnar portion of the quickwit split without having to read all the data.

Let's use aggregations to identify the noisiest kube cluster namespaces. We want to group by kube namespace and emit a count metric for each group. We'll further aggregate our data by time so we get a sense for the trends.

  1. Add a visualization, choose the Quickwit Logs data source (remember we configured this as part of the helm values for the prometheus-kube-stack):

Selecting Quickwit Logs datasource
Selecting the Quickwit Logs data source for the visualization

  1. You'll then be met with an intimidating blank panel editing screen:

Blank panel editor
The initial blank panel editing interface

  1. Be sure your panel is the Metric type and specify the timestamp_nanos field for the aggregation:

Metric type configuration
Configuring the metric type and timestamp field

I don't like the line graph. In the top-right you can change the visualization type to Bar Chart:

Changing to bar chart
Switching the visualization type to Bar Chart

Set the first group by statement to build a date histogram on the timestamp_nanos field and then hit the grafana's dashboard refresh icon and you'll see:

Initial date histogram
Initial date histogram visualization

You'll need to hit the grafana refresh button often, it doesn't look like the quickwit plugin reissues the query after changing the grafana UI, this button:

Refresh button location
Location of the refresh button for updating visualizations

Make the bar chart more intelligible by increasing bucketing interval, set it to 5m:

Adjusting bucket interval
Setting the bucket interval to 5 minutes

And we'll further subdivide those buckets by adding another term aggregation, let's group by attributes.namespace_labels.kubernetes.io/metadata.name. Click on the + icon to the right of the Group By expression builder:

Adding term aggregation
Adding a new term aggregation

Group by configuration
Configuring the group by settings

And thankfully you can type-ahead to discover relevant fields:

Type-ahead field discovery
Using type-ahead to find relevant fields

And you'll get something like this, with the bar side-by-side:

Side by side bars
Visualization showing side-by-side bar chart

Search the options on the right-side for stacking:

Stacking option search
Locating the stacking option in settings

And then choose normal:

Normal stacking selection
Selecting normal stacking mode

Give the panel a smart title:

Panel title entry
Adding a descriptive title to the panel

And that's enough of the screenshot parade. Hit Save on the page and now you have your first starter dashboard:

Final dashboard view
The completed dashboard with configured visualization


Viewing Logs

Aggregations are great for summarizing what's going on. But what about when it's time to dig in to specifics? Thankfully, Grafana has a built in panel type for displaying log data. It has a few gotchas but let's go ahead and add a new panel to our dashboard:

Note: I am filtering to quickwit's pods

  1. Add another visualization

Adding new visualization
Adding a new visualization panel to the dashboard

  1. Change the visualization type from "Time series" to "Logs"

Changing to Logs visualization
Switching visualization type to Logs view

  1. Change query type to "Logs"

Changing query type
Setting the query type to Logs

  1. Click "refresh dashboard" to fetch data and populate the panel

Refreshing dashboard
Initial view after refreshing the dashboard

Notice the little chevrons and the lack of log textual data. Grafana is definitely fetching the data from quickwit but the data is not coming back in a format matching the panels conventions. It's not really documented anywhere, but the logs panel presents data based on its position in dataset returned by the database plugin. The logs panel does not depend on the name of the field, just the position. You can see the data using the table view toggle:

Table view of logs
Table view showing the raw data structure

Notice how $qw_message is the second column and it's blank. I'm not sure what the $qw_message template variable is used for, but we want to reorder the dataset and put body.message as the second column. Good news for everyone, Grafana added data transforms in version 11, which was new to me. Data transform seems like a powerful feature, let's try it out here. Switch from Query to Transform data (0):

Transform data option
Accessing the data transform options

Click "Add transformation" and search for "organize":

Adding organize transformation
Adding the organize transformation

This will then show the columns as they are currently sorted:

Current column sorting
Current order of columns before reorganization

Scroll down, find the body.message column and drag it up. You'll be rewarded with an instantaneous re-rendering showing the logs:

Reordered columns with logs
Logs displaying correctly after column reordering

Good time to hit "save" -- click on the dashboard's name in the breadcrumbs ("Kube logs") to leave the editor:

Dashboard breadcrumb navigation
Using breadcrumb navigation to exit editor

Which now shows:

Updated dashboard view
Dashboard view after saving changes

Let's drag the logs panel below the aggregations and make them full width:

Final dashboard layout
Final dashboard layout with full-width logs panel

Looking good folks!

@ToroNZ
Copy link

ToroNZ commented Feb 1, 2025

Thanks for putting this together! 🍻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment