Skip to content

Instantly share code, notes, and snippets.

@dims
Last active October 30, 2025 15:10
Show Gist options
  • Save dims/6f482d9fe34669c0df8f8f5564474981 to your computer and use it in GitHub Desktop.
Save dims/6f482d9fe34669c0df8f8f5564474981 to your computer and use it in GitHub Desktop.
node-problem-detector plugin architecture guide

Node Problem Detector Plugin Architecture Guide

Table of Contents

  1. Introduction
  2. Core Architecture
  3. Plugin Types Overview
  4. Quick Start Guide
  5. Problem Daemon Plugins
  6. Exporter Plugins
  7. Configuration System
  8. The Status Data Model
  9. Complete Integration Example
  10. Best Practices
  11. Testing Guide
  12. Performance Tuning
  13. Plugin Registration & Lifecycle
  14. Implementing Custom Monitor Types
  15. Problem Metrics Manager
  16. Troubleshooting

Introduction

The Node Problem Detector (NPD) is a Kubernetes daemon designed to detect various node problems and report them to the Kubernetes control plane. At its core, NPD implements a highly modular plugin architecture that separates problem detection from problem reporting, enabling extensibility and customization for different environments and use cases.

This guide provides a comprehensive overview of the plugin system for developers familiar with Kubernetes who want to understand, extend, or customize NPD's behavior.

Core Architecture

NPD implements a two-layer plugin architecture with clear separation of concerns:

Layer 1: Problem Detection (Monitor Plugins)

Monitor plugins detect problems and emit status information through Go channels. They implement the Monitor interface and run independently in their own goroutines.

Layer 2: Problem Export (Exporter Plugins)

Exporter plugins consume status information from monitors and export it to various backends (Kubernetes API, Prometheus, cloud monitoring services, etc.).

Core Interfaces

The foundation of the plugin system rests on two key interfaces defined in pkg/types/types.go:

// Monitor detects problems and reports status
type Monitor interface {
    Start() (<-chan *Status, error)  // Returns channel for problem reporting
    Stop()                           // Clean shutdown
}

// Exporter exports detected problems to external systems
type Exporter interface {
    ExportProblems(*Status)  // Process and export problem status
}

Data Flow

┌─────────────┐    ┌──────────────────┐    ┌─────────────┐
│   Monitor   │───▶│ Problem Detector │───▶│  Exporter   │
│  Plugins    │    │   (Orchestrator) │    │  Plugins    │
└─────────────┘    └──────────────────┘    └─────────────┘

The Status struct serves as the central data model exchanged between monitors and exporters:

type Status struct {
    Source     string      // Problem daemon name (e.g., "kernel-monitor")
    Events     []Event     // Temporary problems (generate Kubernetes Events)
    Conditions []Condition // Permanent node conditions (update Node status)
}

Plugin Types Overview

Problem Daemon Plugins (Monitors)

Plugin Type Purpose Configuration File Key Features
System Log Monitor Parse system logs using regex patterns kernel-monitor.json Multi-source log reading, pattern matching
Custom Plugin Monitor Execute external scripts/binaries custom-plugin-monitor.json Script execution, exit code interpretation
System Stats Monitor Collect system metrics system-stats-monitor.json CPU, memory, disk, network metrics (metrics-only)

Note: HealthChecker and LogCounter are standalone helper binaries that work WITH the Custom Plugin Monitor, not separate plugin types.

Exporter Plugins

Exporter Type Purpose Output Format
Kubernetes Exporter Report to K8s API server Events & Node Conditions
Prometheus Exporter Expose metrics endpoint Prometheus metrics format
Stackdriver Exporter Export to Google Cloud Monitoring Cloud monitoring metrics

Quick Start Guide

This section provides a practical introduction for developers who want to get started with NPD plugins immediately.

Understanding the Flow

  1. Monitors detect problems and send Status objects through channels
  2. Problem Detector orchestrates monitors and forwards status to exporters
  3. Exporters process status and send to external systems (Kubernetes API, Prometheus, etc.)

Your First Custom Plugin Script

Let's create a simple custom plugin that checks if a service is running:

Step 1: Create the script (my-service-check.sh):

#!/bin/bash
readonly OK=0
readonly NONOK=1
readonly UNKNOWN=2

# Check if my-service is running
if systemctl -q is-active my-service; then
    echo "my-service is running"
    exit $OK
else
    echo "my-service is not running"
    exit $NONOK
fi

Step 2: Create configuration (my-custom-monitor.json):

{
  "plugin": "custom",
  "source": "my-service-monitor",
  "conditions": [{
    "type": "MyServiceHealthy",
    "reason": "MyServiceIsHealthy",
    "message": "my-service is functioning properly"
  }],
  "rules": [{
    "type": "permanent",
    "condition": "MyServiceHealthy",
    "reason": "MyServiceUnhealthy",
    "path": "/path/to/my-service-check.sh",
    "timeout": "10s"
  }]
}

Step 3: Run NPD with your plugin:

./node-problem-detector --config.custom-plugin-monitor=my-custom-monitor.json

Step 4: Check results:

# View node conditions (may take up to invoke_interval to appear)
kubectl describe node <node-name>

# View events
kubectl get events --field-selector involvedObject.name=<node-name>

Key Concepts to Remember

  1. Exit codes matter: 0=healthy, 1=problem detected, 2+=error
  2. Status vs Events vs Conditions:
    • Events: Temporary problems (like crashes)
    • Conditions: Persistent node states (like "OutOfDisk")
    • Status: Contains both events and conditions
  3. Configuration drives behavior: Rules define what to monitor and how to react

Problem Daemon Plugins

System Log Monitor

The System Log Monitor is NPD's most sophisticated plugin, designed to parse system logs and detect problems using configurable regex patterns.

Architecture

┌─────────────────┐    ┌─────────────────┐    ┌──────────────────┐
│   Log Watcher   │───▶│   Log Monitor   │───▶│ Problem Detector │
│   (pluggable)   │    │   (core logic)  │    │   (orchestrator) │
└─────────────────┘    └─────────────────┘    └──────────────────┘

Log Watcher Sub-Plugin System

The System Log Monitor implements a two-level plugin architecture: the monitor itself is a plugin, and it has its own sub-plugin system for different log sources. This demonstrates how monitors can be internally extensible.

Available Log Watchers
  • kmsg: Reads kernel messages from /dev/kmsg (always available)
  • filelog: Reads from log files (e.g., /var/log/syslog) (always available)
  • journald: Reads from systemd journal (requires journald build tag)
Log Watcher Registration Pattern

Log watchers have their own registration system separate from problem daemons:

// pkg/systemlogmonitor/logwatchers/log_watchers.go
var createFuncs = map[string]types.WatcherCreateFunc{}

func registerLogWatcher(name string, create types.WatcherCreateFunc) {
    createFuncs[name] = create
}
Conditional Registration with Build Tags

Each log watcher registers in separate files with specific build tags:

// pkg/systemlogmonitor/logwatchers/register_kmsg.go
//go:build !disable_system_log_monitor
// +build !disable_system_log_monitor

package logwatchers

import (
    "k8s.io/node-problem-detector/pkg/systemlogmonitor/logwatchers/kmsg"
)

func init() {
    registerLogWatcher("kmsg", kmsg.NewKmsgWatcher)
}
// pkg/systemlogmonitor/logwatchers/register_journald.go
//go:build journald && !disable_system_log_monitor
// +build journald,!disable_system_log_monitor

package logwatchers

import (
    "k8s.io/node-problem-detector/pkg/systemlogmonitor/logwatchers/journald"
)

func init() {
    registerLogWatcher("journald", journald.NewJournaldWatcher)
}

Key Insights:

  • Two-Level Architecture: Problem Daemon Plugins → Log Watcher Sub-Plugins
  • Conditional Compilation: journald requires special build tag AND systemd libraries
  • Extensible Pattern: This same pattern could be used for other monitors that need pluggable backends

Configuration Example

The kernel monitor configuration (config/kernel-monitor.json) demonstrates the power and flexibility of pattern-based problem detection:

{
  "plugin": "kmsg",
  "logPath": "/dev/kmsg",
  "lookback": "5m",
  "source": "kernel-monitor",
  "conditions": [
    {
      "type": "KernelDeadlock",
      "reason": "KernelHasNoDeadlock",
      "message": "kernel has no deadlock"
    }
  ],
  "rules": [
    {
      "type": "temporary",
      "reason": "OOMKilling",
      "pattern": "Killed process \\d+ (.+) total-vm:\\d+kB.*"
    },
    {
      "type": "permanent",
      "condition": "KernelDeadlock",
      "reason": "DockerHung",
      "pattern": "task docker:\\w+ blocked for more than \\w+ seconds\\."
    }
  ]
}

Key features:

  • Pattern Matching: Uses regex patterns to identify problems in log output
  • Problem Types: Supports both temporary events and permanent conditions
  • Buffer Management: Configurable buffer for multi-line pattern matching
  • Lookback: Can process historical log entries on startup

Usage Patterns

Temporary Events from Log Patterns:

func (l *logMonitor) generateStatus(logs []*Log, rule Rule) *types.Status {
    message := generateMessage(logs, rule.PatternGeneratedMessageSuffix)

    if rule.Type == types.Temp {
        // Temporary rule: generate event only
        return &types.Status{
            Source: l.config.Source,
            Events: []types.Event{
                {
                    Severity:  types.Warn,
                    Timestamp: logs[0].Timestamp,
                    Reason:    rule.Reason,     // e.g., "TaskHung"
                    Message:   message,         // Extracted from log
                },
            },
            Conditions: l.conditions,  // Unchanged
        }
    } else {
        // Permanent rule: update condition + generate event
        updateConditionAndGenerateEvent(rule, message, timestamp)
    }
}

Example Temporary Events:

  • OOMKilling: Process killed due to out of memory
  • TaskHung: Process blocked for extended time
  • KernelOops: Kernel crash or error

Example Permanent Conditions:

  • KernelDeadlock: System deadlock detected
  • FilesystemCorruption: Filesystem errors found

Custom Plugin Monitor

The Custom Plugin Monitor enables integration of external scripts and binaries as problem detection plugins, following the Unix philosophy of small, composable tools.

Helper Communication Protocol

Custom plugin monitors communicate with external helper binaries using a simple, Unix-philosophy protocol focused on exit codes and text output.

Input Protocol

CLI Arguments Only: Helpers receive inputs exclusively through command-line arguments:

{
  "rules": [{
    "path": "/usr/local/bin/my-checker",
    "args": ["--timeout=5s", "--component=kubelet", "--enable-repair=true"]
  }]
}

No Other Input Channels:

  • No stdin: NPD doesn't send data via stdin
  • No custom environment variables: NPD doesn't set special env vars
  • System environment: Helpers inherit standard environment (PATH, HOME, etc.)
Output Protocol

Exit Codes (Primary Communication):

const (
    OK      Status = 0    // Healthy/working correctly
    NonOK   Status = 1    // Problem detected
    Unknown Status = 2    // Error/timeout/unable to determine
)

Exit Code Mapping:

  • Exit 0 (OK): Helper executed successfully, no problems → Condition Status = False (Healthy)
  • Exit 1 (NonOK): Helper detected a problem → Condition Status = True (Problem)
  • Exit 2+ (Unknown): Plugin error/timeout → Condition Status = Unknown (Error)

Stdout (Status Messages):

  • Purpose: Human-readable status message for node conditions/events
  • Max capture: 4KB total buffer per execution
  • Max usage: 80 bytes by default (configurable via max_output_length)
  • Processing: Trimmed of whitespace, truncated if exceeds limit

Stderr (Debug Only):

  • Purpose: Debug logging only - NOT used in conditions
  • Visibility: Only logged at debug verbosity levels
  • Usage: Troubleshooting helper execution
Helper Binary Examples

Simple Health Check Script:

#!/bin/bash
readonly OK=0
readonly NONOK=1
readonly UNKNOWN=2

# Check if NTP service is running
if systemctl -q is-active ntp.service; then
  echo "ntp.service is running"
  exit $OK
else
  echo "ntp.service is not running"
  exit $NONOK
fi

Go-based Health Checker (cmd/healthchecker/health_checker.go):

func main() {
    // Parse CLI arguments
    hco := options.NewHealthCheckerOptions()
    hco.AddFlags(pflag.CommandLine)
    pflag.Parse()

    // Perform health check
    hc, err := healthchecker.NewHealthChecker(hco)
    healthy, err := hc.CheckHealth()

    // Output result and exit with appropriate code
    if err != nil {
        fmt.Printf("error checking %v health: %v\n", hco.Component, err)
        os.Exit(int(types.Unknown))
    }
    if !healthy {
        fmt.Printf("%v:%v was found unhealthy\n", hco.Component, hco.Service)
        os.Exit(int(types.NonOK))
    }
    fmt.Printf("%v:%v is healthy\n", hco.Component, hco.Service)
    os.Exit(int(types.OK))
}

Configuration for Health Checker:

{
  "rules": [{
    "type": "permanent",
    "condition": "KubeletUnhealthy",
    "reason": "KubeletUnhealthy",
    "path": "/home/kubernetes/bin/health-checker",
    "args": [
      "--component=kubelet",
      "--enable-repair=true",
      "--cooldown-time=1m"
    ],
    "timeout": "3m"
  }]
}
NPD Processing Flow

Helper Execution (pkg/custompluginmonitor/plugin/plugin.go):

func (p *Plugin) run(rule cpmtypes.CustomRule) (exitStatus cpmtypes.Status, output string) {
    // 1. Set up timeout context
    ctx, cancel := context.WithTimeout(context.Background(), timeout)
    defer cancel()

    // 2. Create command with arguments
    cmd := util.Exec(rule.Path, rule.Args...)

    // 3. Capture stdout/stderr
    stdoutPipe, _ := cmd.StdoutPipe()
    stderrPipe, _ := cmd.StderrPipe()

    // 4. Start and wait for completion
    cmd.Start()
    cmd.Wait()

    // 5. Parse output and exit code
    output = strings.TrimSpace(string(stdout))
    if len(output) > maxOutputLength {
        output = output[:maxOutputLength]  // Truncate long messages
    }

    exitCode := cmd.ProcessState.Sys().(syscall.WaitStatus).ExitStatus()
    switch exitCode {
    case 0: return cpmtypes.OK, output
    case 1: return cpmtypes.NonOK, output
    default: return cpmtypes.Unknown, output
    }
}

Timing and Concurrency:

  • Invoke Interval: pluginConfig.invoke_interval (default: 30s)
  • Timeout: Per-rule timeout or global pluginConfig.timeout (default: 5s)
  • Concurrency: Max concurrent executions via pluginConfig.concurrency (default: 3)
  • Timeout Handling: Process group killed on timeout

Complete Data Flow:

Helper Binary
├─ Receives: CLI args from rule.args
├─ Outputs: Exit code + stdout message + stderr (debug)
└─ Timeout: Per-rule or global
    ↓
Plugin.run()
├─ Captures output (max 4KB buffer)
├─ Maps exit code to Status enum
├─ Trims and truncates message (max 80 bytes used)
└─ Returns Result{ExitStatus, Message}
    ↓
CustomPluginMonitor.generateStatus()
├─ Converts Status to ConditionStatus
├─ Updates node condition if changed
├─ Generates Kubernetes event if needed
└─ Returns Status for exporters
Exit Code Convention

All custom plugins and helper binaries use standardized exit codes defined in pkg/custompluginmonitor/types/types.go:

type Status int

const (
    // OK means everything is fine.
    OK Status = 0
    // NonOK means error or unhealthy.
    NonOK Status = 1
    // Unknown means plugin returns unknown error.
    Unknown Status = 2
)

These codes are used by:

  • Custom Plugin Monitor (executes and interprets exit codes)
  • HealthChecker binary (cmd/healthchecker/health_checker.go)
  • LogCounter binary (cmd/logcounter/log_counter.go)

Any external binary that follows this convention can be used as a custom plugin helper.

Helper Binary Template
#!/bin/bash

# Exit code constants (MUST match NPD protocol)
readonly OK=0
readonly NONOK=1
readonly UNKNOWN=2

# Best Practice: Make your check idempotent and safe to run repeatedly
# NPD will invoke this script at every invoke_interval

# Function to output message and exit
die() {
    echo "$1"
    exit "$2"
}

# Parse arguments (optional)
while [[ $# -gt 0 ]]; do
    case $1 in
        --verbose) VERBOSE=true; shift ;;
        --timeout) TIMEOUT=$2; shift 2 ;;
        *) die "Unknown option: $1" $UNKNOWN ;;
    esac
done

# Perform your health check logic here
if check_system_health; then
    echo "System is healthy"
    exit $OK
else
    echo "Health check failed: $(get_failure_reason)"
    exit $NONOK
fi

Critical Requirements:

  • Always explicitly exit with 0, 1, or 2+
  • Keep stdout messages concise (under 80 bytes recommended)
  • Use CLI arguments for input, not stdin or custom environment variables
  • Handle timeouts gracefully
  • Return meaningful status messages for debugging
Standard Helper Binaries

NPD includes several standalone helper binaries that work WITH the custom plugin monitor:

Health Checker: Monitors the health of critical Kubernetes components (kubelet, container runtime, kube-proxy) with auto-repair capabilities.

LogCounter: Counts occurrences of log patterns and reports whether thresholds are exceeded.

Both follow the standard exit code convention and are executed by the custom plugin monitor.

Configuration and Features

{
  "plugin": "custom",
  "pluginConfig": {
    "invoke_interval": "30s",
    "timeout": "5s",
    "max_output_length": 80,
    "concurrency": 3
  },
  "rules": [
    {
      "type": "permanent",
      "condition": "NTPProblem",
      "reason": "NTPIsDown",
      "path": "./config/plugin/check_ntp.sh",
      "timeout": "3s"
    }
  ]
}

Key features:

  • Concurrent Execution: Configurable concurrency for plugin execution
  • Timeout Management: Global and per-rule timeout configuration
  • Output Management: Configurable output length limits
  • Process Lifecycle: Graceful termination with escalation to SIGKILL

Migration Guide: Converting Existing Scripts to NPD

Many teams have existing monitoring scripts that run via cron, systemd timers, or standalone processes. This guide helps you migrate these scripts to work seamlessly with NPD's Custom Plugin Monitor.

Assessment: Is Your Script a Good Candidate?

Ideal Candidates:

  • Health checks: Scripts that test component availability
  • Resource monitors: Scripts checking disk space, memory, network
  • Service validation: Scripts testing service responsiveness
  • Configuration audits: Scripts validating system configuration

Requires Modification:

  • ⚠️ Complex workflows: Multi-step processes with intermediate state
  • ⚠️ Interactive scripts: Scripts requiring user input
  • ⚠️ Long-running processes: Continuous monitoring (consider native monitor instead)

Not Suitable:

  • Deployment scripts: One-time setup operations
  • Data processing pipelines: ETL or batch processing jobs
  • High-frequency polling: Sub-second monitoring intervals
Step-by-Step Migration Process
Step 1: Analyze Current Script Structure

Start by understanding your existing script's behavior:

# Example: Existing cron-based disk monitor
#!/bin/bash
# /usr/local/bin/old-disk-monitor.sh
# Runs every 5 minutes via cron

THRESHOLD=90
USAGE=$(df / | awk 'NR==2 {print $5}' | sed 's/%//')

if [ "$USAGE" -gt "$THRESHOLD" ]; then
    echo "CRITICAL: Disk usage at ${USAGE}% (threshold: ${THRESHOLD}%)" >&2
    logger "Disk space critical on $(hostname)"
    exit 1
else
    echo "OK: Disk usage at ${USAGE}%"
    exit 0
fi

Analysis:

  • Clear exit codes: 0 for success, 1 for problem
  • Descriptive output: Human-readable status messages
  • Configurable threshold: Parameterizable via environment or args
  • ⚠️ Hardcoded values: Threshold and mount point need parameterization
Step 2: Adapt for NPD Communication Protocol

Modify the script to follow NPD's helper binary protocol:

#!/bin/bash
# /usr/local/bin/npd-disk-monitor.sh
# NPD-compatible disk space monitor

# Parse command-line arguments (NPD helper protocol requirement)
THRESHOLD=90
MOUNT_POINT="/"

while [[ $# -gt 0 ]]; do
    case $1 in
        --threshold=*)
            THRESHOLD="${1#*=}"
            shift
            ;;
        --mount=*)
            MOUNT_POINT="${1#*=}"
            shift
            ;;
        --help)
            echo "Usage: $0 [--threshold=N] [--mount=PATH]"
            echo "Monitors disk usage and reports problems to NPD"
            exit 0
            ;;
        *)
            echo "Unknown option: $1" >&2
            exit 2  # NPD Unknown status
            ;;
    esac
done

# Validate inputs
if ! [[ "$THRESHOLD" =~ ^[0-9]+$ ]] || [ "$THRESHOLD" -lt 1 ] || [ "$THRESHOLD" -gt 100 ]; then
    echo "Error: threshold must be 1-100" >&2
    exit 2  # NPD Unknown status
fi

if [ ! -d "$MOUNT_POINT" ]; then
    echo "Error: mount point '$MOUNT_POINT' not found" >&2
    exit 2  # NPD Unknown status
fi

# Perform the actual check
USAGE=$(df "$MOUNT_POINT" | awk 'NR==2 {print $5}' | sed 's/%//')

if ! [[ "$USAGE" =~ ^[0-9]+$ ]]; then
    echo "Error: unable to determine disk usage for $MOUNT_POINT" >&2
    exit 2  # NPD Unknown status
fi

# Best Practice: Make your check idempotent and safe to run repeatedly
# NPD will invoke this script at every invoke_interval

# Return status based on NPD protocol
if [ "$USAGE" -gt "$THRESHOLD" ]; then
    echo "Disk usage at ${USAGE}% exceeds threshold ${THRESHOLD}% for $MOUNT_POINT"
    exit 1  # NPD NonOK status (problem detected)
else
    echo "Disk usage at ${USAGE}% within threshold ${THRESHOLD}% for $MOUNT_POINT"
    exit 0  # NPD OK status (healthy)
fi

Key Changes Made:

  1. Argument parsing: Uses --key=value format NPD expects
  2. Error handling: Uses exit code 2 for NPD Unknown status
  3. Input validation: Validates all parameters before proceeding
  4. Idempotency: Safe to run repeatedly at configured intervals
  5. Clear output: Status messages for both success and failure cases
Step 3: Create NPD Configuration

Create the JSON configuration for your converted script:

{
  "plugin": "custom",
  "pluginConfig": {
    "invoke_interval": "300s",
    "timeout": "30s",
    "max_output_length": 200,
    "concurrency": 1,
    "enable_message_change_based_condition_update": false
  },
  "source": "disk-monitor",
  "conditions": [
    {
      "type": "DiskPressure",
      "reason": "DiskSpaceRunningLow",
      "message": "Disk usage exceeds configured threshold"
    }
  ],
  "rules": [
    {
      "type": "permanent",
      "condition": "DiskPressure",
      "reason": "DiskSpaceRunningLow",
      "path": "/usr/local/bin/npd-disk-monitor.sh",
      "args": ["--threshold=85", "--mount=/"]
    },
    {
      "type": "permanent",
      "condition": "DiskPressure",
      "reason": "DiskSpaceRunningLow",
      "path": "/usr/local/bin/npd-disk-monitor.sh",
      "args": ["--threshold=90", "--mount=/var"]
    }
  ]
}

Configuration Notes:

  • invoke_interval: Matches your original cron frequency (5 minutes = 300s)
  • timeout: Reasonable timeout for script execution
  • Multiple rules: Monitor different mount points with different thresholds
  • Permanent condition: Disk pressure persists until resolved
Step 4: Test the Migration

Before deploying, thoroughly test your converted script:

# Test script directly with NPD-style arguments
/usr/local/bin/npd-disk-monitor.sh --threshold=85 --mount=/
echo "Exit code: $?"

# Test error conditions
/usr/local/bin/npd-disk-monitor.sh --threshold=invalid
echo "Exit code: $?"  # Should be 2 (Unknown)

# Test NPD integration (if running locally)
node-problem-detector --config=/etc/node-problem-detector/config/disk-monitor.json --logtostderr --v=3

Validation Checklist:

  • ✅ Script handles all expected argument combinations
  • ✅ Exit codes follow NPD protocol (0, 1, 2)
  • ✅ Output messages are informative and concise
  • ✅ Script executes within configured timeout
  • ✅ No hanging processes or resource leaks
Step 5: Replace Original Monitoring

Once validated, replace your original monitoring setup:

# Disable old cron job
crontab -e
# Comment out: */5 * * * * /usr/local/bin/old-disk-monitor.sh

# Or disable systemd timer
systemctl disable old-disk-monitor.timer
systemctl stop old-disk-monitor.timer

# Deploy NPD configuration
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: disk-monitor-config
  namespace: kube-system
data:
  disk-monitor.json: |
    $(cat /path/to/your/disk-monitor.json)
EOF

# Update NPD DaemonSet to include new config
kubectl patch daemonset node-problem-detector -n kube-system --patch='...'
Common Migration Patterns
Pattern 1: Health Check Scripts

Before (standalone health check):

#!/bin/bash
curl -f http://localhost:8080/health || exit 1

After (NPD-compatible):

#!/bin/bash
SERVICE_URL="http://localhost:8080/health"
TIMEOUT=5

while [[ $# -gt 0 ]]; do
    case $1 in
        --url=*) SERVICE_URL="${1#*=}"; shift ;;
        --timeout=*) TIMEOUT="${1#*=}"; shift ;;
        *) echo "Unknown option: $1" >&2; exit 2 ;;
    esac
done

if curl -f --max-time "$TIMEOUT" "$SERVICE_URL" >/dev/null 2>&1; then
    echo "Service at $SERVICE_URL is healthy"
    exit 0
else
    echo "Service at $SERVICE_URL is unhealthy or unreachable"
    exit 1
fi
Pattern 2: Resource Threshold Monitors

Before (memory monitor):

#!/bin/bash
USED=$(free | awk 'NR==2{print int($3/$2*100)}')
[ "$USED" -gt 80 ] && exit 1 || exit 0

After (NPD-compatible):

#!/bin/bash
THRESHOLD=80

while [[ $# -gt 0 ]]; do
    case $1 in
        --threshold=*) THRESHOLD="${1#*=}"; shift ;;
        *) echo "Unknown option: $1" >&2; exit 2 ;;
    esac
done

USED=$(free | awk 'NR==2{print int($3/$2*100)}')
if [ -z "$USED" ] || ! [[ "$USED" =~ ^[0-9]+$ ]]; then
    echo "Error: unable to determine memory usage"
    exit 2
fi

if [ "$USED" -gt "$THRESHOLD" ]; then
    echo "Memory usage at ${USED}% exceeds threshold ${THRESHOLD}%"
    exit 1
else
    echo "Memory usage at ${USED}% within threshold ${THRESHOLD}%"
    exit 0
fi
Pattern 3: Configuration Validation

Before (config checker):

#!/bin/bash
nginx -t && exit 0 || exit 1

After (NPD-compatible):

#!/bin/bash
CONFIG_FILE="/etc/nginx/nginx.conf"

while [[ $# -gt 0 ]]; do
    case $1 in
        --config=*) CONFIG_FILE="${1#*=}"; shift ;;
        *) echo "Unknown option: $1" >&2; exit 2 ;;
    esac
done

if [ ! -f "$CONFIG_FILE" ]; then
    echo "Error: configuration file '$CONFIG_FILE' not found"
    exit 2
fi

if nginx -t -c "$CONFIG_FILE" >/dev/null 2>&1; then
    echo "Nginx configuration is valid"
    exit 0
else
    echo "Nginx configuration validation failed"
    exit 1
fi
Migration Best Practices
  1. Start Small: Migrate one script at a time to validate the process
  2. Preserve Semantics: Maintain the same detection logic and thresholds
  3. Add Observability: Include relevant context in output messages
  4. Handle Edge Cases: Robust error handling for NPD Unknown status
  5. Test Thoroughly: Validate all exit code paths and argument combinations
  6. Document Changes: Update runbooks and documentation
  7. Monitor Metrics: Ensure migrated checks appear in Problem Metrics Manager
Troubleshooting Migration Issues

Issue: Script works standalone but fails in NPD

  • Check: File permissions, SELinux contexts, AppArmor profiles
  • Solution: Ensure NPD user can execute script and access dependencies

Issue: Conditions not appearing in kubectl

  • Check: Configuration syntax, condition/reason name consistency
  • Solution: Validate JSON config with jq or similar tool

Issue: Script execution timeouts

  • Check: Realistic timeout values, network dependencies
  • Solution: Optimize script performance or increase timeout

Issue: Different behavior in NPD vs standalone

  • Check: Environment variables, working directory, PATH
  • Solution: Make script environment-independent or set explicit paths

This migration approach ensures your existing monitoring logic is preserved while gaining the benefits of NPD's centralized problem detection, Kubernetes integration, and metrics export capabilities.

System Stats Monitor

The System Stats Monitor focuses purely on metrics collection without problem detection, following the principle that metrics and alerting should be separate concerns.

Collector Architecture

The monitor implements a collector pattern with specialized collectors for different metric categories:

// Core interface for all collectors
type Collector interface {
    Collect() (map[string]interface{}, error)
}

Available Collectors

Collector Metrics Source
CPU Collector Load averages, CPU usage, process counts /proc/loadavg, /proc/stat
Disk Collector I/O statistics, usage /proc/diskstats, filesystem info
Memory Collector Memory usage, swap /proc/meminfo
Network Collector Interface statistics /proc/net/dev
Host Collector System uptime /proc/uptime

Configuration

{
  "cpu": {
    "metricsConfigs": {
      "cpu/load_1m": {"displayName": "cpu/load_1m"},
      "cpu/usage_time": {"displayName": "cpu/usage_time"}
    }
  },
  "disk": {
    "includeAllAttachedBlk": true,
    "metricsConfigs": {
      "disk/io_read_bytes": {"displayName": "disk/io_read_bytes"}
    }
  },
  "invokeInterval": "60s"
}

Integration with Metrics

func (ssm *systemStatsMonitor) Start() (<-chan *types.Status, error) {
    go ssm.monitorLoop()  // Collects CPU, memory, disk metrics
    return nil, nil       // No status channel - metrics only!
}

Important: The System Stats Monitor returns a nil status channel since it only collects metrics and doesn't detect problems. Metrics are exposed through the Prometheus exporter and Problem Metrics Manager.

Exporter Plugins

Exporter plugins consume status information from monitors and export it to various backends. NPD includes three main exporter types with different integration patterns.

Kubernetes Exporter

The Kubernetes Exporter is the primary mechanism for integrating NPD with Kubernetes, translating internal problem status into Kubernetes API objects.

Core Functions

  1. Event Creation: Temporary problems become Kubernetes Events
  2. Node Condition Management: Permanent problems update Node conditions
  3. Condition Heartbeat: Maintains condition freshness
  4. Health Endpoint: Provides /healthz for liveness probes

Implementation Detail

The exporter maintains a condition manager that handles the lifecycle of node conditions:

// pkg/exporters/k8sexporter/condition/manager.go
type ConditionManager interface {
    // Start starts the condition manager.
    Start(ctx context.Context)
    // UpdateCondition updates a specific condition.
    UpdateCondition(types.Condition)
    // GetConditions returns all current conditions.
    GetConditions() []types.Condition
}

The condition manager uses RWMutex for thread-safe concurrent access and implements sophisticated synchronization logic to prevent flooding the API server while ensuring timely updates.

Processing Flow

func (ke *k8sExporter) ExportProblems(status *types.Status) {
    // Convert events to Kubernetes Events
    for _, event := range status.Events {
        ke.client.Eventf(
            util.ConvertToAPIEventType(event.Severity),  // Normal/Warning
            status.Source,      // Event source
            event.Reason,       // Event reason
            event.Message,      // Event message
        )
    }

    // Update node conditions (batched for efficiency)
    for _, condition := range status.Conditions {
        ke.conditionManager.UpdateCondition(condition)
    }
}

Condition Batching and Deduplication

// Batches condition updates to prevent API server flooding
func (c *conditionManager) UpdateCondition(condition types.Condition) {
    c.Lock()
    defer c.Unlock()
    // Only keep newest condition per type (deduplication)
    c.updates[condition.Type] = condition
}

// Syncs every 1 second with collected updates
func (c *conditionManager) sync(ctx context.Context) {
    conditions := []v1.NodeCondition{}
    for _, condition := range c.conditions {
        apiCondition := problemutil.ConvertToAPICondition(condition)
        conditions = append(conditions, apiCondition)
    }
    // Single API call to update all conditions
    c.client.SetConditions(ctx, conditions)
}

Prometheus Exporter

Integrates with OpenCensus/OpenTelemetry for metrics export, providing a /metrics endpoint in Prometheus format.

Key Features

  • Metrics Integration: Automatically exposes metrics from System Stats Monitor
  • Problem Metrics: Converts problem counters and gauges to Prometheus format
  • Standard Format: Uses standard Prometheus exposition format
  • Auto-discovery: Metrics from Problem Metrics Manager are automatically available

Configuration

The Prometheus exporter is enabled by default and requires no additional configuration. Metrics are exposed at:

  • Default endpoint: http://localhost:20257/metrics
  • Configurable port: Set via command-line flags

Note: The Prometheus exporter processes metrics from the System Stats Monitor and Problem Metrics Manager. Problem events and conditions are not converted to Prometheus metrics - they go to the Kubernetes exporter.

Stackdriver Exporter

Exports metrics to Google Cloud Monitoring with automatic GCE metadata integration.

Features

  • Cloud Integration: Automatic GCE instance metadata detection
  • Resource Mapping: Maps NPD metrics to Cloud Monitoring resource types
  • Authentication: Uses default service account credentials
  • Conditional Compilation: Can be disabled at compile time

Configuration

The Stackdriver exporter uses the registry pattern and can be configured via command-line options:

--exporter.stackdriver.project-id=my-project
--exporter.stackdriver.cluster-name=my-cluster
--exporter.stackdriver.zone=us-central1-a

This exporter is optional and can be disabled with build tags.

Exporter Registration Patterns

NPD uses three different patterns for exporter initialization, reflecting their importance and dependencies:

Pattern 1: First-Class Default Exporters (Direct Initialization)

Kubernetes and Prometheus exporters are considered essential and are initialized directly in cmd/nodeproblemdetector/node_problem_detector.go:

// Direct initialization - always attempted
defaultExporters := []types.Exporter{}
if ke := k8sexporter.NewExporterOrDie(ctx, npdo); ke != nil {
    defaultExporters = append(defaultExporters, ke)
}
if pe := prometheusexporter.NewExporterOrDie(npdo); pe != nil {
    defaultExporters = append(defaultExporters, pe)
}

Pattern 2: Pluggable Exporters (Registry Pattern)

Optional exporters like Stackdriver use the registry pattern similar to problem daemons:

// pkg/exporters/stackdriver/stackdriver_exporter.go
func init() {
    clo := commandLineOptions{}
    exporters.Register(exporterName, types.ExporterHandler{
        CreateExporterOrDie: NewExporterOrDie,
        Options:             &clo,
    })
}

Pluggable exporters are initialized from the registry:

// Initialize from registry
plugableExporters := exporters.NewExporters()

Pattern 3: Combined Approach

The final exporter list combines both patterns:

allExporters := append(defaultExporters, plugableExporters...)

Why This Architecture?

  • K8s & Prometheus: Core functionality, always enabled
  • Stackdriver: Optional, cloud-specific, can be disabled with build tags
  • Future Exporters: Can use either pattern based on importance

Configuration System

The configuration system in NPD provides a consistent, flexible way to configure different plugin types while allowing plugin-specific customization.

Configuration File Structure

All monitor plugins follow a consistent configuration schema:

{
  "plugin": "plugin_type",           // Plugin type identifier
  "source": "monitor_name",          // Source name for status reports
  "metricsReporting": true,          // Enable metrics reporting
  "conditions": [...],               // Default node conditions
  "rules": [...],                    // Problem detection rules
  "pluginConfig": {...}              // Plugin-specific configuration
}

Core Configuration Elements

Plugin Type Identifiers

Each plugin type has a specific identifier used in the plugin field:

Plugin Type Identifier Purpose
System Log Monitor "kmsg", "filelog", "journald" Log source type
Custom Plugin Monitor "custom" External script execution
System Stats Monitor "system-stats" Metrics collection

Source Names

The source field identifies the monitor in status reports and logs:

  • Must be unique across all monitors
  • Used as the source field in types.Status
  • Appears in Kubernetes events and node conditions
  • Examples: "kernel-monitor", "custom-plugin-monitor", "system-stats-monitor"

Metrics Reporting

The metricsReporting field controls integration with the Problem Metrics Manager:

{
  "metricsReporting": true   // Enable problem counter/gauge reporting
}

When enabled:

  • Problem events increment counters
  • Condition changes update gauges
  • Metrics are exposed via Prometheus exporter

Rule Types

NPD supports two fundamental rule types across all monitor plugins:

Temporary Rules

Generate Kubernetes Events for transient problems:

{
  "type": "temporary",
  "reason": "OOMKilling",
  "pattern": "Killed process \\d+ (.+).*"
}

Characteristics:

  • Create Kubernetes Events only
  • Don't update node conditions
  • Used for alerts and notifications
  • Examples: Process crashes, resource spikes, kernel warnings

Permanent Rules

Update Node Conditions for persistent issues:

{
  "type": "permanent",
  "condition": "KernelDeadlock",
  "reason": "DockerHung",
  "pattern": "task docker:\\w+ blocked.*"
}

Characteristics:

  • Update node conditions AND generate events
  • Reflect persistent node state
  • Used for scheduling decisions
  • Examples: Service failures, persistent resource issues, hardware problems

Plugin-Specific Configuration

System Log Monitor Configuration

{
  "plugin": "kmsg",
  "logPath": "/dev/kmsg",
  "lookback": "5m",
  "bufferSize": 10,
  "source": "kernel-monitor",
  "conditions": [
    {
      "type": "KernelDeadlock",
      "reason": "KernelHasNoDeadlock",
      "message": "kernel has no deadlock"
    }
  ],
  "rules": [
    {
      "type": "temporary",
      "reason": "OOMKilling",
      "pattern": "Killed process \\d+ (.+) total-vm:\\d+kB.*",
      "patternGeneratedMessageSuffix": " - Check memory usage"
    }
  ]
}

Key Fields:

  • logPath: Path to log source
  • lookback: How far back to read on startup
  • bufferSize: Lines to buffer for multi-line patterns
  • patternGeneratedMessageSuffix: Appended to matched patterns

Custom Plugin Monitor Configuration

{
  "plugin": "custom",
  "source": "health-monitor",
  "metricsReporting": true,
  "pluginConfig": {
    "invoke_interval": "30s",
    "timeout": "5s",
    "max_output_length": 80,
    "concurrency": 3,
    "enable_message_change_based_condition_update": false
  },
  "conditions": [
    {
      "type": "KubeletHealthy",
      "reason": "KubeletIsHealthy",
      "message": "kubelet is functioning properly"
    }
  ],
  "rules": [
    {
      "type": "permanent",
      "condition": "KubeletHealthy",
      "reason": "KubeletUnhealthy",
      "path": "/home/kubernetes/bin/health-checker",
      "args": ["--component=kubelet", "--enable-repair=true"],
      "timeout": "3m"
    }
  ]
}

Key Fields:

  • invoke_interval: How often to run checks
  • timeout: Global timeout for helper execution
  • max_output_length: Truncate helper output
  • concurrency: Max parallel helper executions
  • path: Path to helper binary/script
  • args: Command-line arguments for helper

System Stats Monitor Configuration

{
  "invokeInterval": "60s",
  "cpu": {
    "metricsConfigs": {
      "cpu/load_1m": {
        "displayName": "cpu/load_1m"
      },
      "cpu/usage_time": {
        "displayName": "cpu/usage_time"
      }
    }
  },
  "disk": {
    "includeAllAttachedBlk": true,
    "includeRootBlk": true,
    "lsblkTimeout": "5s",
    "metricsConfigs": {
      "disk/io_read_bytes": {
        "displayName": "disk/io_read_bytes"
      },
      "disk/io_write_bytes": {
        "displayName": "disk/io_write_bytes"
      }
    }
  },
  "memory": {
    "metricsConfigs": {
      "memory/anonymous_bytes": {
        "displayName": "memory/anonymous_bytes"
      },
      "memory/available_bytes": {
        "displayName": "memory/available_bytes"
      }
    }
  },
  "network": {
    "interfaceIncludeRegexp": "^(en|eth|wlan)\\\\d+",
    "interfaceExcludeRegexp": "^(docker|br-|veth)",
    "metricsConfigs": {
      "network/rx_bytes": {
        "displayName": "network/rx_bytes"
      },
      "network/tx_bytes": {
        "displayName": "network/tx_bytes"
      }
    }
  }
}

Key Fields:

  • invokeInterval: Metrics collection frequency
  • includeAllAttachedBlk: Include all block devices
  • interfaceIncludeRegexp: Network interfaces to monitor
  • metricsConfigs: Which metrics to collect and expose

Configuration Validation

Each plugin implements configuration validation in its factory function, ensuring invalid configurations are caught at startup rather than runtime.

Common Validation Patterns

// Validate required fields
if config.Source == "" {
    return nil, fmt.Errorf("source field is required")
}

// Validate rule patterns
for _, rule := range config.Rules {
    if _, err := regexp.Compile(rule.Pattern); err != nil {
        return nil, fmt.Errorf("invalid pattern %q: %v", rule.Pattern, err)
    }
}

// Validate timeouts
if config.PluginConfig.Timeout <= 0 {
    return nil, fmt.Errorf("timeout must be positive")
}

Plugin-Specific Validation

Each plugin validates its specific configuration requirements:

  • System Log Monitor: Validates regex patterns, log paths, buffer sizes
  • Custom Plugin Monitor: Validates script paths, timeouts, concurrency limits
  • System Stats Monitor: Validates metric configurations, intervals, regex patterns

Configuration Best Practices

  1. Use Meaningful Source Names: Source names appear in events and logs
  2. Test Regex Patterns: Use tools like grep to test patterns before deployment
  3. Set Appropriate Timeouts: Balance responsiveness with reliability
  4. Enable Metrics Reporting: Provides valuable observability data
  5. Provide Default Conditions: Establish baseline health states
  6. Document Custom Scripts: Include clear documentation for helper binaries

The Status Data Model

The types.Status struct is the unified interface between problem detection (monitors) and problem reporting (exporters). Understanding its structure and usage patterns is crucial for working with NPD plugins.

Core Structure

// pkg/types/types.go
type Status struct {
    // Source identifies which monitor generated this status
    Source string `json:"source"`

    // Events are temporary problems (sorted oldest to newest)
    Events []Event `json:"events"`

    // Conditions are permanent node states (ALL conditions for this monitor)
    Conditions []Condition `json:"conditions"`
}

type Event struct {
    Severity  Severity  `json:"severity"`   // Info or Warn
    Timestamp time.Time `json:"timestamp"`  // When event occurred
    Reason    string    `json:"reason"`     // Short identifier
    Message   string    `json:"message"`    // Human-readable description
}

type Condition struct {
    Type       string          `json:"type"`       // e.g., "KernelDeadlock"
    Status     ConditionStatus `json:"status"`     // True/False/Unknown
    Transition time.Time       `json:"transition"` // When status last changed
    Reason     string          `json:"reason"`     // Short cause description
    Message    string          `json:"message"`    // Detailed explanation
}

Key Design Principles

  1. Immutable Events: Events represent point-in-time occurrences
  2. Stateful Conditions: Conditions represent current node state
  3. Complete Condition Sets: Status always contains ALL conditions for a monitor
  4. Chronological Events: Events are sorted from oldest to newest

Status Creation Patterns

Pattern 1: Initial Status (Monitor Startup)

Every monitor sends an initial status establishing default conditions:

func (l *logMonitor) initializeStatus() {
    // Initialize all conditions to False (healthy state)
    l.conditions = initialConditions(l.config.DefaultConditions)
    l.output <- &types.Status{
        Source:     l.config.Source,    // e.g., "kernel-monitor"
        Events:     nil,                // No events on startup
        Conditions: l.conditions,       // All default conditions
    }
}

Pattern 2: Temporary Problems (Events Only)

For transient issues that don't affect persistent node state:

// System Log Monitor detecting OOM kill from kernel logs
status := &types.Status{
    Source: "kernel-monitor",
    Events: []types.Event{
        {
            Severity:  types.Warn,
            Timestamp: time.Now(),
            Reason:    "OOMKilling",
            Message:   "Killed process 1234 (chrome) total-vm:2048MB",
        },
    },
    Conditions: l.conditions,  // Unchanged from previous state
}

Pattern 3: Permanent Problems (Condition Changes)

For persistent issues that affect node health status:

// Custom Plugin Monitor detecting component failure
// 1. Update the affected condition
for i := range c.conditions {
    condition := &c.conditions[i]
    if condition.Type == "KubeletHealthy" {
        condition.Status = types.True        // Problem detected
        condition.Transition = time.Now()
        condition.Reason = "KubeletUnhealthy"
        condition.Message = "Kubelet health check failed: timeout"
        break
    }
}

// 2. Generate event for the condition change
changeEvent := util.GenerateConditionChangeEvent(
    "KubeletHealthy", types.True,
    "KubeletUnhealthy", "Kubelet health check failed: timeout",
    time.Now(),
)

// 3. Send status with both event and all current conditions
status := &types.Status{
    Source:     "custom-plugin-monitor",
    Events:     []types.Event{changeEvent},
    Conditions: c.conditions,  // ALL conditions, not just the changed one
}

Type Conversions

Internal Types → Kubernetes API

// types.Condition → v1.NodeCondition
func ConvertToAPICondition(condition types.Condition) v1.NodeCondition {
    return v1.NodeCondition{
        Type:               v1.NodeConditionType(condition.Type),
        Status:             ConvertToAPIConditionStatus(condition.Status),
        LastTransitionTime: ConvertToAPITimestamp(condition.Transition),
        Reason:             condition.Reason,
        Message:            condition.Message,
    }
}

// types.ConditionStatus → v1.ConditionStatus
func ConvertToAPIConditionStatus(status types.ConditionStatus) v1.ConditionStatus {
    switch status {
    case types.True:    return v1.ConditionTrue      // Problem present
    case types.False:   return v1.ConditionFalse     // Healthy state
    case types.Unknown: return v1.ConditionUnknown   // Cannot determine
    }
}

// types.Event severity → Kubernetes event type
func ConvertToAPIEventType(severity types.Severity) string {
    switch severity {
    case types.Info: return v1.EventTypeNormal    // Informational
    case types.Warn: return v1.EventTypeWarning   // Problem detected
    }
}

Status Processing Flow

Problem Detector (Orchestrator)

func (p *problemDetector) Run(ctx context.Context) error {
    // Collect status channels from all monitors
    var chans []<-chan *types.Status
    for _, monitor := range p.monitors {
        if ch, err := monitor.Start(); ch != nil {
            chans = append(chans, ch)  // nil channels filtered out
        }
    }

    // Multiplex all status channels
    statusCh := groupChannel(chans)

    // Fan out to all exporters
    for {
        select {
        case status := <-statusCh:
            for _, exporter := range p.exporters {
                exporter.ExportProblems(status)  // Parallel processing
            }
        case <-ctx.Done():
            return nil
        }
    }
}

Best Practices for Status Usage

For Monitor Developers

  1. Always send ALL conditions: Don't send only changed conditions

    // ✅ Correct
    status := &types.Status{
        Source:     source,
        Conditions: allConditions,  // All conditions for this monitor
    }
    
    // ❌ Incorrect
    status := &types.Status{
        Source:     source,
        Conditions: []types.Condition{changedCondition},  // Only changed one
    }
  2. Sort events chronologically: Oldest to newest

    sort.Slice(events, func(i, j int) bool {
        return events[i].Timestamp.Before(events[j].Timestamp)
    })
  3. Use appropriate severities:

    • types.Info: Normal operations, condition resolved
    • types.Warn: Problems detected, requires attention
  4. Update transition time only when needed:

    if condition.Status != newStatus || condition.Reason != newReason {
        condition.Transition = time.Now()  // Only update on actual change
    }

For Exporter Developers

  1. Handle nil/empty arrays gracefully:

    for _, event := range status.Events {  // Safe even if Events is nil
        processEvent(event)
    }
  2. Process conditions efficiently:

    // Batch condition updates rather than individual API calls
    for _, condition := range status.Conditions {
        conditionManager.UpdateCondition(condition)  // Batched internally
    }
  3. Convert types appropriately: Use utility functions for type conversions

Common Patterns and Gotchas

Initialization Pattern

// Every monitor should send initial status
func (m *monitor) Start() (<-chan *types.Status, error) {
    // 1. Initialize conditions to healthy state
    m.conditions = initializeConditions(m.config.DefaultConditions)

    // 2. Send initial status
    m.output <- &types.Status{
        Source:     m.config.Source,
        Conditions: m.conditions,
    }

    // 3. Start monitoring loop
    go m.monitorLoop()
    return m.output, nil
}

Condition Change Detection

// Only generate events when conditions actually change
func needsUpdate(old, new types.Condition) bool {
    return old.Status != new.Status ||
           old.Reason != new.Reason ||
           (enableMessageChanges && old.Message != new.Message)
}

Metrics Integration

// Report to Problem Metrics Manager for Prometheus/Stackdriver
if metricsEnabled {
    for _, event := range status.Events {
        problemmetrics.GlobalProblemMetricsManager.IncrementProblemCounter(
            event.Reason, 1)
    }
    for _, condition := range status.Conditions {
        problemmetrics.GlobalProblemMetricsManager.SetProblemGauge(
            condition.Type, condition.Reason, condition.Status == types.True)
    }
}

This types.Status design enables NPD's flexible architecture where different monitors can report different types of problems through a unified interface, while multiple exporters can process the same information in different ways (Kubernetes API, metrics, external systems, etc.).

Complete Integration Example

This section walks through creating a complete custom monitor from start to finish, demonstrating all aspects of NPD plugin development.

Scenario: Network Connectivity Monitor

Let's create a monitor that checks network connectivity to critical services and reports failures as node conditions.

Step 1: Define the Requirements

Goal: Monitor network connectivity to essential services Problem Types:

  • Temporary: Connection timeouts or transient failures
  • Permanent: Persistent connectivity issues that affect node scheduling

Services to Monitor:

  • DNS servers
  • Kubernetes API server
  • Container registry

Step 2: Create the Helper Script

File: /usr/local/bin/network-connectivity-check.sh

#!/bin/bash

# Exit code constants (MUST match NPD protocol)
readonly OK=0
readonly NONOK=1
readonly UNKNOWN=2

# Configuration
DEFAULT_TIMEOUT=5
DEFAULT_DNS="8.8.8.8"
DEFAULT_REGISTRY="gcr.io"

# Parse arguments
TIMEOUT=$DEFAULT_TIMEOUT
DNS_SERVER=$DEFAULT_DNS
REGISTRY=$DEFAULT_REGISTRY
VERBOSE=false

while [[ $# -gt 0 ]]; do
    case $1 in
        --timeout) TIMEOUT="$2"; shift 2 ;;
        --dns) DNS_SERVER="$2"; shift 2 ;;
        --registry) REGISTRY="$2"; shift 2 ;;
        --verbose) VERBOSE=true; shift ;;
        --help)
            echo "Usage: $0 [--timeout SECONDS] [--dns DNS_IP] [--registry REGISTRY] [--verbose]"
            exit $OK ;;
        *)
            echo "Unknown option: $1"
            exit $UNKNOWN ;;
    esac
done

# Function to log verbose messages
log() {
    if [[ "$VERBOSE" == "true" ]]; then
        echo "[DEBUG] $*" >&2
    fi
}

# Function to test connectivity
test_connectivity() {
    local service="$1"
    local test_command="$2"

    log "Testing connectivity to $service..."

    if timeout "$TIMEOUT" bash -c "$test_command" >/dev/null 2>&1; then
        log "$service connectivity: OK"
        return 0
    else
        log "$service connectivity: FAILED"
        return 1
    fi
}

# Track failures
failures=()

# Test DNS connectivity
if ! test_connectivity "DNS ($DNS_SERVER)" "nslookup google.com $DNS_SERVER"; then
    failures+=("DNS")
fi

# Test Kubernetes API server (if we can find it)
if [[ -n "$KUBERNETES_SERVICE_HOST" ]]; then
    if ! test_connectivity "Kubernetes API" "curl -k -s --connect-timeout $TIMEOUT https://$KUBERNETES_SERVICE_HOST:${KUBERNETES_SERVICE_PORT:-443}/healthz"; then
        failures+=("K8s API")
    fi
fi

# Test container registry connectivity
if ! test_connectivity "Registry ($REGISTRY)" "curl -s --connect-timeout $TIMEOUT https://$REGISTRY/v2/"; then
    failures+=("Registry")
fi

# Generate output and exit
if [[ ${#failures[@]} -eq 0 ]]; then
    echo "All network services reachable"
    exit $OK
else
    # Join failures with commas
    failed_services=$(IFS=','; echo "${failures[*]}")
    echo "Network connectivity issues: $failed_services"
    exit $NONOK
fi

Make the script executable:

chmod +x /usr/local/bin/network-connectivity-check.sh

Step 3: Create the Configuration

File: /etc/node-problem-detector/config/network-monitor.json

{
  "plugin": "custom",
  "source": "network-connectivity-monitor",
  "metricsReporting": true,
  "pluginConfig": {
    "invoke_interval": "60s",
    "timeout": "30s",
    "max_output_length": 200,
    "concurrency": 1,
    "enable_message_change_based_condition_update": false
  },
  "conditions": [
    {
      "type": "NetworkConnectivity",
      "reason": "NetworkConnectivityHealthy",
      "message": "All critical network services are reachable"
    }
  ],
  "rules": [
    {
      "type": "permanent",
      "condition": "NetworkConnectivity",
      "reason": "NetworkConnectivityUnhealthy",
      "path": "/usr/local/bin/network-connectivity-check.sh",
      "args": [
        "--timeout=10",
        "--dns=8.8.8.8",
        "--registry=gcr.io",
        "--verbose"
      ],
      "timeout": "25s"
    }
  ]
}

Step 4: Test the Helper Script

Test the helper script independently first:

# Test successful case
/usr/local/bin/network-connectivity-check.sh --verbose
echo "Exit code: $?"

# Test with unreachable DNS
/usr/local/bin/network-connectivity-check.sh --dns=192.0.2.1 --timeout=2 --verbose
echo "Exit code: $?"

# Test help
/usr/local/bin/network-connectivity-check.sh --help

Step 5: Test the Configuration

Validate the JSON configuration:

# Check JSON syntax
cat /etc/node-problem-detector/config/network-monitor.json | jq .

# Test NPD with dry-run (if available)
node-problem-detector --config.custom-plugin-monitor=/etc/node-problem-detector/config/network-monitor.json --dry-run

Step 6: Deploy and Run

Deploy the monitor:

# Run NPD with the network monitor
node-problem-detector \
  --config.custom-plugin-monitor=/etc/node-problem-detector/config/network-monitor.json \
  --v=2

Step 7: Verify the Integration

Check that the monitor is working:

# 1. Check NPD logs for successful startup
kubectl logs <npd-pod> | grep "network-connectivity-monitor"

# 2. Check node conditions
kubectl describe node <node-name> | grep -A 5 "NetworkConnectivity"

# 3. Check events
kubectl get events --field-selector involvedObject.name=<node-name> | grep NetworkConnectivity

# 4. Check metrics (if Prometheus is configured)
curl http://<node>:20257/metrics | grep problem

Step 8: Advanced Configuration

Create a more sophisticated configuration with different checks:

File: /etc/node-problem-detector/config/network-monitor-advanced.json

{
  "plugin": "custom",
  "source": "network-connectivity-monitor",
  "metricsReporting": true,
  "pluginConfig": {
    "invoke_interval": "30s",
    "timeout": "25s",
    "max_output_length": 200,
    "concurrency": 2
  },
  "conditions": [
    {
      "type": "DNSConnectivity",
      "reason": "DNSConnectivityHealthy",
      "message": "DNS resolution is working"
    },
    {
      "type": "RegistryConnectivity",
      "reason": "RegistryConnectivityHealthy",
      "message": "Container registry is reachable"
    }
  ],
  "rules": [
    {
      "type": "permanent",
      "condition": "DNSConnectivity",
      "reason": "DNSConnectivityUnhealthy",
      "path": "/usr/local/bin/dns-check.sh",
      "args": ["--timeout=5"],
      "timeout": "10s"
    },
    {
      "type": "permanent",
      "condition": "RegistryConnectivity",
      "reason": "RegistryConnectivityUnhealthy",
      "path": "/usr/local/bin/registry-check.sh",
      "args": ["--registry=gcr.io", "--timeout=10"],
      "timeout": "15s"
    },
    {
      "type": "temporary",
      "reason": "NetworkTimeout",
      "path": "/usr/local/bin/network-latency-check.sh",
      "args": ["--threshold=1000"],
      "timeout": "20s"
    }
  ]
}

Expected Outcomes

After successful deployment, you should see:

  1. Node Conditions: New conditions appearing in kubectl describe node

    NetworkConnectivity    False   NetworkConnectivityHealthy   All critical network services are reachable
    
  2. Events: Condition change events when connectivity issues occur

    Warning  NetworkConnectivityUnhealthy  Network connectivity issues: DNS,Registry
    Normal   NetworkConnectivityHealthy    All critical network services are reachable
    
  3. Metrics: Problem metrics available via Prometheus

    problem_gauge{type="NetworkConnectivity",reason="NetworkConnectivityHealthy"} 0
    problem_counter_total{reason="NetworkConnectivityUnhealthy"} 3
    
  4. Logs: Monitor execution logs in NPD output

    I0101 12:00:00.000000 1 custom_plugin_monitor.go:123] network-connectivity-monitor: All network services reachable
    

Error Handling and Edge Cases

The helper script handles common edge cases:

  1. Command not found: Returns UNKNOWN status
  2. Timeout scenarios: Uses timeout command for reliability
  3. Argument validation: Provides help and validates inputs
  4. Verbose logging: Enables debugging via stderr
  5. Multiple failure reporting: Reports all failed services

Integration Benefits

This complete example demonstrates:

  1. Problem Detection: Automated network connectivity monitoring
  2. Kubernetes Integration: Node conditions affect pod scheduling
  3. Observability: Events and metrics provide visibility
  4. Flexibility: Configurable timeouts, services, and thresholds
  5. Reliability: Proper error handling and timeout management

This pattern can be extended for any custom monitoring requirement, from hardware health to application-specific checks.

Best Practices

Plugin Development

  1. Use the Tomb Pattern: Implement graceful shutdown using the tomb library (pkg/util/tomb) for goroutine lifecycle management. NPD's tomb provides a simple pattern: call tomb.Stop() to signal shutdown, check tomb.Stopping() in select loops, and call tomb.Done() when cleanup is complete. Example:
    type monitor struct {
        tomb   *tomb.Tomb
        output chan *types.Status
    }
    
    func (m *monitor) Start() (<-chan *types.Status, error) {
        m.output = make(chan *types.Status)
    
        // Start goroutine manually
        go m.monitorLoop()
    
        return m.output, nil
    }
    
    func (m *monitor) monitorLoop() {
        // Always defer Done() to signal completion
        defer func() {
            close(m.output)
            m.tomb.Done()
        }()
    
        for {
            select {
            case <-m.tomb.Stopping():
                // Stop signal received - cleanup and return
                return
            // ... other monitoring cases
            }
        }
    }
    
    func (m *monitor) Stop() {
        m.tomb.Stop()  // Blocks until Done() is called
    }
  2. Handle Context Cancellation: Respect context cancellation for clean shutdown
  3. Validate Configuration: Fail fast with clear error messages for invalid configurations
  4. Log Appropriately: Use structured logging with appropriate log levels
  5. Metrics Integration: Expose relevant metrics through the problem metrics manager

Configuration Management

  1. Default Conditions: Always provide sensible default conditions for permanent rules
  2. Timeout Configuration: Provide both global and per-rule timeout settings
  3. Resource Limits: Configure appropriate limits for output length, concurrency, etc.
  4. Pattern Testing: Thoroughly test regex patterns with sample log data

Performance Considerations

  1. Channel Buffering: Use appropriately sized buffers for status channels
  2. Regex Compilation: Pre-compile regex patterns for efficiency
  3. Resource Monitoring: Monitor CPU and memory usage of monitor plugins
  4. Rate Limiting: Implement rate limiting for high-frequency events

Thread Safety Considerations

  1. Monitor Internal State: Monitors run in their own goroutines

    type monitor struct {
        mutex      sync.RWMutex
        conditions []types.Condition
        tomb       *tomb.Tomb
        output     chan *types.Status
    }
    
    func (m *monitor) updateConditions() {
        m.mutex.Lock()
        defer m.mutex.Unlock()
        // Safe to modify conditions
    }
    
    func (m *monitor) getConditions() []types.Condition {
        m.mutex.RLock()
        defer m.mutex.RUnlock()
        // Safe to read conditions
        return append([]types.Condition(nil), m.conditions...)
    }
  2. Status Channel: Only one goroutine should send to the status channel

    // ✅ Correct: Single goroutine sends
    func (m *monitor) monitorLoop() {
        for {
            select {
            case <-m.tomb.Stopping():
                return
            case <-ticker.C:
                status := m.generateStatus()
                m.output <- status  // Only this goroutine sends
            }
        }
    }
    
    // ❌ Incorrect: Multiple goroutines sending
    go func() { m.output <- status1 }()
    go func() { m.output <- status2 }()  // Race condition
  3. Exporter Concurrency: Multiple exporters process the same status

    // Safe: Each exporter receives a copy, no coordination needed
    for _, exporter := range p.exporters {
        exporter.ExportProblems(status)  // Concurrent execution is safe
    }
  4. Problem Metrics Manager: Thread-safe (uses sync.Mutex internally)

    // Safe: Can be called from multiple goroutines
    problemmetrics.GlobalProblemMetricsManager.IncrementProblemCounter(reason, 1)
    problemmetrics.GlobalProblemMetricsManager.SetProblemGauge(type, reason, value)
  5. Configuration Access: Read-only after initialization

    // Safe: Configuration is read-only after monitor creation
    func (m *monitor) someMethod() {
        timeout := m.config.Timeout  // No synchronization needed
    }
  6. Common Patterns for Shared State:

    // Pattern 1: Atomic operations for simple counters
    type monitor struct {
        eventCount int64
    }
    
    func (m *monitor) incrementEvents() {
        atomic.AddInt64(&m.eventCount, 1)
    }
    
    // Pattern 2: Channel-based communication
    type monitor struct {
        configUpdates chan Config
    }
    
    func (m *monitor) updateConfig(newConfig Config) {
        select {
        case m.configUpdates <- newConfig:
        default:
            // Non-blocking send
        }
    }

Security Considerations

  1. Input Validation: Validate all external inputs, especially in custom plugins
  2. Process Isolation: Run custom plugin scripts with minimal privileges
  3. Resource Limits: Enforce timeout and output limits to prevent resource exhaustion
  4. Path Validation: Validate and sanitize file paths in configurations

Testing Guide

Unit Testing Patterns

Testing Monitor Interfaces

func TestMonitorInterface(t *testing.T) {
    monitor := NewMyCustomMonitor(validConfig)

    // Test Start() returns proper channel
    statusCh, err := monitor.Start()
    assert.NoError(t, err)
    assert.NotNil(t, statusCh)

    // Test status generation
    select {
    case status := <-statusCh:
        assert.Equal(t, "my-monitor", status.Source)
        assert.NotEmpty(t, status.Conditions)
    case <-time.After(5 * time.Second):
        t.Fatal("No status received within timeout")
    }

    // Test Stop() cleans up properly
    monitor.Stop()

    // Channel should be closed
    select {
    case _, ok := <-statusCh:
        assert.False(t, ok, "Channel should be closed")
    default:
        t.Fatal("Channel not closed")
    }
}

Testing Configuration Validation

func TestConfigValidation(t *testing.T) {
    tests := []struct {
        name        string
        config      MyConfig
        expectError bool
    }{
        {
            name: "valid config",
            config: MyConfig{
                Source: "test-monitor",
                Rules:  []Rule{{Pattern: "valid.*pattern"}},
            },
            expectError: false,
        },
        {
            name: "empty source",
            config: MyConfig{
                Source: "",
                Rules:  []Rule{{Pattern: "valid.*pattern"}},
            },
            expectError: true,
        },
        {
            name: "invalid regex",
            config: MyConfig{
                Source: "test-monitor",
                Rules:  []Rule{{Pattern: "[invalid"}},
            },
            expectError: true,
        },
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            _, err := NewMyCustomMonitor(tt.config)
            if tt.expectError {
                assert.Error(t, err)
            } else {
                assert.NoError(t, err)
            }
        })
    }
}

Integration Testing

Testing Plugin Registration

func TestPluginRegistration(t *testing.T) {
    // Clear registry for clean test
    handlers := problemdaemon.GetRegisteredHandlers()
    originalCount := len(handlers)

    // Register test plugin
    problemdaemon.Register("test-plugin", types.ProblemDaemonHandler{
        CreateProblemDaemonOrDie: func(string) types.Monitor {
            return &mockMonitor{}
        },
    })

    // Verify registration
    newHandlers := problemdaemon.GetRegisteredHandlers()
    assert.Equal(t, originalCount+1, len(newHandlers))
    assert.Contains(t, newHandlers, "test-plugin")
}

Testing End-to-End Flow

func TestEndToEndFlow(t *testing.T) {
    // Create test monitor
    monitor := &testMonitor{
        statusCh: make(chan *types.Status, 1),
    }

    // Create test exporter
    exporter := &testExporter{
        receivedStatus: make([]*types.Status, 0),
    }

    // Create problem detector
    pd := NewProblemDetector([]types.Monitor{monitor}, []types.Exporter{exporter})

    // Start in background
    ctx, cancel := context.WithCancel(context.Background())
    defer cancel()
    go pd.Run(ctx)

    // Send test status
    testStatus := &types.Status{
        Source: "test-monitor",
        Events: []types.Event{{Reason: "TestEvent"}},
    }
    monitor.statusCh <- testStatus

    // Verify exporter received status
    assert.Eventually(t, func() bool {
        return len(exporter.receivedStatus) > 0
    }, 5*time.Second, 100*time.Millisecond)

    assert.Equal(t, testStatus, exporter.receivedStatus[0])
}

Testing Helper Binaries

Script Testing Framework

#!/bin/bash
# test-helper-script.sh - Framework for testing helper binaries

set -euo pipefail

readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
readonly TEMP_DIR="$(mktemp -d)"
readonly OK=0
readonly NONOK=1
readonly UNKNOWN=2

cleanup() {
    rm -rf "$TEMP_DIR"
}
trap cleanup EXIT

# Test function that verifies exit code and output
test_helper() {
    local test_name="$1"
    local expected_exit="$2"
    local expected_output="$3"
    shift 3
    local args=("$@")

    echo "Testing: $test_name"

    local output
    local exit_code

    # Capture output and exit code
    set +e
    output=$("${SCRIPT_DIR}/my-helper-script.sh" "${args[@]}" 2>&1)
    exit_code=$?
    set -e

    # Verify exit code
    if [[ $exit_code -ne $expected_exit ]]; then
        echo "FAIL: Expected exit code $expected_exit, got $exit_code"
        return 1
    fi

    # Verify output contains expected text
    if [[ ! "$output" =~ $expected_output ]]; then
        echo "FAIL: Output '$output' doesn't match expected pattern '$expected_output'"
        return 1
    fi

    echo "PASS: $test_name"
}

# Test cases
test_helper "successful check" $OK "System is healthy" --component=test
test_helper "failed check" $NONOK "System unhealthy" --component=failing-test
test_helper "invalid argument" $UNKNOWN "Unknown option" --invalid-arg
test_helper "help message" $OK "Usage:" --help

echo "All tests passed!"

Mock Services for Testing

#!/bin/bash
# mock-service.sh - Create mock services for testing

# Start mock HTTP server
start_mock_server() {
    local port="$1"
    local response="$2"

    # Simple HTTP server using nc (netcat)
    # Note: netcat might not be available on all systems
    # Consider Python's http.server or httpbin for production testing
    while true; do
        echo -e "HTTP/1.1 200 OK\r\n\r\n$response" | nc -l -p "$port"
    done &

    echo $!  # Return PID
}

# Usage in tests
mock_pid=$(start_mock_server 8080 "OK")
trap "kill $mock_pid" EXIT

# Test helper script against mock server
test_helper "mock server check" $OK "Service available" --url=http://localhost:8080

Performance Testing

Load Testing Monitors

func BenchmarkMonitorPerformance(b *testing.B) {
    monitor := NewMyCustomMonitor(benchConfig)
    statusCh, _ := monitor.Start()
    defer monitor.Stop()

    // Drain status channel in background
    go func() {
        for range statusCh {
            // Consume status updates
        }
    }()

    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        // Trigger monitor activity
        monitor.TriggerCheck()
    }
}

Memory Leak Testing

func TestMonitorMemoryLeak(t *testing.T) {
    var m1, m2 runtime.MemStats
    runtime.GC()
    runtime.ReadMemStats(&m1)

    // Run monitor for extended period
    monitor := NewMyCustomMonitor(testConfig)
    statusCh, _ := monitor.Start()

    // Simulate extended operation
    for i := 0; i < 1000; i++ {
        select {
        case <-statusCh:
            // Process status
        case <-time.After(10 * time.Millisecond):
            // Continue if no status
        }
    }

    monitor.Stop()

    runtime.GC()
    runtime.ReadMemStats(&m2)

    // Check for significant memory increase
    memIncrease := int64(m2.Alloc) - int64(m1.Alloc)
    if memIncrease > 1024*1024 { // 1MB threshold
        t.Errorf("Potential memory leak: %d bytes increase", memIncrease)
    }
}

Configuration Testing

Validation Testing

func TestConfigurationValidation(t *testing.T) {
    testCases := []struct {
        name   string
        config string
        valid  bool
    }{
        {
            name: "valid config",
            config: `{
                "plugin": "custom",
                "source": "test-monitor",
                "rules": [{"type": "temporary", "reason": "Test"}]
            }`,
            valid: true,
        },
        {
            name: "invalid JSON",
            config: `{
                "plugin": "custom",
                "source": "test-monitor"
                "rules": []
            }`,
            valid: false,
        },
        {
            name: "missing required field",
            config: `{
                "plugin": "custom",
                "rules": []
            }`,
            valid: false,
        },
    }

    for _, tc := range testCases {
        t.Run(tc.name, func(t *testing.T) {
            var config MyConfig
            err := json.Unmarshal([]byte(tc.config), &config)

            if tc.valid {
                assert.NoError(t, err)
                assert.NoError(t, validateConfig(config))
            } else {
                assert.Error(t, err)
            }
        })
    }
}

Performance Tuning

Monitor Performance Optimization

Channel Buffer Sizing

// Size buffers based on expected throughput
func (m *monitor) Start() (<-chan *types.Status, error) {
    // For high-frequency monitors, use larger buffers
    bufferSize := 100
    if m.config.HighFrequency {
        bufferSize = 1000
    }

    m.output = make(chan *types.Status, bufferSize)
    go m.monitorLoop()
    return m.output, nil
}

Efficient Regex Compilation

type logMonitor struct {
    compiledPatterns map[string]*regexp.Regexp
}

func (l *logMonitor) compilePatterns() error {
    l.compiledPatterns = make(map[string]*regexp.Regexp)

    for _, rule := range l.config.Rules {
        compiled, err := regexp.Compile(rule.Pattern)
        if err != nil {
            return fmt.Errorf("failed to compile pattern %q: %v", rule.Pattern, err)
        }
        l.compiledPatterns[rule.Pattern] = compiled
    }

    return nil
}

// Use pre-compiled patterns in hot path
func (l *logMonitor) matchPattern(line string, pattern string) bool {
    compiled := l.compiledPatterns[pattern]
    return compiled.MatchString(line)
}

Resource Pool Management

// Pool objects to reduce GC pressure
type statusPool struct {
    pool sync.Pool
}

func newStatusPool() *statusPool {
    return &statusPool{
        pool: sync.Pool{
            New: func() interface{} {
                return &types.Status{
                    Events:     make([]types.Event, 0, 10),
                    Conditions: make([]types.Condition, 0, 5),
                }
            },
        },
    }
}

func (p *statusPool) Get() *types.Status {
    status := p.pool.Get().(*types.Status)
    // Reset for reuse
    status.Source = ""
    status.Events = status.Events[:0]
    status.Conditions = status.Conditions[:0]
    return status
}

func (p *statusPool) Put(status *types.Status) {
    p.pool.Put(status)
}

Memory Management

Preventing Memory Leaks

func (m *monitor) monitorLoop() {
    defer func() {
        // Cleanup resources
        close(m.output)
        m.tomb.Done()
    }()

    // Use bounded channels and proper cleanup
    ticker := time.NewTicker(m.config.Interval)
    defer ticker.Stop()

    for {
        select {
        case <-ticker.C:
            status := m.generateStatus()
            select {
            case m.output <- status:
                // Status sent successfully
            default:
                // Channel full - log warning but don't block
                klog.Warning("Status channel full, dropping status")
            }

        case <-m.tomb.Stopping():
            return
        }
    }
}

String Optimization

// Avoid string concatenation in hot paths
func (m *monitor) formatMessage(template string, args ...interface{}) string {
    var buf strings.Builder
    buf.Grow(len(template) + 100) // Pre-allocate capacity

    fmt.Fprintf(&buf, template, args...)
    return buf.String()
}

// Use byte slices for log processing
func (l *logMonitor) processLogLine(line []byte) {
    // Process directly on byte slice to avoid string allocation
    if bytes.Contains(line, []byte("ERROR")) {
        // Only convert to string when necessary
        errorLine := string(line)
        l.handleError(errorLine)
    }
}

CPU Optimization

Efficient Pattern Matching

// Use efficient string matching for simple patterns
func (l *logMonitor) fastMatch(line string, pattern string) bool {
    switch pattern {
    case "OOM":
        return strings.Contains(line, "Out of memory")
    case "PANIC":
        return strings.Contains(line, "kernel panic")
    default:
        // Fall back to regex for complex patterns
        return l.compiledPatterns[pattern].MatchString(line)
    }
}

Parallel Processing

func (m *monitor) processRulesParallel(line string) []types.Event {
    if len(m.config.Rules) <= 1 {
        // Not worth parallelizing for small rule sets
        return m.processRulesSequential(line)
    }

    var wg sync.WaitGroup
    results := make(chan types.Event, len(m.config.Rules))

    for _, rule := range m.config.Rules {
        wg.Add(1)
        go func(r Rule) {
            defer wg.Done()
            if event := m.processRule(line, r); event != nil {
                results <- *event
            }
        }(rule)
    }

    // Close channel when all goroutines complete
    go func() {
        wg.Wait()
        close(results)
    }()

    var events []types.Event
    for event := range results {
        events = append(events, event)
    }

    return events
}

I/O Optimization

Efficient Log Reading

func (l *logMonitor) readLogs() error {
    // Use buffered I/O for better performance
    file, err := os.Open(l.config.LogPath)
    if err != nil {
        return err
    }
    defer file.Close()

    // Large buffer for reading
    scanner := bufio.NewScanner(file)
    scanner.Buffer(make([]byte, 64*1024), 1024*1024) // 1MB max line

    for scanner.Scan() {
        line := scanner.Bytes() // Get bytes directly
        l.processLogLine(line)
    }

    return scanner.Err()
}

Batch Processing

func (m *monitor) processBatch(items []LogItem) {
    const batchSize = 100

    for i := 0; i < len(items); i += batchSize {
        end := i + batchSize
        if end > len(items) {
            end = len(items)
        }

        batch := items[i:end]
        m.processBatchItems(batch)

        // Yield CPU periodically
        if i%1000 == 0 {
            runtime.Gosched()
        }
    }
}

Configuration Tuning

Optimal Configuration Settings

{
  "pluginConfig": {
    "invoke_interval": "30s",     // Balance between responsiveness and overhead
    "timeout": "10s",             // Prevent hanging helper scripts
    "max_output_length": 200,     // Limit memory per helper execution
    "concurrency": 3,             // Match CPU cores for I/O bound tasks
    "enable_message_change_based_condition_update": false  // Reduce noise
  },
  "bufferSize": 50,               // Match expected log volume
  "lookback": "5m"                // Reasonable startup history
}

Resource Limits

# Kubernetes resource limits for NPD
resources:
  requests:
    cpu: 100m        # Baseline CPU
    memory: 128Mi    # Baseline memory
  limits:
    cpu: 500m        # Burst capacity
    memory: 512Mi    # Prevent OOM

Monitoring Performance

Built-in Metrics

// Add performance metrics to monitors
type performanceMetrics struct {
    processedLines    *prometheus.CounterVec
    processingLatency *prometheus.HistogramVec
    memoryUsage      *prometheus.GaugeVec
}

func (m *monitor) recordMetrics(duration time.Duration, lineCount int) {
    m.metrics.processedLines.WithLabelValues(m.config.Source).Add(float64(lineCount))
    m.metrics.processingLatency.WithLabelValues(m.config.Source).Observe(duration.Seconds())
}

Performance Benchmarks

# Monitor resource usage
kubectl top pods -l name=node-problem-detector

# Check memory usage over time
kubectl logs -l name=node-problem-detector | grep "memory usage"

# Monitor processing latency
curl http://node:20257/metrics | grep processing_latency

Common Performance Issues

  1. Unbuffered Channels: Can cause goroutine blocking
  2. Large Regex Sets: Pre-compile and optimize patterns
  3. String Operations: Use byte slices in hot paths
  4. Memory Leaks: Ensure proper cleanup in defer statements
  5. Excessive Logging: Use appropriate log levels
  6. I/O Blocking: Use timeouts and buffered I/O

Following these performance tuning guidelines ensures NPD operates efficiently even under high load conditions.

Plugin Registration & Lifecycle

Registration Pattern

NPD uses a centralized registry pattern with Go's init() function for plugin registration:

// pkg/problemdaemon/problem_daemon.go
var handlers = make(map[types.ProblemDaemonType]types.ProblemDaemonHandler)

func Register(problemDaemonType types.ProblemDaemonType, handler types.ProblemDaemonHandler) {
    handlers[problemDaemonType] = handler
}

Each plugin registers itself during package initialization using string constants:

// pkg/systemlogmonitor/log_monitor.go
const SystemLogMonitorName = "system-log-monitor"

func init() {
    problemdaemon.Register(
        SystemLogMonitorName,
        types.ProblemDaemonHandler{
            CreateProblemDaemonOrDie: NewLogMonitorOrDie,
            CmdOptionDescription:     "Set to config file paths.",
        })
}

All three monitor types use this pattern with their respective constants:

  • SystemLogMonitorName = "system-log-monitor"
  • CustomPluginMonitorName = "custom-plugin-monitor"
  • SystemStatsMonitorName = "system-stats-monitor"

Plugin Import Mechanism

Plugins are imported through separate files with build tags that use blank imports to trigger registration. Each plugin has its own file to enable selective compilation:

// cmd/nodeproblemdetector/problemdaemonplugins/custom_plugin_monitor_plugin.go
//go:build !disable_custom_plugin_monitor
// +build !disable_custom_plugin_monitor

package problemdaemonplugins

import (
    _ "k8s.io/node-problem-detector/pkg/custompluginmonitor"
)
// cmd/nodeproblemdetector/problemdaemonplugins/system_log_monitor_plugin.go
//go:build !disable_system_log_monitor
// +build !disable_system_log_monitor

package problemdaemonplugins

import (
    _ "k8s.io/node-problem-detector/pkg/systemlogmonitor"
)
// cmd/nodeproblemdetector/problemdaemonplugins/system_stats_monitor_plugin.go
//go:build !disable_system_stats_monitor
// +build !disable_system_stats_monitor

package problemdaemonplugins

import (
    _ "k8s.io/node-problem-detector/pkg/systemstatsmonitor"
)

Build-time Plugin Selection

Plugins can be selectively excluded using build tags. NPD uses both modern and legacy build tag syntax for compatibility:

# Disable specific plugins
BUILD_TAGS="disable_custom_plugin_monitor disable_stackdriver_exporter" make

# Only include system log monitor
BUILD_TAGS="disable_custom_plugin_monitor disable_system_stats_monitor" make

# Enable journald support (opt-in)
BUILD_TAGS="journald" make

Available Build Tags

  • disable_system_log_monitor - Exclude system log monitor
  • disable_custom_plugin_monitor - Exclude custom plugin monitor
  • disable_system_stats_monitor - Exclude system stats monitor
  • disable_stackdriver_exporter - Exclude Stackdriver exporter
  • journald - Enable journald log watcher (requires systemd libraries)

Build Tag Syntax

NPD uses both formats for Go version compatibility:

//go:build !disable_custom_plugin_monitor    // Modern format (Go 1.17+)
// +build !disable_custom_plugin_monitor     // Legacy format (compatibility)

Verifying Build Configuration

# Check what's included in a build
go list -tags "disable_custom_plugin_monitor" ./...

# Verbose build to see what's excluded
go build -v -x -tags "disable_system_stats_monitor"

Lifecycle Management

The Problem Detector orchestrates plugin lifecycle through a robust, channel-based architecture with error handling:

// Actual implementation from pkg/problemdetector/problem_detector.go
func (p *problemDetector) Run(ctx context.Context) error {
    // Start the log monitors one by one.
    var chans []<-chan *types.Status
    failureCount := 0
    for _, m := range p.monitors {
        ch, err := m.Start()
        if err != nil {
            // Do not return error and keep on trying the following config files.
            klog.Errorf("Failed to start problem daemon %v: %v", m, err)
            failureCount++
            continue
        }
        if ch != nil {
            chans = append(chans, ch)
        }
    }
    allMonitors := p.monitors

    if len(allMonitors) == failureCount {
        return fmt.Errorf("no problem daemon is successfully setup")
    }

    defer func() {
        for _, m := range allMonitors {
            m.Stop()
        }
    }()

    ch := groupChannel(chans)
    klog.Info("Problem detector started")

    for {
        select {
        case <-ctx.Done():
            return nil
        case status := <-ch:
            for _, exporter := range p.exporters {
                exporter.ExportProblems(status)
            }
        }
    }
}

Key aspects of the actual implementation:

  • Error Handling: Logs errors and continues, doesn't silently skip failed monitors
  • Failure Counting: Returns error only when ALL monitors fail to start
  • Cleanup: Uses defer to ensure Stop() is called on all monitors
  • Graceful Degradation: Continues operation with partially failed monitor setup
  • Nil Channel Filtering: Metrics-only monitors return nil channels which are excluded

14. Implementing Custom Monitor Types

This section provides a comprehensive guide for advanced developers who need to create entirely new monitor types from scratch, extending NPD's core monitoring capabilities beyond the existing System Log, Custom Plugin, and System Stats monitors.

Understanding Monitor Types

NPD supports three core monitor types out of the box:

  1. System Log Monitor: Parses system logs (syslog, journald) using configurable patterns
  2. Custom Plugin Monitor: Executes external scripts/binaries for custom monitoring
  3. System Stats Monitor: Collects system metrics (CPU, memory, disk) for Prometheus export

When these don't meet your needs, you can implement a completely new monitor type.

When to Create a New Monitor Type

Consider implementing a new monitor type when:

  • Unique Data Sources: You need to monitor data sources not covered by existing monitors (e.g., hardware sensors, network interfaces, container runtimes)
  • Complex Logic: The monitoring logic is too complex for script-based custom plugins
  • Performance Requirements: You need high-performance monitoring with minimal overhead
  • Integration Needs: You need deep integration with NPD's lifecycle and status reporting

Architecture Overview

A new monitor type consists of several components:

// Core interface that all monitors must implement
type Monitor interface {
    Start() (<-chan *types.Status, error)
    Stop()
}

// Configuration parser for your monitor type
type ConfigParser interface {
    Parse(config []byte) (interface{}, error)
}

// Factory function for creating monitor instances
type CreateMonitorFunc func(config interface{}) Monitor

Implementation Steps

Step 1: Define the Monitor Structure

Create a new package for your monitor type:

// pkg/monitors/networkmonitor/network_monitor.go
package networkmonitor

import (
    "context"
    "sync"
    "time"

    "k8s.io/node-problem-detector/pkg/types"
    "k8s.io/node-problem-detector/pkg/util/tomb"
)

type networkMonitor struct {
    config        *NetworkMonitorConfig
    conditions    []types.Condition
    tomb          *tomb.Tomb
    output        chan *types.Status
    mutex         sync.RWMutex

    // Monitor-specific fields
    interfaces    map[string]*InterfaceState
    lastCheck     time.Time
}

type NetworkMonitorConfig struct {
    Source           string        `json:"source"`
    MetricsReporting bool          `json:"metricsReporting"`
    CheckInterval    time.Duration `json:"checkInterval"`
    Interfaces       []string      `json:"interfaces"`

    // Monitor-specific configuration
    PacketLossThreshold    float64 `json:"packetLossThreshold"`
    BandwidthThreshold     int64   `json:"bandwidthThreshold"`
    ErrorRateThreshold     float64 `json:"errorRateThreshold"`
}

type InterfaceState struct {
    Name         string
    PacketsSent  uint64
    PacketsRecv  uint64
    Errors       uint64
    LastUpdated  time.Time
}

Step 2: Implement the Monitor Interface

func (nm *networkMonitor) Start() (<-chan *types.Status, error) {
    klog.Info("Starting network monitor")

    // Initialize monitoring state
    if err := nm.initialize(); err != nil {
        return nil, fmt.Errorf("failed to initialize network monitor: %v", err)
    }

    // Start the monitoring goroutine
    go nm.monitorLoop()

    return nm.output, nil
}

func (nm *networkMonitor) Stop() {
    klog.Info("Stopping network monitor")
    nm.tomb.Stop()
}

func (nm *networkMonitor) monitorLoop() {
    defer nm.tomb.Done()
    defer close(nm.output)

    ticker := time.NewTicker(nm.config.CheckInterval)
    defer ticker.Stop()

    for {
        select {
        case <-ticker.C:
            nm.checkNetworkHealth()
        case <-nm.tomb.Stopping():
            klog.Info("Network monitor stopping")
            return
        }
    }
}

func (nm *networkMonitor) checkNetworkHealth() {
    nm.mutex.Lock()
    defer nm.mutex.Unlock()

    var problems []types.Event
    var conditions []types.Condition

    // Check each configured interface
    for _, ifName := range nm.config.Interfaces {
        if state, exists := nm.interfaces[ifName]; exists {
            // Perform network health checks
            if issues := nm.analyzeInterface(state); len(issues) > 0 {
                problems = append(problems, issues...)
                conditions = append(conditions, nm.createCondition(ifName, issues))
            }
        }
    }

    // Send status update if problems detected
    if len(problems) > 0 || len(conditions) > 0 {
        status := &types.Status{
            Source:     nm.config.Source,
            Events:     problems,
            Conditions: conditions,
        }

        select {
        case nm.output <- status:
        case <-nm.tomb.Stopping():
            return
        }
    }
}

func (nm *networkMonitor) analyzeInterface(state *InterfaceState) []types.Event {
    var events []types.Event

    // Read current network statistics
    stats, err := nm.readInterfaceStats(state.Name)
    if err != nil {
        events = append(events, types.Event{
            Severity:  types.Warn,
            Timestamp: time.Now(),
            Reason:    "NetworkInterfaceReadError",
            Message:   fmt.Sprintf("Failed to read stats for interface %s: %v", state.Name, err),
        })
        return events
    }

    // Calculate error rate
    totalPackets := stats.PacketsSent + stats.PacketsRecv
    if totalPackets > 0 {
        errorRate := float64(stats.Errors) / float64(totalPackets)
        if errorRate > nm.config.ErrorRateThreshold {
            events = append(events, types.Event{
                Severity:  types.Warn,
                Timestamp: time.Now(),
                Reason:    "HighNetworkErrorRate",
                Message:   fmt.Sprintf("Interface %s error rate %.2f%% exceeds threshold %.2f%%",
                    state.Name, errorRate*100, nm.config.ErrorRateThreshold*100),
            })
        }
    }

    // Update state for next check
    state.PacketsSent = stats.PacketsSent
    state.PacketsRecv = stats.PacketsRecv
    state.Errors = stats.Errors
    state.LastUpdated = time.Now()

    return events
}

Step 3: Create Configuration Parser

// pkg/monitors/networkmonitor/config.go
package networkmonitor

import (
    "encoding/json"
    "fmt"
    "time"
)

type NetworkConfigParser struct{}

func (p *NetworkConfigParser) Parse(config []byte) (interface{}, error) {
    var networkConfig NetworkMonitorConfig

    if err := json.Unmarshal(config, &networkConfig); err != nil {
        return nil, fmt.Errorf("failed to parse network monitor config: %v", err)
    }

    // Set defaults
    if networkConfig.CheckInterval == 0 {
        networkConfig.CheckInterval = 30 * time.Second
    }
    if networkConfig.ErrorRateThreshold == 0 {
        networkConfig.ErrorRateThreshold = 0.01 // 1%
    }
    if len(networkConfig.Interfaces) == 0 {
        networkConfig.Interfaces = []string{"eth0"} // Default interface
    }

    // Validate configuration
    if err := p.validateConfig(&networkConfig); err != nil {
        return nil, err
    }

    return &networkConfig, nil
}

func (p *NetworkConfigParser) validateConfig(config *NetworkMonitorConfig) error {
    if config.Source == "" {
        return fmt.Errorf("source field is required")
    }
    if config.CheckInterval < time.Second {
        return fmt.Errorf("checkInterval must be at least 1 second")
    }
    if config.ErrorRateThreshold < 0 || config.ErrorRateThreshold > 1 {
        return fmt.Errorf("errorRateThreshold must be between 0 and 1")
    }
    return nil
}

Step 4: Implement Monitor Factory

// pkg/monitors/networkmonitor/factory.go
package networkmonitor

import (
    "fmt"

    "k8s.io/node-problem-detector/pkg/types"
    "k8s.io/node-problem-detector/pkg/util/tomb"
)

func CreateNetworkMonitor(config interface{}) types.Monitor {
    networkConfig, ok := config.(*NetworkMonitorConfig)
    if !ok {
        panic(fmt.Sprintf("invalid config type for network monitor: %T", config))
    }

    return &networkMonitor{
        config:     networkConfig,
        conditions: []types.Condition{},
        tomb:       tomb.NewTomb(),
        output:     make(chan *types.Status),
        interfaces: make(map[string]*InterfaceState),
    }
}

Step 5: Register the Monitor Type

// pkg/monitors/networkmonitor/register.go
package networkmonitor

import (
    "k8s.io/node-problem-detector/pkg/problemdaemon"
)

func init() {
    // Register the monitor type with NPD
    problemdaemon.RegisterMonitorFactory(
        "network",                    // Monitor type name
        &NetworkConfigParser{},       // Configuration parser
        CreateNetworkMonitor,         // Factory function
    )
}

Step 6: Integration with Build System

Add your monitor to the build tags and imports:

// pkg/problemdaemon/monitors.go
//go:build !disable_network_monitor
// +build !disable_network_monitor

package problemdaemon

import (
    // Existing monitors
    _ "k8s.io/node-problem-detector/pkg/systemlogmonitor"
    _ "k8s.io/node-problem-detector/pkg/custompluginmonitor"
    _ "k8s.io/node-problem-detector/pkg/systemstatsmonitor"

    // Your new monitor
    _ "k8s.io/node-problem-detector/pkg/monitors/networkmonitor"
)

Configuration Example

Create a configuration file for your new monitor:

{
  "plugin": "network",
  "pluginConfig": {
    "invoke_interval": "30s",
    "timeout": "5s",
    "max_output_length": 80,
    "concurrency": 3,
    "enable_message_change_based_condition_update": false
  },
  "source": "network-monitor",
  "metricsReporting": true,
  "conditions": [
    {
      "type": "NetworkNotReady",
      "reason": "NetworkInterfaceDown",
      "message": "Network interface is down"
    },
    {
      "type": "NetworkNotReady",
      "reason": "HighNetworkErrorRate",
      "message": "Network interface error rate is high"
    }
  ],
  "rules": [
    {
      "type": "permanent",
      "condition": "NetworkNotReady",
      "reason": "NetworkInterfaceDown",
      "pattern": "Interface .* is down"
    },
    {
      "type": "permanent",
      "condition": "NetworkNotReady",
      "reason": "HighNetworkErrorRate",
      "pattern": "Interface .* error rate .* exceeds threshold"
    }
  ],

  // Monitor-specific configuration
  "checkInterval": "30s",
  "interfaces": ["eth0", "eth1"],
  "packetLossThreshold": 0.05,
  "bandwidthThreshold": 1000000000,
  "errorRateThreshold": 0.01
}

Testing Your Monitor

Create comprehensive tests for your monitor:

// pkg/monitors/networkmonitor/network_monitor_test.go
package networkmonitor

import (
    "testing"
    "time"

    "k8s.io/node-problem-detector/pkg/types"
)

func TestNetworkMonitor_HighErrorRate(t *testing.T) {
    config := &NetworkMonitorConfig{
        Source:             "test-network-monitor",
        CheckInterval:      100 * time.Millisecond,
        Interfaces:         []string{"test0"},
        ErrorRateThreshold: 0.01,
    }

    monitor := &networkMonitor{
        config:     config,
        tomb:       tomb.NewTomb(),
        output:     make(chan *types.Status, 10),
        interfaces: make(map[string]*InterfaceState),
    }

    // Initialize test interface with high error rate
    monitor.interfaces["test0"] = &InterfaceState{
        Name:        "test0",
        PacketsSent: 1000,
        PacketsRecv: 1000,
        Errors:      50, // 2.5% error rate, above 1% threshold
    }

    statusChan, err := monitor.Start()
    if err != nil {
        t.Fatalf("Failed to start monitor: %v", err)
    }

    defer monitor.Stop()

    // Wait for status update
    select {
    case status := <-statusChan:
        if len(status.Events) == 0 {
            t.Error("Expected events for high error rate, got none")
        }

        found := false
        for _, event := range status.Events {
            if event.Reason == "HighNetworkErrorRate" {
                found = true
                break
            }
        }
        if !found {
            t.Error("Expected HighNetworkErrorRate event")
        }
    case <-time.After(200 * time.Millisecond):
        t.Error("Timeout waiting for status update")
    }
}

Integration Patterns

Metrics Integration

If your monitor produces metrics:

func (nm *networkMonitor) reportMetrics() {
    if !nm.config.MetricsReporting {
        return
    }

    for ifName, state := range nm.interfaces {
        // Report interface-specific metrics
        metrics.RecordInterfacePackets(ifName, state.PacketsSent, state.PacketsRecv)
        metrics.RecordInterfaceErrors(ifName, state.Errors)
    }
}

Problem Metrics Integration

Connect to the Problem Metrics Manager:

import "k8s.io/node-problem-detector/pkg/problemmetrics"

func (nm *networkMonitor) reportProblem(reason string) {
    if nm.config.MetricsReporting {
        problemmetrics.GlobalProblemMetricsManager.IncrementProblemCounter(reason, 1)
    }
}

Best Practices for Custom Monitors

  1. Resource Management: Always implement proper cleanup in Stop()
  2. Error Handling: Log errors but continue monitoring when possible
  3. Thread Safety: Use mutexes for shared state access
  4. Configuration Validation: Validate all configuration parameters
  5. Graceful Degradation: Handle partial failures in multi-component monitoring
  6. Testing: Create unit tests for all monitoring logic
  7. Documentation: Document configuration options and expected behavior

Common Patterns

State Management

type MonitorState struct {
    lastUpdate  time.Time
    counters    map[string]uint64
    mutex       sync.RWMutex
}

func (s *MonitorState) Update(key string, value uint64) {
    s.mutex.Lock()
    defer s.mutex.Unlock()
    s.counters[key] = value
    s.lastUpdate = time.Now()
}

Configuration Hot Reload

func (nm *networkMonitor) updateConfig(newConfig *NetworkMonitorConfig) error {
    nm.mutex.Lock()
    defer nm.mutex.Unlock()

    // Validate new configuration
    parser := &NetworkConfigParser{}
    if err := parser.validateConfig(newConfig); err != nil {
        return err
    }

    // Apply configuration changes
    nm.config = newConfig
    return nil
}

This comprehensive guide provides the foundation for implementing custom monitor types that integrate seamlessly with NPD's architecture while maintaining the same reliability and performance standards as the built-in monitors.

15. Problem Metrics Manager

The Problem Metrics Manager serves as the bridge between monitors and metric exporters, providing a standardized way to expose problem-related metrics without direct coupling between monitors and exporters.

Core Implementation

// pkg/problemmetrics/problem_metrics.go
type ProblemMetricsManager struct {
    problemCounter           metrics.Int64MetricInterface
    problemGauge             metrics.Int64MetricInterface
    problemTypeToReason      map[string]string
    problemTypeToReasonMutex sync.Mutex
}

// Key methods:
func (pmm *ProblemMetricsManager) SetProblemGauge(problemType, reason string, value bool) error
func (pmm *ProblemMetricsManager) IncrementProblemCounter(reason string, count int64) error

// Global singleton instance
var GlobalProblemMetricsManager *ProblemMetricsManager

Integration with Monitors

Monitors use the global metrics manager to report both events and conditions:

// Example from pkg/systemlogmonitor/log_monitor.go
if *l.config.EnableMetricsReporting {
    initializeProblemMetricsOrDie(l.config.Rules)
}

// When problems are detected:
problemmetrics.GlobalProblemMetricsManager.IncrementProblemCounter(event.Reason, 1)
problemmetrics.GlobalProblemMetricsManager.SetProblemGauge(condition.Type, condition.Reason, condition.Status == types.True)

Two Metric Types

  1. Problem Gauges: Track the current state of conditions

    • Value: true (condition present) or false (condition resolved)
    • Used for: Permanent problems like "KernelDeadlock", "OutOfDisk"
  2. Problem Counters: Count occurrences of events

    • Value: Incremented each time an event occurs
    • Used for: Temporary problems like "OOMKilling", "TaskHung"

Integration with Exporters

The metrics manager uses OpenCensus/OpenTelemetry under the hood, making metrics automatically available to:

  • Prometheus Exporter: Exposes metrics at /metrics endpoint
  • Stackdriver Exporter: Sends metrics to Google Cloud Monitoring
  • Future Metric Exporters: Automatic integration through OpenCensus

This architecture ensures that problem detection (monitors) remains decoupled from metric export (exporters) while providing a consistent interface for all monitoring integrations.

16. Troubleshooting

Given the complexity of plugin registration, build tags, and initialization, developers often encounter issues when working with NPD. This section covers common problems and their solutions.

Debug Flowchart

When encountering issues with NPD plugins, follow this systematic debugging approach:

┌─────────────────────────────────────┐
│ NPD Plugin Issue?                   │
└─────────────┬───────────────────────┘
              │
              ▼
┌─────────────────────────────────────┐
│ Does NPD start successfully?        │
├─────────────┬───────────────────────┤
│     NO      │         YES           │
└─────────────┼───────────────────────┘
              │                       │
              ▼                       ▼
┌─────────────────────────┐   ┌──────────────────────────┐
│ Build/Startup Issues    │   │ Is your plugin loaded?   │
│ • Check build tags      │   │ grep "Registered.*plugin"│
│ • Check dependencies    │   │ /var/log/npd.log        │
│ • Check system logs     │   └─────────┬────────────────┘
└─────────────────────────┘             │
                                        ▼
                              ┌─────────────────────────────┐
                              │ Plugin loaded?              │
                              ├──────────┬──────────────────┤
                              │    NO    │       YES        │
                              └──────────┼──────────────────┘
                                         │                  │
                                         ▼                  ▼
                              ┌─────────────────────┐   ┌─────────────────────────┐
                              │ Registration Issues │   │ Does plugin generate    │
                              │ • Check init() call │   │ status messages?        │
                              │ • Check build tags  │   │ Monitor output channel  │
                              │ • Check imports     │   └─────────┬───────────────┘
                              └─────────────────────┘             │
                                                                  ▼
                                                        ┌─────────────────────────┐
                                                        │ Status messages?        │
                                                        ├──────────┬──────────────┤
                                                        │    NO    │     YES      │
                                                        └──────────┼──────────────┘
                                                                   │              │
                                                                   ▼              ▼
                                                        ┌─────────────────┐   ┌─────────────────┐
                                                        │ Monitor Issues  │   │ Are conditions  │
                                                        │ • Check config  │   │ appearing in    │
                                                        │ • Check logic   │   │ kubectl?        │
                                                        │ • Check inputs  │   └─────────┬───────┘
                                                        └─────────────────┘             │
                                                                                        ▼
                                                                              ┌─────────────────┐
                                                                              │ Conditions?     │
                                                                              ├─────────┬───────┤
                                                                              │   NO    │  YES  │
                                                                              └─────────┼───────┘
                                                                                        │       │
                                                                                        ▼       ▼
                                                                              ┌─────────────┐  ┌─────────┐
                                                                              │ Exporter    │  │ SUCCESS │
                                                                              │ Issues      │  │ ✓       │
                                                                              │ • K8s API   │  └─────────┘
                                                                              │ • RBAC      │
                                                                              │ • Network   │
                                                                              └─────────────┘

Debugging Commands Reference

Quick Health Check

# 1. Check NPD pod status
kubectl get pods -n kube-system | grep node-problem-detector

# 2. Check recent logs
kubectl logs -n kube-system daemonset/node-problem-detector --tail=50

# 3. Check node conditions
kubectl describe node $(hostname) | grep -A 10 "Conditions:"

# 4. Check for custom conditions
kubectl get nodes -o json | jq '.items[].status.conditions[] | select(.type | contains("Custom"))'

Build Verification

# Check what plugins are compiled in
strings /usr/bin/node-problem-detector | grep -i "registered.*monitor"

# Verify build tags
go list -tags "$BUILD_TAGS" ./pkg/monitors/...

# Check plugin registration in logs
grep -i "register" /var/log/node-problem-detector.log

Runtime Monitoring

# Monitor status channel output (requires debug build)
kubectl logs -n kube-system node-problem-detector-xyz | grep "status.*channel"

# Check configuration loading
kubectl logs -n kube-system node-problem-detector-xyz | grep "config.*loaded"

# Monitor problem detection
kubectl logs -n kube-system node-problem-detector-xyz | grep -E "(problem|condition|event)"

Plugin Loading Issues

Issue: "My custom plugin isn't loading"

Symptoms: Plugin code exists but never runs, no registration messages in logs

Common Causes & Solutions:

  1. Missing Blank Import

    # Check if import exists in plugin files
    grep -r "your-plugin-package" cmd/nodeproblemdetector/problemdaemonplugins/

    Solution: Add blank import in appropriate plugin file:

    import _ "your-module/pkg/yourcustomplugin"
  2. Missing init() Function

    # Verify init() function exists and calls Register()
    grep -A 5 "func init()" pkg/yourplugin/

    Solution: Ensure init() function calls problemdaemon.Register()

  3. Wrong Build Tags

    # Check what's included in build
    go list -tags "$BUILD_TAGS" ./...

    Solution: Verify BUILD_TAGS environment variable doesn't exclude your plugin

Issue: "Build fails with undefined symbols"

Symptoms: Linker errors about missing functions, undefined references

Common Causes & Solutions:

  1. Missing Build Tags for Dependencies

    # Example: journald requires special build tag
    BUILD_TAGS="journald" make
  2. Missing System Dependencies

    # For journald support
    sudo apt-get install libsystemd-dev
  3. Platform-Specific Code Issues

    # Check for platform-specific files
    find . -name "*_unix.go" -o -name "*_windows.go"

Monitor Runtime Issues

Issue: "Monitor starts but doesn't report problems"

Symptoms: Monitor shows in logs but no events/conditions generated

Diagnostic Steps:

  1. Check Channel Return Value

    // In your monitor's Start() method
    func (m *monitor) Start() (<-chan *types.Status, error) {
        // Return nil for metrics-only monitors
        return nil, nil
    
        // Return actual channel for problem-detecting monitors
        statusCh := make(chan *types.Status)
        return statusCh, nil
    }
  2. Add Debug Logging

    // Add debug logs in monitor loop
    klog.V(2).Infof("Monitor loop iteration, checking conditions...")
  3. Test Configuration Patterns

    # For log monitors, test regex patterns
    echo "sample log line" | grep -E "your-regex-pattern"
  4. Check Metrics Reporting Flag

    // In configuration file
    {
      "metricsReporting": true,  // Enables problem metrics
      "rules": [...]
    }

Issue: "Exporter not receiving status updates"

Symptoms: Monitor detects problems but exporter doesn't process them

Diagnostic Steps:

  1. Check Exporter Initialization

    # Look for exporter startup logs
    kubectl logs <pod> | grep -i "exporter"
  2. Verify Problem Detector Startup

    # Look for successful startup message
    kubectl logs <pod> | grep "Problem detector started"
  3. Check Exporter List

    # Enable verbose logging to see exporter initialization
    /node-problem-detector -v=5

Configuration Issues

Issue: "Invalid configuration file"

Symptoms: Startup errors, configuration validation failures

Common Problems:

  1. JSON Syntax Errors

    # Validate JSON syntax
    cat config.json | jq .
  2. Wrong Plugin Type

    // Ensure plugin type matches available log watchers
    {
      "plugin": "kmsg",  // Must be: kmsg, filelog, or journald
    }
  3. Invalid Regex Patterns

    # Test regex patterns
    echo "test log line" | grep -E "your-pattern-here"

Build and Development Issues

Issue: "Custom plugin code not executing"

Debug Techniques:

  1. Add Registration Logging

    func init() {
        klog.Info("Registering my custom plugin")  // Add this
        problemdaemon.Register(...)
    }
  2. Verify Plugin Handler

    # Check if handler is registered
    # Add debug code to list registered handlers
    go run -tags debug cmd/debug/list_handlers.go
  3. Check Factory Function

    func NewMyPluginOrDie(configPath string) types.Monitor {
        klog.Infof("Creating plugin with config: %s", configPath)  // Add this
        // ... rest of function
    }

Useful Debugging Commands

# Check what plugins are compiled in
go list -tags "$(echo $BUILD_TAGS)" ./...

# Verbose build to see compilation details
go build -v -x -tags "$(echo $BUILD_TAGS)"

# Run with verbose logging
/node-problem-detector -v=5 --config.custom-plugin-monitor=config.json

# Check systemd status (for journald issues)
systemctl --version
systemctl is-active systemd-journald

# Test custom plugin scripts directly
/path/to/your/script && echo "Exit code: $?"

Performance Debugging

Issue: "High CPU or memory usage"

Investigation Steps:

  1. Profile the Application

    # Enable pprof endpoint
    go tool pprof http://localhost:6060/debug/pprof/profile
  2. Check Regex Compilation

    // Pre-compile regex patterns
    var compiledPattern = regexp.MustCompile("your-pattern")
  3. Monitor Channel Buffers

    // Add buffered channels for high-throughput scenarios
    statusCh := make(chan *types.Status, 100)
  4. Check Log Volume

    # Monitor log file growth
    tail -f /var/log/messages | wc -l

When debugging NPD plugin issues, start with the most common causes (missing imports, build tags, configuration syntax) and work your way through the more complex scenarios. The verbose logging flag (-v=5) is particularly helpful for understanding initialization flow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment