Skip to content

Instantly share code, notes, and snippets.

@hemna
Last active May 11, 2026 15:45
Show Gist options
  • Select an option

  • Save hemna/1836e778aa266d1a25259dd91f0af4b3 to your computer and use it in GitHub Desktop.

Select an option

Save hemna/1836e778aa266d1a25259dd91f0af4b3 to your computer and use it in GitHub Desktop.
Investigation: NetApp REST Client Session Race - Volume Migration Failures

Investigation Report: Volume Migration Failures — NetApp REST Client Session Race

Summary

Sustained volume migration failures (InvalidVolume: volume is not assigned to a host) affecting 60,000+ operations over 7+ days. Root cause identified as a race condition in the NetApp REST client session handling. A fix has been deployed to QA.

Symptoms

  • Error: cinder.exception.InvalidVolume: Invalid volume: volume is not assigned to a host
  • Operation: volume_migrate (specifically native_cross_vc_migrate_volume_migrate_unattachedget_file_sizes_by_dir)
  • Error location: cinder/volume/volume_utils.py:791 (extract_host(None))
  • Duration: 7+ days of sustained failures
  • Scale: ≥60,000 failures across multiple pools and vCenters

Call Chain

cinder/volume/manager.py:2994 (migrate_volume)
  → cinder/volume/drivers/vmware/fcd.py:1163 (native_cross_vc_migrate_volume)
    → self.get_netapp_cinder_host(netapp_fqdn)
      → vmdk.py:3647 (_get_all_pools → filters for netapp_server_hostname)
      → Returns None when NetApp pool not found in cached get_pools() response
    → volumeops.py:2780 (migrate_unattached_qtree)
      → netapp_api.get_file_sizes_by_dir(context, host=None, path=...)
        → remote.py:50 (_get_cctxt(host=None))
          → volume_utils.extract_host(None) → RAISES InvalidVolume

Root Cause: NetApp REST Client Session Race Condition

The Bug

In cinder/volume/drivers/netapp/dataontap/client/api.py, the RestNaServer._build_session() method stored the session on self._session (a shared instance attribute). When multiple greenthreads made concurrent REST calls:

  1. Greenthread A calls _build_session(), sets self._session with SVM-A headers
  2. Greenthread B calls _build_session(), overwrites self._session with SVM-B headers
  3. Greenthread A uses self._session — now has wrong SVM headers
  4. REST call goes to wrong SVM → returns empty results → Volume not found → cascading failure

This caused the NetApp backend's REST client to intermittently fail, which could cause:

  • Direct get_file_sizes_by_dir failures (the 0ms exceptions observed)
  • Capability reporting failures (making the backend disappear from scheduler's pool cache)

The Fix

# BEFORE (race-prone):
def _build_session(self, headers):
    self._session = requests.Session()
    self._session.auth = self._create_basic_auth_handler()
    self._session.verify = self._ssl_verify
    self._session.headers = headers

# In send_http_request:
    self._build_session(headers)
    request_method = self._get_request_method(method, self._session)

# AFTER (race-safe):
def _build_session(self, headers):
    session = requests.Session()
    session.auth = self._create_basic_auth_handler()
    session.verify = self._ssl_verify
    session.headers = headers
    return session  # local variable, no shared state

# In send_http_request:
    session = self._build_session(headers)
    request_method = self._get_request_method(method, session)

Each greenthread now gets its own isolated session object, eliminating the shared mutable state.

Verification

Scheduler was healthy during failures

  • The scheduler was receiving capability updates from all NetApp backends throughout the failure window
  • All backends reported netapp_server_hostname correctly
  • No services were disabled, frozen, or down
  • No RabbitMQ connectivity issues (0 MessagingTimeout for get_pools, 0 connection errors)
  • ~3,000 pools actively updating in the scheduler cache

Key evidence

  • Failures occurred at 0ms (no RPC was ever attempted — get_netapp_cinder_host returned None immediately from cache)
  • The scheduler had all the data but the volume service's cached get_pools() response was missing NetApp pools
  • No message size limits being hit (rpc_response_timeout = 600, RabbitMQ default max 128MB)
  • Failures were sustained (not transient startup issues) — consistent with a race that occurs on every cache refresh

Pre-existing Issue: EMS Looping Task Vserver Race

During QA validation, a separate pre-existing race condition was identified in send_ems_log_message() (client_cmode_rest.py).

What is EMS logging?

EMS (Event Management System) is NetApp's autosupport mechanism. The _handle_ems_logging periodic task sends heartbeat messages to the ONTAP cluster that:

  1. Identifies the Cinder driver — records driver name, version, and app version so NetApp knows what software is managing their filer
  2. Reports pool/volume usage — includes which FlexVols are being managed, enabling NetApp support to correlate issues
  3. Enables NetApp support entitlement — NetApp uses these messages to verify the storage is used with a supported integration (important for support contracts)

It's purely telemetry for NetApp support tracking — no functional impact on Cinder operations.

The Race

Location: cinder/volume/drivers/netapp/dataontap/client/client_cmode_rest.py:858

def send_ems_log_message(self, message_dict):
    """Sends a message to the Data ONTAP EMS log."""
    body = { ... }

    bkp_connection = copy.copy(self.connection)
    bkp_timeout = self.connection.get_timeout()
    bkp_vserver = self.vserver

    self.connection.set_timeout(25)
    try:
        self.connection.set_vserver(
            self._get_ems_log_destination_vserver())   # ← MUTATES shared state
        self.send_request('/support/ems/application-logs', 'post', body=body)
    except netapp_api.NaApiError as e:
        LOG.warning('Failed to invoke EMS. %s', e)
    finally:
        timeout = (bkp_timeout if bkp_timeout is not None else DEFAULT_TIMEOUT)
        self.connection.set_timeout(timeout)
        self.connection = copy.copy(bkp_connection)
        self.connection.set_vserver(bkp_vserver)       # ← restores

How the shared state mutation works

self.connection is a single RestNaServer instance (defined in api.py) shared by all operations in the client. It holds self._vserver — the SVM name used for REST API tunneling:

# api.py:760
def set_vserver(self, vserver):
    """Set the vserver to use if tunneling gets enabled."""
    self._vserver = vserver

# api.py:764
def get_vserver(self):
    """Get the vserver to use in tunneling."""
    return self._vserver

When any REST request is made, invoke_successfully (api.py:862) builds headers by reading the current vserver:

# api.py:862
def invoke_successfully(self, action_url, method, body=None, query=None,
                        enable_tunneling=False):
    headers = self._build_headers(enable_tunneling)  # ← reads self._vserver HERE
    ...
    session = self._build_session(headers)            # ← session fix: local session
    ...

# api.py:819
def _build_headers(self, enable_tunneling):
    headers = {"Accept": "application/json", "Content-Type": "application/json"}
    if enable_tunneling:
        headers["X-Dot-SVM-Name"] = self.get_vserver()  # ← reads self._vserver
    return headers

The race sequence

  1. self.connection._vserver is normally set to the data SVM (e.g., svm_data_001)
  2. EMS task fires → calls self.connection.set_vserver("cluster_admin_svm")mutates shared state
  3. Concurrent greenthread (e.g., _update_sscget_flexvol_capacity_get_volume_by_args) calls self.connection.invoke_successfully(..., enable_tunneling=True)
  4. _build_headers() reads self.get_vserver() → gets "cluster_admin_svm" (WRONG)
  5. REST call goes to ONTAP with X-Dot-SVM-Name: cluster_admin_svm
  6. ONTAP queries the cluster admin SVM which has no FlexVols → returns 0 results
  7. _get_volume_by_args raises: VolumeBackendAPIException: Could not find unique volume. Volumes found: []
  8. This is caught by get_flexvol_capacity and re-raised as: NetAppDriverException: Volume /path not found
  9. After EMS completes, send_ems_log_message restores the original vserver in the finally block

The session fix does NOT help here because the race occurs in step 4 — _build_headers() reads the wrong _vserver value BEFORE creating the session. The session fix correctly isolates headers once they're built, but the headers themselves already contain the wrong X-Dot-SVM-Name.

Timing

The EMS task:

  • Fires immediately on startup (initial_delay=0 in nfs_base.py:116)
  • Runs every 1 hour thereafter
  • Temporarily mutates self.connection's vserver to the cluster-level vserver

If _update_ssc() (volume discovery) runs concurrently during startup, it reads the wrong vserver from self.connection, queries the wrong SVM, and gets Volume not found.

Evidence

ERROR oslo_service.service: Error starting thread.:
  NetAppDriverException: Volume /nfs_volume_ds02 not found.

VolumeBackendAPIException: Could not find unique volume. Volumes found: [].

Both NetApp backends failed on startup, couldn't find ANY of their volumes, but recovered on retry and are now running normally. This is not a regression from the session fix — it's a pre-existing startup race.

Suggested Fixes for EMS Vserver Race

Option A: Add initial_delay to EMS task (minimal, lowest risk)

# nfs_base.py
self.loopingcalls.add_task(
    self._handle_ems_logging,
    loopingcalls.ONE_HOUR,
    loopingcalls.ONE_MINUTE)  # Add 60s initial_delay

Ensures _update_ssc() completes before the first EMS call. Simple, but only fixes the startup race — the hourly EMS call could still race with other operations.

Option B: Use a local connection copy in send_ems_log_message (recommended)

def send_ems_log_message(self, message_dict):
    # Create a completely independent connection for EMS
    ems_connection = copy.deepcopy(self.connection)
    ems_connection.set_vserver(self._get_ems_log_destination_vserver())
    ems_connection.set_timeout(25)
    # Send using the isolated connection — self.connection is never mutated
    ...

Eliminates the race entirely by never mutating shared state.

Option C: Lock around vserver mutation (safe but adds contention)

def send_ems_log_message(self, message_dict):
    with self._connection_lock:
        bkp_vserver = self.connection.get_vserver()
        self.connection.set_vserver(self._get_ems_log_destination_vserver())
        try:
            self.send_request(...)
        finally:
            self.connection.set_vserver(bkp_vserver)

Prevents concurrent access but adds lock contention.

Option D: Pass vserver as parameter to send_request (cleanest, most invasive)

Refactor send_request / _build_headers to accept an optional vserver override parameter instead of reading from self.connection. EMS would pass its destination vserver explicitly without mutating shared state.

Test Plan

Objective

Verify the session race fix prevents InvalidVolume: volume is not assigned to a host under concurrent cross-vc migrations.

Steps

# 1. Create 10 volumes concurrently on vmware_fcd backend
for i in $(seq 1 10); do
  openstack volume create --size 1 --type vmware --availability-zone <az> test-race-$i &
done
wait

# 2. Verify creation
openstack volume list --name test-race -c ID -c Status -c Host

# 3. Migrate all concurrently to a different vCenter (cross-vc triggers the code path)
for vol_id in $(openstack volume list --name test-race -f value -c ID); do
  cinder migrate $vol_id <destination-pool-on-different-vc> &
done
wait

# 4. Wait and check status
sleep 60
openstack volume list --name test-race -c ID -c Status -c "Migration Status"

# 5. Verify no errors in logs
# Check for "not assigned to a host" errors
# Check for get_file_sizes_by_dir exceptions

# 6. Cleanup
for vol_id in $(openstack volume list --name test-race -f value -c ID); do
  openstack volume delete $vol_id
done

Success Criteria

  • All migrations complete with migration_status=success
  • Zero InvalidVolume: volume is not assigned to a host errors
  • Zero get_file_sizes_by_dir exceptions

Recommendations

  1. Deploy session fix to production after successful QA validation
  2. Address the EMS vserver race separately (Option A or B above) — file as a follow-up issue
  3. Consider adding a retry with cache refresh in get_netapp_cinder_host() — if None is returned, force _get_all_pools(refresh=True) once before failing (defense-in-depth)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment