Investigation Report: Volume Migration Failures — NetApp REST Client Session Race

Summary

Sustained volume migration failures (InvalidVolume: volume is not assigned to a host) affecting 60,000+ operations over 7+ days. Root cause identified as a race condition in the NetApp REST client session handling. A fix has been deployed to QA.

Symptoms

Error: cinder.exception.InvalidVolume: Invalid volume: volume is not assigned to a host
Operation: volume_migrate (specifically native_cross_vc_migrate_volume → _migrate_unattached → get_file_sizes_by_dir)
Error location: cinder/volume/volume_utils.py:791 (extract_host(None))
Duration: 7+ days of sustained failures
Scale: ≥60,000 failures across multiple pools and vCenters

Call Chain

cinder/volume/manager.py:2994 (migrate_volume)
  → cinder/volume/drivers/vmware/fcd.py:1163 (native_cross_vc_migrate_volume)
    → self.get_netapp_cinder_host(netapp_fqdn)
      → vmdk.py:3647 (_get_all_pools → filters for netapp_server_hostname)
      → Returns None when NetApp pool not found in cached get_pools() response
    → volumeops.py:2780 (migrate_unattached_qtree)
      → netapp_api.get_file_sizes_by_dir(context, host=None, path=...)
        → remote.py:50 (_get_cctxt(host=None))
          → volume_utils.extract_host(None) → RAISES InvalidVolume

Root Cause: NetApp REST Client Session Race Condition

The Bug

In cinder/volume/drivers/netapp/dataontap/client/api.py, the RestNaServer._build_session() method stored the session on self._session (a shared instance attribute). When multiple greenthreads made concurrent REST calls:

Greenthread A calls _build_session(), sets self._session with SVM-A headers
Greenthread B calls _build_session(), overwrites self._session with SVM-B headers
Greenthread A uses self._session — now has wrong SVM headers
REST call goes to wrong SVM → returns empty results → Volume not found → cascading failure

This caused the NetApp backend's REST client to intermittently fail, which could cause:

Direct get_file_sizes_by_dir failures (the 0ms exceptions observed)
Capability reporting failures (making the backend disappear from scheduler's pool cache)

The Fix

# BEFORE (race-prone):
def _build_session(self, headers):
    self._session = requests.Session()
    self._session.auth = self._create_basic_auth_handler()
    self._session.verify = self._ssl_verify
    self._session.headers = headers

# In send_http_request:
    self._build_session(headers)
    request_method = self._get_request_method(method, self._session)

# AFTER (race-safe):
def _build_session(self, headers):
    session = requests.Session()
    session.auth = self._create_basic_auth_handler()
    session.verify = self._ssl_verify
    session.headers = headers
    return session  # local variable, no shared state

# In send_http_request:
    session = self._build_session(headers)
    request_method = self._get_request_method(method, session)

Each greenthread now gets its own isolated session object, eliminating the shared mutable state.

Verification

Scheduler was healthy during failures

The scheduler was receiving capability updates from all NetApp backends throughout the failure window
All backends reported netapp_server_hostname correctly
No services were disabled, frozen, or down
No RabbitMQ connectivity issues (0 MessagingTimeout for get_pools, 0 connection errors)
~3,000 pools actively updating in the scheduler cache

Key evidence

Failures occurred at 0ms (no RPC was ever attempted — get_netapp_cinder_host returned None immediately from cache)
The scheduler had all the data but the volume service's cached get_pools() response was missing NetApp pools
No message size limits being hit (rpc_response_timeout = 600, RabbitMQ default max 128MB)
Failures were sustained (not transient startup issues) — consistent with a race that occurs on every cache refresh

Pre-existing Issue: EMS Looping Task Vserver Race

During QA validation, a separate pre-existing race condition was identified in send_ems_log_message() (client_cmode_rest.py).

What is EMS logging?

EMS (Event Management System) is NetApp's autosupport mechanism. The _handle_ems_logging periodic task sends heartbeat messages to the ONTAP cluster that:

Identifies the Cinder driver — records driver name, version, and app version so NetApp knows what software is managing their filer
Reports pool/volume usage — includes which FlexVols are being managed, enabling NetApp support to correlate issues
Enables NetApp support entitlement — NetApp uses these messages to verify the storage is used with a supported integration (important for support contracts)

It's purely telemetry for NetApp support tracking — no functional impact on Cinder operations.

The Race

Location: cinder/volume/drivers/netapp/dataontap/client/client_cmode_rest.py:858

def send_ems_log_message(self, message_dict):
    """Sends a message to the Data ONTAP EMS log."""
    body = { ... }

    bkp_connection = copy.copy(self.connection)
    bkp_timeout = self.connection.get_timeout()
    bkp_vserver = self.vserver

    self.connection.set_timeout(25)
    try:
        self.connection.set_vserver(
            self._get_ems_log_destination_vserver())   # ← MUTATES shared state
        self.send_request('/support/ems/application-logs', 'post', body=body)
    except netapp_api.NaApiError as e:
        LOG.warning('Failed to invoke EMS. %s', e)
    finally:
        timeout = (bkp_timeout if bkp_timeout is not None else DEFAULT_TIMEOUT)
        self.connection.set_timeout(timeout)
        self.connection = copy.copy(bkp_connection)
        self.connection.set_vserver(bkp_vserver)       # ← restores

How the shared state mutation works

self.connection is a single RestNaServer instance (defined in api.py) shared by all operations in the client. It holds self._vserver — the SVM name used for REST API tunneling:

# api.py:760
def set_vserver(self, vserver):
    """Set the vserver to use if tunneling gets enabled."""
    self._vserver = vserver

# api.py:764
def get_vserver(self):
    """Get the vserver to use in tunneling."""
    return self._vserver

When any REST request is made, invoke_successfully (api.py:862) builds headers by reading the current vserver:

# api.py:862
def invoke_successfully(self, action_url, method, body=None, query=None,
                        enable_tunneling=False):
    headers = self._build_headers(enable_tunneling)  # ← reads self._vserver HERE
    ...
    session = self._build_session(headers)            # ← session fix: local session
    ...

# api.py:819
def _build_headers(self, enable_tunneling):
    headers = {"Accept": "application/json", "Content-Type": "application/json"}
    if enable_tunneling:
        headers["X-Dot-SVM-Name"] = self.get_vserver()  # ← reads self._vserver
    return headers

The race sequence

self.connection._vserver is normally set to the data SVM (e.g., svm_data_001)
EMS task fires → calls self.connection.set_vserver("cluster_admin_svm") — mutates shared state
Concurrent greenthread (e.g., _update_ssc → get_flexvol_capacity → _get_volume_by_args) calls self.connection.invoke_successfully(..., enable_tunneling=True)
_build_headers() reads self.get_vserver() → gets "cluster_admin_svm" (WRONG)
REST call goes to ONTAP with X-Dot-SVM-Name: cluster_admin_svm
ONTAP queries the cluster admin SVM which has no FlexVols → returns 0 results
_get_volume_by_args raises: VolumeBackendAPIException: Could not find unique volume. Volumes found: []
This is caught by get_flexvol_capacity and re-raised as: NetAppDriverException: Volume /path not found
After EMS completes, send_ems_log_message restores the original vserver in the finally block

The session fix does NOT help here because the race occurs in step 4 — _build_headers() reads the wrong _vserver value BEFORE creating the session. The session fix correctly isolates headers once they're built, but the headers themselves already contain the wrong X-Dot-SVM-Name.

Timing

The EMS task:

Fires immediately on startup (initial_delay=0 in nfs_base.py:116)
Runs every 1 hour thereafter
Temporarily mutates self.connection's vserver to the cluster-level vserver

If _update_ssc() (volume discovery) runs concurrently during startup, it reads the wrong vserver from self.connection, queries the wrong SVM, and gets Volume not found.

Evidence

ERROR oslo_service.service: Error starting thread.:
  NetAppDriverException: Volume /nfs_volume_ds02 not found.

VolumeBackendAPIException: Could not find unique volume. Volumes found: [].

Both NetApp backends failed on startup, couldn't find ANY of their volumes, but recovered on retry and are now running normally. This is not a regression from the session fix — it's a pre-existing startup race.

Suggested Fixes for EMS Vserver Race

Option A: Add initial_delay to EMS task (minimal, lowest risk)

# nfs_base.py
self.loopingcalls.add_task(
    self._handle_ems_logging,
    loopingcalls.ONE_HOUR,
    loopingcalls.ONE_MINUTE)  # Add 60s initial_delay

Ensures _update_ssc() completes before the first EMS call. Simple, but only fixes the startup race — the hourly EMS call could still race with other operations.

Option B: Use a local connection copy in send_ems_log_message (recommended)

def send_ems_log_message(self, message_dict):
    # Create a completely independent connection for EMS
    ems_connection = copy.deepcopy(self.connection)
    ems_connection.set_vserver(self._get_ems_log_destination_vserver())
    ems_connection.set_timeout(25)
    # Send using the isolated connection — self.connection is never mutated
    ...

Eliminates the race entirely by never mutating shared state.

Option C: Lock around vserver mutation (safe but adds contention)

def send_ems_log_message(self, message_dict):
    with self._connection_lock:
        bkp_vserver = self.connection.get_vserver()
        self.connection.set_vserver(self._get_ems_log_destination_vserver())
        try:
            self.send_request(...)
        finally:
            self.connection.set_vserver(bkp_vserver)

Prevents concurrent access but adds lock contention.

Option D: Pass vserver as parameter to send_request (cleanest, most invasive)

Refactor send_request / _build_headers to accept an optional vserver override parameter instead of reading from self.connection. EMS would pass its destination vserver explicitly without mutating shared state.

Test Plan

Objective

Verify the session race fix prevents InvalidVolume: volume is not assigned to a host under concurrent cross-vc migrations.

Steps

# 1. Create 10 volumes concurrently on vmware_fcd backend
for i in $(seq 1 10); do
  openstack volume create --size 1 --type vmware --availability-zone <az> test-race-$i &
done
wait

# 2. Verify creation
openstack volume list --name test-race -c ID -c Status -c Host

# 3. Migrate all concurrently to a different vCenter (cross-vc triggers the code path)
for vol_id in $(openstack volume list --name test-race -f value -c ID); do
  cinder migrate $vol_id <destination-pool-on-different-vc> &
done
wait

# 4. Wait and check status
sleep 60
openstack volume list --name test-race -c ID -c Status -c "Migration Status"

# 5. Verify no errors in logs
# Check for "not assigned to a host" errors
# Check for get_file_sizes_by_dir exceptions

# 6. Cleanup
for vol_id in $(openstack volume list --name test-race -f value -c ID); do
  openstack volume delete $vol_id
done

Success Criteria

All migrations complete with migration_status=success
Zero InvalidVolume: volume is not assigned to a host errors
Zero get_file_sizes_by_dir exceptions

Recommendations

Deploy session fix to production after successful QA validation
Address the EMS vserver race separately (Option A or B above) — file as a follow-up issue
Consider adding a retry with cache refresh in get_netapp_cinder_host() — if None is returned, force _get_all_pools(refresh=True) once before failing (defense-in-depth)

hemna/netapp-session-race-report.md

Select an option

No results found