Sustained volume migration failures (InvalidVolume: volume is not assigned to a host) affecting 60,000+ operations over 7+ days. Root cause identified as a race condition in the NetApp REST client session handling. A fix has been deployed to QA.
- Error:
cinder.exception.InvalidVolume: Invalid volume: volume is not assigned to a host - Operation:
volume_migrate(specificallynative_cross_vc_migrate_volume→_migrate_unattached→get_file_sizes_by_dir) - Error location:
cinder/volume/volume_utils.py:791(extract_host(None)) - Duration: 7+ days of sustained failures
- Scale: ≥60,000 failures across multiple pools and vCenters
cinder/volume/manager.py:2994 (migrate_volume)
→ cinder/volume/drivers/vmware/fcd.py:1163 (native_cross_vc_migrate_volume)
→ self.get_netapp_cinder_host(netapp_fqdn)
→ vmdk.py:3647 (_get_all_pools → filters for netapp_server_hostname)
→ Returns None when NetApp pool not found in cached get_pools() response
→ volumeops.py:2780 (migrate_unattached_qtree)
→ netapp_api.get_file_sizes_by_dir(context, host=None, path=...)
→ remote.py:50 (_get_cctxt(host=None))
→ volume_utils.extract_host(None) → RAISES InvalidVolume
In cinder/volume/drivers/netapp/dataontap/client/api.py, the RestNaServer._build_session() method stored the session on self._session (a shared instance attribute). When multiple greenthreads made concurrent REST calls:
- Greenthread A calls
_build_session(), setsself._sessionwith SVM-A headers - Greenthread B calls
_build_session(), overwritesself._sessionwith SVM-B headers - Greenthread A uses
self._session— now has wrong SVM headers - REST call goes to wrong SVM → returns empty results →
Volume not found→ cascading failure
This caused the NetApp backend's REST client to intermittently fail, which could cause:
- Direct
get_file_sizes_by_dirfailures (the 0ms exceptions observed) - Capability reporting failures (making the backend disappear from scheduler's pool cache)
# BEFORE (race-prone):
def _build_session(self, headers):
self._session = requests.Session()
self._session.auth = self._create_basic_auth_handler()
self._session.verify = self._ssl_verify
self._session.headers = headers
# In send_http_request:
self._build_session(headers)
request_method = self._get_request_method(method, self._session)
# AFTER (race-safe):
def _build_session(self, headers):
session = requests.Session()
session.auth = self._create_basic_auth_handler()
session.verify = self._ssl_verify
session.headers = headers
return session # local variable, no shared state
# In send_http_request:
session = self._build_session(headers)
request_method = self._get_request_method(method, session)Each greenthread now gets its own isolated session object, eliminating the shared mutable state.
- The scheduler was receiving capability updates from all NetApp backends throughout the failure window
- All backends reported
netapp_server_hostnamecorrectly - No services were disabled, frozen, or down
- No RabbitMQ connectivity issues (0
MessagingTimeoutforget_pools, 0 connection errors) - ~3,000 pools actively updating in the scheduler cache
- Failures occurred at 0ms (no RPC was ever attempted —
get_netapp_cinder_hostreturnedNoneimmediately from cache) - The scheduler had all the data but the volume service's cached
get_pools()response was missing NetApp pools - No message size limits being hit (
rpc_response_timeout = 600, RabbitMQ default max 128MB) - Failures were sustained (not transient startup issues) — consistent with a race that occurs on every cache refresh
During QA validation, a separate pre-existing race condition was identified in send_ems_log_message() (client_cmode_rest.py).
EMS (Event Management System) is NetApp's autosupport mechanism. The _handle_ems_logging periodic task sends heartbeat messages to the ONTAP cluster that:
- Identifies the Cinder driver — records driver name, version, and app version so NetApp knows what software is managing their filer
- Reports pool/volume usage — includes which FlexVols are being managed, enabling NetApp support to correlate issues
- Enables NetApp support entitlement — NetApp uses these messages to verify the storage is used with a supported integration (important for support contracts)
It's purely telemetry for NetApp support tracking — no functional impact on Cinder operations.
Location: cinder/volume/drivers/netapp/dataontap/client/client_cmode_rest.py:858
def send_ems_log_message(self, message_dict):
"""Sends a message to the Data ONTAP EMS log."""
body = { ... }
bkp_connection = copy.copy(self.connection)
bkp_timeout = self.connection.get_timeout()
bkp_vserver = self.vserver
self.connection.set_timeout(25)
try:
self.connection.set_vserver(
self._get_ems_log_destination_vserver()) # ← MUTATES shared state
self.send_request('/support/ems/application-logs', 'post', body=body)
except netapp_api.NaApiError as e:
LOG.warning('Failed to invoke EMS. %s', e)
finally:
timeout = (bkp_timeout if bkp_timeout is not None else DEFAULT_TIMEOUT)
self.connection.set_timeout(timeout)
self.connection = copy.copy(bkp_connection)
self.connection.set_vserver(bkp_vserver) # ← restoresself.connection is a single RestNaServer instance (defined in api.py) shared by all operations in the client. It holds self._vserver — the SVM name used for REST API tunneling:
# api.py:760
def set_vserver(self, vserver):
"""Set the vserver to use if tunneling gets enabled."""
self._vserver = vserver
# api.py:764
def get_vserver(self):
"""Get the vserver to use in tunneling."""
return self._vserverWhen any REST request is made, invoke_successfully (api.py:862) builds headers by reading the current vserver:
# api.py:862
def invoke_successfully(self, action_url, method, body=None, query=None,
enable_tunneling=False):
headers = self._build_headers(enable_tunneling) # ← reads self._vserver HERE
...
session = self._build_session(headers) # ← session fix: local session
...
# api.py:819
def _build_headers(self, enable_tunneling):
headers = {"Accept": "application/json", "Content-Type": "application/json"}
if enable_tunneling:
headers["X-Dot-SVM-Name"] = self.get_vserver() # ← reads self._vserver
return headersself.connection._vserveris normally set to the data SVM (e.g.,svm_data_001)- EMS task fires → calls
self.connection.set_vserver("cluster_admin_svm")— mutates shared state - Concurrent greenthread (e.g.,
_update_ssc→get_flexvol_capacity→_get_volume_by_args) callsself.connection.invoke_successfully(..., enable_tunneling=True) _build_headers()readsself.get_vserver()→ gets"cluster_admin_svm"(WRONG)- REST call goes to ONTAP with
X-Dot-SVM-Name: cluster_admin_svm - ONTAP queries the cluster admin SVM which has no FlexVols → returns 0 results
_get_volume_by_argsraises:VolumeBackendAPIException: Could not find unique volume. Volumes found: []- This is caught by
get_flexvol_capacityand re-raised as:NetAppDriverException: Volume /path not found - After EMS completes,
send_ems_log_messagerestores the original vserver in thefinallyblock
The session fix does NOT help here because the race occurs in step 4 — _build_headers() reads the wrong _vserver value BEFORE creating the session. The session fix correctly isolates headers once they're built, but the headers themselves already contain the wrong X-Dot-SVM-Name.
The EMS task:
- Fires immediately on startup (
initial_delay=0innfs_base.py:116) - Runs every 1 hour thereafter
- Temporarily mutates
self.connection's vserver to the cluster-level vserver
If _update_ssc() (volume discovery) runs concurrently during startup, it reads the wrong vserver from self.connection, queries the wrong SVM, and gets Volume not found.
ERROR oslo_service.service: Error starting thread.:
NetAppDriverException: Volume /nfs_volume_ds02 not found.
VolumeBackendAPIException: Could not find unique volume. Volumes found: [].
Both NetApp backends failed on startup, couldn't find ANY of their volumes, but recovered on retry and are now running normally. This is not a regression from the session fix — it's a pre-existing startup race.
# nfs_base.py
self.loopingcalls.add_task(
self._handle_ems_logging,
loopingcalls.ONE_HOUR,
loopingcalls.ONE_MINUTE) # Add 60s initial_delayEnsures _update_ssc() completes before the first EMS call. Simple, but only fixes the startup race — the hourly EMS call could still race with other operations.
def send_ems_log_message(self, message_dict):
# Create a completely independent connection for EMS
ems_connection = copy.deepcopy(self.connection)
ems_connection.set_vserver(self._get_ems_log_destination_vserver())
ems_connection.set_timeout(25)
# Send using the isolated connection — self.connection is never mutated
...Eliminates the race entirely by never mutating shared state.
def send_ems_log_message(self, message_dict):
with self._connection_lock:
bkp_vserver = self.connection.get_vserver()
self.connection.set_vserver(self._get_ems_log_destination_vserver())
try:
self.send_request(...)
finally:
self.connection.set_vserver(bkp_vserver)Prevents concurrent access but adds lock contention.
Refactor send_request / _build_headers to accept an optional vserver override parameter instead of reading from self.connection. EMS would pass its destination vserver explicitly without mutating shared state.
Verify the session race fix prevents InvalidVolume: volume is not assigned to a host under concurrent cross-vc migrations.
# 1. Create 10 volumes concurrently on vmware_fcd backend
for i in $(seq 1 10); do
openstack volume create --size 1 --type vmware --availability-zone <az> test-race-$i &
done
wait
# 2. Verify creation
openstack volume list --name test-race -c ID -c Status -c Host
# 3. Migrate all concurrently to a different vCenter (cross-vc triggers the code path)
for vol_id in $(openstack volume list --name test-race -f value -c ID); do
cinder migrate $vol_id <destination-pool-on-different-vc> &
done
wait
# 4. Wait and check status
sleep 60
openstack volume list --name test-race -c ID -c Status -c "Migration Status"
# 5. Verify no errors in logs
# Check for "not assigned to a host" errors
# Check for get_file_sizes_by_dir exceptions
# 6. Cleanup
for vol_id in $(openstack volume list --name test-race -f value -c ID); do
openstack volume delete $vol_id
done- All migrations complete with
migration_status=success - Zero
InvalidVolume: volume is not assigned to a hosterrors - Zero
get_file_sizes_by_direxceptions
- Deploy session fix to production after successful QA validation
- Address the EMS vserver race separately (Option A or B above) — file as a follow-up issue
- Consider adding a retry with cache refresh in
get_netapp_cinder_host()— ifNoneis returned, force_get_all_pools(refresh=True)once before failing (defense-in-depth)