Upgrading to 1.5.4 via Helm

This is the rough outline of how we successfully did an in-place control + data plane upgrade from Istio 1.4.7 -> 1.5.4 via the official Helm charts.

Upgrade was

applied via scripting/automation
on a mesh using
- mTLS
- Istio RBAC via AuthorizationPolicy
- telemetry v1
- tracing enabled, but Jaeger not deployed via istio chart
- istio ingress gateway + secondary istio ingress gateway
- active traffic flowing through without any observed increase in error rates

1. Review upgrade nodes

This ignores anything specifically mentioned in the upgrade notes.

2. Pre-upgrade gotchas

Bug in RBAC backward compatibility with 1.4 in 1.5.0 -> 1.5.2, fixed in 1.5.3
- istio/istio#23064
issue with visibility of ServiceEntrys being scoped using Sidecar resource
- istio/istio#24251 subsequent added to the upgrade notes
All traffic ports are now captured by default; this caused our non-mTLS metrics ports to start enforcing mTLS which they previously did not do on 1.4.7
- Fix: exclude metrics ports via sidecar annotations traffic.sidecar.istio.io/excludeInboundPorts: "9080, 15090"

3. Pre-upgrade scripting

#!/usr/bin/env bash

# In 1.4 Galley manages the webhook configuration; in 1.5 Helm manages it and it is patched by galley dynamically 
# without `ownerReferences`, so we can detect if we have upgraded Galley already
if kubectl get validatingwebhookconfiguration/istio-galley -n istio-system -o yaml | grep ownerReferences; then
  echo "Detected 1.4 installation - preparing Helm upgrade to 1.5.x by deleting galley-managed webhook..."

  # Disable webhook reconciliation so we can delete the webhook
  kubectl get deployment/istio-galley -n istio-system -o yaml | \
    sed 's/enable-reconcileWebhookConfiguration=true/enable-reconcileWebhookConfiguration=false/' | \
    kubectl apply -f -

  # Wait for Galley to come back up
  kubectl rollout status deployment/istio-galley -n istio-system --timeout 60s

  # Delete the webhook
  kubectl delete validatingwebhookconfiguration/istio-galley -n istio-system

  # Now we can proceed to helm upgrade to 1.5 which will recreate the webhook
fi

3. Istio Upgrade

Not to be taken literally - this is pseudo-script...

helm upgrade --install --wait --atomic --cleanup-on-fail istio-init istio-init-1.5.4.tgz

# scripting to wait for jobs to complete goes here

helm upgrade --install --wait --atomic --cleanup-on-fail istio istio-1.5.4.tgz

# scripting to bounce `Deployment`s for injected services goes here

3. Post-upgrade gotchas

We noticed issues with ingress gateways coming up during the control plane upgrade.

It appears there was some kind of race condition when starting new 1.5.4 ingressgateway instances while parts of the 1.4.7 control plane were still running. Suspect new ingressgateway talking to old verison pilot problem perhaps?

Symptom

lots of weird errors about invalid configuration being received from pilot relating to tracing on the new version ingressgateway logs
a subset of new version ingress gateways would not become ready which could cause the helm upgrade --wait to get stuck

Fix

delete the pods that fail to become ready (manual intervention in our case, although technically possible to automate)
the new pods automatically re-created always came ready
helm upgrade will go to completion

chadlwilson/istio-154-upgrade.md

Upgrading to 1.5.4 via Helm

1. Review upgrade nodes

2. Pre-upgrade gotchas

3. Pre-upgrade scripting

3. Istio Upgrade

3. Post-upgrade gotchas

chadlwilson commented Feb 9, 2021 •

edited

Loading

Uh oh!

ghost commented Jul 20, 2022

Uh oh!

chadlwilson commented Jul 20, 2022

Uh oh!

chadlwilson/istio-154-upgrade.md

Upgrading to 1.5.4 via Helm

1. Review upgrade nodes

2. Pre-upgrade gotchas

3. Pre-upgrade scripting

3. Istio Upgrade

3. Post-upgrade gotchas

chadlwilson commented Feb 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Jul 20, 2022

Uh oh!

chadlwilson commented Jul 20, 2022

Uh oh!

chadlwilson commented Feb 9, 2021 •

edited

Loading