Created
May 14, 2020 12:01
-
-
Save moredhel/bcebc1e9c8a75cb5dbc802a2c1ddf1ae to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # Introduction | |
| With the rise of CI/CD in popularity over the last years, we have seen an explosion of different solutions. As these CI/CD solutions get larger, consuming more application release lifecycles they become less and less manageable in an ad-hoc manner. We see, as with most tooling, a move to declarative configuration of these tools. | |
| # Background & Context | |
| This talk isn't so focused on the declarative nature of the pipelines themselves, but more on the management of declarative configuration itself (Kubernetes Manifest files, Large terraform projects) | |
| The project I worked on was serving multiple different websites for different companies across the world. This naturally came with its share of regional policies & peculiarities. Additionally, each website was managed & maintained by a subsidiary company which were all acquired at different times meaning a large variation in the technical landscape of each company. | |
| This brought up a large amount of complexity when it came to managing each of these environments & passing the required barriers to process a release. | |
| The current solution was a hybrid between what had been built for a single company & a tool which had been put together to try & capture all of the requirements from all the businesses to manage all of these environments. | |
| It would do a number of things: | |
| - apply terraform for the website's dependencies | |
| - inject the secrets into a number of environment files | |
| - merge the secrets with manually defined secrets | |
| - dynamically inject the secrets into the configuration | |
| - dynamically tag images for deployment | |
| - apply the final result to Kubernetes | |
| # What was the problem / why did it matter? | |
| When I arrived It was not a simple tool to understand, I was the second person to understand it. | |
| Secondly, it was not easy to see (without inspecting the clusters themselves) what versions were running in any one environment. | |
| It also wasn't easy to see what _should_ be running in a particular environment. | |
| Additionally it was incredibly difficult to track down a change all the way back to the source. | |
| The support burden on our team was high as it was too hard to debug a failed deployment without understanding this tool. | |
| Infrastructure was intermingled with service-code, meaning any service update would trigger all infrastructure pipelines to redeploy. This was putting a lot of load on our CICD pipelines whenever a service was updated. | |
| There were a few odd issues that would occasionally arise which were difficult to resolve. | |
| # What did we try to solve & how did it go? | |
| We decided to split the problem into 3 different sections | |
| - secrets | |
| - infrastructure | |
| - service-level | |
| We wanted to limit the amount of churn in the CI/CD pipelines whenever a change was made to one of the dependencies. Essentially refactoring the dependency tree to make it less of a catastrophy whenever a service was updated. | |
| We wanted to increase the clarity of each step so devs could more clearly understand where their change was applied, and where it was yet to be applied. We also wanted to be able to have oversight on the whole process without needing to step in as much as possible. | |
| Our solution dramatically increased insight into what was going on & where the change had come from. This increased developer engagement in the release process & increased auditability re. what service versions were running for which websites & who had approved their release. | |
| There were a few issues though. One is the amount of information at the beginning was quite large as all the version pinning & updating overwhelmed our ability to do much other than test/approve PR's for new versions. This was partially mitigated by tweaking the behaviours & rules around environmental controls. | |
| Another issue is that our deployment for a service into acceptance,production went from 10-12 minutes (over a couple of chained pipelines) to 1-2 hours. Testing was unaffected. I don't believe this was a huge issue as the release cadence was in the region of fortnightly. | |
| # How did you solve (or try) the problem | |
| I focused first on infrastructure as we had full control of all the source & could iterate on solutions faster. | |
| We started pulling apart the first stage of the tool which would clone a number of repos & create symlinks between them. We refactored the dependent repos into tf modules, pinned them & started versioning them semantically. | |
| This allowed us to manage each website on different versions of infrastructure with less of a headache. | |
| As the secrets and services were pretty tightly coupled, we started to pull out what we could, and refactor the services to not have their manifests be versioned alongside the source-code. | |
| We introduced a tool (renovatebot/dependabot) which would watch a number of repos & update their dependencies when they got out of date. | |
| This would mean every update to a production environment would have a corresponding PR which could be approved according to the website's policies, we could revert a service deployment quickly and easily via either the CLI or simply reverting the PR & it was clear who signed off on the PR/we could wait until all parties were ready before pushing a change to production | |
| # What did I learn & what would I do differently? | |
| Managing a (bespoke) multi-tenant service with many different stakeholders introduces an absurd level of complexity. | |
| - decouple services from infrastructure | |
| Our service updates were triggering infrastructure redeployments & vice-versa, this was definitely causing a fair amount of confusion & should be separated out. | |
| - decouple secrets from deployment pipeline | |
| This is a challenging one as managing secrets in an effective way is challenging. It needs to be performant & secure. | |
| There are effectively 2 ways to go about this: | |
| - the 'force everyone to conform method' | |
| - the 'give everyone the tools to do everything they would like to method' | |
| this is naturally a spectrum, at it is a line that must be walked without going too far in either direction. | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment