Rolling updates for SURFconext
At SURFconext, we strive for 100% availability. That’s why SURFconext is configured redundantly and the platform runs at 3 different locations. The next issue that we want to improve is the prevention of disruptions after releases. Despite extensive testing and a solid release process, we cannot give complete assurance that everything will keep working for everyone. That’s why we are going to work with rolling updates: updates that are implemented gradually, per user group. By doing so, problems stay manageable and old software does not have to be restored if something goes wrong. How do rolling updates work and which problems do they solve?
Current SURFconext release process
Over the years, the SURFconext test and release process has been constantly improved. This is because we have kept on refining the process and have regularly performed releases. We have, for example, performed 18 releases in the past academic year.
Yet a lot happens even before the software goes into production:
- All software is tested both manually and automatically, using fixed protocols.
- Next, the software and configurations are rolled out. This is fully automated for all environments (test, acceptance and production) with the help of Ansible.
- To ensure that the new software works well with production data, our last check involves first putting the software on 1 production server for a short period. Only the SURFconext team can access it.
- By agreement, the new software is rolled out to all users during the maintenance window. And during this window, there is no downtime while we do this.
New releases are always announced a week in advance, so if there are any problems, institutions and service providers know that something has changed on our side.
Disadvantages of the current release process
This process works well, but there are still a number of disadvantages with this method:
- There are many institutions and services linked to SURFconext. Institutions and services use lots of different SAML products and configurations: in principle, there are a modest 200,000 possible combinations of IdP and SP. We cannot, therefore, test every possible combination beforehand.
- If an error is discovered and the old software has to be restored, there are risks associated with doing so: we often have to do this at the busiest time of the day.
- The release process is relatively demanding for the team and our management partners. The manager and the technical product managers at SURFconext have to get up at 5am to test and execute the release. This means that we release less often than we would like. Our preference would be to do this more often, so that we could make the changes smaller each time.
We would like to eliminate these drawbacks; to do so we need another way to perform releases: rolling updates.
What is a rolling update?
With a rolling update a small percentage of users use the new software first and if the monitoring shows that there are no problems, this percentage is increased. If this goes well – with no deviations in the monitoring and no complaints from users – then more and more users will use the new software. This is a predictable and automated process.
If the new software does not appear to be successful, then the old software version can be restored. Both versions are available alongside one another and can handle full loads. The restoration of old software will then no longer be necessary. This way, rolling updates ensure that the drawbacks of the current release method are no longer present.
We started using rolling updates for a number of non-critical applications in June. This went well and the plan is to adopt this way of updating for the rest of the platform as well, after September. Before we do this, however, there are still a number of challenges that we have to tackle:
- Our monitoring is quite extensive, but is principally focused on detecting larger disruptions. To be able to detect the smaller problems (for example, a specific institution-service combination), we have to improve our monitoring even more.
- Not all the releases can be executed using rolling updates because there are, for example, major database changes required. We have to take this into account in the development of new software. And we are still carrying out major changes or alterations, which cannot be done without downtime, within the maintenance window.
- A user or institution does not know if they are working with the current or new software version. In the event of problems, we would like to know which version is being used without the user having to do complicated things. It is not only the SURFconext team that would like to know this, but the institutions and service providers too.
We have even more questions for which we would like to find solutions by working together with the member institutions. What is, for example, the best time to perform a rolling update? How long should a rolling update take? How do I ensure that my help desk is informed? Are you an IT manager, service manager or help desk team member who would like to collaborate with us on the rolling update process? If so, send an email to email@example.com.