At the beginning of 2019, we replaced the foundation of our federation: SURFconext’s key material. A so-called ‘key rollover’. This was a large project which required all service providers (or SPs) to take action. Why did we do this, how did we do it and what does it yield?
All logins pass through our central proxy. The login information is sent by the proxy to the service provider in a message via the user’s browser. In order to prevent this user from falsifying the information en route (and, for example, John could claim to be Peter), these XML messages are signed by SURFconext with a key, which is trusted by the SP. This way, the service provider can check whether the login information really comes from SURFconext (red arrow in Figure 1).
SURFconext publishes so-called metadata in which an SP can find that key. The metadata itself is also signed with a key with which the SP can check the origin (Figure 2).
These keys became five years old in May 2019, and keys need to be replaced regularly. But, in addition to the fact that it was high time, we also wanted to make a number of important improvements at the same time.
Hardware Security Module
We decided to make several improvements. The metadata process is separated from the login process. This allows us to create and sign the metadata on a separate, less exposed machine, and to publish it on a separate URL. It was also integrated with our international metadata flow from and to eduGAIN which could then become fully automatic. In the new situation, the metadata will be signed with a different key than the assertions are signed with.
But, perhaps the most important change is that in the new situation, we’re using a Hardware Security Module (HSM) to sign the metadata. An HSM is a device specially designed to handle key material safely. This makes it impossible to steal the metadata key. In our case, these are machines manufatured by Utimaco. SURFnet also uses the same machines for the DNSSEC signatures. Using an HSM, however, meant that we had to start all over again with new keys that were only known to the device from the start, and not outside it. As a result, every SP must make this change – effectively reconfiguring the trust in SURFconext. This may sound drastic, but it usually only boils down to updating a few configuration items. Nevertheless, the system would only continue to work if all SPs actually did this – and SURFconext currently has more than eight hundred connected SPs.
With 884 service providers required to take action, a ‘flag day’ migration on a set date is not feasible. We have therefore used a clever feature of SURFconext. To start an authentication, an SP calls the SingleSignOn URL of SURFconext, which it finds in the metadata. The new metadata at the new location now contains an extra parameter of that URL: ‘key:20181213’. The old metadata did not contain that parameter. This enables us to already see at the start of an authentication which metadata an SP is relying on: the old one, or the new one? In the case of the new one, we also directly sign the message about successful authentication back to the SP with the new key. This way, the moment the SP has configured the new metadata, it has effectively already switched to the new situation and it will immediately be clear that everything is working successfully. This means that each SP can carry out the change at a time of its choosing and will then have completed the migration.
An additional advantage of this extra parameter in the new metadata is that SURFconext also includes it in the logging of the login concerned. And that in turn enabled us to closely monitor progress by counting the number of logins per SP using the old and the new keys. This allows us to create graphs that show the percentage of the number of logins using the new key (which, of course had to be 100% by the deadline).
In eduGAIN our metadata signing key is only trusted by central eduGAIN aggregator, so this required only one place to change it for all eduGAIN SPs. For the assertion signing key, the SingleSignOn-locations we publish in eduGAIN also contain the key identifier. We could see that when we changed our metadata feed in the direction of eduGAIN, most eduGAIN SPs quickly picked up the new key. And those eduGAIN-SPs that still used old information after a few days, turned out to have a broken metadata refresh process that could be fixed.
Because which key is used is logged per login, we can also determine our priorities this way. All SPs are important to us, but to be honest, it is more important that an SP with 100,000 logins per day completes the migration on time than an SP that is logged onto once every two weeks.
Communication communication communication
Once the technology was in place for the roll over, we could start informing people. This was begun on 5 February in an e-mail with instructions to all SPs. In our graphs we saw a number of SPs taking action immediately, sometimes within an hour. That was great news! But after the first week, the percentage soon stagnated. As the deadline approached, we started e-mailing SPs every week, indicating the urgency with the concrete number of logins via the old key that we could see. As 1 May approached, our team got on the phone and called the list of SPs that hadn’t migrated yet, starting with with the most used SPs first, of course. This was often enlightening, as some people thought that the change had already been implemented. But even more often their response was that our contact person had long ago left the company. In the end, however, it was almost always possible to find someone who could implement the change. In a number of cases, we were also able to make the necessary changes via our institutions, as ‘customers’ of the service.
Although we would have preferred it otherwise for our peace of mind: during the last week before the deadline, many large SPs switched to the new key. This can be clearly seen in the graph, where green has already migrated and pink is still using the old key. So, we had a lot of work come our way over those last few weeks – late, but not too late: when we eventually technically disabled the old key on 10 May, we received hardly any complaints. Quite an achievement when you’re processing 500,000 logins a day, if we may say so ourselves.
What did we learn?
We are scheduled to do this again in five years’ time. Perhaps even earlier – because the previous key rollover, in 2014, was done not because the key had expired, but because of the infamous Heartbleed security incident. Knock on wood, but such an unscheduled rollover could happen again, of course.
We expect the next one to at least be easier. This time, in order to make a big leap, we had to make a lot of changes at once, including changing the trust anchor. Next time, however, if we only need to replace the assertion signing keys, people will only need to ‘refresh’ the metadata – and this is in many cases done automatically.
The most important thing, however, was that in many cases, our contact information was outdated. Once an SP has finished joining the federation, we are usually no longer informed about personnel changes. We’re looking at ways to resolve this. In any case, from now on we will push harder for SPs to register with functional addresses (firstname.lastname@example.org) instead of personal e-mail addresses. We also think that our SP Dashboard offers more possibilities to keep the data up to date. But it might also be a good idea to periodically send a test message to the contact persons to see if they respond.
Even though there are certainly things we could have done better, the key rollover was certainly a success. We now have a safe, long-term set-up and a much better better technical foundation than we did at the beginning of the year. Trust is the foundation of the Federation. So, all our efforts were certainly worthwhile.