20 September 2020

Gather round…

By admin@labtinker.net

Most commercially available stories we read, hear or view are of exceptional or fantastical events but few of us experience these regularly. In contrast to this, I have been reading ‘The Wrench’ by Primo Levi which celebrates the day-to-day work experiences of a rigger: a man who assembles cranes and bridges. My everyday work is in IT and this is an area that does not lend itself to tales that can be told to non-technical folk around a campfire but I’m going to have a go. There are elements of this tale that do seem quite unlikely, well one element.

Many years ago, I supported a website which allowed people to connect from the Internet to a restricted network after they had gone through a fairly elaborate authentication procedure similar to the one some banks use with a card and card-reader. I’ll call this website a portal from hereonin.

I had to perform a simple change to alter a menu or function and it amounted to little more than copying two pre-prepared files to specific directories on the portal devices. The change began at 6 pm and before starting I checked the the login process was working OK by logging in with a test account, card and card-reader.

The change itself took ten minutes and thereafter I re-checked the login process to make sure the new feature/menu item was available. I got an error. I shrugged and reversed the change, I’d give the files back to the devs to debug in the morning. Having restored the original files I then re-checked the login process and got the same error. Puzzled, I checked the original files had been copied back correctly and the permissions hadn’t changed: no. I restarted some services: same error, I rebooted the portal devices: same error. The change had been so simple that I couldn’t think of much else to try and did I say the error I was receiving was generic and unhelpful? The logging on these portal devices (Microsoft IAGs – if you’re interested) was fairly rudimentary and gave me no clues. I persisted trying things for another couple of hours… and then working on the assumption that one of the files may have got corrupted called out the storage guy to come and do a full restore of the two portal devices. He biked in and by 11pm we had done an image restore from the previous night’s backup. I tried my test again and to my horror I saw the same error.

So five hours into a problem, having totally restored the systems I’d changed, having checked everything I could think to check, and with no useful error or logs to work on I was struggling. There seemed to be no logic, no clues. The rest is a little blurry it is only the stingingly unlikely (and dammit, unfair) denouement that has stayed with me. I think there were further call-outs, vendor escalations, but I can’t in all honesty remember when the eureka moment came about or who found it.

Furthermore, the solution was not found on the portal devices I supported or had changed or on any devices I had access to. The portals themselves didn’t do much of the checking in the authentication process but handed this off to some backend servers. When servers speak to each other securely these channels are generally secured with certificates exactly like the ones e-commerce websites use (well, pretty much all websites now – including this one) and that lead to that little padlock being displayed in your browser. These certificates are minted for a given time period: typically, a year. You may see where this is going: the certificate which secured the access between the portals and the backend servers had expired in the ten minutes between my pre-change test and my post-change test. This meant the backend server couldn’t be trusted, hence the error.

What are the chances? Well, people buy lottery tickets, don’t they? They had a syndicate at the very same place where the above events took place. I was offered a chance to join but I always pooh-poohed the chances of them winning as infinitesmal. It took them ten years but the last time I met my mate who’s still in said syndicate he told me they’d won a million pounds; he did give me a lift back from the curry house in a very nice car. I didn’t ask him if they’re still buying tickets

I have seen expired certs cause other outages. It’s never the certs on important high-visibility websites that expire but always those on backend servers which are usually intrinsic to a service but whose function is only half-remembered till it fails. I think for such certs, which are used internally, there is a argument to either install certificates that will last the lifetime of the service or to go the other way and have certs which last six months so people remember they need to update them.