Single Points of Failure
Single Points of Failure I have suffered at the hands of two single points of failure this week and I couldn't do anything about either of them, however both were easily avoidable if not for a modest amount of investment. For those of you unfamiliar with the term 'single point of failure' it is where a critical information flow, business process, system or service fails and there is inadequate provision for dealing such a event. Let me explain what happened to us.
We process DAIP or bus positions in real time for buses running in Bristol, Bath and Birmingham. The data from the buses is transmitted over the mobile data network, through the data aggrigator and then terminates on our customers network. We have a small piece of software we call and 'end point' which effectively acts as a proxy and forwards the message on to our Icarus cloud. The end point is installed on one our customers virtualised servers in their hosted environment. Unfortunately the area is prone to power cuts and on Thursday the servers and the environment was up and down all day. Now for our system, this is such a small and insignificant piece of software, when the loss of data occurred we scratched our heads as the usual suspects were all still working. It was only when we ruled out all of the heavyweight business processing apps we realised that one of the few SPoF's we have was indeed living up to it's name.
So what happened when this failure occurred? With the loss of bus information, our systems effectively become glorified and expensive electronic timetables. As the failure was not part of a system we manage or provide there was nothing we could do, but still it sits uncomfortably with me. You see, the solution to the SPoF is really easy and simple but comes at a price, and this is the crux if the issue. Why spend lots of money 'just in case'?
The second time we suffered at the hands of a SPoF was a classic business process error. Another one of our customers is currently transferring over the maintenance of their passenger information system from the incumbent to ourselves. We currently provide the information through the dissemination channels whist they gather the data from the vehicles. Again one day we stopped receiving vehicle updates from the system and wondered what had gone wrong. It turns out that the supplier had neglected to pay the bill for a critical piece of communications channel and as a result has stopped over 80% of the bus data being transmitted.
Now these types of business process failures are regrettable but not entirely unique to this supplier. What compounded the issue further was that a mitigation plan for this very failure (or similar event) from happening had been put in place when the system was initially commissioned, however over time the equipment needed replacing and rather than replacing like for like, a lesser functioning piece of equipment was used which did not support the required failover function. This last error almost unforgivable, and if I were the customer, I would not be happy!
That being said, it is all too easy to look in from the sidelines and snigger from afar, confident in ones own robust systems and processes, but as pointed out earlier we had failure ourselves only this week and it wasn't even our fault. Having recently moved house and taken on a larger mortgage I decided to get a larger insurance policy to cover our larger mortgage. Obviously I hope never to need it for myself or my wife, it is there 'just in case', indeed it is slightly uncomfortable even thinking that such a event could occur. I wonder if we shouldn't look at SPoF's in the same light and put more value to the protection and safety net these systems provide to the products and services?