for over 36 hours, 32million mobile users lost access to data and other services provided by mobile Network Operator O2.

affecting individuals as well as businesses – both of whom are becoming more and more reliant on mobile as their only communication method.

What Can we Learn From the O2 Data Failure?

All good companies should seek to learn from events so that they improve and, in this instance, become more resilient, and if you can learn from someone else’s disruptive event rather than your own, all the better.

Whether you were affected by yesterday’s O2 outage or not, please read on as we take a look at the scale of the disruption, the effect on businesses across the UK and how o2 managed the situation to see how we can all learn from these events.

As learning is always best in a collaborative environment, we welcome your comments and suggestions on possible alternative approaches or questions.  

The Disruption

At 0530hrs GMT on Thursday 6 December 2018, O2 experienced a failure of their data systems that affected 25 million of their customers and a further 7 million customers from Sky, Tesco, Giffgaff and Lycamobile among others.  The system failure meant that customers could not access data other than through a wifi connection and therefore significantly disrupted businesses and individuals.

Early on Friday 7 December, O2 announced that the problem had been resolved and that after a root cause analysis the problem had been identified as an “expired certificate in the software versions installed with these customers”.

The Effect on Businesses

Given the outcry in the news and on social media that the impact of this outage has caused, with customers stating “O2 are responsible” and many seeking compensation for lost earnings, what can be learned from such an event?

1.    Single Point of Failure

Good business continuity practice is that for any critical component of your business, there should be a back-up or workable alternative. In this case, how do you have an alternative telecoms supplier?  

For larger companies you might consider having more than one provider and if the worst occurs you prioritise the handsets to those that require them most.  For short term disruptions and smaller companies is this practical? The benefits of one supplier may offer a short-term discount or ease paperwork, but the loss of a day’s earnings may far out-weigh a small discount?

2.    Workable Alternatives.

Have we reached the point in most businesses where the loss of access to the internet is a catastrophic event?  One of our clients recently voluntarily shut down their systems for a week to ensure they weren’t infected by the WannaCry bug – despite the fact they had identified that it was a critical capability that they couldn’t cope with more than 1 hour of outage.  By reverting to pen and paper and the use of the fax machine they kept working and avoided infection. Are there similar alternatives for most businesses if we plan ahead?

3.    Having a plan

As with all emergency events, we don’t know when they will occur but when they do, they cause disruption and stress.  Time and again we read stories of equipment or facilities failing and companies being caught unaware. We know that the world experiences severe weather and we know that IT fails and yet all too often I hear that companies are either too busy to plan for this type of resilience measure or are happy to rely on a “tick box” approach with no depth to their plan.  How do we get beyond so that we have effective plans for all the likely types of disruption, not just fire?

The Response From O2

Turning now to O2 and their partners Ericsson.  What can we learn from their response?

1.    Prior Preparation and Planning? I am not a software engineer so would welcome comments from those that are, however, how is it possible for 32 million people to be left exposed due to a certificate expiring?  Should there not be multiple fail-safe mechanisms to ensure that this type of thing doesn’t occur? Given the size of O2, if their systems failed to pick this issue up, how confident that your business may not be vulnerable to the same issue, particularly if you rely on the internet for sales or the general running of your business?

2.    Crisis Communication.  The BBC reported that in a joint statement, O2 boss Mark Evans said:

“I want to let our customers know how sorry I am for the impact our network data issue has had on them, and reassure them that our teams, together with Ericsson, are doing everything we can.

We fully appreciate it’s been a poor experience and we are really sorry.”

Is saying sorry good enough?  Some reports note that customers first became aware of the outage via BBC online?  What more should customers expect from their key supplier? What messages should O2 have aimed for?  The normal approach is to admit the issue as soon as possible, apologise and to recognise that an investigation is to be conducted to identify the root cause, so it can be prevented from occurring again.  All of that is covered in the O2 statement, so what more does a supplier need to do and what do customers expect?

Some customers and industry experts are noting that compensation may be possible for lost earnings.  For one day’s lost earnings, how will the process work? If O2 require masses of time-consuming paperwork, this may leave them exposed to further reputation damage.  Too little paperwork and it leaves them at risk of fraudulent claims, sadly as witnessed even in the aftermath of the Grenfell fire. So, what is the acceptable middle ground?  

What do you think?

As stated at the start of this short article, lots to consider for all of our future activities as we grow increasingly reliant on digital communications from a few providers.  All thoughts and observations are really appreciated so we can grow the conversation and learn from each other’s experiences.

Afternote: at 1425hrs on Fri 7 Dec 2018 I noted the following article published on BBC Online that reports a compensation package has been announced by O2 – seems like a textbook crisis communications response. Time will now tell whether their customers think that 2 day’s worth of airtime as a credit in Jan 2019 bills is the end of the matter.


Update: Mon 10 Dec 2018

We reached out to the wider industry for comment on the situation to further the discussion:

Toby Ingram  Senior Consultant at Inverroy Crisis Management Limited:

For me, the killer phrase is “O2 are responsible” which smacks of the Macondo reaction “BP are responsible”. In both cases the customer-facing organisation (O2 and BP) could mount a credible argument that they were NOT in fact responsible, but any such argument will always fall on deaf ears. So the learning point is that if the name on the tin is your name, you will be held responsible, even if the mistake belonged to one of your sub contractors, and shape your corporate comma accordingly

Jim Mathieson FInstLM AMBCI  HSE Incident Investigation and Competence Lead at Shell:

I would like to expand on your comments Toby. If I had a contract with O2. I would expect them to provide the service. The fact they chose to outsource the software certification is their choice. O2 are responsible for the service of their sub contractors in the same way BP were responsible for the service provided by their contractor on the deep water horizon. The point here is: Had O2 conducted stress tests and asked questions of their contractor about lead times to certificate run out dates then this could have been avoided. What failed was robust contractor management.

John Duncan MBE MSc DipHE  Senior Advisor Emergency Preparedness & Security at Total Exploration & Production:

Really good response Matthew. Having gone through a cyber attack causing us to shut down our IT it showed that you have to have a back up plan which allows you to deliver the critical aspects of your business. If you are reliant on IT services have you thought on what you would actually do if you have no access. IT infrastructure is being attacked on a daily basis by criminal gangs and rogue states. As we are so reliant on our IT we must have a back up plan to deal with this. Matthews article hits this home for me. As the old adage goes if you fail to plan then you plan to fail.

Sandra Riddell Business Resilience Specialist (Emergency Response, Business Continuity & Crisis Management):

O2 have done as best they can to respond, taking responsibility and communicating reasonably well. This type of event is like water. It finds a way out to social media quickly. Crisis comms go through an approval process and for situations like this, teams will always be on the back foot, trying to get ahead. O2 are undoubtedly now looking to learn from this. On preparation and planning, what is clear is that you and your suppliers need to be aligned on service goals, critical success factors (certificate renewal) and roles and responsibilities in BAU and ‘when things go wrong’. In addition, exercises designed around credible threats to service goals and critical success factors will help a team practice ‘when things go wrong’ and ensure processes, plans and strategies are validated. Organisation’s are complex. Unpicking this and updating processes etc to plug this gap will take time. O2 are not alone, none of us can be complacent. We all have to do more preparation and planning and be ready for crisis. No two are the same.

Joshua Hutchinson  CEO @ ARX Maritime | Safety, Operations & Security Technology | Shipping

This is a really good read of the situation that occurred, I assume similar to the same one that happened in Glasgow with giffgaff? My initial thought is that when the big companies apologise we seem to accept it. Due to the size and power, they have in the market, what can we do? Has O2 taken into the account the effect this would have had on SMEs that are reliant upon the network? We could also ask the same question of O2 do they believe it won’t happen to them because they are a big comms company? Everyone can take something away from this, regardless of how big and small, we need to have a plan in place… a plan for A, B and C… because at some point they will always go wrong. 

David Freeman Crisis, Risk and Resilience Manager – EMEA at Amazon

My first, and rather cynical question is, who really lost earnings from a 24hr mobile data outage? There may be completely valid use cases but aside from remote communities I struggle to conceive any with any veracity beyond minor inconvenience.  

Second observation, if O2 are to be believed then this was entirely avoidable.

Thirdly, their response was pretty good overall. Nothing exceptional but solid covering of the basics.  Could they have done more? Of course they could have. This question needs to be asked with a common sense bubble around it and regulated properly though.  We use our phones for a lot more than work and our data can be the source of dopamine meaning when data is cut, so has our supply of the dopamine hit we all crave. 

Viewing it through this lens brings a unique view to the otherwise functional and practical arguments, much like I opened with. https://bit.ly/2oQaJj4 (SAFE)

Maybe the utility company/airline model of compensation should be adopted where it is time dependant and fixed compensation payments with an ombudsman to regulate?  Data is as much of a utility as water and electricity so why not follow suit?

This post will be updated as the case continues.


If you would like to contact Inverroy on how you can prepare for disruptive occurrences, you can do so by filling out the form below:

Name


Subscribe To Our Newsletter
Sign up with your email address to receive news and updates