Incident Response - OpenCRM Outage 01/01/2021

Resolved
Resolved

Issues experienced: Loss of access to OpenCRM systems for a number of customers. Customers may have been able to load their login pages but were unable to log in with a "Maintenance" message displayed. Some customers may have been able to connect but seen intermittent stability issues.

Total Time Period affected: 11:30am - 3:55pm

Cause of outage: A power and network outage in our primary AWS data centre caused extended loss of connectivity to OpenCRM servers. We are awaiting a full AWS incident response for more details on this.

Cause of redundancy failure: As the AWS issue affected all servers available in the primary data centre, failover to secondary servers in the same data centre was not possible. Failover was therefore attempted to our secondary (disaster recovery) data centre.

Unfortunately due to increased traffic on the secondary data centre (to be confirmed by AWS) , further to the outage in the primary data centre, we experienced a number of difficulties in the secondary data centre in bringing the failover online. This was further compounded as we attempted to add extra redundancy to the disaster recovery servers in order to be able to fail over confidently.

To protect the integrity of customer data, we could not bring customers online in the disaster recovery data centre without introducing extra redundancy to that service first.

It is important to note that at all stages of the outage, a fully redundant copy of all customer data was available and online in our secondary data centre. Our focus on bringing people's systems back online was to ensure the integrity and availability of that up to date redundant copy by ensuring further redundancy and data checks were done first.

This was nearing completion when full connectivity to the primary data source and servers was restored. As such customers remain connected to the primary database servers, not the redundant failovers and no further maintenance is required to maintain our usual level of service

What we are doing about this: As well as fully investigating with AWS the issues which caused the initial outage and subsequent issues in bringing redundant services online in the secondary data centre, we are looking once again at hardening our disaster recovery plan so that in the event of a full failure of the primary data centre, such as in this instance, we can bring customers online more quickly in the second data centre.

Avatar for
Began at:

Affected components
  • Network Connectivity
  • Web Servers
  • Database Clusters
  • Task Servers