Incident Management

Definition

An incident is any event capable of compromising the availability, security or normal operations of the Turtl platform or the data we hold. Such incidents are managed carefully according to the incident management process to limit disruption and return to business as usual as quickly as possible.

Incident management process

Incidents move through the following well-tested process to arrive at resolution:

  1. Report
    The incident is reported via automated alerts or manual submission.
  2. Acknowledgement
    The incident is acknowledged by the Operations Team.
  3. Assignment
    An Incident Commander is assigned who is responsible for:

    1. Internal and external incident communication;
    2. Establishing and maintaining an incident timeline;
    3. Recording all key events and decisions.
  4. Diagnosis
    The incident is fully explored and steps to resolution are determined.
  5. Communication
    Progress updates are regularly communicated to all affected parties.
  6. Development and testing
    A resolution is developed and QA’d in sandbox environments where applicable.
  7. Deployment
    The resolution is deployed to production environments.
  8. Resolution notification
    Affected parties are notified of resolution.
  9. Post-mortem
    An incident post-mortem is carried out and communicated to affected parties including lessons learned, steps taken and a request for feedback.

Case studies

All incidents will, by definition, have their own unique challenges and circumstances. Below, we describe our plans for handling a range of common incidents affecting SaaS platforms and the steps taken to preempt and mitigate them.

Infrastructure failure

The Turtl infrastructure and application have been designed to be highly-available and self-healing. The system detects for itself when a component enters a failure state and automatically replaces the unhealthy components with a healthy copy.

In addition, Turtl routinely runs simulated incidents to test our plans and processes in the event of an infrastructure failure which cannot be self-headled. Simulations are conducted in a controlled manner and take advantage of scheduled maintenance windows to ensure minimal disruption to customers.

In the event of a real world infrastructure incident, Turtl’s on-call ops team receive automated telephone calls notifying them of the incident and providing first stage diagnostic information.

In general, the affected infrastructure component can be replaced without incurring any downtime.

In very rare cases, multiple simultaneous failures may occur, requiring a full migration of Turtl’s infrastructure to an alternative datacenter. An emergency maintenance window of approximately two hours will be enacted in this case.

Data corruption and human error

Each customer’s data is backed up on a regular basis and retained as follows:

  • Bihourly backups for one day
  • Daily backups for one week
  • Weekly backups for one month
  • Monthly backups for one year
  • Yearly backups for three years

Should one or more customers’ data become corrupted due to application error, system fault, or human error, the affected data will be restored to an earlier, uncorrupted state.

All requests for restoration of data are initiated by an Account Manager in consultation with affected customers and the Operations Team who will assist in determining the severity of the incident and identifying the complete list of affected customers.

On completion of this analysis, affected customers and notified of the incident and their accounts are placed in emergency maintenance to prevent further corruption.

Once damages are quantified and the point of corruption is identified, a suitable uncorrupted backup is identified and data restoration takes place. Turtl will assist customers in retrieving any information not present in the selected backup set on a case-by-case basis.

Traffic spikes

The Turtl infrastructure scales automatically to handle the traffic we receive. It is conceivable that an extreme traffic spike could occur which outpaces the speed at which automated scaling can respond, resulting in degraded performance.

In this case, the Operations Team will manually modify infrastructure topology to handle the surge in traffic without any downtime.

The Turtl infrastructure undergoes regular stress tests designed to ensure capacity for traffic spikes of up to 5,000% in any 24 hour period without human intervention.

Team member absence

All members of the Operations Team are trained in the management of our infrastructure, deployment process and application configuration.

At any given time, at least three members of the Operations Team are available and fully capable of responding to incidents affecting our service. The Operations Team is split across two geographic locations to mitigate the affect of localised issues such as power or network outages.

Infrastructure provider independence

The entirety of the Turtl infrastructure, from setup and resource provisioning to application deployment, is fully automated and is, to a high degree, provider agnostic.

In the very unlikely event of our current provider (Amazon Web Services) going out of business or otherwise becoming unsuitable for our purposes, we will undertake migration to a new provider.

The process of switching providers is outside the scope of this document, but plans and expertise are maintained to perform such a migration with a minimum impact on service availability.