An incident is any event capable of compromising the availability, security or normal operations of the Turtl platform or the data we hold. Such incidents are managed carefully according to the incident management process to limit disruption and return to business as usual as quickly as possible.
Incidents move through the following well-tested process to arrive at resolution:
All incidents will, by definition, have their own unique challenges and circumstances. Below, we describe our plans for handling a range of common incidents affecting SaaS platforms and the steps taken to preempt and mitigate them.
The Turtl infrastructure and application have been designed to be highly-available and self-healing. The system detects for itself when a component enters a failure state and automatically replaces the unhealthy components with a healthy copy.
In addition, Turtl routinely runs simulated incidents to test our plans and processes in the event of an infrastructure failure which cannot be self-headled. Simulations are conducted in a controlled manner and take advantage of scheduled maintenance windows to ensure minimal disruption to customers.
In the event of a real world infrastructure incident, Turtl’s on-call ops team receive automated telephone calls notifying them of the incident and providing first stage diagnostic information.
In general, the affected infrastructure component can be replaced without incurring any downtime.
In very rare cases, multiple simultaneous failures may occur, requiring a full migration of Turtl’s infrastructure to an alternative datacenter. An emergency maintenance window of approximately two hours will be enacted in this case.
Each customer’s data is backed up on a regular basis and retained as follows:
Should one or more customers’ data become corrupted due to application error, system fault, or human error, the affected data will be restored to an earlier, uncorrupted state.
All requests for restoration of data are initiated by an Account Manager in consultation with affected customers and the Operations Team who will assist in determining the severity of the incident and identifying the complete list of affected customers.
On completion of this analysis, affected customers and notified of the incident and their accounts are placed in emergency maintenance to prevent further corruption.
Once damages are quantified and the point of corruption is identified, a suitable uncorrupted backup is identified and data restoration takes place. Turtl will assist customers in retrieving any information not present in the selected backup set on a case-by-case basis.
The Turtl infrastructure scales automatically to handle the traffic we receive. It is conceivable that an extreme traffic spike could occur which outpaces the speed at which automated scaling can respond, resulting in degraded performance.
In this case, the Operations Team will manually modify infrastructure topology to handle the surge in traffic without any downtime.
The Turtl infrastructure undergoes regular stress tests designed to ensure capacity for traffic spikes of up to 5,000% in any 24 hour period without human intervention.
All members of the Operations Team are trained in the management of our infrastructure, deployment process and application configuration.
At any given time, at least three members of the Operations Team are available and fully capable of responding to incidents affecting our service. The Operations Team is split across two geographic locations to mitigate the affect of localised issues such as power or network outages.
The entirety of the Turtl infrastructure, from setup and resource provisioning to application deployment, is fully automated and is, to a high degree, provider agnostic.
In the very unlikely event of our current provider (Amazon Web Services) going out of business or otherwise becoming unsuitable for our purposes, we will undertake migration to a new provider.
The process of switching providers is outside the scope of this document, but plans and expertise are maintained to perform such a migration with a minimum impact on service availability.