LivePerson’s Disaster Recovery (DR) Program designed to align with industry best practices for governance, documentation, and disaster recovery exercises. LivePerson maintains policies governing the DR program, management and testing.
Data center resiliency (LP Cloud)
LivePerson maintains data centers in the United States, the United Kingdom, and Australia. These facilities are equipped with N+1 or greater UPS and cooling power. The LivePerson infrastructure incorporates redundancy in all components including but not limited to network interconnects, routing, switching, firewalls, load balancers, compute clusters, and database clusters.
LivePerson performs local and offsite encrypted backups to an alternate processing site. Source code and system configuration artifacts which are stored in the source code repository, can be used to restore the infrastructure and services in an alternate site upon disaster.
Public Cloud and Multi-AZ architecture
LivePerson's public cloud initiative heavily relies on the cloud provider's features as the foundation for its recovery strategies. By utilizing Google Cloud Platform (GCP), LivePerson has established resilient and dependable infrastructure and applications, employing active-active and Multi-Availability Zones (AZ) architectures to effectively withstand disruptions and ensure swift recovery.
LivePerson uses GCP Regions across the United States, Europe, and APAC for its three regions with a brand’s data residing in its region of choice. No data is replicated outside of the brand selected region.
Threat analysis
There are two scenarios, based on different threats, that are considered a disaster situation:
- Total Farm failure/Availability Zone failure
- Total LOB (Line of Business) failure
For each scenario, LivePerson has identified the critical components and dependencies that allow for the fastest recovery time.
Disaster recovery process
In the event of a disaster or significant disruption impacting LivePerson's premises or operations, the Crisis Action Team (comprising of key LivePerson stakeholders) will lead the overarching response. This effort will be carried out in coordination with relevant internal teams and external agencies,as needed, to ensure all stakeholders are appropriately informed and engaged across the following phases:
- Phase 1: Discovery and Assessment
- Phase 2: Determination and Reporting
- Phase 3: Response Actions
- Phase 4: Recovery Completion and Validation
Failover decision
LivePerson's Disaster Recovery Plan mandates that once a data center is confirmed to be non-operational or failure is deemed imminent, senior management is engaged promptly to authorize failover, forming a critical step in our rapid response protocol.
Tactical Recovery:
For the On Premises (LP Cloud)
Once one of the disaster scenarios above is identified, the situation is mitigated by executing the following steps:
- Rebuild infrastructure using the system configuration artifacts.
- Rebuild the data stores using the offsite backup of the data.
- Deploy services using repositories of released software packages.
- Validate the recovery of the services and datastores using automated and manual tests.
- Global DNS settings are updated to reflect the change of the primary data center role.
For the Public Cloud (after migration for the region)
LivePerson employs active-active Multi-Availability Zone architecture in the Public Cloud (GCP). If an AZ becomes unavailable, the horizontally scaled microservices in other AZs seamlessly take over. Additionally, manual actions to resolve any bottlenecks, delays or failures during the failover are taken swiftly to ensure the services are running and where needed, recovered, within the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) timeframe.
Disaster recovery tests
LivePerson’s disaster recovery plan is validated and tested at least once per year.
Test goals
- Determine if applications can support the workload in the reduced AZ footprint in a GCP region.
- Validate the operation of all microservices, datastores and infrastructure post-AZ failure (simulated or actual).
- Find and resolve any bottlenecks, delays or failures during the failover activity to the disaster recovery site.
- Validate RTO and RPO.
- Update and train applicable LivePerson personnel regarding failover procedures.
- Identify activities during the failover process that can be optimized or automated to improve SLA.
Test strategy
The disaster recovery test strategy consists of detailed evaluation, planning and documentation. The main objectives of the tests are:
- Define target time frame, scope and goals of each disaster recovery test.
- Perform dry runs and live tests and keep track of historical tests. Dry run tests are executed using automatic tools based on standard procedures and templates to ensure efficiency and consistency in these simulated tests.
Monitoring
Beyond the ongoing system monitoring done in the DR sites, we implemented an end to end mechanism that constantly tests our services. The tests are based on our end to end scenarios running on our active sites. This continuous monitoring mechanism gives us greater confidence that if a major disruption occurs, we can promptly initiate our DR plan. Further, it helps us catch and proactively address any potential problems that might arise with our services or our ability to recover, especially in between our more comprehensive scheduled DR tests.
Recovery Times
LivePerson's recovery time objective (RTO) is 2 hours, and its recovery point objective (RPO) is 24 hours once a specific region is migrated to GCP.
Additional information
For additional information, please refer the DR Overview document at LivePerson Trust Center
https://trust.liveperson.com/d/dr-overview/45mKHb?lng=en