An incident with elasticsearch is impacting Express Store services

Incident Report for CDK Modern Retail

Postmortem

On Sat at 7am PDT, a maintenance operation performed by our platform vendor that enables our Elasticsearch (ES) cluster, caused a replacement of one of our ES servers. That server booted up poorly and our engineering teams started receiving alerts and notifications about poor performance and queued jobs.

For context, Roadster uses Elasticsearch to power much of our inventory and vehicle data services as well as enable a snappy search capability across the Express Storefront. While the problems were limited to the Elasticsearch services, the impact across the platform was wide-spread and, at times, resulted in slow page load speeds, 500 errors and general poor performance for the majority of our customers during the incident.

After attempting reboot of the ES cluster without success, we redirected traffic to our disaster recovery instance in Oregon while simultaneously rebuilding our primary stack in the Virgina datacenter. By the end of the day Saturday we had a rebuilt cluster in Virgina that we began to reengage with live traffic, which was working OK at the lower evening traffic level but still had signs of weakness.

Sunday AM we received alerts of poor performance once morning traffic started to ramp. Engineering teams worked on rebuilding our cluster with additional servers which resulted a return to normal performance around 10am PDT.

The key measures we will be putting into place to prevent this from happening again include:

Increasing our support tier with our ES vendor to include prioritized support during off hours
Significantly increasing the performance of our existing ES cluster
Increased rehearsal frequency of our DR strategy to ensure more seamless failover to the Oregon datacenter
The team is looking into adding a backup ES cluster prepared for failover

The entire Roadster team apologizes for the frustration this caused our dealer partners and consumers who had to deal with poor performance and unresponsive pages for the majority of Saturday. We have confidence that the measures we are putting in place will prevent something like this event from happening again.

Posted Sep 22, 2020 - 12:11 CDT

Resolved

This incident has been resolved. A detailed post mortem will be published in the next 48 hours.

Posted Sep 19, 2020 - 23:48 CDT

Monitoring

A permanent fix has been implemented and we have seen a significant increase in performance and elimination of 500 errors. We will be monitoring performance throughout the evening to confirm that the issues are fully resolved.

Posted Sep 19, 2020 - 21:48 CDT

Update

Work continues on a permanent fix. Services remain available but with significant performance degradation. The problem continues to be with our elasticsearch cluster which powers many of the inventory and search capabilities of the site.

Dealerships will likely be experiencing some of the following issues:
- Slow page load speeds
- Some 500 responses
- Delayed refresh of inventory and pricing

Posted Sep 19, 2020 - 19:02 CDT

Update

We are continuing to work on an issue discovered this morning that is causing degraded performance across the Express Storefront.

The cause is a configuration issue within our elasticsearch service and our team is currently working on bringing up a larger instance.

Stores remain operable but sluggish at times and without response to some requests. We will continue to provide updates as we have them.

Posted Sep 19, 2020 - 15:58 CDT

Identified

We’ve identified an issue with our elasticsearch services that may be impacting availability of the Express Storefront.

Posted Sep 19, 2020 - 10:45 CDT

This incident affected: Americas Datacenter (Express Storefront, Dealer Admin).