On Sat at 7am PDT, a maintenance operation performed by our platform vendor that enables our Elasticsearch (ES) cluster, caused a replacement of one of our ES servers. That server booted up poorly and our engineering teams started receiving alerts and notifications about poor performance and queued jobs.
For context, Roadster uses Elasticsearch to power much of our inventory and vehicle data services as well as enable a snappy search capability across the Express Storefront. While the problems were limited to the Elasticsearch services, the impact across the platform was wide-spread and, at times, resulted in slow page load speeds, 500 errors and general poor performance for the majority of our customers during the incident.
After attempting reboot of the ES cluster without success, we redirected traffic to our disaster recovery instance in Oregon while simultaneously rebuilding our primary stack in the Virgina datacenter. By the end of the day Saturday we had a rebuilt cluster in Virgina that we began to reengage with live traffic, which was working OK at the lower evening traffic level but still had signs of weakness.
Sunday AM we received alerts of poor performance once morning traffic started to ramp. Engineering teams worked on rebuilding our cluster with additional servers which resulted a return to normal performance around 10am PDT.
The key measures we will be putting into place to prevent this from happening again include:
The entire Roadster team apologizes for the frustration this caused our dealer partners and consumers who had to deal with poor performance and unresponsive pages for the majority of Saturday. We have confidence that the measures we are putting in place will prevent something like this event from happening again.