Archive for August, 2015

Less Annoying CRM Apology Letter – or – How it’s done

Monday, August 24th, 2015

I’ve been a LessAnnoyingCRM user for about a year now and I love it. (NOTE: If you use this Less Annoying CRM affiliate link you’ll get 60 days free use instead of the normal 30)

Recently they had some issues on their site, and the level of transparency was amazing.

If you ever have to issue an apology letter to your customers, you can’t go wrong using this as a template

 

Less Annoying CRM server issues – apology and post-mortem

from: Tyler King <tyler@lessannoyingcrm.com>
to: “malcolm.b.anderson@pragmaticagility.com” <malcolm.b.anderson@pragmaticagility.com>
date: Mon, Aug 17, 2015 at 8:08 AM

You may have noticed that the Less Annoying CRM website had some problems last Friday and over the weekend. We take great pride in the speed and reliability of our site, and it’s a huge embarrassment when we let you down like this. I’m writing to apologize, and to explain what happened.

Let me start by saying that I’m sorry. The entire LACRM team is sorry. This has been a stressful couple of days for us as we worked to fix the server issues, yes, but I know that’s not even close to what our customers experienced when they weren’t able to access their crucial data when they needed it.

So what happened? First, in case you’re worried, let me assure you that no data was lost (other than if you tried to save anything when the site was down, including logged emails). The site is running smoothly and we have no indication that any of the previous problems are still present, so performance should be back to normal.

If you’re the type of person who is interested in the nitty gritty, I’ve included both a short and a long description of the issues, as well as what actions we took to solve them, below. Please let me know if you have any questions. Once again, please accept my deepest personal apology.

Thanks,
Tyler
CEO of Less Annoying CRM

—The short story—
Friday morning (US time), the site went down due to a database failure. No data was lost, but it took about one hour for us to bring up our backup server, and the site was inaccessible during that time. When the site came back up, it was clear that our failover database still wasn’t behaving correctly, and there were performance issues for the rest of the day. We believed that we had fixed the issue Friday night, but unfortunately it occurred again on Saturday morning, so the site was slow most of the day on Saturday. The fix we attempted Saturday night worked temporarily, but the problem was back Sunday morning. On Sunday we decided to take more drastic actions which required us to put the site into “read only” mode for the day while we built an entirely new database from scratch. The new database went live Sunday night, and we haven’t had any problems since then. We’ll continue to monitor the situation, and we’ve already taken actions to strengthen our infrastructure for the future.

—The long (more technical) story—
Our databases crashed Friday morning (US time) for unknown reasons. We keep a failover database ready at all times so that it’s easy to recover from this type of scenario. Switching to the failover database is supposed to be nearly instant, but (again, for unknown reasons) it took much longer than it typically does. We suspect that something was wrong with the hardware in question. Because of this extra delay, it took us about one hour from the start of the downtime before the site was live again.

The failover complete, the site continued to have intermittent database problems throughout Friday. The site was still generally accessible and there was no more extended downtime, but we noticed continued issues with slowness, and occasional brief outages that quickly self-corrected. After looking into the issue, we believed that the problem was caused by a problem with the database server related to the same issue that caused it to take so long to fail over initially. Friday night we switched to an entirely new database server, and the problems subsided.

Unfortunately, Saturday morning, the problems returned. This time they followed a pattern. First, the site would experience major slowdowns caused by database issues. After some period of time (normally approximately 30 minutes) the database would calm down and the site would perform normally for about 20-30 minutes, and then the cycle would start over. The site was generally available during this time, but during the bad periods it would run very slowly (10-20+ seconds per page load).

We spent all day on Saturday working on diagnosing the problem. We tested countless possible solutions, and eventually were able to eliminate certain possibilities. For example, we confirmed that the problem wasn’t caused by user behavior (i.e. a user surge of traffic that we couldn’t handle). We also tried stopping all of our various services to see if that would prevent the database from choking and nothing worked. So that left us with one possibility: the hardware was faulty. Saturday night, we attempted to restart the database using a new virtual machine. After this, the site seemed to be running smoothly until Sunday morning when the problems appeared again.

Leading up to Sunday, we had been trying to find a solution that wouldn’t result in major disruptions to the CRM (the site was still running, it was just slow). Sunday morning we decided to bite the bullet and completely rebuild the database on all new hardware, along with some other changes (such as switching to solid state drives). Unfortunately this required us to put the site into a read-only mode so that no new data could be written while we created the new database. The site was in read-only mode for most of the day on Sunday, until the new database was ready Sunday night. As soon as the new database was ready and tested, we turned off read-only mode and everything was back to normal.

We will continue to monitor things to make sure that there aren’t lingering issues, but the fact that the site is running smoothly under the high load of Monday morning traffic is a great sign. The changes we made on Sunday are already a big step towards preventing this type of thing from happening again in the future, but our technical team will also be making other changes to how the database works in the future to make it even more reliable.

Once again, I’m very sorry about this problem. I know that this explanation doesn’t make up for the inconvenience we put you through, but I always want to be as transparent as possible any time there is a problem. Please let me know if you have any questions.