From Shiny UIs to Stable Servers: Our Journey of Ensuring Uptime and Resilience
When we started Glassix, new features were the top priority, above almost all else. Our first customers didn't lose much sleep about uptime SLA; They wanted their shiny new UI and the next channel integration!
As we signed our first big clients, their priorities were different. While they still cared about functionality, they also cared about resiliency; a customer support center with 40 agents in a shift couldn't afford to go offline because we had a lazy server.
Then, about a year and a half ago (conveniently around the same time my daughter was born), our platform started responding slowly to various requests. And by slowly, I mean about 20 seconds on average. As an agent, it was almost impossible to work this way. This problem (which we later found related to a hidden backup in our cloud provider) happened for about 20 long minutes every week until we found the problem.
This was a wake-up call for Glassix. The company, recognizing the critical balance between innovation and reliability, began redirecting resources toward strengthening its infrastructure.
Now, apart from the usual checklist, such as caching, horizontal scaling, and CDN, we also did the following:
Even with a lousy network signal, you can still open and view messages on your WhatsApp app. You can even send messages, and once you are online, they will be sent automatically. Magic!
Same thing with Glassix. Try it now: Go offline and browse to your Glassix workspace. You can view messages and images (that were cached before), send messages, view canned replies, close tickets, and assign conversations. All these actions are queued and will be synced once you are online again. It is one of our resiliency features I'm proud of the most. We used Workbox, which simplifies the process of building offline-friendly web apps, making it easier to create reliable user experiences even when the network is unreliable.
Our offline mode can mitigate network issues that our customers may experience.
Pairs of parallel services
We use so many databases and services I lost count: MongoDB, SQL server, Elastic search, Redis, Ably, static apps, serverless functions, and many more.
Services covered by an SLA with four-nines availability -- or 99.99% -- could be unavailable 52 minutes and 36 seconds per year. Three-nines availability -- 99.9% -- allows 8 hours and 46 minutes of downtime annually.
Let's say that each one of these core services is down for 2 hours in total once a year. It means we'll be down for at least24 hours yearly, something we can't live with!
So, of course, every cloud service comes with a backup, but from our bitter experience, more is needed.
Our costly solution works well: each core service has a redundancy service.
a. Azure tables service runs with AWSdynamoDB.
b. SignalR with Ably.
c. Static apps are replicated across 5 region sand cached in the load balancer.
d. Fastly CDN with Azure CDN.
e. Redis with Elasticache.
f. Multiple domains in case of a DNS outage: *.glassix.com and *.glassix.io
And the list goes on and on.
Chaos Engineering is a disciplined approach to identifying failures before they become outages. By proactively testing how a system responds under stress, you can identify and fix failures before they end up in the news. Basically, it means start breaking stuff!
This way, we thoroughly identified all the choking points in our platform and solved them (with other services/different logic).
Do you love your Redis service? Shut it down (or even better, block all IPs; this way, you always reach the configured max timeout for each request). We discovered you can't log in to our app if Redis is down because of our CSRF token.
Your CDN? Mess up its DNS records and see what happens.
We found out that without the CDN, our app crashes. So we implemented a retry logic everywhere: First, we load from the primary CDN (Fastly); if it fails, we go to the secondary one (Azure CDN), and if that fails, we go to the current host.
Sacrifice the few to save the many
This concept in software engineering refers to making trade-offs or decisions that may involve sacrificing the performance or resources of a few components or processes to enhance a software system's overall reliability and stability. This concept is often applied to situations with limited resources or where optimizing every component is not feasible or practical.
In our case, we've set short timeouts for non-critical HTTP and SQL requests.It means that they will be the first to fail once the SQL or other service is not functioning well, but we've lightened the burden over these services by falling these non-critical requests fast.
Failing fast is critical. Consider an example: many users access a service that's down, like in a banking app where the account service fails and overwhelms other services like authentication. This can exhaust resources, causing widespread system failures. Circuit breakers in micro services prevent such cascading failures by quickly halting operations to protect the system. We also implemented a circuit breaker over some of our non-critical HTTP APIs. Once a server detects it is nearing exhaustion, it automatically rejects non-critical requests.
Don't put all your eggs in one basket
Micro services can improve the overall resilience of the application. If one micro service fails, it doesn't necessarily bring down the entire application. This containment of faults is a significant advantage over monolithic architectures, where a single failure can impact the whole system.
These vast changes didn't happen overnight. We had to do them slowly, as they affect the core of our platform. We will continue to balance the pursuit of innovative features with the need for a robust and reliable platform. This means we will persist in introducing cutting-edge features while ensuring the core platform remains solid and dependable.