What happened
A series of our database servers are down due to an increased workload on Saturday, May 16, 1:06 pm UTC until Saturday, May 16, 8:02 pm UTC. This causes some customers unable to access our services. IGA Hosting services are fully operational and no interruption was reported during this period.
Network CPU Performance
Network RAM Performance
What were affected
Here’s a list of affected services during this period:
Some of our development sites are affected but it appears that all other sites are functioning normally.
What we are doing
Our servers have been getting increased workload since April and we have tried to mitigate that using a series of servers to even the workload, know as load balancing. However, we have reconfigured the load-balancing network shortly before that. It seems that a CPU spike has caused a RAM spike and thus crashing the server. The cause of this spike is unknown and our team is looking into it at the moment. It appears that the spike in resources has crashed our database servers and causes this issue.
Network CPU Performance
Network RAM Performance
What does it take so long to fix
The time that this error occurred is rather an inconvenience, our automated response system detected the error at 1:06 pm UTC and attempted to correct the issue but was unsuccessful. At which time is 6:06 AM for our server team that is responsible. And since it is Saturday, our team was unavailable until 10 AM local time.
Our team found out the issue around 10:07 AM and attempted to correct it. Unaware that the database crashed, our team was looking at the Database Disconnection Checklist which is a lengthy procedure that checks the connection between our web servers and our database servers before tracing the issue to the database itself.
Our team has completed the checklist and found no issue with the connections, it has been clear that the issue was in the server. It then takes some time for our team to find the server network that was causing the issue and find the server that was not functioning as expected. Around 12 PM local time, our team has confirmed that the database servers are unresponsive and attempted to find out what was wrong.
After looking into the server it was found that the databases were not running (Inactive). Our team then started a restart on the databases at 12:50:40 PM. We have confirmed our database was operational shortly after and the issue was confirmed fixed at 1:02 PM.
What are we doing to prevent this
The management has decided that our current network no longer fits the demand for our services, we are working on migrating our services to our new AMD servers. This will increase our RAM processing power by 4 times and our CPU processing power by 3 times.
Our response time was deemed unacceptable by the management and the server management has been expended to both our Malaysia office and our United States office. This is estimated to reduce the maximum response time to less than 5 hours. We have revised our checklist to check server operations before checking the connections. We have also provisioned extra load balancing servers and backup servers to prevent service interruption in the future.
We are deeply sorry for any inconvenience caused during this period and we are working on making sure that it does not happen in the future again.