Status Update: Service Interruption Reported, IGA Hosting services are not affected and are fully operational. Certain services were unavailable during the downtime from 1:06 PM UTC to 8:02 PM UTC (7 hours). The issue is fully resolved at this time.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

What happened

A series of our database servers are down due to an increased workload on Saturday, May 16, 1:06 pm UTC until Saturday, May 16, 8:02 pm UTC. This causes some customers unable to access our services. IGA Hosting services are fully operational and no interruption was reported during this period.

Network CPU Performance

Created with Highcharts 4.1.5APP.DATACENTER.SECURITY.MONITORING.CHARTS.LIST.X_AXIS.LABEL.DATEAPP.DATACENTER.SECURITY.MONITORING.CHARTS.LIST.Y_AXIS.LABEL.PERCENTPrimary Server Network04:0006:0008:0010:0012:0014:000%25%50%75%100%

Network RAM Performance

Created with Highcharts 4.1.5APP.DATACENTER.SECURITY.MONITORING.CHARTS.LIST.X_AXIS.LABEL.DATEAPP.DATACENTER.SECURITY.MONITORING.CHARTS.LIST.Y_AXIS.LABEL.PERCENTPrimary Server Network04:0006:0008:0010:0012:0014:000%25%50%75%100%

What were affected

Here’s a list of affected services during this period:

Some of our development sites are affected but it appears that all other sites are functioning normally.

What we are doing

Our servers have been getting increased workload since April and we have tried to mitigate that using a series of servers to even the workload, know as load balancing. However, we have reconfigured the load-balancing network shortly before that. It seems that a CPU spike has caused a RAM spike and thus crashing the server. The cause of this spike is unknown and our team is looking into it at the moment. It appears that the spike in resources has crashed our database servers and causes this issue.

Network CPU Performance

Created with Highcharts 4.1.5APP.DATACENTER.SECURITY.MONITORING.CHARTS.LIST.X_AXIS.LABEL.DATEAPP.DATACENTER.SECURITY.MONITORING.CHARTS.LIST.Y_AXIS.LABEL.PERCENTPrimary Server Network04:0006:0008:0010:0012:0014:000%25%50%75%100%

Network RAM Performance

Created with Highcharts 4.1.5APP.DATACENTER.SECURITY.MONITORING.CHARTS.LIST.X_AXIS.LABEL.DATEAPP.DATACENTER.SECURITY.MONITORING.CHARTS.LIST.Y_AXIS.LABEL.PERCENTPrimary Server Network04:0006:0008:0010:0012:0014:000%25%50%75%100%

What does it take so long to fix

The time that this error occurred is rather an inconvenience, our automated response system detected the error at 1:06 pm UTC and attempted to correct the issue but was unsuccessful. At which time is 6:06 AM for our server team that is responsible. And since it is Saturday, our team was unavailable until 10 AM local time. 

Our team found out the issue around 10:07 AM and attempted to correct it. Unaware that the database crashed, our team was looking at the Database Disconnection Checklist which is a lengthy procedure that checks the connection between our web servers and our database servers before tracing the issue to the database itself. 

Our team has completed the checklist and found no issue with the connections, it has been clear that the issue was in the server. It then takes some time for our team to find the server network that was causing the issue and find the server that was not functioning as expected. Around 12 PM local time, our team has confirmed that the database servers are unresponsive and attempted to find out what was wrong. 

After looking into the server it was found that the databases were not running (Inactive). Our team then started a restart on the databases at 12:50:40 PM. We have confirmed our database was operational shortly after and the issue was confirmed fixed at 1:02 PM.

What are we doing to prevent this

The management has decided that our current network no longer fits the demand for our services, we are working on migrating our services to our new AMD servers. This will increase our RAM processing power by 4 times and our CPU processing power by 3 times.

Our response time was deemed unacceptable by the management and the server management has been expended to both our Malaysia office and our United States office. This is estimated to reduce the maximum response time to less than 5 hours. We have revised our checklist to check server operations before checking the connections. We have also provisioned extra load balancing servers and backup servers to prevent service interruption in the future.

We are deeply sorry for any inconvenience caused during this period and we are working on making sure that it does not happen in the future again.

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Uncategorized

Scheduled Maintenance

Hello, just a heads up, our AMD Server Network is going for scheduled maintenance on August 20th, 2020 at 8AM UTC to 11 AM UTC. Affected services includes Anthonian Plesk Network, HAIL IGA Vault and others. This is to optimize the performance and security of the new network.

Notices

We’re Shutting Down #CovidResistance.Network

What’s Happening Hello everyone, we are shutting down CRN now that the pandemic economical impact has starting to stabilize. We have been using up a lot of funding to provide free websites for businesses and we are running low on the allocated funding. And without any source of additional funding the CRN will be running

We Are Here For You

Drop Us A Line To See How

Scroll to Top
%d bloggers like this: