Ok here goes,
We have just got off the phone with AWS engineers and I will explain the issue.
We have our infrastructure setup on an Auto Scaling group to increase and decrease instances based on average load CPU.
This was implemented to handle these type of issues and prevent any outage by scaling and decrease instances, so the only point of failure would be AWS itself and our API will notify it here health status https://status.aws.amazon.com/
After much discussion and checking each instance it looks like on instance receive an OOM (Out of memory) error due to high load the whole system will be receiving a very high load this time of year and we are committed to keeping all our users setups running so this is very frustrating, with one instance receiving an OOM the autoscaling health check became unresponsive, meaning it could not reach the instance so it could not do what it should be doing in this scenario and firing up a new instance and remove the failed instance.
The solution.
We have now implemented a service health check for OOM on each instance so if any instance does receive an OOM it should do what it is set up to do and create a new instance and remove the failed instance. The setup will also scale up to high demand adding instance when necessary and removing the instance when they are not needed.
Hope this all makes sense we are continuously monitoring the setup and checking for any problems.
We apologise for any downtime.
Best Regards
Sam
-
This reply was modified 7 years, 3 months ago by
s3bubble.