[RESOLVED] June 29, 2019 - Extended Email Service Outage

Note: The email platforms have been restored after the recent upgrade with majority of issues resolved. For the remaining problem with outbound delivery delays, please follow our new post here
July 9 - Outbound Delivery Delays

During the planned upgrade of the email infrastructure, we encountered a number of issues that resulted in a greatly extended period of downtime which continues to impact some customers.  One of the primary central storage units that stores mailbox data for customers had multiple physical faults that required components to be swapped out before the storage volume could be brought back online – a lengthy process due to the amount of data involved.

The storage volume has been restored and the data is intact – we are now going through a process of connecting and synchronizing the storage unit with the frontend mail nodes – a process that is also lengthy, but that we are trying to expedite as best we can. Inbound mail is still being queued for impacted customers and should be delivered as soon as the mailbox nodes are synced and online.

The team has been working on this issue around the clock and will continue to do so throughout the weekend until all customers have mail service restored. We anticipate having the bulk of the remaining impacted customers online by EoD and will provide updates here as work continues.


June 29, 2019:

11: 45 AM CT: Our administrators continue working on the synchronization with the frontend email nodes. Thank you for your patience.

3:30 PM CT: Service to a portion of the impacted mailboxes has been restored and inbound / outbound mail should be flowing – although we are monitoring the status of these closely. However, there remains a large number of mailboxes that are still unavailable – storage for these appears intact, but we are working through a networking issue between the backend and frontend nodes. The team is continuing to diagnose the root cause of this and believes they have zeroed in on the root cause. Our engineers are in the process of making network config updates and will continue to work throughout the day and as long as it takes to fully restore service to all customers. 

8:30 PM CT: The root cause of the networking issue affecting some of the mail server nodes has been addressed and we have restored connectivity from the central storage system to these systems. We are now restarting some of the core networking gear to propagate some configuration changes that should bring the remaining mailboxes online. We will be testing to ensure mail is flowing to all mailboxes and monitoring the flow throughout the weekend.

June 30, 2019:

7:45 AM CT: Access to 30% of the affected mailboxes have been restored. Our administrators continue to work in order to fully restore service to all customers.

2: 45 PM CT: Access to 60 % of the affected mailboxes have been restored. Our engineers are currently working on restoring the remaining services. 

11:45 PM CT: All mailboxes are currently online and both inbound and outbound mail are flowing. There is a large queue of inbound messages that has accrued while the impacted mailboxes were offline. This queue of messages is currently being delivered. Because the queue is “first in / first out” – new message delivery may be delayed while the queue is being processed. We are also investigating reports of sporadic connection issues when retrieving mail and will continue to monitor the entire platform closely.  

July 1, 2019:

7:45 AM CT: Queued incoming emails are in process of delivery to the intended mailboxes. The large queue might cause delays in receiving. Our engineers are currently investigating sporadic issues in sending out emails via email clients and mail2web interface.

2:00 PM CT:  We’ve brought additional servers on-line to help process the delivery queue and have implemented some tweaks that have accelerated the delivery rate. The sheer volume of messages will still take several hours to fully process, and we are looking into adding additional horsepower to further increase the backlog delivery rate.

You may notice that new messages are now delivered with relatively little delay – this is because they are being routed through separate servers that are now dedicated to handling inbound mail so that new mail doesn’t sit in back of line in the delivery queue.

We are also continuing to investigate an issue with outbound SMTP connectivity. We believe we have identified the root cause and are currently working on a fix and will continue to post updates on this issue here.

9:45 PM CT: We are currently working on deploying an update that we hope will address the first of two separate outbound connection issues that are impacting separate groups of customers. We are also working on a fix for the second issue concurrently. The inbound backlog is also continuing to deliver – although still not at the pace we’d like to see. As soon as the outbound issue is resolved we are going to redirect attention to the backlog to see if we can further accelerate the delivery rate.  


July 2, 2019:

7:40 PM CT: We are currently implementing a number of fixes to the mail flow and monitoring for results. So far they seem to be effective. We will provide further updates on these fixes shortly.

July 3, 2019:

10:50 AM CT: 
SMTP connectivity on the email platform is now stable. Email delivery both inbound and outbound have improved and the majority of email messages are delivered within few mins.
We are seeing sporadic cases where single messages are delayed and are working on addressing these by applying configuration changes to the email clusters.

The backlog of emails accumulated until Sunday 6/30 is still being delivered (down to about ~340K messages) and we continue to work on improving the pace at which this queue of emails is being delivered to the appropriate mailboxes.

11:50 AM CT: We are also working on restoring catchall functionality for customers on Basic Email as well as addressing reports of POP3 connections not working on port 995. Please note that POP3 connections on port 110 are working and stable.

3:20 PM CT: Issues with POP3 connections on port 995 via securemail.myhosting.com have been fixed and confirmed as working. 

7:40 PM CT: No changes at this time. We are continuing to monitor the queue activity.

10:30 PM CT: No changes presently. Queue delivery is continuing and we are monitoring the activity.

July 4, 2019:

2:30 AM CT: Catchall functionality on the classic email platform (Basic Email) has been restored. POP/IMAP and SMTP connections are stable. Email delivery for both inbound and outbound messages is working stable and messages are delivered within few mins. The configuration changes applied yesterday along with the resolution of the catchall issue have improved the mail delivery on the platform. We are continuing to monitor the email queue activity 

10:00 AM CT: Incoming and outgoing mail are flowing and new messages are delivered within few minutes. Accumulated email backlog is still in process of delivering to the intended mailboxes. We continue to monitor the email platform.

1:00 PM CT: Currently incoming and outgoing mail is flowing, and delivery is stable. We are still working on processing the backlog of mail to deliver to the intended mailboxes. Our administrators continue to monitor the email platform. 

5:40 PM CT: Mail for incoming delivery and outgoing still continues to be stable and both are flowing. Our administrators are still currently working to process the backlog of queued mail, and are monitoring the email platform carefully.

9:20 PM CT: Mail flow continues to be stable. The backlog of queued emails is still being processed and monitored.

July 5, 2019:

7:30 AM CT: 80 % of the backlog of emails accumulated until Sunday 6/30 has been processed. POP/IMAP and SMTP connections are stable. Inbound emails are delivered within few minutes. We are currently investigating an issue with outbound email delivery whereby small number of emails are being delayed and customers are receiving the following error: 421 Too many concurrent SMTP connections. We are working on this problem with utmost urgency.

3:00 PM CT: We have corrected an issue with TLS on port 25 and should be working now. We are still working with urgency on the issue with outbound email delivery whereby small number of emails are being delayed and customers are receiving the following error: 421 Too many concurrent SMTP connections.

7:10 PM CT: Our administrators continue to work on the issue with outbound email delivery whereby small number of emails are being delayed and customers are receiving the following error: 421 Too many concurrent SMTP connections.

10:00 PM CT: Work on the outbound email delivery is continuing. Email delivery is improving, however, a small number of emails are still being delayed.

July 6, 2019:

2:15 PM CT: Our administrators are still working on the issue with outbound delivery. While delivery is improving, a small number of emails are still experiencing a delay.


Was this article helpful?
90 out of 535 found this helpful