An Apology

The following is about as near as BT Internet get to an apology...

It appeared in the newsgroup btinternet.announce on June 7 - eleven days after a disastrous BT Internet "mail server upgrade".


Dear Customers'

Over the past few days we have experienced several problems. We would like to apologise for the interruptions in service and offer an explanation of what's happened and what we are doing about it.

The problems have been as a result of some software architecture changes between the database and our application servers.

As you will have seen on the Service Status page, we have been experiencing mail problems. In order to understand what happened, you would need to know roughly how your mailboxes work. Your mailbox is simply a folder on our mail server. When you connect to the server to download your email using POP3, you simply log into your folder on the server. When you log in, a special file called a lock is created. This "locks" the mailbox to stop anyone else from connecting to the same mailbox at the same time as you. When you disconnect from the server, the lock is deleted. Locks are also created when the server delivers mail to your mailbox.

Our engineers have been working on the mail servers and they have been altering the processes that:

(a) Validates that the mail-box exists
(b) Opens the mail-box and inserts the mail item

The mailbox is locked during (b), the benefits of the new mail system are
:- shorter lock time and less machine resource use. This will allow the servers to handle more email.

Where did it go wrong? During the transition, which lasted several days, several mailboxes were left in a locked status. On Wednesday (31/5/00), we identified the defect that was causing this.

How do we know? We ran a process that told us that we had 2000 locked boxes, to put this in perspective, we have around 400k customers, many with multiple boxes. Therefor, this is rather a low number. We all use e-mail however, and understand the frustration this causes.

During Wednesday, we were able to recreate the conditions that caused the lock and we have developed a solution.

We left the problem "open" on the Status page and on the Service Status line because we were checking via an automated process for a reoccurrence of this problem. I can now say that the problem is solved. If any mailboxes are still inaccessible, please contact the helpdesk and they will be investigated individually.

Other customers noticed that mail was slow to arrive - Mail between domains is normally measured in minutes, mail is queued and then transited either internally within our server farm or to external domains. Incoming mail is dealt with similarly e.g. it is queued so that we can accept the mail item from the sending domain and then post the mail individual customer mail boxes.

During these changes and following the discovery of the mail box lock we have for several periods held the incoming mail on input queues to
(a) Investigate the locks
(b) Un-lock the boxes.

As I said earlier delivery of mail can cause a natural lock and we wanted to scope the problem accurately.

Regarding the Emergency Planned outage, Wednesday P.M, Ok I understand that this sounds odd, however it is just a phrase that we normally use internally.

We have planned outages which are normally agreed and communicated with 7 days notice, a planned outage moves to an emergency planned outage if we do one of two things either

(a) Come inside the 7 Day time frame
(b) Move a planned outage.
In this case, we decided to move an already planned outage forward a few days. The outage allowed our software engineers and Server Support teams to make the upgrades to our applications that prepared the platform for Surftime. During this period, we have again held external mail in queues.

A number of you also reported intermittent problems accessing web pages. During last Tuesday (30/05/00), customers reported 'slow web-page downloads' from the www.btinternet.com, our monitors showed the same thing. It looked like we were 20% down on average evening web pages served. Again, we think that this is a small defect, which we resolved on Tuesday evening. This started at approximately 16:00 hrs on Tuesday afternoon.

Regards, BT Internet Support.


A response...

I posted the following in the newsgroup btinternet.support on June 8, 2000, under the title "May I be a Spectre at the Feast?"...

I'm afraid I fail to share the enthusiasm shown by others for the unsigned BTi statement about the e-mail (and other) problems that followed a supposed mail server upgrade during the run-up to the Whitsun bank holiday. This mostly purports to deal with the two persistent problems have been reported ever since the "upgrade":

1. A customer's Mail box is "locked": he cannot access new mail.
2. Repeated delays of many hours' duration have occurred between receipt of incoming mail at BTi and its arrival in customers' mail boxes.

The explanation starts:

Over the past few days we have experienced several problems. We would like to apologise for the interruptions in service and offer an explanation of what's happened and what we are doing about it.

The problems have been as a result of some software architecture changes between the database and our application servers.

"software architecture changes between the database and our application servers"? This, as Reg Edwards would say, is gobbledegook. I *imagine* it means that someone at BTi rewrote the interfaces through which the mail, web and other servers obtain authentication data from the customer database - but I'm only guessing.

What it surely does mean is that BTi implemented new software on a live system - on which 400,000 people rely for e-mail (amongst other services) - when it had not been adequately debugged.

As you will have seen on the Service Status page, we have been experiencing mail problems. In order to understand what happened, you would need to know roughly how your mailboxes work. Your mailbox is simply a folder on our mail server. When you connect to the server to download your email using POP3, you simply log into your folder on the server. When you log in, a special file called a lock is created. This "locks" the mailbox to stop anyone else from connecting to the same mailbox at the same time as you. When you disconnect from the server, the lock is deleted. Locks are also created when the server delivers mail to your mailbox.

Our engineers have been working on the mail servers and they have been altering the processes that:

(a) Validates that the mail-box exists
(b) Opens the mail-box and inserts the mail item

The mailbox is locked during (b), the benefits of the new mail system are
:- shorter lock time and less machine resource use. This will allow the servers to handle more email.

Now this is a nice, clear explanation. But it doesn't exactly sound like an "upgrade" to me - more like a means of squeezing more capacity out of the existing servers in order to *avoid* the need to buy and install new hardware.

Been here before, haven't we? BT implemented the transparent proxies to overcome the need to upgrade the BT Net infrastructure.

Where did it go wrong? During the transition, which lasted several days, several mailboxes were left in a locked status. On Wednesday (31/5/00), we identified the defect that was causing this.

How do we know? We ran a process that told us that we had 2000 locked boxes, to put this in perspective, we have around 400k customers, many with multiple boxes. Therefor, this is rather a low number. We all use e-mail however, and understand the frustration this causes.

1. Damned if I understand the use of the phrase "during the transition".

2. The word "several" may be usefully imprecise, but I don't think it can stretch to 2,000 in anybody's language.

3. "How do we know," you ask: you knew because a large number of customers had been complaining of locked mail boxes, not because you "ran a process" to prove the point.

4. The process told you that 0.5% of mail boxes were (incorrectly) locked. On that basis - and given the number of customers who post to .support and .whinge - you would have expected to see about 4 people complaining of the problem in your ngs. As you know, the number is *far* larger than that: you've till got that many complaining now - tonight - after you claim to have fixed the problem.

During Wednesday, we were able to recreate the conditions that caused the lock and we have developed a solution.

I take it you mean Wednesday 31st; i.e. that it took you only (only!) about 5 days to reach that all-important first stage: the ability to repeatably demonstrate the fault - without which you cannot begin to devise a bug-fix. From the phrasing, however, I take it that you do not intend to give us the date on which you developed and implemented your solution.

You must have known on May 28th that the changes had resulted in trouble: your management evidently decided to allow customers to suffer rather than revert to the old "architecture" while a fix was being developed.

We left the problem "open" on the Status page and on the Service Status line because we were checking via an automated process for a reoccurrence of this problem. I can now say that the problem is solved. If any mailboxes are still inaccessible, please contact the helpdesk and they will be investigated individually.

Ah - in other words it's NOT exactly "solved": you mean that you do not expect any more incorrect locks to be applied, and that you are relying on customers to tell you about the existing ones - either by paying you 50p per minute for the privilege of reporting your problems or by waiting days for you to get round to dealing with it via e-mail or newsgroup.

Other customers noticed that mail was slow to arrive - Mail between domains is normally measured in minutes, mail is queued and then transited either internally within our server farm or to external domains. Incoming mail is dealt with similarly e.g. it is queued so that we can accept the mail item from the sending domain and then post the mail individual customer mail boxes.

During these changes and following the discovery of the mail box lock we have for several periods held the incoming mail on input queues to
(a) Investigate the locks
(b) Un-lock the boxes.

As I said earlier delivery of mail can cause a natural lock and we wanted to scope the problem accurately.

So on several occasions during the past week you have taken the mail system down altogether - without bothering to tell us - so that you can scan for locks in circumstances where none should exist.

In my own experience mail has sat on one of your machines for more than seven hours before being placed in my mailbox. May I ask any of the UNIX geeks who follow this ng how long it should take to scan 400k folders for a zero-length file with a unique filespec and to delete all instances found?

Regarding the Emergency Planned outage, Wednesday P.M, Ok I understand that this sounds odd, however it is just a phrase that we normally use internally.

(So use it internally if you must. When communicating with your customers, please try to use simple, and preferably sane, English.)

We have planned outages which are normally agreed and communicated with 7 days notice, a planned outage moves to an emergency planned outage if we do one of two things either
(a) Come inside the 7 Day time frame
(b) Move a planned outage.

I've been with BTi for nearly three years (God help me). I don't recall *ever* being given seven day's notice of an outage.

In this case, we decided to move an already planned outage forward a few days. The outage allowed our software engineers and Server Support teams to make the upgrades to our applications that prepared the platform for Surftime. During this period, we have again held external mail in queues.

Wonderful. Did you decide that you might as well be hung for a sheep as a lamb? And will existing customers who wish to transfer to surftime to be able to do so rather sooner than previously advised, or did you add to our suffering solely for the benefit of new customers?

A number of you also reported intermittent problems accessing web pages. During last Tuesday (30/05/00), customers reported 'slow web-page downloads' from the www.btinternet.com, our monitors showed the same thing. It looked like we were 20% down on average evening web pages served. Again, we think that this is a small defect, which we resolved on Tuesday evening. This started at approximately 16:00 hrs on Tuesday afternoon.

"20% down" - effectively failure to deliver one page in five - is NOT a small defect. What makes me despair is that you think this was a one-off... it is not: many of your customers - as you well know - simply hang up and redial every time they get a 62.7 IP; all that happened during the period in question is that all platforms were providing that kind of performance.

**************************************

To summarise:

* you implemented inadequately tested software immediately before a bank holiday

* took five days to identify the fault that your customers were reporting

* chose to let customers suffer rather than revert to the old software

* repeatedly took the mail server down for hours at a time and without warning those who rely on it

* delayed making an announcement for more than a week after you had identified the problem

* ignored the evidence of your newsgroups and claimed that only a half per cent of customers were affected

* chose to regard the problem as "fixed" when you know perfectly well that customers are still suffering locked mailboxes - which implies either that mailboxes are even now being locked incorrectly or that you are incapable of removing all locks even when you've taken the mail service offline.

I'm not surprised that no-one was prepared to put his name to that announcement: you're overdue for some personnel changes, aren't you?

--
Regards
Peter Boulding


Support reply...

BT Internet support replied as follows...

In article <4mntjskf2d4700darc717bjgnphfkincfg@4ax.com>, Peter Boulding wrote:

<snip>

**************************************

To summarise:

* you implemented inadequately tested software immediately before a bank holiday

* took five days to identify the fault that your customers were reporting

* chose to let customers suffer rather than revert to the old software

* repeatedly took the mail server down for hours at a time and without warning those who rely on it

* delayed making an announcement for more than a week after you had identified the problem

* ignored the evidence of your newsgroups and claimed that only a half per cent of customers were affected

* chose to regard the problem as "fixed" when you know perfectly well that customers are still suffering locked mailboxes - which implies either that mailboxes are even now being locked incorrectly or that you are incapable of removing all locks even when you've taken the mail service offline.

I'm not surprised that no-one was prepared to put his name to that announcement: you're overdue for some personnel changes, aren't you?

--
Regards
Peter Boulding

Hello Peter,

I am currently in the process of forwarding All of your comments onto the Product Team for their attention. I will be looking for responses to your questions and I will let you know the outcome.

--
Regards

William
BT Internet Support


As is generally the case, no follow-up had appeared, either in the above thread or in btinternet.announce, by the time the above message expired from the BTi newsfeed.

Meanwhile it became apparent that the mail system was indeed still in trouble: various messages from angry customers - and from those who are trying to send mail to BTi customers - described how BT's mail relays were either sitting on incoming mail for hours or were repeatedly refusing to accept it.

Contributions from some members of BT Internet Support in other threads confirmed that the Locked Mailbox problem had yet to be solved - while others continued to claim that the problem was fixed.

As complaints about locked mailboxes and delayed mail began to tail off, more and more complaints concerning duplicate mails appeared in the newsgroups...

Click here to experience one customer's frustration.

Note: The mail problem first reared its ugly head towards the end of May 2000. Click here to view the full list of acknowledged faults for May 2000.

Home       Top