![]()
High Reliability in New York Telephone and New England Telephone Development Projects
Menno E. Aartsen
Member of Technical Staff
Operator Services Systems Group
Network Services Laboratory
NYNEX Science & Technology, Inc.
White Plains, NY 10604
At the Twincom Workshop in Slotje Limburg at Oosterhout on Tuesday October 25, 1994
I. NYNEX: Communications
II. Technology Engines In First Gear
A. Telco: Copper Wire And Switches
1. The Law
III. A Case Study: The Butt And The Button
A. Survivability Is Designed, Not Engineered
1. $$$: Too Much Of A Good Thing
2. Emergency Services
B. The Business Decision
IV. Fault Tolerance: A Definition
A. Zero Fault Tolerance
B. High Availability
C. The Marketing Decision
V. The Machine Factor
1. MTBF: Systems Don't Fail On Friday Night Unless You Work On Saturday
2. Things That Break
a) Disk Mirror
b) Operating Systems
Fault Tolerance or High Availability: A Business Decision
My name is Menno Aartsen, and my subject today is that of Fault Tolerance in the business of NYNEX Corporation, of White Plains, New York. NYNEX is predominantly a telecommunications corporation, and has traditionally provided local telephone service in the State of New York, and in New England, which is comprised of the States of Maine, New Hampshire, Vermont, Massachusetts, Rhode Island and Connecticut. What that means, in a nutshell, is the Northeastern United States, from the Canadian border down to the Southern tip of New York City, Staten Island.
To give you a rough idea of the telephone side of our business:
NYNEX region
Inhabitants 16 million
NYNEX Subscribers 98% of households
Installed Lines 15 million
Avg. Calls Per Year 60 billion
Of Which Local Calls 36 billion
I have worked for NYNEX' Advanced Technology Research Division, NYNEX Science & Technology, Inc., for the past five years. I was educated in The Hague, The Netherlands, and professionally trained at IBM Nederland N.V. and IBM U.K. Ltd., as a mainframe Systems Engineer, moving to London in the early Seventies and from there to the United States in 1983. I am a Member of Technical Staff with the Operator Services Systems Group in the Network Services Laboratory - we develop enhanced telephony systems based on our own proprietary Fourth Generation programming language, SCL, or Services Creation Language, originated by VP Craig L. Reding and Member of Technical Staff Suzi Levas. I am currently engineering and networking our Automated Functions Node platform, which will provide NYNEX with Enhanced and Automated Operator Services and Speech Technology Implementation in the years to come, with a view to increasing revenues (that's the Enhanced bit) and cutting cost (that's the Automated portion).
NYNEX: CommunicationsNYNEX was formed in 1984, after the U.S. government mandated the breakup of US-wide telephone services provider AT&T, and became the holding company for New York Telephone and New England Telephone. NYNEX’ most important service packages are:
Overseas, we have major offices in the U.K. and Thailand. In England, NYNEX is now the largest Cable Television provider, while in Thailand, we are building a two million line network in Bangkok for the local authorities. We are also present in the People’s Republic of China, Indonesia, Greece and Czechoslovakia, while NYNEX is a founding member of the Fiber Optic Link Around The Globe, or FLAG, conglomerate.
Technology Engines In First GearNew services providers, such as cable television companies, have it easy: they don’t need to maintain a decades old copper based network, and do not (or, at least, not yet) fall under the legal constraints that are part of telecommunications service provision in the United States. All of our so called Regulated Services, such as
are regulated by two different entities: in each State, by that State’s Public Service Commission, and Federally, by the Federal Communications Commission. Over and above their requirements, our network becomes part of the country’s defense network in times of war, so a minimum standard of network service must be maintained at all times. Last, but not least, both the United States Department of Justice and the Federal Courts keep an eye on telephone company compliance with deregulation requirements.
The various Commissions have regulated our response times, permitted network downtime, and technical quality of service. They also regulate our pricing structure, and prevent telephone companies from using "regulated" revenues to subsidize "unregulated" activities. All in all, there is a limit to the amount of money we can make on telephone services, and thus, to the amount of financing available for network maintenance, improvement and expansion - apart from anything else, the investment required to provide a good measure of fault tolerance is very high, as is the cost of complying with the mass of regulations. Our Technology Engine, advanced as it may be in the laboratory, can therefore only rarely be in high gear - most of the time, we need to implement our new technologies on ancient copper wire, and are limited by the bandwidth this provides. The challenge is to bring advanced services that really need high throughput technology, such as fiber optic cabling, to copper wiring. The true challenge is inventing technologies that optimize the use of bandwidth, as opposed to increasing it - eventually, you'll always run out of bandwidth, we tend to forget the law of nature, sometimes.
It will be clear, from the foregoing, that affordable Fault Tolerance, as it pertains to providing telephone and ancillary services on what we call a 24x7 (all day, every day, including Christmas Day) basis is of primary importance to our every day operations. As an example, I'd like to cite last year's bombing of the World Trade Center, a group of office buildings with 55,000 occupants, and an equal number of telephone lines. Our switching centers underneath the World Trade Center continued to function through the blast and the ensuing fire, even though one was located less than a hundred yards from the center of the explosion, and City power failed immediately. All those trapped in the towers were able to communicate with the emergency services, and their loved ones, throughout their ordeal, and NYNEX was able to activate auxiliary cellular telephone repeaters for the rescue workers within hours of the attack.
A Case Study: The Butt And The ButtonFault Tolerance? Heard any good definitions, lately? To me, Fault Tolerance (literally, the ability to ignore errors) applies to systems, only, not to their individual components. Unwittingly, one of my colleagues provided me with the best example of what fault tolerance is, a few years ago, when he brought down a fully fault tolerant half million dollar IBM System/36 - by accidentally sitting on the on/off button.... While the System/36, also know as Stratus, is a very well engineered, and a beautiful example of fault tolerance driven through to the operating system, there was always the one single on/off button, which I have seen pushed twice, over the years.
To this day, I don't know what moved the designers of the Stratus to build a beautifully redundant system with one on/off button and one system key, I really don't - every other component of the system was duplicated.... If you've ever watched one of these Doomsday movies you already know better: you can’t ever initiate global annihilation unless you have two guys (in uniform, of course) with two keys and two passwords, and two global destruction buttons. I can't think of any fault tolerant design I've seen that doesn't have a least one single point of failure, and that does include the human element.
To take as an example a telephone company: however redundant our systems are, in many cases there's only this one wire (or, in telephone parlance, the two pairs) connecting the subscriber to the switching office. Those wires get cut and the customer has no service. At that point, your fault tolerance is determined by the amount of time needed for one worker to trace the fault, and the time needed for another worker to mend it. Again: the human element.
Of course, we could fully duplicate the entire system: two sets of wire into the subscriber's residence, and two telephones there. Here comes into play the last element of the fault tolerance equation: money. The cost of staying up.
At NYNEX, we have two different classes of service:
Let me explain.
A typical service that has to stay up is 911, the telephone number that, in most areas of the US, connects anyone from any telephone to the Emergency Services. I don’t need to explain to you how important it is that that service function whatever the circumstances, so NYNEX and the local authorities (whose responsibility it is to provide 911) spend whatever is necessary to maintain around-the-clock service. That’s expensive - in the County of Westchester, where I live, fourty miles North of New York City, a special County 911 tax is collected via the telephone bill to finance an overhaul of the 911 service, while every subscriber to a cellular service provider pays a small amount, every month, to help finance the handling of all those calls to 911 made from cellular telephones.
Below that tier, however, what is it that brands a service as 24x7? Who takes the decision between fault tolerance and high availability?
Fault Tolerance: A Definition
Generally, we in the telecommunications industry define fault tolerance as a systems strategy: the ability to survive a fault without dropping any calls in progress. You will appreciate that the need for fault tolerance applies predominantly to 911 types of service: an operator attempting to establish the exact location from where a caller is attempting to report a fire, must under no circumstances lose the connection. Equally important, in this example, is the continuous accessibility of the telephone number database: if the operator has the telephone number from which the call is placed, they can use our subscriber database to look up the installation address of that particular telephone. Lives may be at stake, and a 911 caller may be unfamiliar with the area, or too confused, distraught, or young, to provide useful information. Operator Services takes a fair amount of calls from intended suicides; the ability to keep the person talking, while directing emergency services to the location, is paramount.
At a much lower cost, high availability can provide for secure and continuous service without the "no dropped calls" feature. High availability allows one, in the case of a failure, to drop calls in progress, but process the next incoming call using a failover mechanism.
U.S. telephone companies utilize a set of service and technical standards laid down in documents provided by the Bell Companies Research organization (or Bellcore, for short), an R&D establishment jointly owned by all regional telephone companies in the United States. Bellcore publishes standards to which all so called "Baby Bells" (as opposed to AT&T, historically called "Ma Bell") have committed, and which provide for everything from electrical switching center standards to the exact format of system alarm messages that all Central Office equipment must provide.
It isn’t all that hard, or expensive, to provide high availability equipment: you take all of the equipment, firm- and software that is necessary to run your application, multiply this by two, and add software and a communications link that will allow you to "fail over" the service from a bad component to a good component. If you want to keep it simple, you simply take down the system that has the bad component, and switch to the standby system. The failed system will simultaneously report its failure to the service engineers, and cause itself to be repaired, hopefully before the standby system has a problem of its own - Murphy’s Law applies very much to our operations....
There are lots of problems with this concept....
One of the simple solutions to fault tolerance is the implementation of duplicated alarm channels, which both report on the own as well as the "other" system. This type of alarming provides for failover as well as sanity checks, and makes it possible to arrive at a good measure of high availability without the exaggerated cost. Add to this peripherals that can be controlled by both systems simultaneously, and you have fault tolerance - at a price. We have to realize, that most fault tolerance requirements are based on the need for some kind of peripheral to achieve survival capability, it is rarely sufficient to just have the CPU survive. From database storage devices via process control to switching services, it is generally the peripheral that must maintain its links with the outside world.
True failover capability is expensive. When we started engineering one of the systems that is currently under development in our lab, we found that, while the workstations our application serves sit on a redundant backbone, this backbone is only engineered to maintain half the workstations upon failure of one of the redundant links. The decision to do without half the workstations at a location, in case of a LAN failure, was financial, of course, and at the time, we had more than enough capacity in our network to survive this. What we didn’t think about, at the time, is the introduction of new services that require the use of particular workstations - we couldn’t know, the architecture had not been engineered, and we are now faced with the need for retrofitting full redundancy on those backbones.
How much of this redundancy is really necessary? It’s difficult to calculate, save for to say that full disaster engineering is prohibitively expensive. Does NYNEX have fully duplicated peak hour capabilities? Certainly not, the regulators would not stand for the kind of expenditure required to maintain full functionality in a disaster situation, as those would require us to raise our consumer pricing way beyond what would be acceptable. We do have considerable survivability, and the capability to reroute telecommunications traffic away from problem areas. New regulations governing the multitude of local and long distance carriers have made that job much more difficult than it was before deregulation, though, we can’t, for instance, take long distance traffic routing AT&T cannot carry because of a network problem, and route it via MCI - that’s not legal, we can in fact not even advise the customer to user other carriers’ facilities in such a case.
Another factor in the equation is the increasing complexity of telecommunications systems, with many new services requiring their own hard- and software outside of the normal switching equipment. It is when engineering these new systems that one is faced with some stark decisions - when my voicemail system fails, do I fall back on operators? The availability of voicemail has done away with many operators and secretaries, and that has as a consequence that that fallback has only very limited availability, so choices become very limited indeed. In one of our projects, we will use a digital switch with voice playback capability, solely to be able to warn the customer away in case of a systems failure - the fully redundant front end switch has a low failure probability, and is therefore likely to be up should the rest of the system be down.
Today, twenty-four hour service all too often means that you can reach a voice mail system around the clock. That is, if its high availability CPU is functioning... Many of our customers use high availability as a sales argument, however - the Visa credit card department of Marine Midland Bank keeps regular business hours, while Citibank Visa provides around the clock service - at a price. You’ll say that there can’t be a credit card emergency that couldn’t wait until the morning, right? Wrong! I’m in Europe, right now, and my business hours are shifted six hours with those of U.S. banks. That means that if a Dutch ATM machine eats a Marine Midland Visa card at 2pm, it’d be 8pm in the US, and they’d be closed, and I’d have to wait until 3am Dutch time to report the problem. And that, in turn, the way banks work, would mean a delay of twenty-four hours in getting a replacement card. Both for holiday travelers and for business travelers, such a delay can be very costly. Ever try to walk into an airline office saying that "the machine ate my Visa card"?. I mean.... So I use Citibank, and my wife uses Marine, and she pays several percent less interest than I do, and I can get service in the middle of the night wherever I am.
So once my bank decides to operate its credit card customer service around the clock, that department’s computers and databases also need to operate on this schedule. It is unthinkable, in the United States, that cash machines would be off line part of the day, as is still common here in Europe - once one bank provides twenty-four hour cash availability, they all do. It’s a competitive marketing decision. Apart from which, once you operate your business across different time zones, the combined normal working hours of all your offices already span most of the twenty-four hour day.
Back in the Eighties, many commercial firms thought they could afford full fault tolerance for their operations - especially banks, and their international brokerage departments, felt that optimum around the clock service was worth the price. Towards the end of that decade competition became fierce, service pricing dropped, and the true cost of fault tolerance became apparent.
As a comparison, when I researched pricing for fault tolerance compared with high availability, a fully fault tolerant system that came in at half a million dollars could be compared with high availability at around $100,000, or a fifth the price. These figures, from the late Eighties, do not include the support structure required to keep something running around the clock - local service technicians on twenty-four hour standby, an operating system debugging center half a continent away, equally available all day and all night, and parts shipment facilities that include all night courier delivery. At IBM, we sometimes used to helicopter replacement parts from Germany to Amsterdam, just so the customer’s system could be returned to service as quickly as possible. I think you can safely multiply that half a million by a factor two to arrive at the real cost of fault tolerance.
Needless to say, you’d have to have very good regulatory or commercial reasons, in today’s economic climate, to shell out the kind of money needed to make true 24x7 operation possible. In most cases, a minimum of downtime is acceptable, and for as long as I can install operating systems upgrades, hardware components and applications without taking a system down, I am probably in good shape. Which brings me to the fallacy of predictability...
The Machine FactorMTBF: Systems Don't Fail On Friday Night Unless You Work On SaturdaySales people always try to tell me about the Mean Time Between Failures of their equipment, as if that means anything. You know what I mean: regression testing allows you to predict the average number of failures over a given time period. What MTBF cannot predict is if your system will go down on Christmas morning, when every United States resident of foreign extraction calls their overseas relatives to wish them Happy Christmas. Yes, it’s true, even those who don’t celebrate Christmas call home, as AT&T has its one-day-a-year special Xmas rate - you save money by spending money, or something along those lines.
You can’t predict failure. You can’t even predict a failure rate! Someone calculated that if you used RAID 5 technology as intended, and stuck twenty cheapo disk drives in a RAID box, and operated it around the clock, their combined MTBF would mean that you would have to replace, on average, one of the twenty disks every five days, which would mean you’d have to have 73 spare disks on hand to run a full year without downtime. Average is the operative word here - what is average downtime? You’re either down or you’re not, and there is no law of averages that applies when you blow the partition table on a server drive with your boss watching, as happened to me, the other day. My fault tolerance, that day, consisted of a Norton Utilities diskette. Without that diskette, my credibility would have been severely dented... With expensive disks, and a much better MTBF, that equation would come to one drive every week, not that much of an improvement. And we haven’t even mentioned the person responsible for replacement, as you can’t predict which disk will fail next, so a certain amount of monitoring would be necessary. The annual hardware cost: 73 500Mb drives @ $500: $36,500, without sales tax, not counting the initial twenty drives or the RAID box.
In actual fact, 98% of all hard disks will run very happily for years on end - you won’t notice their failures because their hardware and I/O controllers have a certain amount of error correction and redundancy built in. This is amazing considering the mechanics of the Winchester disk, but then we have over one hundred years of hardware engineering experience, and not nearly that much in electronics, which are much more complicated, to boot. So mechanical solutions are safe, in my opinion, a lot safer than software or electronic solutions. I’ve seen more boards fail than hard drives.
My favourite toy is an NKK 35Gb magneto-optical jukebox, which contains 35 5.25" inch rewritable platters serviced by two Pioneer drives and an IBM RS/6000 host. We have used this system for a little over two years now, and have performed 2 re-installs, 3 software upgrades, 2 operating system upgrades and 2 repairs. Only once did the mechanics fail - the eject mechanism on one of the Pioneer drives was faulty, and the drive was replaced. Even that would have been survivable if the driver software had been able to detect the error, which was adequately alarmed on the firmware level, enabling it to fail over to the secondary drive. To me, that’s adequate proof that the majority of downtime isn’t due to hardware at all... Even though the data on magneto-optical media is virtually incorruptible (you need a well-aimed laser to erase it), you can’t actually access that data unless your server is up, and the hard disk that contains the read/write cache up and running. So my jukebox is terrific for data reliability, but doesn’t provide operational availability beyond that of the RS/6000 host, but even if I bought the High Availability version of the RS/6000, there would be no way to have both boxes talk to the optical device at the same time.
Of course, we’re here at the invitation of Twincom, which provides a software solution to hardware failures. From the foregoing, I would appear to be saying this constitutes solving a problem that doesn’t exist....
Disk MirrorWhat Disk Mirror does for me is NOT providing me with a mechanism to fail over to a second disk when my primary goes down. Of course, it does, and that means I can stay up and running, but to me that is not its primary function. Disk Mirror’s primary function in my implementation on our AFN platform is that it provides me with a way to alarm my disk subsystem! It lets me know my hard disk has failed, and that I need to replace it. As far as I am concerned, the rest is (very necessary, that has to be said) icing on the cake - I need to know that one unit central to my system’s functioning is down, and this it tells me.
Disk failure, now, becomes a survivable event. I don’t need to worry about MTBF any more, or indeed about what kind of hardware I use, for as long as I have this alarm mechanism installed I am armed.
Unix, my group’s mainstay development operating system, is very disk-intensive. It swaps, it runs maintenance utilities, it uses the disk I/O subsystem all the time. While, as I stated earlier, these are very reliable subsystems, they can break, and when they do, you go down - hard. Eliminating failures like these is at the core of Fault Tolerance - if you eliminate errors you can control, you have a better system downtime predictability, you know better where to look for possible problems. Do forgive me if I don’t dwell on MS-DOS, here, but I had already mentioned one should avoid single points of failure...
There aren’t that many subsystems you can have this kind of control over. You can’t test memory while you’re using it, and when a CPU fails, it simply fails. Adapter cards that are in use are equally unpredictable, and so are motherboards, so if you want to guard against the failure of any of these components you have to, at a minimum, duplicate the entire system. Of primary importance is some ability to detect failures early - one of the systems we use for telco applications, Dialogic’s DTP chassis, is fitted with an alarm generating CPU that continually monitors the state of health of the components. It detects power fluctuations, fan failures and a host of other parameters it communicates to our alarm detection systems, whether or not its host computer is functioning, or even has power. In a multi-computer system, this won’t just tell me I have a failure, but also where that failure occurred - all the technician has to look for is the alarm light on the rack that contains the bad unit.
Boeing Aircraft Corporation rolled out its 757 with a triple redundancy implementation of the Flight Management system - a primary computer, then a secondary, which is on line all the time, providing reference information for the primary system, and then a backup system for both. That is about as fault tolerant as you can get, provided the third system is different in all aspects from the first two. Only that way can one prevent a multiplied design flaw from affecting all three systems.
So: if you can have this kind of control over a system, or a subsystem, implement it. Any component that can provide you with an alarm, should. The debate over how much redundancy is warranted, will rage on. There are as many people who say that a disk subsystem should run at all times, as there are who say that it should be taken down whenever possible, to enhance its life expectancy. You already know my answer: since you cannot predict when a system will fail, it doesn’t matter one iota how long it could theoretically live. What matters is that you know when it fails, and that there’s something out there to take its place.
This point was again emphasized by the manufacturers of smoke detectors in the U.S., not two weeks ago. They’re advertising a new type of smoke detector, built around a new Duracell 9 volt lithium battery, which is guaranteed to last six years. Great, that, don’t you think? According to the newscaster, this meant that the problem of smoke detectors with empty batteries has now been solved! In the U.S., most rental housing has to be fitted with smoke detectors by Law, and many fire deaths could have been prevented if only the smoke detectors had had working batteries - this in a country where the majority of residential building is based on wooden frames.
Has it?
You know what causes a smoke detector to have an empty battery? Right, the failure of its minder to replace that battery - the smoke detector beeps for days, and flashes a red LED, when it detects its power beginning to fail.. It doesn’t matter if the battery lasts a hundred years, once it’s empty and you don’t replace it, your protection goes up in smoke - sometimes literally. Moving the problem six years away does not constitute fault tolerance, if you don’t have a battery replacement mechanism now you won’t have one in six years, either.
Maintenance, attention to detail, but most of all: a well implemented real time warning system, and the people and procedures to act on those warnings, provide the survivability we in the telecommunications industry are expected to deliver. If, like me, you are in the reliability business, don’t go for the latest technology, the six year battery, go for what you know works!