Wednesday, July 9, 2014

Double Trouble: Version 0.20 Released

Nigel and I are pleased to announce that version 0.20 of the retro-B5500 emulator was released on 29 June. All changes have been posted to the Subversion repository for our Google Code project GitHub project. The hosting site has also been updated with this release, and for those of you running your own web server, a zip file of the source can be downloaded from the GitHub repository [updated 2022-05-07].

It has been five months since the previous version, 0.19, was released. That is far longer than any of us would have liked, but the main item in this release proved to be quite a challenge, as the following discussion will detail.


Double-Precision Arithmetic

The major enhancement in this release, and one that has been a long time coming, is a full implementation of the double-precision (DP) arithmetic operators, DLA, DLS, DLM, and DLD. These were the last operators left to be implemented in the Processor component of the emulator. Since the earliest releases they have been stubbed out by their single-precision (SP) equivalents, although a preliminary (and not very good) implementation of DP Add/Subtract has been available for more than a year.

There is actually one more operator that remains incompletely implemented, Initiate for Test (IFT, octal 5111). This is a diagnostic operator, and is available only in Control State. On the B5500, this operator had the ability to inject arbitrary state into the processor registers and initiate execution in the middle of an instruction. We do not emulate the B5500 at the clock level, however, and in particular we do not support the J register, which was used as a state variable to control stepping through execution of an instruction. Thus, we can't completely implement the IFT operator.

Coming into this project, I had suspected that the arithmetic operators were going to be difficult. In fact, I tried to hand them off to Nigel, but he was smart enough to hand them back. The SP operators were indeed a challenge, but that part of the instruction set proved to be very interesting to work on. If nothing else, I finally understood how long division works.

After getting the SP operators to work, I started to look at the DP operators, thinking they would be a straightforward extension of their SP equivalents. Oh, my... reading about the DP operators in the B2581 Processor Training Manual revealed that they were much, much more complex. In fact, they were downright intimidating. To understand why, we first need to look at how arithmetic is done in the B5500.

An Overview of Single Precision


The B5500 has a 48-bit word. That word can hold a single-precision numeric operand, or eight 6-bit characters, or a variety of control words. A SP numeric operand looks like this:

B5500 Single-precision Word Format
B5500 Single-precision Word Format
The high-order bit, numbered 0, is the flag bit, which is zero for operands and one for control words. Attempting to access a word as an operand in Word Mode that has its flag bit set will cause a (typically fatal) Flag Bit interrupt. In this sense the B5500 is a tagged-word architecture, but having the tag inside the word is quite awkward for processing character data -- when the high-order character has its high-order bit set and the processor is in Word Mode, it looks like a control word. Thus, character processing is normally done in Character Mode, which is not sensitive to the flag bit -- a characteristic that has its own set of problems. This awkwardness was resolved in the B6500 and later systems by expanding the tag and moving it to a separate field outside the 48-bit data portion of the word. In addition, character operations in the B6500 were combined with word operations into a single mode of processor operation.

Bit 1 is the sign of the mantissa, with a one indicating negative values. This is a signed-magnitude representation, so both positive and negative zero values are possible, although the arithmetic operators do not produce negative-zero results.

Bit 2 is the sign of the exponent, which also has a signed-magnitude representation.

The next six bits are the magnitude of the exponent, which is a power of eight. Therefore, when normalizing or scaling a floating-point value, the mantissa is shifted in three-bit groups, or octades.

The low-order 39 bits in the word are the mantissa. This size yields a precision of approximately 11.5 decimal digits.

With the exception of the flag bit, this looks like a fairly typical floating-point representation of the time, but there are two unusual things about it. The first is that the scaling point for the mantissa is not at the high-order end of the field, but rather at the low-order end. Unlike most floating-point representations that store the mantissa as a fraction, the B5500 represents its mantissa as an integer.

This leads to the second unusual characteristic. Not only is this the format of a floating-point operand, it is also the format for an integer operand. The B5500 has what is sometimes referred to as a unified numeric format. Integers are considered to be a subset of floating-point values, distinguished by having an exponent of zero. Most of the arithmetic operators attempt to keep the result of integer operands as an integer, but will automatically switch to a floating-point representation if the result overflows the integer range. Some floating-point results are not completely normalized, but that does not detract from their use in later calculations.

The idea for this unified format came either from the Bendix G-20 or the fertile mind of Bob Barton, depending on whose version of events you choose to believe. See the 1985 B5000 Oral History transcription for the story. The details of the formats for the two machines differ quite a bit, but the connection with the G-20 is plausible, as its predecessor, the G-15, was designed by Harry Huskey, who also consulted with Electrodata/Burroughs in the 1950s.

One consequence of this form of numeric representation is that you do not need separate instructions for integer and floating-point operations. To the hardware, there is no operational difference between 1 and 1.0, so a second consequence is that integer and floating operands can be mixed arbitrarily. A third consequence is that most integer values can be stored in multiple forms. For example, the value +1 has multiple representations, each with a different exponent value. The two most common are the one normalized as an integer [octal 0000000000000001], and the one fully-normalized as a floating-point value [octal 1141000000000000, i.e., (1×812) × (8-12)].

Doing arithmetic on mixed integer and floating-point values seems as if it might be quite complex, but its implementation on the B5500 is actually simpler than you may expect. The mechanization of the arithmetic operators is quite clever, and is discussed with headache-inducing detail in the Training Manual cited above. Here is a quick overview:
  • Addition and subtraction require that the exponents be equal. If both operands are in integer form, their exponents are zero, and therefore can simply be added or subtracted. If the exponents are unequal, the value with the larger exponent is normalized (shifted left with a decrease in exponent, if not already fully normalized) and the value with the smaller exponent is scaled (shifted right with an increase in exponent) until the exponents match or one of the mantissas goes to zero. If adding two integers yields a value that exceeds 39 bits, the result is automatically scaled, producing a floating-point result, with consequent loss of one octade of precision. A flip-flop keeps track of octades scaled off the low-order end of the word so the result can be rounded.
  • Multiplication notes whether both operands are initially in integer form, and if so, tries to produce an integer result, automatically overflowing to floating-point as necessary. Otherwise both operands are fully normalized before being multiplied.
  • Standard division, following the rules of Algol, always produces a real (floating-point) result, even with integer operands, and thus always normalizes its operands before commencing the division.
  • Integer division always normalizes its operands, but is mechanized in such a way as to produce either an integer result or an Integer Overflow interrupt.
  • Remainder division always normalizes its operands, and curiously, always produces a result in floating-point form. 5 mod 3 yields 2.0 in fully-normalized floating-point form. 3.3 mod 2 yields 1.3, or as close to it as you can represent with a binary fraction.
Variants of the store operators can normalize operands to integer representation when the semantics of the programming language require such. Fractional values are rounded during integerization. Attempting to integerize a single-precision value whose magnitude exceeds 39 bits results in an Integer Overflow interrupt.

Extending to Double Precision


So much for the single-precision representation and basic arithmetic behavior on the B5500. In terms of data representation, double-precision values are a straightforward extension of the single-precision format:

B5500 Double-precision Word Formats
B5500 Double-precision Word Formats
The first word of a DP value has the same representation as a SP value. The second word contains a 39-bit extension of the mantissa. The high-order nine bits of this second word are ignored by the processor. The scaling point remains at the low-order end of the first word -- the high-order mantissa is still an integer, but the low-order mantissa is effectively a fraction appended to that integer. The first word is generally stored at the lower address, but this is not required, as the processor must load and store the two words individually. Conveniently, a SP value can be converted to a DP value simply by appending a word of zeros to it.

This unified numeric representation worked well enough on the B5500 that it was carried forward into the B6500 and later systems. It is still used in the modern Unisys MCP systems. The data formats and numeric behavior in the modern systems are the same, with four exceptions:
  1. The flag bit is ignored, as its function was moved to the extra tag bits present in each word on the later systems.
  2. With the exception of the flag bit, the SP word format is the same, but in the second word used for DP operands, the high-order nine bits are used as a high-order extension to the exponent. Thus the first word has the low-order exponent and high-order mantissa, while the second word has the high-order exponent and low-order mantissa.
  3. Remainder divide with integer operands yields a result in integer form. This is a welcome refinement.
  4. Mechanization of the arithmetic functions is somewhat more sophisticated. The details of this have changed over the years, but current systems have extra guard digits, and will produce sub-normal numbers instead of Exponent Underflow interrupts the the very low end of the value range.

The Trouble with Double


So what is it that makes double precision so difficult in the B5500 emulator? The answer to that lies in the registers that are available inside the processor, or rather, the lack of them.

The B5500 processor has over 20 registers, but only four of them are larger than 15 bits. The two top-of-stack registers, A and B, hold 48 bits. These also serve roles similar to that of an accumulator on other machines. An extension register, X, holds 39 bits, large enough for a mantissa field. The fourth large register, P, is also 48 bits, but holds the current program word and is not involved in arithmetic operations.

Note that there is only one extension register, X. Therefore, it isn't possible for the processor internally to hold and operate on two full DP values at the same time. Double-precision arithmetic must be done in parts. What is worse, octades of the mantissa can be shifted only between the B and X registers, so when normalization or scaling of a DP operand is necessary, the operand must be present in B and X. The mantissa field of the A register can be transferred and exchanged with the X register, but shifting between those two registers is not possible.

These limitations lead to what I think of as The Dance of Insufficient Registers. The processor must go through a complex sequence of memory loads and stores during a double-precision operation, shuffling words between registers and the memory portion of the stack. In most cases, the memory portion of the stack actually grows temporarily as the operator pushes intermediate results, although the stack ultimately shrinks by two words as the operation consumes one of the DP operands, leaving the final DP result in the A and B registers.

Complicating the situation somewhat, the processor expects the high-order word of a DP operand to be on top of the stack, meaning the low-order word is in the stack at a lower address -- exactly the opposite order in which DP values are generally stored in memory. The rationale for this appears to be that it positions the words to make The Dance somewhat more efficient, but at the cost that setting up the operands in the stack is sometimes less efficient. The processor does not have double-precision load or store operations, so as mentioned previously, each half of a DP operand must be pushed or stored individually by software.

Thus, double precision on the B5500 is a mixed blessing. On one hand, it yields up to 78 bits of precision -- 23 decimal digits. On the other hand, you would need to really want that degree of precision, because double precision operations were not fast. A typical add operation may require 6 or more memory references, in addition to any required for the initial stack adjustment. Lots of clock cycles were required on top of that to normalize/scale the operands, and possibly the result. In the case of Multiply and Divide, lots more cycles were required to develop the 26-octade result.

Emulating Double Precision


In general, the emulator tries to do things the way the B5500 hardware did them, but unless you are trying to do a clock-level emulation (which probably isn't practical from a performance perspective in our web-based Javascript environment), the way that you mechanize an operator in software can be -- and sometimes must be -- quite a bit different from the way it is mechanized using digital logic. Circuits like to work in parallel, but software likes to work sequentially.

Our goal in this project has been to produce a "functional" emulation, meaning that at the end of each instruction, any state that may be needed by future instructions must have been developed and stored in the registers. Any "scratch state" that has no further use need not be preserved, and need not even be developed to begin with. In Word Mode, the state of the M, N, X, Y, and Z registers and most of the Q-register flip-flops fall into this scratch-state category. In some cases, we've developed and preserved this otherwise unneeded state for potential display purposes, but we haven't been very religious about it.

Thus, while the implementation of most operators in the emulator follows the general outline of their digital-logic implementation, the low-level details are often quite different, and are usually simpler. For example, multiplication is mechanized much the same way a person would do it by hand, multiplying the multiplicand by each digit (or rather, octade) of the multiplier in sequence, shifting the partial products, and adding them to produce the result. The B5500 hardware did the individual multiplications by repeated addition of the multiplicand, but the emulator does not need to operate at that primitive a level -- it just multiplies the multiplicand by the current octade of the multiplier. In general, the SP arithmetic operators work at a somewhat higher level of abstraction than did the B5500 hardware.

The big lesson from the work in the emulator on DP arithmetic operators, and the thing that has caused such a long delay in the most recent release, is that we've had to mechanize these operators much closer to the low-level way the B5500 hardware works than I had originally expected. It has also turned out that doing the actual arithmetic is a relatively small part of the job. A lot more effort has had to go into implementing The Dance, normalizing and scaling the operands and results, and the Mother of All Hair-Pullers, getting rounding to work properly.

Getting the rounding right is largely a function of how you keep track of octades shifted off the low end during scaling, which you would think is not that big a deal. Well, it isn't -- unless you have negative numbers -- in which case the rounding bit in some cases must be complemented. There are two operands and only one rounding bit (Q01F), and which operand gets scaled depends on the magnitude of their exponents, but only one operand at a time can be in the B and X registers to be scaled, and either one may or may not be negative -- you get the picture. Attempting to shortcut in software the way the digital logic worked turned out to be an exercise in futility.

I learned the hard way with the SP operators how important is is to get the rounding right. I had a B5500 Algol program that did orthonormalization of vectors in single precision, and results from a run of that program in 1970 that were formatted to 12 digits. I transcribed that program and ran it under the emulator, comparing its output to the 1970 listing. Alas, the results matched only to one significant digit, at most. This both astonished and annoyed me, and I spent weeks pouring over the code for the program, and over the code for the arithmetic operators in the emulator, trying to find what was causing the emulator to generate such poor results. It got me nowhere.

Finally, in desperation last Fall, I wrote an Algol program to try to thoroughly test the SP arithmetic operators in the emulator, especially in terms of overflow and rounding. That program used a table of 64 "interesting" numeric bit patterns. It worked by adding, subtracting, multiplying, and dividing all 64 patterns against each other, and dumping the results to a printer file in octal. Then I converted that program to modern Algol and ran it under the modern MCP. Comparing the output of the two showed that the results for SP Multiply and Divide were agreeing nicely. There were some normalization differences, and in some cases the B5500 generates Exponent Underflow interrupts (which the MCP converts to a zero result in the stack) while the modern system generates valid numbers (due to its larger exponent range), but otherwise the values were arithmetically equal.

For SP Add/Subtract, many of the results were equivalent, but the rest differed only in the low-order bit of the mantissa -- it was a rounding difference between the two systems. Fortunately, examining a few of those differing results showed where the emulator was not handling rounding properly, mostly during scaling. Fixing those few -- seemingly obscure -- rounding problems resolved all but a very few of the differences between the emulator and the modern MCP. Upon rerunning my orthonormalization program, the results from the emulator finally agreed with the 1970 listing, to the digit. That was both quite a relief, and a real lesson on the significance of rounding

Those with a background in numerical analysis are by now probably rolling their eyes or on the floor laughing. This certainly isn't the first floating-point implementation to suffer from bad rounding -- the original IBM 360 was notorious for its bad precision, due largely to the fact that it did not even try to round its results -- and it probably won't be the last. The IEEE 754 (ISO 60559) standard has done a lot to improve the precision of floating-point arithmetic, but that did not come along until more than 20 years after the B5500.

When I decided to start again on the DP operators earlier this year, rounding was thus very much on my mind, and I tried to apply the fixes from the SP operators to DP Add/Subtract. To test, I built a new program using those same 64 "interesting" bit patterns, but this time in pairs, to exercise the DP operators. I also converted that new program to modern Algol to generate results for comparison.

The initial results from this test were pretty disappointing. There were still rounding problems, the signs were often wrong, and it appeared that carries between the two halves of the mantissa weren't always working right. After tinkering with the original design quite a bit and getting nowhere, I decided that my high-level, software approach to mechanizing DP Add/Subtract wasn't going to work, and started to look much more closely at how the B5500 actually does arithmetic.

The Training Manual mentioned above is mostly a narrative guide to another document known as "the flows." These are state diagrams that show, on a clock-by-clock basis, how the logic levels in the processor cause changes in states of the registers and flip-flops. They are essentially a schematic representation of the logic equations for the system. We did not have access to the flows when starting the project, just the narrative description of them in the Training Manual, but they have since become available as the B5000 Processor Flow Chart document on bitsavers.org. The narrative in the Training Manual is pretty good, but it doesn't tell you everything. The flows are as close to The Truth about the B5500 as we are likely ever to get, and they have been invaluable in solving several problems with the emulator.

Thus, it was the flows that I turned to in order to fix the DP implementation. It has taken three complete rewrites of DP Add/Subtract, and some major rework on DP Divide. I couldn't reconcile my original approach to the flows, so each successive rewrite moved the implementation closer to being the state machine described in the flows. I now realize I could have saved myself a lot of trouble if I had just slavishly coded from the flows to begin with, but by more closely modeling the flows, the emulator now produces DP results that compare favorably with those from the modern MCP.

I confess that "compare favorably" is a bit of hand-waving on my part. The Add/Subtract tests match perfectly in many cases. In the remaining cases, they differ by the low-order bit. That looks like the same type of difference that got me into trouble with the SP arithmetic operators. In looking at several of cases, however, the emulator appears to be generating the result that the flows say it should -- assuming I'm reading the flows properly, of which I'm always in doubt.

The differences in the results for Multiply and Divide are mostly in the two low-order octades. DP Multiply is known to be imprecise at this level, however. Here is was the Training Manual has to say on the subject (page 3.23-1):
Twenty seven [octal] digits of the 52 digit product are retained. The product is normalized and truncated to a 26 digit result. The least significant two digits are not considered a precise part of the result because there may be a maximum error of 1 in the twenty-fifth digit position. [emphasis mine]

That nicely describes most of the differences I am seeing in the Multiply tests. DP Divide uses DP Multiply during its final stage of developing a quotient, so we should expect to see similar imprecision for division.

Another thing to keep in mind -- and something that I need to keep reminding myself -- is that matching results with a modern MCP implementation is not the goal. The goal is for the emulator to work the way a B5500 did. The only reasons for using the modern MCP as a basis for comparison are (a) it has a similar floating-point implementation, and (b) we don't presently have any double-precision results from a real B5500 to compare against. Thus, the modern MCP is the best standard we have to compare against, but it's highly likely that differs in some cases from what a B5500 would have generated.

Of course, it's also highly likely that emulator isn't quite right yet, either. I won't be the least bit surprised if we find flaws in the emulator's current DP implementation, but what we have seems to be good enough to release, and it's certainly in better shape that the original SP implementation was.


Those who may be interested in seeing the results of the tests with the 64 "interesting" bit pattens can view PDF comparisons at the following links. Be forewarned, though -- this is a lot more octal than any normal person should ever want to see.

Other Significant Changes in 0.20

1. The mechanism that schedules and manages the many asynchronous activities of the emulator -- running one or two processors, doing multiple I/Os, updating the console lights, and driving SPO and datacom output at ten characters/second -- has been heavily reworked in this release and implemented consistently across all of the emulator components. How this is done and the history of its development is worthy of a blog post on its own, so I won't go into details here. Suffice it to say that you should see somewhat better I/O behavior and snappier performance overall.

2. Character translation and keyboard filtering in the Datacom terminal device have been modified in an attempt to support CANDE and the TSS MCP better.

3. Button colors and the way they are illuminated has been standardized across the B5500 Console and I/O device windows.

4. Four tape drives (MTA-MTD) are now enabled in the default system configuration.

In Other News...

The B5500 emulator itself is nearing completion -- not that it will actually ever be completed, of course -- and effort is already beginning to shift from making the emulator work to having more things for it to work with. There is lots of interesting software already available, but most of it is in the form of scanned listings. Those listings must be transcribed into machine-readable source code. That is a tedious and error-prone task. We've already had about as much luck with 40-year old 7-track tapes as we are likely to have, so transcription is the best path to more applications for the B5500.

Fortunately, significant progress is being made towards making transcription easier and more reliable. Jim Fehlinger in New Jersey (USA) has managed to get an off-the-shelf OCR program to do a passable job of converting scanned listings from bitsavers.org. It is still a very labor-intensive process, involving lots of manual validation and correction of the OCR output, but it is producing usable source code much faster than any of us have been able to do before by simply keying the text.

One thing that we learned early on with the Mark XVI Algol and ESPOL compiler transcriptions is that a compiler is a better proofreader than most people are. I spent a full week carefully proofreading the original Algol compiler transcription, only to have my first attempt at compiling that code identify more typos than that week of very tedious effort had. A compiler isn't perfect for this -- it won't find typos in comments and literals, for example -- but it is a powerful proofing tool.

Jim has used this idea to good advantage. After OCR-ing and manually correcting several pages from a listing, he then compiles the source he has accumulated up to that point. He corrects any errors, and does additional compilation passes as necessary until there are no more errors left to be corrected. Then he goes back to OCR-ing, and the cycle continues. The process is not perfect, just a lot better than anything we've had up to now.

Thus far, Jim has managed to complete transcriptions of the following:
  • XBASIC, an interactive BASIC interpreter, developed by the Paisley College of Technology in the mid 1970s.
  • B6500/SIM, a simulator for the Burroughs B6500 that runs on the B5500. This was developed by the Burroughs B6500 engineering team in the mid/late-1960s. Its use awaits development of a variant of Algol, LONGALG, which did special array handling for the simulated B6500 memory. We do not have any materials for LONGALG, so are going to have to guess how it worked and try to patch the standard Algol compiler to replicate its behavior.
  • B6500 ESPOL, a cross-compiler for the B6500 that ran on the B5500. This was also developed by the Burroughs engineering team to create the initial B6500 MCP.
Jim is currently working on the source for the Mark I.0 B6500 MCP. He has been using the B6500 ESPOL cross-compiler to validate his scanning of the MCP. Since that ESPOL compiler is a product of his scanning process, it still had errors that a simple compile could not uncover, so he and I have had an interesting exchange over the past month. Jim uses the compiler as best he can until the compiler starts crashing or generating false syntax errors. He sends those to me, and I try to debug the compiler, sending him corrections so he can continue validating his OCR work. A remarkable number of the problems have been due to confusion between the plus-sign and the left-arrow. We have also had some really nasty bugs due to confusion between "I" and "1". We are slowly getting the compiler debugged, but the original compiler listing appears to have been of very poor quality, and there are sure to be more problems like this that we have not yet uncovered. I'm impressed that Jim has been able to convert the scan of that listing as well as he has.

Coming Attractions

The plan for the next release of the emulator is to make some improvements in the user interface, particularly in the area of system configuration control. This will probably take several weeks, so stay tuned.