Enter email address here to subscribe to B5500 blog posts...

Wednesday, July 9, 2014

Double Trouble: Version 0.20 Released

Nigel and I are pleased to announce that version 0.20 of the retro-B5500 emulator was released on 29 June. All changes have been posted to the Subversion repository for our Google Code project. The hosting site has also been updated with this release, and for those of you running your own web server, a zip file of the release can be downloaded from Google Drive.

It has been five months since the previous version, 0.19, was released. That is far longer than any of us would have liked, but the main item in this release proved to be quite a challenge, as the following discussion will detail.


Double-Precision Arithmetic

The major enhancement in this release, and one that has been a long time coming, is a full implementation of the double-precision (DP) arithmetic operators, DLA, DLS, DLM, and DLD. These were the last operators left to be implemented in the Processor component of the emulator. Since the earliest releases they have been stubbed out by their single-precision (SP) equivalents, although a preliminary (and not very good) implementation of DP Add/Subtract has been available for more than a year.

There is actually one more operator that remains incompletely implemented, Initiate for Test (IFT, octal 5111). This is a diagnostic operator, and is available only in Control State. On the B5500, this operator had the ability to inject arbitrary state into the processor registers and initiate execution in the middle of an instruction. We do not emulate the B5500 at the clock level, however, and in particular we do not support the J register, which was used as a state variable to control stepping through execution of an instruction. Thus, we can't completely implement the IFT operator.

Coming into this project, I had suspected that the arithmetic operators were going to be difficult. In fact, I tried to hand them off to Nigel, but he was smart enough to hand them back. The SP operators were indeed a challenge, but that part of the instruction set proved to be very interesting to work on. If nothing else, I finally understood how long division works.

After getting the SP operators to work, I started to look at the DP operators, thinking they would be a straightforward extension of their SP equivalents. Oh, my... reading about the DP operators in the B2581 Processor Training Manual revealed that they were much, much more complex. In fact, they were downright intimidating. To understand why, we first need to look at how arithmetic is done in the B5500.

An Overview of Single Precision


The B5500 has a 48-bit word. That word can hold a single-precision numeric operand, or eight 6-bit characters, or a variety of control words. A SP numeric operand looks like this:

B5500 Single-precision Word Format
B5500 Single-precision Word Format
The high-order bit, numbered 0, is the flag bit, which is zero for operands and one for control words. Attempting to access a word as an operand in Word Mode that has its flag bit set will cause a (typically fatal) Flag Bit interrupt. In this sense the B5500 is a tagged-word architecture, but having the tag inside the word is quite awkward for processing character data -- when the high-order character has its high-order bit set and the processor is in Word Mode, it looks like a control word. Thus, character processing is normally done in Character Mode, which is not sensitive to the flag bit -- a characteristic that has its own set of problems. This awkwardness was resolved in the B6500 and later systems by expanding the tag and moving it to a separate field outside the 48-bit data portion of the word. In addition, character operations in the B6500 were combined with word operations into a single mode of processor operation.

Bit 1 is the sign of the mantissa, with a one indicating negative values. This is a signed-magnitude representation, so both positive and negative zero values are possible, although the arithmetic operators do not produce negative-zero results.

Bit 2 is the sign of the exponent, which also has a signed-magnitude representation.

The next six bits are the magnitude of the exponent, which is a power of eight. Therefore, when normalizing or scaling a floating-point value, the mantissa is shifted in three-bit groups, or octades.

The low-order 39 bits in the word are the mantissa. This size yields a precision of approximately 11.5 decimal digits.

With the exception of the flag bit, this looks like a fairly typical floating-point representation of the time, but there are two unusual things about it. The first is that the scaling point for the mantissa is not at the high-order end of the field, but rather at the low-order end. Unlike most floating-point representations that store the mantissa as a fraction, the B5500 represents its mantissa as an integer.

This leads to the second unusual characteristic. Not only is this the format of a floating-point operand, it is also the format for an integer operand. The B5500 has what is sometimes referred to as a unified numeric format. Integers are considered to be a subset of floating-point values, distinguished by having an exponent of zero. Most of the arithmetic operators attempt to keep the result of integer operands as an integer, but will automatically switch to a floating-point representation if the result overflows the integer range. Some floating-point results are not completely normalized, but that does not detract from their use in later calculations.

The idea for this unified format came either from the Bendix G-20 or the fertile mind of Bob Barton, depending on whose version of events you choose to believe. See the 1985 B5000 Oral History transcription for the story. The details of the formats for the two machines differ quite a bit, but the connection with the G-20 is plausible, as its predecessor, the G-15, was designed by Harry Huskey, who also consulted with Electrodata/Burroughs in the 1950s.

One consequence of this form of numeric representation is that you do not need separate instructions for integer and floating-point operations. To the hardware, there is no operational difference between 1 and 1.0, so a second consequence is that integer and floating operands can be mixed arbitrarily. A third consequence is that most integer values can be stored in multiple forms. For example, the value +1 has multiple representations, each with a different exponent value. The two most common are the one normalized as an integer [octal 0000000000000001], and the one fully-normalized as a floating-point value [octal 1141000000000000, i.e., (1×812) × (8-12)].

Doing arithmetic on mixed integer and floating-point values seems as if it might be quite complex, but its implementation on the B5500 is actually simpler than you may expect. The mechanization of the arithmetic operators is quite clever, and is discussed with headache-inducing detail in the Training Manual cited above. Here is a quick overview:
  • Addition and subtraction require that the exponents be equal. If both operands are in integer form, their exponents are zero, and therefore can simply be added or subtracted. If the exponents are unequal, the value with the larger exponent is normalized (shifted left with a decrease in exponent, if not already fully normalized) and the value with the smaller exponent is scaled (shifted right with an increase in exponent) until the exponents match or one of the mantissas goes to zero. If adding two integers yields a value that exceeds 39 bits, the result is automatically scaled, producing a floating-point result, with consequent loss of one octade of precision. A flip-flop keeps track of octades scaled off the low-order end of the word so the result can be rounded.
  • Multiplication notes whether both operands are initially in integer form, and if so, tries to produce an integer result, automatically overflowing to floating-point as necessary. Otherwise both operands are fully normalized before being multiplied.
  • Standard division, following the rules of Algol, always produces a real (floating-point) result, even with integer operands, and thus always normalizes its operands before commencing the division.
  • Integer division always normalizes its operands, but is mechanized in such a way as to produce either an integer result or an Integer Overflow interrupt.
  • Remainder division always normalizes its operands, and curiously, always produces a result in floating-point form. 5 mod 3 yields 2.0 in fully-normalized floating-point form. 3.3 mod 2 yields 1.3, or as close to it as you can represent with a binary fraction.
Variants of the store operators can normalize operands to integer representation when the semantics of the programming language require such. Fractional values are rounded during integerization. Attempting to integerize a single-precision value whose magnitude exceeds 39 bits results in an Integer Overflow interrupt.

Extending to Double Precision


So much for the single-precision representation and basic arithmetic behavior on the B5500. In terms of data representation, double-precision values are a straightforward extension of the single-precision format:

B5500 Double-precision Word Formats
B5500 Double-precision Word Formats
The first word of a DP value has the same representation as a SP value. The second word contains a 39-bit extension of the mantissa. The high-order nine bits of this second word are ignored by the processor. The scaling point remains at the low-order end of the first word -- the high-order mantissa is still an integer, but the low-order mantissa is effectively a fraction appended to that integer. The first word is generally stored at the lower address, but this is not required, as the processor must load and store the two words individually. Conveniently, a SP value can be converted to a DP value simply by appending a word of zeros to it.

This unified numeric representation worked well enough on the B5500 that it was carried forward into the B6500 and later systems. It is still used in the modern Unisys MCP systems. The data formats and numeric behavior in the modern systems are the same, with four exceptions:
  1. The flag bit is ignored, as its function was moved to the extra tag bits present in each word on the later systems.
  2. With the exception of the flag bit, the SP word format is the same, but in the second word used for DP operands, the high-order nine bits are used as a high-order extension to the exponent. Thus the first word has the low-order exponent and high-order mantissa, while the second word has the high-order exponent and low-order mantissa.
  3. Remainder divide with integer operands yields a result in integer form. This is a welcome refinement.
  4. Mechanization of the arithmetic functions is somewhat more sophisticated. The details of this have changed over the years, but current systems have extra guard digits, and will produce sub-normal numbers instead of Exponent Underflow interrupts the the very low end of the value range.

The Trouble with Double


So what is it that makes double precision so difficult in the B5500 emulator? The answer to that lies in the registers that are available inside the processor, or rather, the lack of them.

The B5500 processor has over 20 registers, but only four of them are larger than 15 bits. The two top-of-stack registers, A and B, hold 48 bits. These also serve roles similar to that of an accumulator on other machines. An extension register, X, holds 39 bits, large enough for a mantissa field. The fourth large register, P, is also 48 bits, but holds the current program word and is not involved in arithmetic operations.

Note that there is only one extension register, X. Therefore, it isn't possible for the processor internally to hold and operate on two full DP values at the same time. Double-precision arithmetic must be done in parts. What is worse, octades of the mantissa can be shifted only between the B and X registers, so when normalization or scaling of a DP operand is necessary, the operand must be present in B and X. The mantissa field of the A register can be transferred and exchanged with the X register, but shifting between those two registers is not possible.

These limitations lead to what I think of as The Dance of Insufficient Registers. The processor must go through a complex sequence of memory loads and stores during a double-precision operation, shuffling words between registers and the memory portion of the stack. In most cases, the memory portion of the stack actually grows temporarily as the operator pushes intermediate results, although the stack ultimately shrinks by two words as the operation consumes one of the DP operands, leaving the final DP result in the A and B registers.

Complicating the situation somewhat, the processor expects the high-order word of a DP operand to be on top of the stack, meaning the low-order word is in the stack at a lower address -- exactly the opposite order in which DP values are generally stored in memory. The rationale for this appears to be that it positions the words to make The Dance somewhat more efficient, but at the cost that setting up the operands in the stack is sometimes less efficient. The processor does not have double-precision load or store operations, so as mentioned previously, each half of a DP operand must be pushed or stored individually by software.

Thus, double precision on the B5500 is a mixed blessing. On one hand, it yields up to 78 bits of precision -- 23 decimal digits. On the other hand, you would need to really want that degree of precision, because double precision operations were not fast. A typical add operation may require 6 or more memory references, in addition to any required for the initial stack adjustment. Lots of clock cycles were required on top of that to normalize/scale the operands, and possibly the result. In the case of Multiply and Divide, lots more cycles were required to develop the 26-octade result.

Emulating Double Precision


In general, the emulator tries to do things the way the B5500 hardware did them, but unless you are trying to do a clock-level emulation (which probably isn't practical from a performance perspective in our web-based Javascript environment), the way that you mechanize an operator in software can be -- and sometimes must be -- quite a bit different from the way it is mechanized using digital logic. Circuits like to work in parallel, but software likes to work sequentially.

Our goal in this project has been to produce a "functional" emulation, meaning that at the end of each instruction, any state that may be needed by future instructions must have been developed and stored in the registers. Any "scratch state" that has no further use need not be preserved, and need not even be developed to begin with. In Word Mode, the state of the M, N, X, Y, and Z registers and most of the Q-register flip-flops fall into this scratch-state category. In some cases, we've developed and preserved this otherwise unneeded state for potential display purposes, but we haven't been very religious about it.

Thus, while the implementation of most operators in the emulator follows the general outline of their digital-logic implementation, the low-level details are often quite different, and are usually simpler. For example, multiplication is mechanized much the same way a person would do it by hand, multiplying the multiplicand by each digit (or rather, octade) of the multiplier in sequence, shifting the partial products, and adding them to produce the result. The B5500 hardware did the individual multiplications by repeated addition of the multiplicand, but the emulator does not need to operate at that primitive a level -- it just multiplies the multiplicand by the current octade of the multiplier. In general, the SP arithmetic operators work at a somewhat higher level of abstraction than did the B5500 hardware.

The big lesson from the work in the emulator on DP arithmetic operators, and the thing that has caused such a long delay in the most recent release, is that we've had to mechanize these operators much closer to the low-level way the B5500 hardware works than I had originally expected. It has also turned out that doing the actual arithmetic is a relatively small part of the job. A lot more effort has had to go into implementing The Dance, normalizing and scaling the operands and results, and the Mother of All Hair-Pullers, getting rounding to work properly.

Getting the rounding right is largely a function of how you keep track of octades shifted off the low end during scaling, which you would think is not that big a deal. Well, it isn't -- unless you have negative numbers -- in which case the rounding bit in some cases must be complemented. There are two operands and only one rounding bit (Q01F), and which operand gets scaled depends on the magnitude of their exponents, but only one operand at a time can be in the B and X registers to be scaled, and either one may or may not be negative -- you get the picture. Attempting to shortcut in software the way the digital logic worked turned out to be an exercise in futility.

I learned the hard way with the SP operators how important is is to get the rounding right. I had a B5500 Algol program that did orthonormalization of vectors in single precision, and results from a run of that program in 1970 that were formatted to 12 digits. I transcribed that program and ran it under the emulator, comparing its output to the 1970 listing. Alas, the results matched only to one significant digit, at most. This both astonished and annoyed me, and I spent weeks pouring over the code for the program, and over the code for the arithmetic operators in the emulator, trying to find what was causing the emulator to generate such poor results. It got me nowhere.

Finally, in desperation last Fall, I wrote an Algol program to try to thoroughly test the SP arithmetic operators in the emulator, especially in terms of overflow and rounding. That program used a table of 64 "interesting" numeric bit patterns. It worked by adding, subtracting, multiplying, and dividing all 64 patterns against each other, and dumping the results to a printer file in octal. Then I converted that program to modern Algol and ran it under the modern MCP. Comparing the output of the two showed that the results for SP Multiply and Divide were agreeing nicely. There were some normalization differences, and in some cases the B5500 generates Exponent Underflow interrupts (which the MCP converts to a zero result in the stack) while the modern system generates valid numbers (due to its larger exponent range), but otherwise the values were arithmetically equal.

For SP Add/Subtract, many of the results were equivalent, but the rest differed only in the low-order bit of the mantissa -- it was a rounding difference between the two systems. Fortunately, examining a few of those differing results showed where the emulator was not handling rounding properly, mostly during scaling. Fixing those few -- seemingly obscure -- rounding problems resolved all but a very few of the differences between the emulator and the modern MCP. Upon rerunning my orthonormalization program, the results from the emulator finally agreed with the 1970 listing, to the digit. That was both quite a relief, and a real lesson on the significance of rounding

Those with a background in numerical analysis are by now probably rolling their eyes or on the floor laughing. This certainly isn't the first floating-point implementation to suffer from bad rounding -- the original IBM 360 was notorious for its bad precision, due largely to the fact that it did not even try to round its results -- and it probably won't be the last. The IEEE 754 (ISO 60559) standard has done a lot to improve the precision of floating-point arithmetic, but that did not come along until more than 20 years after the B5500.

When I decided to start again on the DP operators earlier this year, rounding was thus very much on my mind, and I tried to apply the fixes from the SP operators to DP Add/Subtract. To test, I built a new program using those same 64 "interesting" bit patterns, but this time in pairs, to exercise the DP operators. I also converted that new program to modern Algol to generate results for comparison.

The initial results from this test were pretty disappointing. There were still rounding problems, the signs were often wrong, and it appeared that carries between the two halves of the mantissa weren't always working right. After tinkering with the original design quite a bit and getting nowhere, I decided that my high-level, software approach to mechanizing DP Add/Subtract wasn't going to work, and started to look much more closely at how the B5500 actually does arithmetic.

The Training Manual mentioned above is mostly a narrative guide to another document known as "the flows." These are state diagrams that show, on a clock-by-clock basis, how the logic levels in the processor cause changes in states of the registers and flip-flops. They are essentially a schematic representation of the logic equations for the system. We did not have access to the flows when starting the project, just the narrative description of them in the Training Manual, but they have since become available as the B5000 Processor Flow Chart document on bitsavers.org. The narrative in the Training Manual is pretty good, but it doesn't tell you everything. The flows are as close to The Truth about the B5500 as we are likely ever to get, and they have been invaluable in solving several problems with the emulator.

Thus, it was the flows that I turned to in order to fix the DP implementation. It has taken three complete rewrites of DP Add/Subtract, and some major rework on DP Divide. I couldn't reconcile my original approach to the flows, so each successive rewrite moved the implementation closer to being the state machine described in the flows. I now realize I could have saved myself a lot of trouble if I had just slavishly coded from the flows to begin with, but by more closely modeling the flows, the emulator now produces DP results that compare favorably with those from the modern MCP.

I confess that "compare favorably" is a bit of hand-waving on my part. The Add/Subtract tests match perfectly in many cases. In the remaining cases, they differ by the low-order bit. That looks like the same type of difference that got me into trouble with the SP arithmetic operators. In looking at several of cases, however, the emulator appears to be generating the result that the flows say it should -- assuming I'm reading the flows properly, of which I'm always in doubt.

The differences in the results for Multiply and Divide are mostly in the two low-order octades. DP Multiply is known to be imprecise at this level, however. Here is was the Training Manual has to say on the subject (page 3.23-1):
Twenty seven [octal] digits of the 52 digit product are retained. The product is normalized and truncated to a 26 digit result. The least significant two digits are not considered a precise part of the result because there may be a maximum error of 1 in the twenty-fifth digit position. [emphasis mine]

That nicely describes most of the differences I am seeing in the Multiply tests. DP Divide uses DP Multiply during its final stage of developing a quotient, so we should expect to see similar imprecision for division.

Another thing to keep in mind -- and something that I need to keep reminding myself -- is that matching results with a modern MCP implementation is not the goal. The goal is for the emulator to work the way a B5500 did. The only reasons for using the modern MCP as a basis for comparison are (a) it has a similar floating-point implementation, and (b) we don't presently have any double-precision results from a real B5500 to compare against. Thus, the modern MCP is the best standard we have to compare against, but it's highly likely that differs in some cases from what a B5500 would have generated.

Of course, it's also highly likely that emulator isn't quite right yet, either. I won't be the least bit surprised if we find flaws in the emulator's current DP implementation, but what we have seems to be good enough to release, and it's certainly in better shape that the original SP implementation was.


Those who may be interested in seeing the results of the tests with the 64 "interesting" bit pattens can view PDF comparisons at the following links. Be forewarned, though -- this is a lot more octal than any normal person should ever want to see.

Other Significant Changes in 0.20

1. The mechanism that schedules and manages the many asynchronous activities of the emulator -- running one or two processors, doing multiple I/Os, updating the console lights, and driving SPO and datacom output at ten characters/second -- has been heavily reworked in this release and implemented consistently across all of the emulator components. How this is done and the history of its development is worthy of a blog post on its own, so I won't go into details here. Suffice it to say that you should see somewhat better I/O behavior and snappier performance overall.

2. Character translation and keyboard filtering in the Datacom terminal device have been modified in an attempt to support CANDE and the TSS MCP better.

3. Button colors and the way they are illuminated has been standardized across the B5500 Console and I/O device windows.

4. Four tape drives (MTA-MTD) are now enabled in the default system configuration.

In Other News...

The B5500 emulator itself is nearing completion -- not that it will actually ever be completed, of course -- and effort is already beginning to shift from making the emulator work to having more things for it to work with. There is lots of interesting software already available, but most of it is in the form of scanned listings. Those listings must be transcribed into machine-readable source code. That is a tedious and error-prone task. We've already had about as much luck with 40-year old 7-track tapes as we are likely to have, so transcription is the best path to more applications for the B5500.

Fortunately, significant progress is being made towards making transcription easier and more reliable. Jim Fehlinger in New Jersey (USA) has managed to get an off-the-shelf OCR program to do a passable job of converting scanned listings from bitsavers.org. It is still a very labor-intensive process, involving lots of manual validation and correction of the OCR output, but it is producing usable source code much faster than any of us have been able to do before by simply keying the text.

One thing that we learned early on with the Mark XVI Algol and ESPOL compiler transcriptions is that a compiler is a better proofreader than most people are. I spent a full week carefully proofreading the original Algol compiler transcription, only to have my first attempt at compiling that code identify more typos than that week of very tedious effort had. A compiler isn't perfect for this -- it won't find typos in comments and literals, for example -- but it is a powerful proofing tool.

Jim has used this idea to good advantage. After OCR-ing and manually correcting several pages from a listing, he then compiles the source he has accumulated up to that point. He corrects any errors, and does additional compilation passes as necessary until there are no more errors left to be corrected. Then he goes back to OCR-ing, and the cycle continues. The process is not perfect, just a lot better than anything we've had up to now.

Thus far, Jim has managed to complete transcriptions of the following:
  • XBASIC, an interactive BASIC interpreter, developed by the Paisley College of Technology in the mid 1970s.
  • B6500/SIM, a simulator for the Burroughs B6500 that runs on the B5500. This was developed by the Burroughs B6500 engineering team in the mid/late-1960s. Its use awaits development of a variant of Algol, LONGALG, which did special array handling for the simulated B6500 memory. We do not have any materials for LONGALG, so are going to have to guess how it worked and try to patch the standard Algol compiler to replicate its behavior.
  • B6500 ESPOL, a cross-compiler for the B6500 that ran on the B5500. This was also developed by the Burroughs engineering team to create the initial B6500 MCP.
Jim is currently working on the source for the Mark I.0 B6500 MCP. He has been using the B6500 ESPOL cross-compiler to validate his scanning of the MCP. Since that ESPOL compiler is a product of his scanning process, it still had errors that a simple compile could not uncover, so he and I have had an interesting exchange over the past month. Jim uses the compiler as best he can until the compiler starts crashing or generating false syntax errors. He sends those to me, and I try to debug the compiler, sending him corrections so he can continue validating his OCR work. A remarkable number of the problems have been due to confusion between the plus-sign and the left-arrow. We have also had some really nasty bugs due to confusion between "I" and "1". We are slowly getting the compiler debugged, but the original compiler listing appears to have been of very poor quality, and there are sure to be more problems like this that we have not yet uncovered. I'm impressed that Jim has been able to convert the scan of that listing as well as he has.

Coming Attractions

The plan for the next release of the emulator is to make some improvements in the user interface, particularly in the area of system configuration control. This will probably take several weeks, so stay tuned.

Saturday, March 29, 2014

SWITCH vs. CASE, Part 2

This post continues the discussion of SWITCH vs. CASE as implemented in the Burroughs B5500 Extended Algol compiler. In Part 1, we examined the code generated for each of these constructs and analyzed their differences. In this second part, I will analyze what was wrong with the program I wrote to explore those constructs and describe how to fix it.

To briefly recap the discussion thus far, I wrote a small Algol program shortly after going to work for Burroughs in 1970. The ostensible purpose of this program was to examine the code for both the SWITCH and CASE constructs of the language to determine which was more efficient.

Today, being surrounded by such a glut of inexpensive, incredibly powerful computing devices that they literally have become hazardous waste, it is easy to forget how precious and expensive computer time was a few decades ago, and how difficult it often was to come by. The start of my career at Burroughs was blighted by assignment to a boring documentation project that offered no opportunity to program. Recreational programming was rarely an option in those days, so when the subject of SWITCH vs. CASE came up within another group in the office, I leaped at the chance to get a coding fix and help them decide which construct they should use.

Capitalizing on this opportunity, I decided to go beyond writing a program simply to analyze SWITCH and CASE, and add a few more things to see how they worked as well, including a couple of Stream Procedures that would attempt to dump the program's Program Reference Table, or PRT. In hindsight, that was more than a little foolish, as Stream Procedures, improperly used, could compromise the health of the entire system, and they were quite easy to use improperly.

I managed to demonstrate that ease on my first attempt by making a couple of really dumb mistakes in coding the Stream Procedures. I did not get the desired results, and the program aborted with an Invalid Index (array bounds violation) fault. Now, decades later, the goal is to figure out what happened and fix it.

The Dumb Mistakes

While the program and listing discussed in Part 1 were successful in illuminating how SWITCH and CASE worked, I didn't get the PRT dump I wanted. I got part of a dump before the program aborted with the Invalid Index interrupt, but it turns out that dump was not of the program's PRT.

What the program tried to do was copy the PRT to the array A in the program, and then format the words from that array as octal to a printer file. Copying the PRT was to be the responsibility of the MOVEPRT Stream Procedure. It turns out that I was on the right track, but made two serious errors.

The basic idea of the procedure is simple -- get the address of a word with a known offset within the PRT, adjust that address downward to the beginning of the PRT, then copy some number of words from that address to the destination array. The first variable declared in an Algol program is always at PRT offset 25 octal, so I chose that location as the base. The procedure has four parameters: a descriptor containing the address of the first variable, a descriptor for the destination array, and two integers, the first representing the number of words to copy divided by 64, and the second the number of words modulo 64. The reason for having two parameters is that repeat counts in B5500 Character Mode are limited to six bits -- values 0-63 -- so the div/mod parameters will be used as repeat counts for two nested loops.

The mistakes in this procedure are all on line 19 of the program:
SI ~ LOC PRT25; SI ~ SI - 21;
The first statement was intended to assign to the source index (SI) the address of the word at PRT offset 25 octal (the variable I in the program). Alas, what it assigned was the address in the stack of the PRT25 parameter itself. Dumb. I was confused about the semantics of LOC -- that keyword should not be there. The second statement was intended to back down that address by 21 decimal (25 octal) words to the beginning of the PRT. Alack, what it did was back down the address by 21 characters. Dumber.

The rest of the procedure was written correctly. The destination index (DI) was assigned the address of the array A, and the appropriate number of words gets moved by the "DS~...WDS" constructs. The parentheses indicate repeat loops, which are preceded by their repeat count, limited to the range 0-63. The hardware forces word transfers to begin on a word boundary, so even though the adjustment to SI above left it pointing in the middle of a word, whole words starting on word boundaries were transferred to the array.

At this point the alert reader may have noticed that the destination array is declared in the inner block as A[0:I], and the call on MOVEPRT at line 50 in the program uses I as the number of words to transfer, but I is never assigned a value in the program. How could that ever work? The answer lies in the second control card of the deck, "?COMMON = 100". That command stores the specified value in the first scalar variable declared in the program, before execution of the program begins. In this case, that store is to the integer I at PRT offset 25. Thus, A has dimensions of [0:100] and a length of 101; 100 words will be moved by MOVEPRT.

But what got moved to the array A? Since SI was adjusted backwards by 21 characters (two words plus five characters), it is pointing into the third word below the location of the PRT25 parameter in the stack. A word-oriented transfer adjusts the address, if necessary, forward to the next word boundary, so the transfer actually began two words below the location of that parameter and continued for 100 words. What the output in the original listing shows is a piece of the program's stack, starting one word below the stack frame for the call on MOVEPRT:
  • The first word in the output (all zeroes) is whatever was at top-of-stack before MOVEPRT was called.
  • The second word (beginning with a 6) is the Mark Stack Control Word (MSCW) that starts the stack frame for the call. The primary purpose of this word is to link to the prior stack frame, which is at address 12262 octal.
  • The third word (beginning with a 5) is the parameter PRT25. This is a data descriptor pointing to the variable I in the program, at address 13325 octal.
  • The fourth word is a data descriptor for the array A. The data for this array is present in memory at address 11737 octal.
  • The fifth word (value 1) is the value of I (100) divided by 64 and truncated to an integer. This word is the parameter N1.
  • The sixth word is the value 36 (100 mod 64 = 36, or 44 octal), although that may not be very obvious, as it is in B5500 floating-point notation. The RDV syllable that implements the Algol MOD operator produces a result in floating-point format, even if it is an integer value. This word is the parameter N2. The fact that this value is not a normalized integer is another problem, as discussed below.
  • The seventh word is the Return Control Word (RCW). The primary purpose of this word is to hold the procedure return address (12163 octal) and to link back to the MSCW (at 12264 octal). 
  • Any local variables for the procedure would appear after the RCW, but this procedure has none. What we see in the rest of the output is whatever was left in the stack by prior push-pop activity.
 The corrected statements for line 19 should look like this:

SI ~ PRT25;   8(SI ~ SI - 21);

Removing the LOC keyword causes SI to load the address of the variable passed as the parameter PRT25 (i.e., I), not the address of the parameter word itself. Adding a repeat of eight around the adjustment to SI decrements the index backward by 21 words instead of 21 characters. This could also have been written "21(SI~SI-8)", but the former involves larger decrements with fewer loop iterations, so is more efficient.

There is another bug concerning the MOVEPRT procedure, but it is in the call on line 50, not in the procedure itself. I discovered this as I was looking over the program's output in the original listing. As mentioned above, the value of N2 is in floating point format from that MOD operator used in the call, not in normalized integer format. The problem is that B5500 Character Mode doesn't know about floating-point numbers, and when presented with a dynamic repeat count, it simply takes the low-order six bits from the word. The low-order six bits of that floating-point word are zero, so wherever N2 is used in the Stream Procedure, the repeat will be executed zero times instead of 36 as intended. The same problem occurs with the MOD operator in the call to BINOCT on line 51.

One way to fix this problem is to compute the modulo count and assign it to an integer variable, then pass that integer variable as the parameter. Another way to force integer normalization is an integer divide (Algol DIV operator) by one on the result of the MOD operator. Since the variable K was not being used for anything at that point in the program, a quick-and-dirty solution is to use that variable in an in-line assignment solely for the side effect of generating an integer result, thus:

MOVEPRT (I, A[*], I DIV 64, (K ~ I MOD 64));
BINOCT (I DIV 64, (K ~ I MOD 64), A[*], B[*]);

Perhaps the best way to deal with the limit of 63 for Character Mode repeat counts is to do the div/mod inside the Stream Procedure, thus:

STREAM PROCEDURE MOVEPRT (PRT25, A, N);
  VALUE N;
BEGIN
  LOCAL N1, N2;
  SI ~ LOC N;   SI ~ SI + 6;          % POINT TO 7TH CHAR OF N
  DI ~ LOC N1;  DI ~ DI + 7;          % POINT TO 8TH CHAR OF N1
  DS ~ CHR;                           % MOVE SIX BITS TO N1
  DI ~ LOC N2;  DI ~ DI + 7;          % POINT TO 8TH CHAR OF N2
  DS ~ CHR;                           % MOVE SIX BITS TO N2
  
  SI ~ PRT25;   8(SI ~ SI - 21);      % POINT TO PRT (NO CHANGE)
  DI ~ A;                             % POINT TO ARRAY A (NO CHANGE)
  N1(2(DS ~ 32 WDS));   DS ~ N2 WDS;  % MOVE WORDS (AN IMPROVEMENT)
END MOVEPRT;

In this approach, the N1 and N2 parameters have been replaced by a single parameter, N. It still relies on the value of N being an normalized integer, but that is easier to accomplish in the call than with the MOD operator. N1 and N2 are now declared as words local to the Stream Procedure; they will be allocated in the stack frame for that procedure. Actually, since Character Mode did not use the stack as such, these locals are passed as hidden parameters by the caller. This allocation is done automatically by the compiler. The hidden parameter words are guaranteed to have a zero value upon entry to the procedure.

The partitioning of the count N into div-64 (in N1) and mod-64 (in N2) portions works as follows:
  • The address of the parameter word for N in the stack is stored in the Source Index, SI. This is an example of proper use of the LOC keyword -- SI is loaded with the location (address) of N rather than the contents of N. That address is then advanced by six characters to point to the second-lowest order character in the word. Assuming the count will be less than 4096, that character holds the binary value of the count, div 64.
  • Similarly, the Destination Index, DI, gets the address of the word allocated for N1. This address is advanced by seven characters to point to the low-order character in the word.
  • One character is moved to the Destination String (DS). This implicitly references SI and DI, and advances both by the number of characters moved. That results in the div-64 value being stored in the low-order six bits of N1.
  • Next, DI gets the address of the word allocated for N2, and that address is advanced to the low-order character in that word.
  • One character is again moved to DS. SI was left pointing to the next (low-order) character of N by the prior move, so this moves the mod-64 value to the low-order six bits of N2.
At this point, the values of N1 and N2 can be used in the same manner as in the original procedure. There is another optimization that can be made, however. The original procedure moved the mod-64 portion of the data by means of "N2(DS~WDS)". This specifies a loop repeating N2 times, moving one word on each iteration. It is significantly more efficient to write this as "DS~N2 WDS", which moves N2 words in one operation, avoiding the loop management overhead.

A similar technique can be used with BINOCT to pass a single parameter for the count and have it partitioned into div-64 and mod-64 values within the procedure.

Now that we have the problems with the Stream Procedures corrected, the next issue is the Invalid Index interrupt that aborted the program. The fault occurred during evaluation of one of the list elements in the WRITE statement at line 55, which is part of the FOR J loop that starts on line 52. The termination message on the listing gives us a strong clue as to what the problem is:

-INVALD INDEX CASESW /PAULROS= 3, S =   3, A =  63, 201 GEQ 201

As mentioned earlier, "S=3, A=63" refers to the decimal segment number and word offset within the segment where the interrupt occurred. The "= 3" preceding that refers to the program's mix number (equivalent to a PID in Unix/Windows). "201 GEQ 201" describes the bounds violation. The index value used was 201, which being zero-relative, exceeds the 201-word length of the array.

The problem is either the terminating value for the FOR loop, or the dimensionality of the array B -- take your pick. The loop is attempting to format the words from the array A as octal values. The BINOCT call at line 51 does the translation from binary to a character string representing the octal values of the words, but since we are translating from three-bit octades to six-bit characters, the destination string must have twice as much space as the source. Therefore, B should have twice the length of A, but it doesn't:

ARRAY A[0:I], B[0:2|I];

The variable I has the value 100, so A has a length of 101, and B has a length of 201. Oops.

As the joke goes, there are only two hard things in Computer Science: cache invalidation, naming things, and off-by-one errors.

To fix this, we need either to make that length at least 202 or to terminate the loop one iteration earlier. The latter is probably the more correct solution (since we moved only I words from the PRT into A), but I chose the former. With either of those corrections, the program will complete without throwing the Invalid Index interrupt.

There is still one more bug in this program -- well, okay -- at least one more bug. If you look at the program's output on the original listing, it consists of two columns. The left column is intended to be the PRT offset in octal, with the right column containing the octal value of the word at that offset. Alas, the left column is all zeroes. The PRT offset is not being formatted properly.

The offset is formatted by the BINOCT call on line 54. One word is formatted from the address of J to two words at the address of Y. Since Z is declared immediately after Y, the second word of the formatted result (the one with the significant half of the offset) will be in Z. Alas, the WRITE statement at line 55 references Y, which coming from the high-order bits of J, contains all zeroes. The solution is simply to substitute Z for Y in the WRITE statement's list.

In sum, the fixes to the program from 1970 required the following changes (shown in red) to five lines:

13: ARRAY A[0:I], B[0:2|I+1];
19:   SI ~ PRT25;   8(SI ~ SI - 21);
50:   MOVEPRT (I, A[*], I DIV 64, (K ~ I MOD 64));
51:   BINOCT (I DIV 64, (K ~ I MOD 64), A[*], B[*]);
55:       WRITE (PR, F1, Z, B[J|2], B[J|2+1]); 

That is a lot of bugs for a 55-line program, but perhaps is not so bad for a first attempt. Considering what I have been able to learn from it, not only in 1970, but more recently in the retro-B5500 emulator project, I count this as a very successful effort. That I finally got the program to work properly is just gravy.

Resources

In addition to the original listing cited in Part 1, you may be interested in the following files and documents, generated from the retro-B5500 emulator. Note that the listings from 1970 were produced by a B5500 running the Mark X system software release, probably with some local-site patches. The emulator is running the base Mark XIII software release from late 1971, so you should expect to see some slight differences in the output.

Postscript

About two weeks after my fling with the B5500 in October, 1970, I was released from the durance vile of that boring documentation project and transferred into a group at the Benchmark Facility at Burroughs Great Valley Laboratories, also in suburban Philadelphia. There, in preparation for eventually becoming the software tech at a customer site, I began learning a new machine, the Burroughs B6500. Now 43 years later, I'm still learning the B6500 -- in the guise of its modern direct descendants bearing the Unisys ClearPath logo.

That little piece of Algol discussed in these two posts was the last time I wrote a program for (or even used) a B5500 until I started working on the emulator two years ago. It's gratifying to have finally been able to get that last program right.

I have yet to code another SWITCH.

Sunday, March 23, 2014

SWITCH vs. CASE, Part 1

I wrote a small program in 1970, and finally got it to work a few months ago. Here's the story...

The story has gotten a little long in the telling, so I have divided it into two parts. The nature of the division will become clear shortly. This post represents Part 1. Part 2 will be published next week.

Background

In the Spring of 1970, I graduated from the University of Delaware with a degree in Chemical Engineering and a well-developed aversion to anything having to do with chemistry, which persists to this day. I had become increasingly interested in computers and software, though. During my senior year, the University acquired a Burroughs B5500, which really grabbed my interest. To make a long story short, upon graduation I went to work for the Burroughs Corporation near Philadelphia, Pennsylvania.

Just as I began that job, an economic recession took hold in the United States, and Burroughs was hurting a bit. To avoid having to lay me off, my management assigned me to a technical documentation project for a message-switching system. That assignment was really boring, carried on into October, and offered no opportunities to program -- let alone use -- computers. It was depressing, but not having much in the way of options, I stuck with it.

Another group in the office was just starting a project to redesign a law-enforcement information system. That system was based on the B5500 and written in Algol. I would overhear the other group's discussions, and with my job not requiring a lot of concentration, occasionally pay attention. At some point a question arose within this group as to which was more efficient, the Algol SWITCH construct or the then-new CASE construct.
  • SWITCH is a standard Algol construct, and acts somewhat like an array of labels -- you use it with a one-relative index value in a <designational expression>, generally as part of a GO TO statement to implement a multi-way branch. In its simplest form it is similar to the "computed go-to" of FORTRAN or "go-to depending on" of COBOL. 
  • The B5500 Algol CASE statement was a Burroughs extension to standard Algol, and is much like a modern case or C-style switch statement, but the cases are not labeled -- each statement in the body of the CASE is implicitly numbered starting from zero, only one of which is selected for execution based on the index value. BEGIN/END pairs could be used, of course, to create a block or compound statement as one selection in the CASE body. 
  • A significant difference between the two is that using an out-of-range index value with a SWITCH effectively made its GO TO  a no-op. Using an out-of-range index with a CASE statement terminated the program.

I got caught up in this discussion, and someone (probably me, as I was desperate for an opportunity -- any opportunity -- to do some programming) suggested writing a test to try both approaches and examine the code generated by each one. In any case, I spent a little time playing hookey from what I was supposed to be doing and teamed up with one of the programmers from the other group, Rose, who had an account on the division's B5500. I probably volunteered to write the program and keypunch it. More likely, I begged to do it.

The Program

Young whippersnapper that I was, I decided to expand my charter a bit and try to find out more about the B5500 while I had the chance. I had done some Algol programming on the B5500 at Delaware, and as a student employee in the Computer Center there had even made some minor changes to the XALGOL compiler. I had only a rough idea, though, of the inner workings of the machine and its instruction set. Thus, I decided I would put in a few constructs I was interested in knowing more about, including Stream Procedures. Not only that, but in a moment of complete recklessness, I decided I would try to have the program dump its own PRT.

On the B5500, the PRT, or Program Reference Table, is an area of memory that stores the the global variables plus some information the MCP uses to manage execution of the program. In an Algol program, the PRT holds the declarations of the outer block; in COBOL, the Data Division declarations. The processor's R register points to the base of this area. The other major data area for a program is the stack, which immediately precedes the PRT in memory. That arrangement allows the R register to serve as a limit register for the S (top-of-stack) address register, to detect stack overflows.

In order to access the PRT as a vector of words, it was necessary to use a Stream Procedure. The intended purpose of Stream Procedures in Burroughs Extended Algol is to provide access to the processor's Character Mode capabilities. Character Mode operations consist largely of manipulations of a source address (SI, the Source Index) and a destination address (DI, the Destination Index), plus various ways to test data and transfer (or stream) it from source to destination. Most operations can start and end in the middle of a word; data transfers can take place in units of words, characters, or bits.

Character Mode was sort of a last-minute add-on to the design of the B5000, the original product that became the B5500. It was taken from the design for an earlier machine, possibly the 4111, which was never built. It was quite powerful, and gave efficient character and bit manipulation capabilities to what otherwise was a strictly word-oriented scientific machine. The most significant characteristic of Character Mode, though, is that all of the nice bounds protection built into Word Mode was bypassed. Stream Procedures could potentially address and manipulate anything in memory. This was a capability that could be used for good or evil, and it was. You could implement sophisticated parsing and data movement operations in Character Mode via Stream Procedures. You could also crash the system with it.

Thus, my self-expanded charter for that test program was more than a little dangerous, especially since I didn't really know what I was doing. I don't think Rose knew all of what I was trying to do, either. There was exactly one B5500 for my whole division of Burroughs. It was quite busy, doing everything from payroll to (quite literally) rocket science, and having it crashed by a new-hire trying to satisfy his curiosity would not exactly have been a career-enhancing move. Oblivious to these risks, I wrote the program, keypunched it onto cards, and put it in the inter-office mail to be run at division headquarters. Here is what the card deck would have looked like, using the character-coding conventions of our B5500 emulator:

 1: ?USER=LANZA  ;COMPILE CASESW   /PAULROSE  NO 85410800  ALGOL
 2: ?COMMON = 100
 3: ?DATA
 4: $CARD LIST SINGLE PRT DEBUGN
 5: %    CASE VS. SWITCH      10/01/70              ROSE & PK
 6: BEGIN
 7: INTEGER I, J, K;
 8: FILE OUT PR 18 (2,15);
 9: BEGIN         %%% INNER BLOCK %%%
10: REAL X, Y, Z;
11: LABEL L1, L2, L3;
12: SWITCH S ~ L1, L2, L3;
13: ALPHA ARRAY A[0:I], B[0:2|I];
14: FORMAT F1 (X20,O,X5,2O);
15: 
16: STREAM PROCEDURE MOVEPRT (PRT25, A, N1, N2);
17:   VALUE N1, N2;
18: BEGIN
19:   SI ~ LOC PRT25;   SI ~ SI - 21;
20:   DI ~ A;
21:   N1(2(DS ~ 32 WDS));   N2(DS ~ WDS);
22: END MOVEPRT;
23: 
24: STREAM PROCEDURE BINOCT (N1, N2, S, D);
25:   VALUE N1, N2;
26: BEGIN
27:   SI ~ S;
28:   DI ~ D;
29:   N1(32(32(DS~ 3 RESET; 3(IF SB THEN DS ~ SET ELSE DS ~ RESET;
30:                           SKIP SB))));
31:   N2(16(DS ~ 3 RESET;   3(IF SB THEN DS ~ SET ELSE DS ~ RESET;
32:                           SKIP SB)));
33: END BINOCT;
34: 
35: L1:
36:   J ~ 3;   GO TO S[J];
37: L2:
38:   CASE J MOD 10 OF
39:     BEGIN
40:       J ~ 3;
41:       K ~ J;
42:       X ~ K +J;
43:       Y ~ X ~ SQRT(X);
44:       ;
45:       Z ~ 2|Y + 6.0;
46:       ;
47:       K ~ 5000;
48:     END CASE;
49: L3:
50:   MOVEPRT (I, A[*], I DIV 64, I MOD 64);
51:   BINOCT (I DIV 64, I MOD 64, A[*], B[*]);
52:   FOR J ~ 0 STEP 1 UNTIL I DO
53:     BEGIN
54:       BINOCT (0, 1, J, Y);
55:       WRITE (PR, F1, Y, B[J|2], B[J|2+1]);
56:     END;
57:   END INNER BLOCK;
58: END.
59: ?END
Note that in our emulator, we use the tilde (~) to represent the B5500 left-arrow for assignment, and the vertical bar (|) to represent the small-cross for multiplication.

There may have been an initial compile that had syntax errors -- I don't clearly remember -- but on 3 October 1970 this program compiled successfully and ran. The card deck above was reconstructed from a listing of that run, which you can view here. Since we were interested in how the two constructs worked at the machine level, that listing includes the generated code, enabled by the DEBUGN option on the $-card at line 4. The instruction mnemonics are described in the B5500 Reference Manual. The notations on the listing are mine from 1970.

Alas, while the program ran, it did not do what I had intended, and as you can see from the end of the listing, it aborted with an Invalid Index interrupt (i.e., an array bounds violation) at "S=3,A=63". That stands for "segment 3, word offset 63" (decimal). If you look on the listing at the four-digit numbers on the far right of the source code lines, offset 63 for segment 3 is within the code generated for sequence number 00051000 (the WRITE statement on line 55 in the card deck above). Specifically, the fault occurred at the instruction "0373 DESC 0036" (Descriptor Call on PRT offset 36 octal, the array "B"). The 0373 is an octal syllable offset from the beginning of the segment. At four syllables per word, offset 63 (decimal) times four is 252, which is 374 octal. Interrupt state is stored by the processor in a manner similar to that for a subroutine call, so the "return" address for the interrupt is one after the syllable that caused the fault. No return is possible from this type of error, however.

It turns out that I was really lucky with this run, because the Stream Procedure MOVEPRT did not work properly at all. Fortunately, what it did was benign, and while it accessed memory locations it shouldn't have, at least it did not overwrite anything it shouldn't have. The invalid index did not have anything to do with the dumb mistakes I made in MOVEPRT -- that was due to an entirely different dumb mistake, as will be seen in Part 2.

The remainder of this Part 1 post will analyze how the SWITCH and CASE constructs work, and what we learned about them from this little program. Next week, Part 2 will analyze the dumb mistakes I made in writing the program, what actually happened when it ran, and what it took to make the program work properly.

SWITCH Declarations and Invocations

The bulk of the program above consists of an inner block that begins at line 9. Execution for that inner block begins at line 35. The SWITCH S is declared at line 12 to select among three labels, L1, L2, and L3, based on index values 1, 2, and 3, respectively. That SWITCH is used on line 36, where the integer variable J is set to 3 just before the statement "GO TO S[J]".

If you look at the listing, the SWITCH declaration generates some code. The way that code works will be clearer if we first examine what happens when the SWITCH is invoked. From line 36 in the program:

J ~ 3;   GO TO S[J];
    0160  LITC  0003  0014
    0161  LITC  0026  0130
    0162  ISD         4121
    0163  OPDC  0026  0132
    0164  LITC  0021  0104
    0165  ISD         4121
    0166  LITC  0035  0164
    0167  LBU         6131
    0170  NOP         0055
    0171  NOP         0055
    0172  NOP         0055

The first three syllables handle the assignment to J by (a) pushing the literal value 3 onto the stack, (b) pushing a literal value for PRT offset 26 octal (J) onto the stack, and (c) executing the Integer Store Destructive (ISD) syllable. The "destructive" indicates that both the PRT offset and value will be deleted from the stack when ISD completes.

The code for the SWITCH invocation starts at offset 0163:
  •  The Operand Call (OPDC) syllable copies the value of the switch index at PRT offset 26 octal (J) and pushes that value onto the stack. Then the offset for PRT location 21 octal is pushed, followed by another integer-store syllable. This copies the value of J into a word in the lower part of the PRT that is reserved for the MCP and compilers -- variables in the outer block of an Algol program are assigned higher PRT locations beginning at 25 octal.
  • At offset 0166, the value 35 octal is pushed onto the stack, followed by a Long Backward Unconditional (LBU) branch instruction. The "long" branches compute a destination address by taking from the stack an offset in words that is relative to the location of the branch itself, and branching to the first syllable in that destination word. The LBU is at offset 0167 octal, which is word 35 octal plus syllable 3. Going backwards 35 octal words lands us at offset 0 in the code segment, which is the start of the code for the SWITCH declaration, discussed immediately below.
  • The No Operation (NOP) syllables after the LBU have a purpose, which will be explained shortly.

Note that the code shown above is not quite what you will see on the listing. The Burroughs Algol compilers were (and still are) strictly one-pass affairs. The compiler generates code as it is reading and parsing the input source lines. This means that forward references, such as a branch to a point in the program that has not been encountered yet, cannot be resolved until later.

To deal with this, the compiler resorts to what I call "back-patching." When it encounters a forward reference, it makes an entry in its symbol table for the as-yet unresolved destination point and reserves syllables at the current place in the instruction stream for the instructions that will ultimately need to go there. It often stores linkage data in that reserved space, so that multiple references to the same unresolved address can be chained together. After the compiler encounters the destination point in the input source, it reaches back into the previously-emitted instruction stream to fix up the syllables that had been reserved earlier, overwriting those syllables with the correct opcodes and address offsets.

This behavior can be really confusing when you first look at it in a code listing, especially since the data that is initially emitted in the reserved spaces is often formatted as if it were instructions, and the fix-ups are output in the code listing intermixed with whatever else the compiler is generating at the moment. You have to pay attention to the octal offset on the left side of the lines of generated code to understand what is really happening. In the examples here, I have unraveled all of that out-of-order generation so the code will be easier to follow.

With that introduction, the code generated for the SWITCH declaration is this:

SWITCH S ~ L1, L2, L3;  
    0000 LITC 0000 0000  
    0001 OPDC 0021 0106  
    0002 GEQ       0125  
    0003 OPDC 0021 0106  
    0004 LITC 0003 0014  
    0005 GTR       0225  
    0006 LOR       0215  
    0007 OPDC 0021 0106  
    0010 DUP       2025  
    0011 ADD       0101  
    0012 BFC       0231  
    0013 LITC 0156 0670  
    0014 BFW       4231  
    0015 LITC 0031 0144  
    0016 LFU       6231  
    0017 LITC 0033 0154  
    0020 LFU       6231  
    0021 LITC 0047 0234  
    0022 LFU       6231  

The LBU syllable from the SWITCH invocation branches to offset 0000 in the SWITCH declaration.
  • The SWITCH declaration begins by pushing a zero onto the stack, followed by the value from PRT offset 21 (the copy of the value of the switch index, J). 
  • The Greater-Than-or-Equal (GEQ) syllable tests whether the second word in the stack is greater than or equal to to the top word in the stack; if so it pushes a one in the stack, otherwise it pushes a zero. In both cases, the original values are popped from the stack before the result value is pushed. On the B5500, a binary value is considered to be "true" if its low-order bit is set, so the zero and one values correspond to Algol FALSE and TRUE, respectively. All other bits in the word are ignored. Another way to remember this is that on the B5500 and its descendants, the truth is always odd.
  • At offset 0003, the value of the switch index at PRT location 21 octal is again pushed onto the stack, followed by a literal 3 and the Greater-Than (GTR) syllable. This works just like GEQ except for the difference in the relation of the two top-of-stack values being tested. 
  • We now have two Boolean values on the top of the stack, which are combined using the Logical OR (LOR) syllable. Once again, both original values are popped from the stack before the result value is pushed. What these two tests and the LOR have accomplished is to determine whether zero is greater-than-or-equal to the switch index (i.e., index<1) or the index is greater than 3. Since the SWITCH has three elements, this tests whether the index falls outside the valid range for the SWITCH.
  • With the result of the LOR remaining on top-of-stack, an OPDC 21 at offset 0007 pushes another copy of the index on top of that, followed by Duplicate (DUP) and Add (ADD) syllables. DUP simply makes a copy of the word on top-of-stack, and the ADD sums the two copies, effectively multiplying the value of the index by two. 
  • This is followed at offset 0012 by a Branch Forward Conditional (BFC) syllable. Unlike the "long" branch above which only branches to word boundaries, this is a syllable-oriented branch. The word at top of stack is the number of syllables to branch, relative to the location of the syllable after the branch. With a conditional branch, the second word in the stack holds the condition, which in this case is the result of the LOR above. The branch takes place if the condition is false (i.e., its low-order bit is zero), which is what you want for IF statements. Whether the branch occurs or not, both the branch offset and condition are popped from the stack.
  • Following the conditional branch, starting at offset 0013, are four pairs of literal-call/branch syllables. If the condition resulting from the LOR above is true (i.e., the index is out of bounds), the branch will not take place, so control simply proceeds in sequence. This will result in a branch forward of 156 octal syllables, or to offset 173 octal, which is the first instruction after the code generated for the SWITCH invocation on line 36. Thus, if J is out of bounds, the GO TO on line 36 is effectively a no-op, as the semantics of Algol require.
  • If the conditional branch is taken due to J being within bounds, one of the next three pairs of literal-call/branch syllables will be selected. As it takes two syllables to effect a branch -- one to push an offset and one to do the branch -- the value of the switch index had to be multiplied by two (DUP, ADD) to obtain the correct offset. The LFU opcode is a Long Forward Unconditional word-oriented branch. 
  • Note that the literal values for the relative offsets used by the branches will take control to the locations for labels L1, L2, and L3, respectively. The compiler emits NOPs as necessary to align the code following these labels on word boundaries, as the long branches require.

So much for how the SWITCH works in this program. But wait -- it is possible to use a SWITCH in multiple places in a program. In the code above, if the switch index is out of bounds, there is a branch to a fixed location. How can that be right if you use the SWITCH more than once?

The answer to that is an example of back-patching at its finest. For the first invocation of the SWITCH, the compiler generates exactly what is shown above. If it encounters a second invocation of the SWITCH, the compiler changes its strategy and fixes up both the code for the SWITCH declaration and first use of the SWITCH to use the new strategy. The fixed-up code for the SWITCH declaration will look like this:

SWITCH S ~ L1, L2, L3;  
    0000  LITC  0000  0000  
    0001  OPDC  0021  0106  
    0002  GEQ         0125  
    0003  OPDC  0021  0106  
    0004  LITC  0003  0014  
    0005  GTR         0225  
    0006  LOR         0215  
    0007  OPDC  0021  0106  
    0010  DUP         2025  
    0011  ADD         0101  
    0012  BFC         0231  
    0013  LITC  0017  0074  
    0014  RTS         1235
    0015  LITC  0034  0160  
    0016  RTS         1235
    0017  LITC  0035  0164  
    0020  RTS         1235
    0021  LITC  0036  0170  
    0022  RTS         1235  

and an invocation of the SWITCH will look like this:

J ~ 3;   GO TO S[J];
    0160  LITC  0003  0014
    0161  LITC  0026  0130
    0162  ISD         4121
    0163  OPDC  0026  0132
    0164  LITC  0021  0104
    0165  ISD         4121
    0166  OPDC  0032  0152
    0167  XRT         0061
    0170  LOD         2021
    0171  BFW         4231
    0172  NOP         0055

The differences from the original code are shown in red. What the compiler has done is convert the code for the SWITCH into a subroutine.
  • It is not shown here, but the compiler also allocates four additional PRT locations to hold Program Control Words (PCWs) for the entry point to the subroutine that SWITCH S has become and the locations of the labels L1, L2, and L3. Like all control words for the B5500, these will have their high-order (flag) bit set. The branch syllables are sensitive to the flag bit and can accept either an integer offset or a PCW on the top-of-stack as a branch destination. Executing an OPDC that references a PCW results in a subroutine call to that code location (OPDC is the do-it-all kid -- it will also index arrays if its top-of-stack operand is a data descriptor).
  • With the code for the SWITCH declaration reconfigured as a subroutine, testing the bounds of the index and selecting a pair of syllables to branch works the same as before, but instead of directly branching, the selected syllables return a PRT offset value as the subroutine result. RTS is the Return from Subroutine syllable, which takes the value on top-of-stack (the result of the selected LITC syllable in this case), cuts back the stack used by the subroutine, and branches to the return address, leaving the original top-of-stack value at the new top-of-stack location.
  • The code that invokes the SWITCH now does an operand call on the PCW for the SWITCH's subroutine, which will return the PRT offset for one of the label PCWs, as selected by the index value. XRT (Set Variant) extends the range of PRT addressing for the next syllable (it is not needed in this small program, but might be in programs having more than 512 words in their PRT). LOD (Load Operand) takes a PRT offset on top-of-stack and replaces it with the value at that PRT location, which will be one of the label PCWs. BFW (Branch Forward) is an unconditional syllable-oriented branch. It can accept either an integer offset or a PCW as its operand.
  • Note how the NOPs at offset 0170-0171 have been overwritten by the extra code needed to implement the subroutine-based switch invocation code. The one-pass compiler could not know whether the SWITCH would be referenced more than once, so initially emitted code that efficiently supported the simpler scenario.

In this example, the PCW for the SWITCH's subroutine is at PRT offset 32 octal. That subroutine will return one of 17, 34, 35, or 36 octal, depending upon the value of J. Offsets 34, 35, and 36 represent the PCWs for labels L1, L2, and L3 respectively, but offset 17, which is returned if J is out of bounds, is in the area of the PRT reserved for the MCP and compilers. What is that? It turns out that the word at PRT offset 17 octal always contains a zero. Thus, if the SWITCH index is out of bounds, the BFW will branch forward zero syllables, which is effectively a no-op, and control continues with the next statement in the program.

Switches can be even more complex than we have seen here, as the elements of a SWITCH declaration can themselves be designational expressions (e.g., invocations of some other SWITCH) which must be evaluated at run time. Investigation of how that works is left as an exercise, dear reader, to you.

CASE Statements

In contrast to SWITCH, the implementation of CASE statements is simple and straightforward. For each statement (which may be a compound statement or a block) that is immediately subordinate to the CASE statement, the compiler determines a relative branch offset. It then constructs a one-dimensional array of those offsets, indexed by the zero-relative case value. The array is stored in the object file for the program, and is pointed to by a data descriptor that is placed in the PRT. Indexing that descriptor by the CASE expression yields the offset to the appropriate statement.

The CASE statement in the program above has seven subordinate statements, and the compiler generates syllable offsets of 0, 5, 10, 17, 40, 26, 40, and 35 decimal, respectively, for them. The code to execute the CASE statement looks like this (again, with the back-patching unraveled for clarity):

CASE J MOD 10 OF
      0170  OPDC  0026  0132
      0171  LITC  0012  0050
      0172  RDV         7001
  BEGIN
      0173  OPDC  0042  0212
      0174  BFW         4231
    J ~ 3;
      0175  LITC  0003  0014
      0176  LITC  0026  0130
      0177  ISD         4121
      0200  LITC  0043  0214
      0201  BFW         4231
    K ~ J;
      0202  OPDC  0026  0132
      0203  LITC  0027  0134
      0204  ISD         4121
      0205  LITC  0036  0170
      0206  BFW         4231
    X ~ K +J;
      0207  OPDC  0027  0136
      0210  OPDC  0026  0132
      0211  ADD         0101
      0212  LITC  0032  0150
      0213  STD         0421
      0214  LITC  0027  0134
      0215  BFW         4231
    Y ~ X ~ SQRT(X);
      0216  MKS         0441
      0217  OPDC  0032  0152
      0220  OPDC  0043  0216
      0221  LITC  0032  0150
      0222  SND         1021
      0223  LITC  0033  0154
      0224  STD         0421
      0225  LITC  0016  0070
      0226  BFW         4231
    ;
    Z ~ 2|Y + 6.0;
      0227  LITC  0002  0010
      0230  OPDC  0033  0156
      0231  MUL         0401
      0232  DESC  1777  7777
      0233  ADD         0101
      0234  LITC  0034  0160
      0235  STD         0421
      0236  LITC  0005  0024
      0237  BFW         4231
    ;
    K ~ 5000;
      0240  DESC  1777  7777
      0241  LITC  0027  0134
      0242  ISD         4121
      0243  LITC  0000  0000
      0244  BFW         4231
  END CASE;

The statement begins by computing the CASE index. The OPDC pushes the value of J onto the stack, LITC pushes the value 10 decimal onto the stack, and RDV (Remainder Divide) implements the MOD operator. The heavy lifting is done by the next syllable, another OPDC for PRT offset 42 octal:
  • That PRT location contains the data descriptor for the array of syllable offsets. 
  • When OPDC detects that its operand on the top-of-stack is an unindexed data descriptor, it applies the second word in the stack as an index to that descriptor, computes the memory address of the indexed element of the array, and loads the value at that address onto top-of-stack after popping both the descriptor and index value. 
    • A data descriptor contains the length, address, and presence status of the data for the array it describes.
    • If the value of the index value is less than zero or greater than or equal to the length of the array stored in the descriptor, OPDC will raise an Invalid Index interrupt and quit. Unless this fault is trapped by the program, the MCP will terminate the program.
    • Initially, the presence status of the descriptor is "absent" to indicate that the contents of the array it describes are not present in memory. Thus, the first time we execute this CASE statement, the OPDC will simply raise a Presence Bit interrupt and quit, allowing the hardware to branch to the appropriate interrupt vector.
    • The MCP will respond to the P-bit interrupt by allocating an area in memory of appropriate size (seven words in this case), reading the array element values from the object file on disk into that new area, fixing up the descriptor with the address of the new area, setting the presence bit [2:1] in the descriptor to indicate it now points to a real memory address, and exiting back into the program to restart the OPDC syllable, which this time will complete the index-and-load operation.
    • It is possible that the array of syllable offsets may be forced out of memory later due to pressure from other memory allocation activity, in which case the MCP simply deallocates the memory area and fixes up the descriptor for it in the program's PRT to point back to the copy of the data in the object code file. The area does not need to be written out to disk, since the system considers it to be read-only, and therefore not dirty. The next time the CASE statement is executed (if ever), the same P-bit process will bring the array back into memory from the code file.
The final result, whatever machinations are involved, is that the syllable offset ends up on top-of-stack. The syllable after the OPDC, BFW, uses that offset to branch to the beginning of the selected subordinate statement in the CASE statement. The BFW is at syllable offset 174 octal in its code segment, so the offsets in the array identified above would yield branch locations of 175, 202, 207, 216, 245, 227, 245, and 240 octal, respectively. Recall that syllable branches are relative to the syllable after the branch.

After the end of each of the subordinate statements, the compiler inserts a branch to the code for the statement following the CASE statement. These are shown in blue in the code above.

Note that two of the subordinate statements in the CASE statement are empty statements, represented by just their delimiting semicolons. For these the compiler generates no code, just an offset that branches around the CASE statement to the syllable at 0245.

SWITCH vs. CASE

So which is better, SWITCH or CASE? I have forgotten what Rose and I concluded in 1970, but looking at it afresh, my answer now is that it depends on what you are trying to do and how you are using the multi-way branch. Both constructs are very efficient in certain cases.

SWITCH must store the index value in PRT cell 21 octal and then fetch it three times. If the SWITCH is referenced more than once in a program, it must be invoked in a subroutine call, which is not an inexpensive operation. Those issues aside, the actual dispatch to a location based on the index value is very efficient. You can also do things with a SWITCH that you cannot with a CASE statement, including nesting designational expressions that have expressions for indexes, all of which will be evaluated dynamically at the time the SWITCH is invoked.

CASE is more efficient in the way that it computes the ultimate branch location, but there is significant overhead the first time it is used to handle the P-bit interrupt for its array of offsets and bring it into memory. That overhead can be repeated later, possibly multiple times, if the array is pushed out of memory. CASE is an in-line construct, so it is not subject, lexically, to multiple references. Of course, code segments can be pushed out, too, so the code for the SWITCH is not entirely immune to overlay by memory allocation pressure.

Using an out-of-bounds index with a SWITCH results in no branch occurring at all. This can be considered a feature or a bug, depending on your point of view. Because CASE obtains its branch offset by indexing an array, an out-of-bounds index causes an Invalid Index interrupt, which unless trapped, will abort the program. The programmer may need to insert bounds tests before the CASE statement to protect against aborts. Eventually, a numbered CASE statement with provision for a default case was implemented in Algol for the B6700/7700, but that does not appear ever to have been (officially) implemented for B5500 Algol.

I matured as a programmer during the post-Dijkstra, Go-To-Considered-Harmful era, so my personal preference would be to ignore the relatively minor performance-difference issues, use the CASE statement, and eschew the SWITCH, labels, and go-to statements altogether.


That completes the analysis of CASE vs. SWITCH. Tune in next week for Part 2, where I will deconstruct my less-than-stellar 1970 programming abilities, show what went wrong in the execution of the program, and demonstrate how to fix it.

Resources

In addition to the original listing, cited earlier in this post, you may be interested in the following files and documents, generated from the retro-B5500 emulator. Note that the listings from 1970 were produced by a B5500 running the Mark X system software release, probably with some local-site patches. The emulator is running the base Mark XIII software release from late 1971, so you should expect to see some slight differences in the output.
  • CASESW-PAULROSE-DECK.card --
    The original card deck, as displayed above.
  • CASESW-PAULROSE-20131106-OUTPUT.txt --
    Printer output from the emulator for that original card deck, with a compile listing showing the generated code, incorrect PRT dump, and Invalid Index abort. The emulator had exactly the same problems with my program that a real B5500 did.