Monday, March 26, 2012

Preparing the Algol compiler source

As previously discussed, Nigel Williams and I have embarked on a project to emulate the Burroughs B5500. One of our goals is to get the Extended Algol compiler operating in that environment. Alas, we have no object code for the system at all, just scans of listings on bitsavers. Also, the compiler is written in itself, so we have a real chicken-and-egg problem here -- we need the object code for the compiler to compile the compiler to object code. Therefore, we need a way to bootstrap this software.

Significant work on the original B5000 compiler was done at Burroughs by Richard Waychoff, who in 1979 wrote quite an interesting memoir of the time surrounding the development and early deployment of that system. One of the more faithful copies of this was transcribed by Ian Joyner, who has placed it on his web site as "Stories of the B5000 and People Who Were There."

Another fascinating source of information about the development of the original B5000 and its software is the transcription of the 1985 Burroughs B5000 Conference oral history. From that conversation among the principals involved with the system, it's clear that they had the same chicken-and-egg problem. The Algol compiler was being written in Algol, so how do you get it compiled the first time? Their answer was to bootstrap it on a different machine.

Burroughs had an in-house pseudo-assembler for the B5000, called OSIL (Operating Systems Implementation Language), which ran on the Burroughs 220, an earlier vacuum-tube machine. The compiler developers hand-translated their Algol design into OSIL and initially cross-compiled it on the 220. Once they got the OSIL-generated compiler working on the B5000 well enough to compile the Algol source for the compiler, they abandoned the OSIL version.

Interestingly, the hand-coded OSIL version of the compiler took nine hours to assemble on the 220. The Algol version of the compiler took only four minutes to compile itself on the B5000, and even generated a slightly smaller object code file. OSIL was also used to develop the original drum-based MCP operating system for the B5000, but once the Algol compiler was working on the B5000, a new compiler for a subset of Algol was cloned from the Algol compiler for development of the more-advanced B5500 Disk File MCP. That subset is termed ESPOL (Executive Systems Problem-Oriented Language). Once ESPOL was working, the use of OSIL was completely abandoned.

With that rather long introduction out of the way, we plan to do something similar to bootstrap the B5500 Extended Algol compiler.

A little over a year ago, Nigel heroically transcribed by hand the source for the Algol compiler from the scan of a listing on bitsavers. He says it took him about a year, working on it piecemeal mostly during his lunch hour. The listing is dated 7 October 1977, and appears to have come from the University of California at Santa Cruz, which had at least one B5500 at the time. The listing is also identified as being for the "B5700," but that is just a marketing designation for the B5500 near the end of its life, renamed to go along with the successor B6700 and B7700 systems.

Our plan is to port the source for the compiler to a modern MCP system, get it working there as a cross-compiler that will generate B5500 object code files, and then use that cross-compiler to compile the original Algol compiler source. We can also use the cross-compiler to generate test cases for the hardware emulator, once its development gets far enough along.

We chose the modern MCP as a host for the cross compiler because it has an Algol compiler that is somewhat similar to the one for the B5500, and a number of hardware-related things have been retained from the B5500, such as word size and numeric format. There are a number of challenges in porting the B5500 compiler to a current MCP system, though:
  1. While the B5500 and modern systems use the same 48-bit word size and binary numeric format, the B5500 used 6-bit characters and the modern systems use 8-bit ones. The B6700/B7700 had 6-bit capabilities, but these were removed from the architecture starting with the  B6900/B7900 ca. 1980. The internal workings of the B5500 compiler are heavily dependent on the 6-bit coding.
  2. The B5500 operated in two modes, word and character. Word mode is still fairly compatible with modern systems, but character mode is not. Character mode was programmed in B5500 Algol using Stream Procedures, the syntax for which is little more than a low-level assembler for the character-mode instruction set. Not only will all of the Stream Procedures need to be ripped out and converted to current language facilities, but in character mode you could get at and manipulate absolute memory addresses. Alas, the B5500 compiler makes extensive use of this ability in its scanning and input file handling routines. Nothing like that is available in current systems (thank goodness), so dealing with this aspect of the compiler implementation is going to be very difficult.
  3. The bits in words were numbered differently between the two systems. On the B5500, the high-order bit in a word was numbered zero and the low-order bit 47. Starting with the B6500/B6700, that numbering was reversed. This affects the use of partial-word designators in the language, e.g., A.[47:1] accessed the low-order bit on the B5500, but on current systems that needs to be written A.[0:1].
  4. All of the I/O will need to be reworked. File declarations are similar in concept to those in the modern compiler, but the syntax is completely different. Both formatted and array-row I/O work similarly on modern systems, but we will need to deal with the 6- vs. 8-bit character size problem. The B5500 had a very strange mechanism that allowed a program to access the top physical buffer for a file and perform I/O via a RELEASE statement on the file identifier. That statement would result in a read for input files and a write for output files. This appears to have been used primarily with Stream Procedures, and yes, the compiler uses them, and yes, this is all tied up with the absolute address manipulation issue above. I/O for the compiler will also require either support in the emulator, or the presence of the MCP running within the emulator.
There are probably additional issues that will reveal themselves as we get further into the job of porting the source to the modern MCP. Coming into this I knew that 6-bit characters and Stream Procedures were going to be major problems, but the I/O and absolute address manipulation issues were unwelcome surprises. Nonetheless, there are a lot of similarities with the syntax and semantics between the two Algol dialects, and porting to the modern MCP is overall probably the least-difficult of alternatives.

I have taken on the job of developing the cross-compiler, which we are calling ALGOLXEM, the "XEM" standing for "cross-compiled on E-mode." E-mode is the term used to designate the modern MCP architecture, which is based on the B6500/B6700. Work to date for this portion of the project has been focused on getting the Algol compiler source prepared and in a form we can start working on the cross compiler. Nigel did a masterful job of transcribing the text from the scanned listings, but as anyone who has had experience with the old Burroughs drum printers knows, the quality of the printing is rather poor, and in many cases it is very difficult to identify the characters. Periods and commas have been particularly troublesome. Also, the compiler is a little over 12K lines, and you just can't transcribe that much text without making some mistakes.

The first step was to take Nigel's transcription and get it into a form we can use with the modern MCP and its compilers. Both the B5500 and modern compilers use a fixed-length record, originally based on punched cards, with text in columns 1-72 and an optional sequence number in columns 73-80. The modern compiler also supports an optional "patchmark" field in columns 81-90, typically used for revision identification. Unisys has a Windows-based editor, PWB, which supports this format, understands sequence numbers and patchmarks, can store and retrieve files in both the MCP and Windows file systems, and has an import wizard that will convert many text file formats into its canonical one. PWB was the obvious choice for both a file format and an editor. It stores Windows-based files with a ".xxx_m" extension (the "_m" signifying MCP), so for example, Algol files are stored with ".alg_m" extensions, as you will see if you look in the Subversion repository for the Google Code project.

Since the B5500 was to be an Algol machine, the designers gave it an Algol character set, or at least as much of one as you can get with only a 6-bit code. Most of the Algol characters are in the 96-character ASCII set, but five are not. The following table shows the ASCII substitutions we have settled on for these characters, and their equivalents in modern Extended Algol.

ASCIIModernDescription
~:=Left-arrow (assignment statement)
|*Multiply (see below)
!^=Not-Equal
{<=Less-than-or-Equal
}>=Greater-than-or-Equal

Note that the B5500 used a small cross (×) for multiplication, which we choose to represent as the vertical bar (|). Asterisk (*) was used for exponentiation. Modern Algol uses the asterisk for multiplication (as with most other languages) and ** for exponentiation.

Keying source from a listing is not easy, and neither is keeping track of the indentation level of the text while you are doing it. Because of the fixed length of the source record (and because we wanted to recreate the source as accurately as possible), we had to keep the indentation and line lengths the same as in the original. I converted Nigel's text to PWB format, placing any overflow due to incorrect indentation in the patchmark field (which the B5500 did not have). Then I designed a simple indentation guide and printed it on a transparency cell. Using that cell as an overlay, I made a pass through the listing, comparing the indentation there with that in the source text, and correcting indentation and repairing overflows as I went. That took the equivalent of about two days, working on it over the span of about a week.

The next step was to make a complete proof-reading pass through the transcribed text, comparing it line for line with the listing. That took about a week, working nearly full time, and resulted in a little over 200 lines of corrections. It wasn't fun, but it had to be done. I was sure I had not caught everything, and further work on the source has already demonstrated that to be true. I suspect we are going to be finding typos and omissions in that source for a long time.

The result of that proof-reading pass is the basis both for the compiler we eventually want to compile, and for porting to a cross-compiler on the modern MCP. The cross-compiler is a means to an end, not an end in itself, and the source will require a lot of modification, so we are not as concerned with maintaining fidelity with the original source. If we are successful with the emulator, then hopefully the cross-compiler will eventually go the way of OSIL.

Not needing to maintain fidelity with the original source is good, because that original source is really messy -- inconsistent indentation, unusual (to say the least) treatment of nested statements, multiple statements per line, and other styling that today looks just plain trashy. On the other hand, that compiler was about 15 years old at the time the listing was made, and had been worked on by who knows how many people. Structured programming was still a relatively new, and still controversial concept in 1977, and completely unknown in 1962. The B5000 Algol compiler was arguably one of the first large programs written in a free-form, higher-level language. The developers, talented as they were, were coming from strong assembly-language backgrounds, and you can image that given this new medium for coding, they just didn't know how to use it. Coding style takes time and experimentation to develop, so I think what we see in this source are the beginnings of that experimentation.

In any case, I find that compiler source really difficult to read, and since I was going to have to whack on it quite a bit anyway, decided to reformat it. I have an Algol pretty-printing program, NEATUP, that was originally written by Dave Brown at U.S. Customs in the mid-1970s for the B7700. I hacked (and I do mean hacked) it to better tolerate some of the B5500 syntax differences, and to replace the special Algol characters with their modern equivalents. I also modified the program to prefix string literals so the modern compiler would interpret them for 6-bit codes (e.g., 6"BCL").

NEATUP handles end-of-line (%) comments quite well, and COMMENT comments when they were present at the statement level, but not COMMENT comments embedded within statements. I spent quite a bit to time trying to get the program to handle those better. I also tried to get the program to insert lines of "%" between procedure declarations to give them some visual separation.

When reformatting source code that has sequence numbers, you almost always need to resequence those numbers, and I did so, using a generous increment of 5000. I also modified the NEATUP program to copy the original sequence numbers from the input source lines into the patchmark field on the output lines (which otherwise wasn't used) so we will be able to correlate lines from the original and XEM sources later. This has already proven invaluable in the manual fix-up discussed below, and in applying corrections discovered in the XEM source back to the original source file.

Using the hacked-up NEATUP, now called NEATUP55, I reformatted the B5500 compiler source into an initial version of the source for the ALGOLXEM cross-compiler. There were some things it was just not worth the trouble to get NEATUP55 to deal with (such as the weird way octal constants are specified in FILL statements), and a couple of sections of code made it go bananas for reasons I decided not to bother trying to figure out, so I suppressed the reformatting around those.

The output of the reformatting process needed quite a bit of manual fix-up, which I did in PWB. Some of the special Algol characters were embedded in quoted strings and comments, and the areas where formatting had been suppressed needed to be reformatted by hand. The biggest job was reversing the bit numbering for partial word designators and bit concatenation constructs. About 50 carefully-specified find-and-replace commands did the trick, or at least I hope that it did, because if I messed any of those up the bugs are going to be really hard to find.

I found quite a few additional typos from the transcription while working on the initial ALGOLXEM source, some of which caused serious problems with the source reformatting process (and which, of course, is how I found out about them). All of those were corrected, and corresponding corrections entered into the source for the B5500 compiler.

At each stage through the processes described above, the source for both the B5500 compiler and ALGOLXEM have been committed back to the Subversion repository for the project, with appropriate log notes so that we have a record of the development and can get back to the intermediate version later, if necessary. I have also committed the original version of NEATUP and the hacked-up NEATUP55.

The next step in development of the cross-compiler is to study the code and try to figure out how to deal with the major conversion issues -- 6-bit coding, Stream Procedures, I/O, and those nasty absolute memory addresses. My current feeling is that even on the modern MCP, the data internal to the cross-compiler should stay in 6-bit form and only be converted to 8-bit coding at the external interfaces. The problem with absolute memory addresses is somewhat bounded, because that technique could only be used with the PRT (Program Reference Table), items in the stack, physical file buffers, and SAVE arrays. Those were nailed down in memory and could not be moved or overlaid to disk. All other areas were subject to relocation and paging out to disk by the MCP. Trying to manipulate absolute memory addresses for those "non-save" areas would be suicidal.

This next step is going to take a while, and there will probably not be any visible results for at least a few months.

No comments:

Post a Comment