Tuesday, October 22, 2013

On-going Progress with the Emulator

Another four months has come and gone since the last blog post for this project. There is much to report. We had made a lot of progress through July, then I was pulled away by other commitments for most of August and September, and have just reengaged with the project over the past few weekends.

In the work leading up to my two-month hiatus, the emulator became substantially more stable and more capable on almost a weekly basis. We are now at Release 0.14, which was pushed out in early October. There are still problems, and more features to implement, but the emulator is now in a very usable state.

The last blog post ("It's Alive...," 3 June 2013) seemed to strike a chord. Several people are now using the emulator, others have contacted us with comments, and we have received offers of additional B5500 material. There is more on this subject below.

Significant Changes and Improvements

Shortly before the last blog post, we resolved a very nasty problem with the so-called "R+7" aspect of subroutine stack linkage. That one fix made the emulator about an order of magnitude more stable than it had been prior to that point. It enabled us to begin using the system as you would a real B5500 under the control of its MCP operating system. Since then, the following major enhancements and fixes have been implemented:

Card Reader
Initially, all that we had were the SPO and Head-per-Track disk peripheral units. This made it impossible to run anything but programs we could load from the Mark XIII system tape images using the ColdLoader utility. A high priority was to implement card input. The current driver emulates the Burroughs B129 1400-card/minute reader. Card decks are ordinary ASCII text files. You load one or more files into the reader using a standard file picker dialog, then press the reader's START button. The MCP senses the reader's change in status, and starts reading cards, just as it worked on a real B5500.
"Dummy" Line Printer
Nigel started working on the implementation for a line printer peripheral unit, and ran into problems getting a prototype to work. It turned out the problems were mine in the way that printer I/Os were being initiated and terminated in the IOUnit and CentralControl modules. In the process of fixing those, I literally threw together a very basic diagnostic printer driver out of pieces of the SPO and card reader implementations. After getting my IOUnit problems fixed, I took out the diagnostic stuff, and well, we're still using that. It works fine for simple output, but at some point will need to be replaced by something with a better user interface and more complete functionality.
Card Punch
After getting the card reader and preliminary line printer units to work, it was a straightforward task to clone a card punch peripheral driver out of those. Besides, I was beginning to work on the Card-Load-Select mode of loading the system, and needed a way to output card decks for programs like the COOL and COLD loaders.
Improved Console Display
The B5500 had a very minimalist operator console -- just a few buttons and lights. There were lots more lights on the maintenance panels in the Distribution and Display unit, but those were usually hidden from view behind the "skins" of the mainframe cabinets. With the emulator, though, it was often difficult to see from the console what was happening with the system (or whether anything was happening at all), so I have added some annunciators to the console that show the activity of the I/O Units, external interrupts, and the individual peripheral devices. These are really helpful to gauge the activity of the system, and they make a nice show, besides. The extra lights can be disabled if you are a purist and want the console to look like it did on a real B5500.
Smaller User Interface Windows
The initial design of several of the windows for the peripheral devices were just too large. They worked fine on the 23-inch monitor I typically use for development and testing, but on other systems, particularly laptops, the windows crowded each other out and made it difficult to see what was happening overall with the system. The windows for the SPO, card reader, and card punch have all been reduced in size to better accommodate smaller displays.
RTS Presence-Bit Bug
RTS is the Return from Subroutine instruction. It is typically used to exit from what Burroughs termed "accidental-entry" procedures,  but what the rest of the world refers to as "thunks." This type of subroutine is used to implement the semantics of Algol Call-by-Name parameters. It turns out that such subroutines can return data descriptors, and a requirement of the RTS instruction is to throw a Presence Bit interrupt (page fault) if the returned descriptor points to an absent memory segment.

That requirement was poorly-documented, and we weren't checking for descriptor absent status in the emulator. The result was that the emulator could use the address field of an absent descriptor as if that field contained a valid memory address. This error only showed up while running some FORTRAN programs, and it proved to be very difficult to trace the symptoms back to the cause. It took several long, frustrating days of tracing and debugging before I finally found an obscure reference to the P-bit requirement, after which the fix was obvious and simple -- as is often the case with such problems.
Floating-Point Arithmetic Bugs
One of the first things I tried once the card reader and line printer were working was an Algol number-crunching program from my student days, for which I still had listings of source and output from a B5500 run in 1970. The program does orthonomalization of vectors to compute rheological parameters for two-phase flow in a round pipe (and before you ask, no, I don't understand what that means anymore).

Getting the program to run was no problem, but the results from the emulator were not even close to those on the 1970 listing. I was getting at best one digit of agreement between the two. I have a few other programs and listings from that era, and they were showing similar problems for cases involving complex calculations. Programs doing simpler calculations showed quite good agreement, however, so that smelled like some sort of rounding or normalization problem in the emulator.

After being deviled by this for months and frustrated in a couple of attempts to find the problems, earlier this month I wrote an Algol program for the B5500 that generated a variety of numeric-word bit patterns, computed all of the combinations of those bit patterns for the add, subtract, multiply and floating-divide operators, and dumped the results in octal. I then converted that program to the modern Unisys MCP architecture (which uses the same numeric format) and generated an equivalent set of results.

Comparing the two sets of results indeed revealed a number of cases of off-by-one differences in mantissa values, and a few cases where the differences were even worse. Knowing the bit patterns that generated these differences, I was able to trace the evaluation of those specific patterns in the emulator, and found several problems with rounding and normalization. All but one of the problems were in add/subtract, which internally is the same operation with some sign manipulation. The remaining problem was due to rounding when multiplying two integers -- which by their nature should never have their product rounded.

After correcting these issues, the results from the emulator now match -- to the digit -- the results in all of my listings from 1970 that I've been able to check thus far.
Card Load Select Bug
By default, the B5500 booted from disk. A push-button switch on the operator console would cause it to load from cards, however. Loading the MCP from disk has been working for months, but attempting to load from cards would read the binary boot loader card plus the first card of the program being loaded, then hang. After previously making several runs at the problem, earlier this month I finally found the cause -- hardware load proceeds much like any other I/O, but it is initiated as a special mode of the I/O Unit. It generates a result descriptor, but not a completion interrupt.

The problem turned out to be that the emulator was not suppressing the completion interrupt. Load from disk worked, either because of the timing involved, or more likely because the MCP's KERNEL bootstrap was smart enough to ignore the extraneous interrupt. In contrast, the binary one-card loader is pretty dumb, and apparently became confused by the pending interrupt left by the hardware load mechanism. Booting the system from cards now works.
Hosting Site and Wiki
While the emulator runs entirely within a web browser, it requires a web server from which it can be loaded into the browser. Not everyone has the wherewithal to set up and operate their own web server, so we have set up a web site to support the following:
  • Host the current release of the emulator.
  • Make available the Mark XIII tape images containing the Burroughs system software. 
  • Make available releases of the emulator source code for downloading.
  • Serve as a central source for emulator utilities and information. 
You are welcome to visit and use this site at http://www.phkimpel.us/B5500/.

We have also created a number of wiki pages on the project's Google Code site (http://code.google.com/p/retro-b5500/) GitHub site describing how to set up and use the emulator and its components. There are links to these wiki pages under the main Help link on the hosting site

Browser Status and Performance

One of the goals of this project has been to have the emulator execute programs at the speed of a real B5500, or as close to that as can be practical in its browser-based environment. Throughout the development of the emulator, especially in the Processor module, we have been concerned about the emulator's potential performance, and have tried the keep the coding as lean as possible. With the emulator becoming reasonably stable and the ability to compile and run various programs through the card reader, we are starting to get a feel for its performance.

Based on timing statistics in some of my listings from the 1970s, the emulator initially appeared to be running about 50% slower than a real B5500. I do most of my development and testing on a relatively new Dell Optiplex 390 with a quad-core, 3.3GHz Intel Pentium i3-2120 processor running 64-bit Windows 7. Monitoring performance in the Windows Task Manager while the emulator was running in Mozilla Firefox showed that raw processor power was not the problem -- the emulator may have been running slower than desired, but the core1 running that Javascript thread was loafing. So why was the emulator appearing to run slow?

The problem has turned out to be two-fold. First, the emulator attempts to throttle its performance by estimating for each instruction the number of 1MHz clock cycles that instruction should take to execute. It accumulates those cycle counts and periodically compares the number of microsecond cycles it has accumulated against the amount of real time that has elapsed. If the number of microseconds accumulated is greater than the elapsed time, then the emulator is running too fast, so it pauses using the Javascript setTimeout() function to allow real time to catch up to emulated time. We were obviously accumulating more clock cycles than we should have, especially in the number of cycles that memory access consumed. Simply lowering the number of clock cycles accumulated for each read or write memory access brought the emulator to within about 15% of the timing statistics from 1970.

Second, we found that the Javascript setTimeout() and time-of-day facility (using new Date().getTime()) were not nearly as granular as we were counting on. Some research revealed that (as of June of this year) most browsers on Windows had a timing resolution of about 15 milliseconds, and that the emerging HTML5 DOM standards call for a minimum resolution of 4ms. The emulator was requesting delays below 4ms, with the result that when the throttling mechanism tried to delay, say, 3ms, the real delay might be 15ms, resulting in the appearance that the emulator was running slow. It was actually running amazingly fast, but throttling too much.

The throttling mechanism has another important role besides regulating the speed of the Processor module. Everything in the emulator runs synchronously on one Javascript thread -- the Processor, the I/O Units, the peripheral devices, the interval timer in Central Control, even the second Processor, once we get that working. If the Processor does not yield control of the thread periodically, I/Os will not be initiated and completed, external interrupts will not be serviced, and in general the system just won't work. Thus, while one approach to dealing with the Javascript timing granularity would be to increase the amount of time the Processor runs before throttling, hogging the thread has a negative impact on the system's ability to do I/O and service interrupts.

We could see this in the difference of the emulator's performance when running in Google Chrome compared to Mozilla Firefox. Earlier this summer, Chrome was ahead in implementing the 4ms HTML5 DOM standard for setTimeout(), and the effective speed was much closer to the B5500 than with Firefox. Apparently Firefox made a change to its timer granularity in version 22, and the emulator performance is now better with Firefox that it is with Chrome.

Resolving this problem in the emulator has turned out to be an adventure. The main improvement has been to change the throttling mechanism so that it could better tolerate long setTimeout() delays without interfering with access to the Javascript thread by the other components. What we needed was a way to yield control of the thread without introducing any additional delay -- if other components were scheduled to run on the thread, they would, but control could return to the Processor as soon as everyone else had their turn.

There is a proposed setImmediate() function for Javascript that does just that, but only Microsoft Internet Explorer implements it, and it does not appear that this proposal will become a standard. Some freely-available "shims" have been written to implement the behavior of setImmediate() using existing DOM features, however, and we found one by Dominic Denicola (https://github.com/NobleJS/setImmediate) that works quite well. The throttling mechanism now computes a delay time and compares it to some threshold value (currently 4ms). If the delay is greater than the threshold, it uses setTimeout(); otherwise it uses the pseudo setImmediate() implementation.

In that latter case, the Processor will resume sooner than it should (and not actually throttle the performance very much), but the throttling mechanism does its computations based on total cycles accumulated vs. total elapsed time, so at the end of the next throttling cycle, it will typically compute an even larger delay. Eventually the delay will grow to the point that it exceeds the threshold, and setTimeout() will be called to do some real throttling. This approach generates jitter in the execution of the Processor, but the delays and non-delays average out, and it happens fast enough (15ms is approximately the refresh rate on most monitors), that the jitter usually is not noticeable.

With this new approach to throttling in place, the emulator is still running a little slower than a real B5500, but only by 7-8%. Further improvements in apparent performance will probably require detailed tuning of clock accumulation in the individual instructions. That can wait. In any case, B5500 instruction timings are difficult to model, because the Processor overlapped execution and memory access whenever it could, and the crossbar memory access mechanism in Central Control could generate random delays due to conflicting access to a memory module by multiple Processors and I/O Units.

Another outstanding problem in performance tuning is that I/O times appear to be four to eight times longer in the emulator that my listings from 1970 indicate they should be. I/O time is essentially channel time -- the MCP records the time when an I/O request when is initiated against an I/O unit, and again when the I/O complete interrupt is serviced. The difference between the two times is accumulated as I/O time for the requesting job. I suspect that this may be another problem with setTimeout() granularity, or possibly the IndexedDB mechanism we are using for disk I/O is just plain slow. [Correction as of 2013-10-23] Oops -- I am completely wrong about this. The program I have been using in timing tests outputs results at several points during its execution. I had thought that the times were differential between the output cases, but it turns out they are cumulative. When I compute the differences between the cases, the emulator's I/O times are actually lower than those on the 1970 listing by almost half. The exception is the first output case, where the I/O time is substantially higher. That might be due to the overhead of the AUTOPRINT spooler printing the output of the program's compilation during the time that first case is running.

Performance of the emulator also depends somewhat on the underlying platform. Firefox and Chrome remain the two browsers we have found that support the features the emulator needs to run. Apple Safari through 6.0 does not yet support IndexedDB, although the emulator should work in Firefox on a Mac. It does not work on Microsoft Internet Explorer through IE10. We have not yet tried Opera.

I have tried the emulator on a variety of Windows systems using Firefox, and it runs everywhere I have tried. It runs well, if slightly slower than on my quad-core Optiplex 390, on a five-year old Dell D830 with a 2 GHz Pentium Core Duo T7250 under 32-bit Windows 7, and also slightly significantly slower on an eight-year old Dell Optiplex GX520 with a 3 GHz Pentium P4 under Windows XP. [Clarifications added 2013-10-26]  It even runs acceptably on a four-year old HP Mini netbook with an Atom processor, also running under XP. The big problem with the Mini is not performance, but that the screen is too small to see much more than one of the windows at a time. On laptops, the best performance has been observed when they are on external power. The performance is noticeably slower when running on battery, or on a travel charger with a lower wattage rating than the standard charger.

Other Participants

One of the gratifying things about this project is the interest that other people have shown in it. We have been somewhat surprised at the number of people who have picked up the emulator and started using it, without much apparent difficulty, and only then let us know what they were doing. A few of those people have become more intensely involved with the project:
  • Fausto Saporito of Naples, Italy has been an early user of the emulator, and has contributed a number of FORTRAN benchmarks, including Whetstone and an arctangent program that appears to do a good job of measuring floating-point loss of significance for a processor. Fausto has also single-handedly transcribed the Mark XVI FORTRAN source from the listing on bitsavers.org. The current version is in the project's Subversion repository on Google Code.
  • Tim Sirianni of Eureka, California, US, stunned us by reporting that he had the TS (timesharing) MCP running and was using the CANDE timesharing editor, sort of. We don't have datacom working yet in the emulator, which is where the "sort of" comes in. Tim found a way to use the SPO as a CANDE terminal, but it is awkward to use, and not an approach for the less-than-determined.
  • Paul Cumberworth of Adelaide, Australia has transcribed the patches for our Mark XVI ESPOL compiler source and gotten those to compile with the base source. These are also available in our Subversion respository.
Out of the blue, Ed Vandergriff of Chaska, Minnesota, US, wrote me in late August to say that he had a listing of the APL interpreter for the B5500, created by Gary Kildall (of CP/M fame) and others at the University of Washington in the early 1970s. He asked if we were interested in it. I had not been aware of the existence of this interpreter, but Nigel had been looking for a copy of it for some time without success. Ed generously sent me the listing, which is actually a first-generation photocopy taken from a line-printer listing. We have scanned it for our use, and will eventually donate the original to a museum for long-term preservation.

A copy of the scanned listing is available on our hosting site at http://www.phkimpel.us/PickUp/APL-B5500-Listing-19710111.pdf. It is about 44MB in size.

Fausto and Hans Pufal of Angouleme, France, have volunteered to transcribe the APL listing. Hans has helped us before, having previously transcribed the Mark XVI source code we used to create our ESPOLXEM cross-compiler. Fausto is starting from one end of the scanned APL listing and Hans from the other. At last report, they had only 20 pages to go until they meet in the middle, à la the Mont Blanc tunnel. Their progress to date is available in the Subversion repository. Once their transcription is complete, it will need to be proofread and corrected before it can be used. We will also need to get datacom working in the emulator.

Current Efforts

We have some known problems and a couple of high-priority features requiring attention.
  1. A proper datacom interface will be required to run the TSMCP and CANDE, as well as the APL interpreter. I am currently working on a very basic, one-terminal implementation of the B249 Data Transmission Control Unit and B487 Data Transmission Terminal Unit. Supporting external terminals in a browser environment is extremely difficult (browsers are quite determined be be clients, not servers), so this initial implementation will simply host a single terminal as a user interface to the B249/B487, somewhat similar to the way the SPO currently works. That should be adequate for most users. We hope to have this feature available soon.
  2. After datacom, the next priority is support in the emulator for magnetic tapes. We think we know how to approach this, but detailed design work has not yet begun.
  3.  The B5500 would support two processors, but our attempts to get the second processor working have thus far been a failure. This has been especially frustrating, because the differences between P1 (the control processor, which is currently working) and P2 (which could only run Normal-State user programs under control of P1) are very minor. In fact, the two processors on a real B5500 were physically identical, and either one could be designated as P1 by means of a mechanical switch. I have made three serious runs at this problem, most recently last weekend, and come up short each time. I made some progress this last time, finding a problem in the way P2 was handled by the SFI (Store for Interrupt) instruction. With that change, P2 now runs for a few seconds before somehow failing. The problem is obviously subtle, and is proving difficult to trap, even with special code inserted into the emulator to do so. Getting P2 to work is a relatively low priority, so this problem has been set aside for now.
  4. The other major deficiency in the Processor implementation at present is that the double precision arithmetic operators have never been finished. Their single-precision equivalents are currently standing in for them. One of Fausto's benchmarks requires double precision, and the compilers require double-precision in order to properly compile double-precision literals, so the priority of this issue is rising.
We look forward to hearing from everyone who is using the emulator, or a least trying to. Please let us know why you are interested in the B5500 and what you are doing with the emulator. We are anxious to hear your comments and suggestions, and we especially want to hear about any problems or bugs that you encounter. Either comment on the blog, or contact me privately at paul (dot) kimpel (at) digm (dot) com. If you have a Google account, you can also post issues on our Google Code project GitHub project site at http://code.google.com/p/retro-b5500/issues/list.




____________
1 Has anyone else noticed that in the days of the B5500, "core" meant memory, but now it means processor?