BASIC (Re)Numbering — with Commodore

Feb 27, 2020

How to thoroughly renumber Commodore BASIC programs (PET 2001, VIC-20, C64)

Renumbering Commodore BASIC — The curious wonders of Commodore BASIC renumbering.

Renumbering a is popular beginner’s sport for those who dare to venture into the mythical realms of intricacy, which the BASIC interpreter and its storage formats provide. Since it is also a useful utility, there are numerous programs for this, especially it’s a favorite example in the intructory sections of machine language guides. However, as always, there’s an easy way and a thorough one. And we’ll see soon, why this may be.

Let’s start at the beginning, a stored BASIC program in memory.

(Yes, an advert.)

BASIC Code Representation in Memory

We already had a closer look at how Commodore BASIC stores its programs in a previous episode, however, we may want to recap this here with a somewhat different focus. Importantly, Commodore BASIC stores keywords as single-byte tokens, by this saving quite an amount of memory, but there are also literals reppresented in plain ASCII (or, beyond charter code 127, in PETSCII). And, as we shall see, BASIC lines are stored as a linked list and may be all over the place, regardless of their order in sequence.

Trivia: The tokens are the same for the PET 2001, VIC-20, and the C64. However, the first ROM edition for the PET 2001 is lacking the bare GO keyword, which was only added as the very last token in revision 2 of the PET’s ROMs (see below).

Let’s have a look at a (very) simple program:

10 PRINT "NUMBER";:INPUT A
20 PRINT "YOUR NUMBER IS ";A
30 IF A>0 GOTO 10
40 END

As the attentive reader may have observed already, this tiny program is asking for a number (line 10), which is then printed onto the screen (line 20). If the number is larger than 0, we jump back to the beginning, otherwise, we finish on a quite luxurious END statement in line 40.

And this is, what this program looks in memory, in all the glory of its idle bytes. Here, on a PET, where BASIC memory starts at $0401 (on a C64, it’s $0801, and on a VIC-20 it starts at $1000, or, if there’s a memory extension of at least 8K, at $1200):

0400: 00 15 04 0A 00 99 20 22  ...... "
0408: 4E 55 4D 42 45 52 22 3B  NUMBER";
0410: 3A 85 20 41 00 2F 04 14  :. A....
0418: 00 99 20 22 59 4F 55 52  .. "YOUR
0420: 20 4E 55 4D 42 45 52 20   NUMBER
0428: 49 53 20 22 3B 41 00 3E  IS ";A..
0430: 04 1E 00 8B 20 41 B1 30  .... A>0
0438: 20 89 20 31 30 00 44 04   . 10...
0440: 28 00 80 00 00 00        ......

At first glance, we may already spot some dispersed sequences of ASCII codes, however, the line numbers, we’re particularly interested in, are hiding somewhere in those bytes. Except, for a single “10” standing out near the end of the program, which might be well the jump target of our GOTO command.

Let’s apply a bit more structure to this and find out, how this works:

‘10 PRINT "NUMBER";:INPUT A’

0401: 15 04   binary address of next line (LO-byte, HI-byte), $0415
0403: 0A 00   BASIC line number as binary (LO-byte, HI-byte), $000A = 10
0405:    99   BASIC token for PRINT (a token’s value is always ≥ $80)
0406:    20   ASCII character <blank>
0407: 22 4E 55 4D 42 45 52 22   ASCII sequence (string literal), "NUMBER"
040F:    3B   ASCII character ;
0410:    3A   ASCII character :
0411:    85   BASIC token for INPUT
0412:    20   ASCII character <blank>
0413:    41   ASCII character A (identifier)
0414:    00   END OF LINE marker (binary zero)

As we may see, almost everything is stored as ASCII text with the notable exception of the tokens, which substitute the BASIC keywords. A token has always the highest bit set and a value ≥ $80. From this follows that any normal ASCII text, outside of string literals, must not have the MSB set, hence, PETSCII special characters are not allowed in normal code.

The most interesting parts, however, are the first 4 bytes and the very last one, which provide the structural aspect to the code: The first pair (in LO-HI order) provides a link to the memory address of the next line in sequence, the second pair the line number in binary. (While this would facilitate line numbers up to 65535 in theory, the highest line number supported by Commodore BASIC is actually 63999.) A trailing zero-byte eventually ends the line of BASIC code.

‘20 PRINT "YOUR NUMBER IS ";A’

4015: 2F 04   link to next line, $042F
4017: 14 00   line number (binary), $0014 = 20
4019:    99   BASIC token PRINT
041A:    20   ASCII character <blank>
041B: 22 59 4F 55 52 20 4E 55 4D 42 45 52 20 49 53 20 22
              string literal (ASCII), "YOUR NUMBER IS "
042C:    3B   ASCII character ;
042D:    41   ASCII character A (identifier)
042E:    00   END OF LINE marker (binary zero)

This is quite the same, but we may verify that the link address of the first line does actually point to the start of this line. For the next line, as we're already familiar with the pattern, we skip a few bytes and are concerned only with the interesting part:

‘30 IF A>0 GOTO 10’

042F: 3E 04   link to next line, $043E
0431: 1E 00   line number (binary), $001E = 30
...
0439:    89   BASIC token GOTO
043A:    20   ASCII character <blank>
043B: 31 30   ASCII sequence ‘10’ (line number)
043D:    00   END OF LINE marker (binary zero)

As we can see, while line numbers are generally stored as binaries, when they occur as a jump target in BASIC text, they are stored as plain ASCII strings!

‘40 END’

043E: 44 04   link to next line, $0444
0440: 28 00   line number (binary), $0028 = 40
0442:    80   BASIC token END
0443:    00   END OF LINE marker (binary zero)

0444: 00 00   line link 0000 => end of code

And here’s our very last line. What’s interesting here, is that it is not the END keyword, which ends the code in memory, this one rather instructs the interpreter to end its execution. What really marks the end of the code, is an empty (zero) line link, which marks the end of the linked list of BASIC lines. This is essential to the housekeeping of the BASIC interpreter. Without, we may encounter some difficulties, when we try to add, insert or edit a line.

Commodore BASIC lines as linked list — Commodore BASIC lines are stored as a linked list.

Renumbering

As you probably have worked out already, a simple renumbering algorithm is as simple as following the line links, starting from the BASIC start address (here, on a PET 2001: $0401) and fixing the binary line numbers, LO-byte first, followed by the HI-byte. This is also what most of the tutorial aplications do. However, this is a bit of a naïve approach, because we would probably end up with a nonfunctional program, since the jump targets were left untouched. By leaving this to the user, our utility wouldn’t do a particularly great job.

Identifying Jump Targets

As we’ve seen above, line numbers as jump targets are stored as ASCII literals in the program text. These are quite easy to identify, since they may be preceded by just a few keywords. Namely, these are:

89 ....... GOTO
8D ....... GOSUB
A7 ....... THEN
CB A4 .... GO TO

All of them come with a few of special cases:

GOTO and GOSUB may be part of a “ON … GOTO” or “ON … GOSUB” clause, where they may be followed not only by a single line number, but by a comma separated list of line numbers.
(However, instead of checking for the ON token and parsing an arbitrary expression inbetween, we may check for a following comma, as well, by this reducing the amount of look ahead or look behind. If an unconditional jump is followed by a list of line numbers, the interpreter will take the jump on the first number and will not care for any following code. Maybe, a construct like this was intended by the user as an alternative, now defunced target, just like commenting a previous target by inserting REM, so it wouldn’t hurt to fix this, as well.)
THEN may be followed immediately by a line number, thus indicating a conditional jump, but there’s no obligation for a line number to follow this keyword. Hence, any parsing must be context sensitive.
Finally, “GO TO” is an alternative form of GOTO. The GO keyword (token $CB) is lacking in the early PET 2001, as it was introduced only with ROM 2.0. The “TO” (token $A4) is the same token as in “FROM A=0 TO 10”. This will require a bit of look ahead. Notably, it maybe following an ON, like the more conventional GOTO, so the same rules apply for this as well. Regarding PETs with ROM 1.0, we won’t have to apply any special care, since the token was added to the end of the list and will simply not occur in those programs.
As always, where there can occur a space character, there may be any number of them, as is the case for the white space which goes between “GO” and “TO”.

Line Numbers as Jump Targets

As previously observed, line numbers are stored as plain ASCCI strings. However, there’s more to them. They may not only be proceded by any number of spaces, they may also contain any number of spaces, which will be simply ignored by the interpreter.

1 0 0  =  100

Here are a few examples of the type of code, we'll have to consider:

GOTO100
GOTO 100
GOTO 1 200
GO TO100
GO TO 100
GO TO 1 200
GOSUB200
GOSUB 200
ON … GOTO  100,200, 300
ON … GO TO 100,200, 300
ON … GOSUB 100,200, 300
THEN100
THEN 100
THEN  1 0 0

And, above all, we must ignore anything following a REM statement and any string literals (in quotes, ASCII $22). Any numbers, we may find here, regardless of the PETSCII character by which they might be preceded, are to be left as-is.

Varying Extent — Code Length

As may have become obvious by these explanations, there’s no guarantee that a new line number may be fitted in the same amount of bytes as there has been previously. E.g., as we have to replace “10” by “1000”, there’s no hope to summon the extra bytes from a miraculous source. We'll have to move all consecutive bytes in memory and we'll have to relink the lines. In essence, we'll have to rebuild the very structure of the program in memory. (However, if there are a few dead bytes in memory from previous edits, we may close these gaps and save a few bytes, as we rebuild the entire linked list in sequence.)

Doing it

Time to come up with an algorithm. Here, for illustration purpose and because it’s also real code for a real emulator, we’ll do it in JavaScript. This shouldn’t only be easier to follow than machine code, at the same time, it also provides us with ample external resources. If we did the same in 6502 code, we’d probably have to move stuff around using buffers, namely the the BASIC input buffer and the cassette buffer(s).

Trivia: Amazingly, the PET 2001 had two cassette buffers, one for the internal tape drive and one for a second, external cassette port. This way, you could compute like a pro, loading both from a dedicated program tape and a data tape without DJ-ing multiple cassettes on a single drive. Which may have been actually of use, when running complex business logic on a machine with just 8K, which may require the sequential loading of partial programs or even code overlays.

Anyway, which way we ever decide to do it, we won’t succeed on a single pass. We’ll have to read all occuring line numbers first and build a reference to the new, renumbered ones in a table. This is particularly, because there may be forward references in the code and there’s no guarantee that jumps will happen only to lines, which we have encountered and renumbered already. So we’ll have to take a second pass to correlate the jump targets according to this reference table, once we’ve successfully parsed all lines in a first pass.

There are aspects, which are open to the implementation. E.g., we could insert the binary line numbers on the first pass, or we could do so on the second pass just the same. If we did it in machine language, we would want to keep at least one of the passes as simple as possible and just skim through the linked list of lines to build the reference table as quickly as possible.

Here, in the case of our external JS utility, we decide to go for a different approach, first parsing and splitting the lines in chunks, which will be either opaque blobs of bytes, for which we do not care, or numbers representing jump targets. We’ll collect them in a list, and then, in a second pass, we’re going to reassemble the code from this list, which is as trivial as rejoining the junks, now using the new numbers for any jump targets, preceded by the new line numbers in binary. And we’ll have to fix up the line link as soon as we know the extent of our newly assembled line. Finally, we’ll add two zero bytes for the end-of-code marker and are done.

(For a machine language implementation, we’d probably move chunks of code into a buffer, parse and reassemble the new code from there into memory, at the same time taking care for any code that has to be moved in order to provide space.)

Let’s dive into this!

First Pass – Parse & Split

For the purpose of this blog, we’ll assume that we have santized our input values already, which will be passed as arguments to our utility function. These are:

a memory image of our RAM, an array of bytes
the start address of BASIC in the RAM (for a PET $0401)
an integer number in the range of 0..36999 as the starting line number
a non-zero integer as the step number

Further, there will be variables for the array of lines, containing anything, we’ll parse in this first pass, and a table holding a reference of line numbers (old → new). And we’ll encapsulate each of the passes in its own function. This gives us the following basic structure:

// renumber BASIC in provided memory slice (0..max)
// processes GOTO, GOSUB, GO TO, ON GOTO|GOSUB|GO TO, THEN

function renumberBasic(mem, basicStart, startNo, step) {

   var lineNoTable = {},
       lines = [],
       maxLineNo = 36999;

   // parse and split lines to chunks of blobs and line targets
   // stores parsed chunks in array lines,
   // generates a reference of old and new line numbers in lineNoTable.
   function parseSplit() {
      ...
   }

   // reassamble BASIC code from line chunks using new line numbers
   // from lineNoTable
   function reassembleLines() {
      ...
   }

   parseSplit();
   reassembleLines();

}

As for our function parseSplit(), we’ll have a main loop, which will be also safeguarding the range, in case we’d be dealing with corrupted code. In this, we’ll read lines in sequence from memory following the line links. An empty line link will indicated that we have arrived at the end of the code and are done with the pass. Otherwise, we generate a new entry for the line, both in our array of lines and in the reference table “lineNoTable”. Each entry to the lines array will have the original line number and an array to hold the parsed chunks. After advancing the line number in preparation of a new line to parse, we read the rest of the line until we encounter a zero-byte as the EOL marker.

function parseSplit() {

    var addr = basicStart,
        lineNo = startNo,
        maxMem = mem.length,
        chunks,
        blob;

    while (addr < maxMem) {
       var lineLink = mem[addr++] + (mem[addr++]<<8);
       if (!lineLink) break;
       chunks = [];
       blob = [];
       var ln = String(mem[addr++] + (mem[addr++]<<8)),
           byte = mem[addr++],
           line = {
              'ln': ln,
              'chunks': chunks
           };
       lineNoTable[ln] = lineNo;
       lineNo += step;
       while (byte) {
          ...
          b = mem[addr++];
       }
       if (blob.length) chunks.push({'type': 'blob', 'val': blob});
       lines.push(line);
       addr = lineLink;
   }
}

The array blob holds a reference to our current working blob and is initially empty, however, it will be filled in our inner “while” loop, where we’re searching for any jump targets.

       while (byte) {
          blob.push(byte);
          switch(byte) {
             case 0x8F: // REM (read rest of line)
                while (mem[addr]) blob.push(mem[addr++]);
                break;
             case 0x22: // quote (read up to next quote)
                var c = mem[addr++];
                while (c) {
                   blob.push(c);
                   if (c === 0x22) break;
                   c = mem[addr++];
                }
                break;
             case 0x89: // GOTO
             case 0x8D: // GOSUB
                parseLineTarget(true);
                break;
             case 0xA7: // THEN
                parseLineTarget(false);
                break;
             case 0xCB: // GO (read ahead and test for TO)
                var t = addr;
                while (mem[t] == 0x20) t++; // skip any blanks
                if (mem[t] !== 0xA4) break;
                while (addr <= t) blob.push(mem[addr++]);
                parseLineTarget(true);
                break;
          }
          byte = mem[addr++];
       }

Here, we first add the byte just read to our current blob. Then, we check, if it is a token of interest in a switch statement:

If it is a REM statement, we’ll have to add any following bytes to the blob, without further processing, until the end of line.
Quite the same, we add any string literals, as started by a double quote ($22), as-is. The only substantial difference here is that we’re looking for the next quote ($22) instead of the EOL mark.
If we encounter a token for GOTO ($89) or GOSUB ($8D), we pass on to a function “parseLineTarget()” to do the actual job of scanning for and extracting the target line number. The argument “true” passed to this function indicates that this may be a list (as used for “ON … GOTO”, etc).
The same is true for THEN ($A7), but this may be not a list (as indicated by the argument false passed to the “parseLineTarget()” function).
Note that this is quite a luxury, since, regarding valid BASIC syntax, it wouldn’t matter, if we were scanning here for a list, as well.
Finally, we handle GO ($CB), which is a different kind of beast, since we’re going to handle this conditionally only. Therefore, we’re scanning ahead. If the next non-blank byte is not a token for TO ($A4), we break out of the switch statement. Only, if we did find this token, we’ll copy any of the bytes already scanned to the blob and call our function “parseLineTarget()” to do the rest.

So, the real business of identifying line numbers and splitting the source is left to a function “parseLineTarget()”, which is a function internal (and private) to “parseSplit()”, looking like this:

    // scans for a jump target,
    // generates a target entry and a new blob in current chunks,
    // any leading blanks are added to the current blob
    function parseLineTarget(maybeList) {
       while (mem[addr] === 0x20) {
          blob.push(0x20);
          addr++;
       }
       // parse ASCII to string, ignoring any blanks
       var n = '',
           c = mem[addr];
       while ((c >= 0x30 && c <= 0x39) || c === 0x20) {
          if (c !== 0x20) n += String.fromCharCode(c);
          c = mem[++addr];
       }
       // if we found a number, push chunks and open a new blob
       if (n !== '') {
          chunks.push({'type': 'blob', 'val': blob});
          chunks.push({'type': 'target', 'val': n});
          blob = [];
          // scan for any comma (ON GOTO|GOSUB|GO TO)
          if (maybeList && c === 0x2C) {
             blob.push(0x2C);
             addr++;
             parseLineTarget(true);
          }
       }
    }

While this function does much of the heavy lifting, it is also quite straight forward:

First, we consume any white space and add it to the current working blob. (For a THEN, which is not followed by a line target, this won’t hurt any way.) Since we’re looking explicitly for blanks ($22), we won’t have to worry about any EOL, as well.
The next thing should be a line number providing the jump target. This one should start with a numeric character ($30..$39). We collect any numeric characters in a string, ignoring any blanks in the process, by this also normalizing the representation.
If our string in variable n isn’t empty, we actually found a line number. In this case, we close the current working blob by adding it to the list of chunks in our current line. Then, we add a new item containing the target line number. Finally, we create a new working blob for any bytes that may follow.
By this, we’re nearly done. But there may be an entire list to be processed, as incicated by the boolean flag maybeList, which has been passed to the function as an argument. If this is true and the current character, which was the first non-numeric and non-blank character scanned by our number collection loop, happens to be a comma ($2C), we repeat the process in a tail recursion, in order to scan and parse the next target.

I guess, the algorithm lends itself rather well to 6502 machine language code, as well, if we replaced the store-in-chunks mechanism by some actual copying to current pointer positions in memory. However, on the 6502, we’d probably prefer to do this on the second pass. Using a higher level language with ample resources at hand, we prefer to divide the tasks along the lines of gathering and preparing information on the one hand and final (re)assembly on the other hand.

By this, we have our entire parse & split routine in place:

var lineNoTable = {},
    lines = [],
    maxLineNo = 36999;

// parse and split lines to chunks of blobs and line targets
// stores parsed chunks in array lines,
// generates a reference of old and new line numbers in lineNoTable.
function parseSplit() {

    var addr = basicStart,
        lineNo = startNo,
        maxMem = mem.length,
        chunks,
        blob;

    // scans for a jump target,
    // generates a target entry and a new blob in current chunks,
    // any leading blanks are added to the current blob
    function parseLineTarget(maybeList) {
       while (mem[addr] === 0x20) {
          blob.push(0x20);
          addr++;
       }
       // parse ASCII to string, ignoring any blanks
       var n = '',
           c = mem[addr];
       while ((c >= 0x30 && c <= 0x39) || c === 0x20) {
          if (c !== 0x20) n += String.fromCharCode(c);
          c = mem[++addr];
       }
       // if we found a number, push chunks and open a new blob
       if (n !== '') {
          chunks.push({'type': 'blob', 'val': blob});
          chunks.push({'type': 'target', 'val': n});
          blob = [];
          // scan for any comma (ON GOTO|GOSUB|GO TO)
          if (maybeList && c === 0x2C) {
             blob.push(0x2C);
             addr++;
             parseLineTarget(true);
          }
       }
    }

    while (addr < maxMem) {
       var lineLink = mem[addr++] + (mem[addr++]<<8);
       if (!lineLink) break;
       chunks = [];
       blob = [];
       var ln = String(mem[addr++] + (mem[addr++]<<8)),
           byte = mem[addr++],
           line = {
              'ln': ln,
              'chunks': chunks
           };
       lineNoTable[ln] = lineNo;
       lineNo += step;
       while (byte) {
          blob.push(byte);
          switch(byte) {
             case 0x8F: // REM (read rest of line)
                while (mem[addr]) blob.push(mem[addr++]);
                break;
             case 0x22: // quote (read up to next quote)
                var c = mem[addr++];
                while (c) {
                   blob.push(c);
                   if (c === 0x22) break;
                   c = mem[addr++];
                }
                break;
             case 0x89: // GOTO
             case 0x8D: // GOSUB
                parseLineTarget(true);
                break;
             case 0xA7: // THEN
                parseLineTarget(false);
                break;
             case 0xCB: // GO (read ahead and test for TO)
                var t = addr;
                while (mem[t] == 0x20) t++; // skip any blanks
                if (mem[t] !== 0xA4) break;
                while (addr <= t) blob.push(mem[addr++]);
                parseLineTarget(true);
                break;
          }
          byte = mem[addr++];
       }
       if (blob.length) chunks.push({'type': 'blob', 'val': blob});
       lines.push(line);
       addr = lineLink;
    }
}

As this function finishes, we should get a neatly arranged list of lines and chunks, like this:

‘10 …ON A GOTO 1000, 1100’ →

lines = [
   {
      "ln": 10,
      "chunks": [
         {
            "type": "blob",
            "val": […, 0x91, 0x20, 0x41, 0x20, 0x89, 0x20] // …"ON A GOTO "
         },
         {
            "type": "target",
            "val": "1000"
         },
         {
            "type": "blob",
            "val": [0x2C, 0x20]
         },
         {
            "type": "target",
            "val": "1100"
         }
      ]
   },
   ...
]

lineNoTable = {
   "10": 100,
   ...
}

Second Pass — Reassembling the BASIC Code

Reassembling the new code from this should be fairly easy. All we have to do, is joining the chunks, by this replacing old line numbers by new ones. Moreover, we have to generate the linkt to the next line and insert the current line number in binary.

The only tricky part here is the edge case of a dead target line number, for which we do not have an entry in our reference table. What should we do? — Shrug and reuse the old one.

// reassamble BASIC code from line chunks using new line numbers
// from lineNoTable
function reassembleLines() {
   var addr = basicStart;
   for (var i = 0, max = lines.length; i < max; i++) {
      var currLine = lines[i],
          currNo = lineNoTable[currLine.ln],
          linkAddr = addr;
      mem[addr++] = 0;
      mem[addr++] = 0;
      mem[addr++] = currNo & 0xFF;
      mem[addr++] = (currNo >> 8) & 0xFF;
      // transfer chunks to memory
      for (var j = 0; j < currLine.chunks.length; j++) {
         var chunk = currLine.chunks[j];
         if (chunk.type === 'blob') {
            var blob = chunk.val;
            for (var k = 0; k < blob.length; k++)
               mem[addr++] = blob[k];
         }
         else if (chunk.type === 'target') {
            var s = '';
            if (chunk.val) {
               var n = lineNoTable[chunk.val];
               s = typeof n !== 'undefined'? n.toString(10):chunk.val;
            }
            for (var k = 0; k < s.length; k++)
               mem[addr++] = s.charCodeAt(k);
         }
      }
      mem[addr++] = 0; //EOL
      // fix up the line link
      mem[linkAddr++] = addr & 0xFF;
      mem[linkAddr] = (addr >> 8) & 0xFF;
   }
   // end of code (empty line link)
   mem[addr++] = 0;
   mem[addr++] = 0;
   return addr;
}

The new BASIC code is directly injected in the memory slice (mem), which was passed to our function renumberBasic() by reference. Mind that the cursor addr is finally pointing to the first memory location immediately after the reassembled code. Since this may be handy, we pass this as a return value.

Calling the Subroutines

The only thing left is actually calling these functions. Mind that there is an edge case, where the renumbered lines exced the range of valid line numbers. (Compare the value 63999 configured in variable maxLineNo.) Therfore, after calling the function parseSplit(), we should check the highest of the newly assigned line numbers, which happens to be the last entry in lineNoTable. If it is out of range, we abort with an error massage. (On the 6502, we’d probaly call the kernals error routine and output an “ILLEGAL QUANTITY” error.) Otherwise, we call the reassembly routine and return the end address of the newly generated code. Which is also the end of our renumbering routine.

function renumberBasic(mem, basicStart, startNo, step) {
   ...

   parseSplit();
   // check, if we are still in the range of legal line numbers
   if (lines.length) {
      var topLineNo = lineNoTable[lines[lines.length - 1].ln];
      // we could throw an exception as well…
      if (topLineNo > maxLineNo) return {
         'addr': -1,
         'message': 'Out of range. Top line number is ' + topLineNo +
                    ' (' + maxLineNo + ' allowed).'
      };
   }
   var endAddr = reassembleLines();
   return {
      'addr': endAddr
   };
}

One Thing Left — Or, What Could Possibly Go Wrong?

Now, this looks great. If we call our utility routine from the emulator, we should get a nice, renumbered code. Indeed, consider this as simple as pointless test sample (purposefully including some questionable formatting):

10 REM RENUMBER TEST
20 PRINT "NUMBER",:INPUT A
30 IF A<0 THEN 6 0
40 ON A GOTO 100,200, 300
50 GO TO 400
60 GOSUB 500:PRINT "THE END."
70 END
100 PRINT "1 IS A BEGINNING.":GOTO 20
200 PRINT "2 ARE A PAIR.":GOTO 20
300 PRINT "3 ARE A GROUP.":GOTO 20
400 PRINT "MORE THAN 3 ARE A LOT.":GOTO 20
500 PRINT "LESS THAN ZERO IS NOTHING.":RETURN

Calling

renumberBasic(Emulator.getRam(), 0x0401, 100, 10);

and typing “LIST”, we get:

100 REM RENUMBER TEST
110 PRINT "NUMBER",:INPUT A
120 IF A<0 THEN 150
130 ON A GOTO 170,180, 190
140 GO TO 200
150 GOSUB 210:PRINT "THE END."
160 END
170 PRINT "1 IS A BEGINNING.":GOTO 110
180 PRINT "2 ARE A PAIR.":GOTO 110
190 PRINT "3 ARE A GROUP.":GOTO 110
200 PRINT "MORE THAN 3 ARE A LOT.":GOTO 110
210 PRINT "LESS THAN ZERO IS NOTHING.":RETURN

While not particularly intellectual pleasing or challenging, it certainly looks like what we expected. We may even RUN it.

The program even withstands a closer inspection of memory:

0400: 00 15 04 64 00 8F 20 52  ...... R
0408: 45 4E 55 4D 42 45 52 20  ENUMBER
0410: 54 45 53 54 00 29 04 6E  TEST.)..
0418: 00 99 20 22 4E 55 4D 42  .. "NUMB
0420: 45 52 22 2C 3A 85 20 41  ER",:. A
0428: 00 39 04 78 00 8B 20 41  .9.... A
0430: B3 30 20 A7 20 31 35 30  .0 . 150
0438: 00 50 04 82 00 91 20 41  .P.... A
0440: 20 89 20 31 37 30 2C 31   . 170,1
0448: 38 30 2C 20 31 39 30 00  80, 190.
0450: 5C 04 8C 00 CB 20 A4 20  \.... .
0458: 32 30 30 00 73 04 96 00  200.....
0460: 8D 20 32 31 30 3A 99 20  . 210:.
0468: 22 54 48 45 20 45 4E 44  "THE END
0470: 2E 22 00 79 04 A0 00 80  ."......
0478: 00 99 04 AA 00 99 20 22  ...... "
0480: 31 20 49 53 20 41 20 42  1 IS A B
0488: 45 47 49 4E 4E 49 4E 47  EGINNING
0490: 2E 22 3A 89 20 31 31 30  .":. 110
0498: 00 B5 04 B4 00 99 20 22  ...... "
04A0: 32 20 41 52 45 20 41 20  2 ARE A
04A8: 50 41 49 52 2E 22 3A 89  PAIR.":.
04B0: 20 31 31 30 00 D2 04 BE   110....
04B8: 00 99 20 22 33 20 41 52  .. "3 AR
04C0: 45 20 41 20 47 52 4F 55  E A GROU
04C8: 50 2E 22 3A 89 20 31 31  P.":. 11
04D0: 30 00 F7 04 C8 00 99 20  0......
04D8: 22 4D 4F 52 45 20 54 48  "MORE TH
04E0: 41 4E 20 33 20 41 52 45  AN 3 ARE
04E8: 20 41 20 4C 4F 54 2E 22   A LOT."
04F0: 3A 89 20 31 31 30 00 1C  :. 110..
04F8: 05 D2 00 99 20 22 4C 45  .... "LE
0500: 53 53 20 54 48 41 4E 20  SS THAN
0508: 5A 45 52 4F 20 49 53 20  ZERO IS
0510: 4E 4F 54 48 49 4E 47 2E  NOTHING.
0518: 22 3A 8E 00 00 00        ":....

However, have a look at what the next attempt to list the program looks like, after running the program:

Something here is certainly wrong. (PET 2001 emulation) — Certainly, something is looking wrong here.

Just look at that mess all over the screen, what is happening here? Are we missing something? Even worse, if we try to edit the program, the editor doesn’t return, as it became lost in some endless loop, apparently! Are we missing something?

Housekeeping — Memory Allocation & BASIC Pointers

Indeed, we are missing something, namely, correct values in the BASIC pointers, which control how BASIC stores and accesses its variables. As our programs grows thanks to longer line numbers in the BASIC text, it also grows into the range of memory, which is used by BASIC for its variables. As a result, the end of our program was overwritten by variables as we ran the program.

Now, what are these pointers and how do they control variable allocation?

These pointers are found in a consecutive area of bytes in the zero page and look like this:

• C64

label   loc.hex       loc.dec    comment

TXTTAB  002B-002C      43-44     Pointer: Start of BASIC Text
VARTAB  002D-002E      45-46     Pointer: Start of BASIC Variables
ARYTAB  002F-0030      47-48     Pointer: Start of BASIC Arrays
STREND  0031-0032      49-50     Pointer: End of BASIC Arrays (+1)
FRETOP  0033-0034      51-52     Pointer: Bottom of String Storage
FRESPC  0035-0036      53-54     Utility String Pointer
MEMSIZ  0037-0038      55-56     Pointer: Highest Address Used by BASIC


• PET 2001 ROM 2.0 ("new ROM") and ROM 4 (BASIC 4.0)

TXTTAB  0028-0029      40-41     Pointer: Start of BASIC Text
VARTAB  002A-002B      42-43     Pointer: Start of BASIC Variables
ARYTAB  002C-002D      44-45     Pointer: Start of BASIC Arrays
STREND  002E-002F      46-47     Pointer: End of BASIC Arrays (+1)
FRETOP  0030-0031      48-49     Pointer: Bottom of String Storage
FRESPC  0032-0033      50-51     Utility String Pointer
MEMSIZ  0034-0035      52-53     Pointer: Highest Address Used by BASIC


• PET 2001 ROM 1.0 ("old ROM")

TXTTAB  007A-007B     122-123    Pointer: Start of BASIC Text
VARTAB  007C-007D     124-125    Pointer: Start of BASIC Variables
ARYTAB  007E-007F     126-127    Pointer: Start of BASIC Arrays
STREND  0080-0081     128-129    Pointer: End of BASIC Arrays (+1)
FRETOP  0082-0083     130-131    Pointer: Bottom of String Storage
FRESPC  0084-0085     132-133    Utility String Pointer
MEMSIZ  0086-0087     134-135    Pointer: Highest Address Used by BASIC


• VIC 20 (same as C64)

TXTTAB  002B-002C      43-44     Pointer: Start of BASIC Text
VARTAB  002D-002E      45-46     Pointer: Start of BASIC Variables
ARYTAB  002F-0030      47-48     Pointer: Start of BASIC Arrays
STREND  0031-0032      49-50     Pointer: End of BASIC Arrays (+1)
FRETOP  0033-0034      51-52     Pointer: Bottom of String Storage
FRESPC  0035-0036      53-54     Utility String Pointer
MEMSIZ  0037-0038      55-56     Pointer: Highest Address Used by BASIC

And this is the meaning of these pointers:

TXTTAB: is none other than our old friend, the start of BASIC.
VARTAB: is the start of the variable space. It follows immediately after the end of the program in memory (the final empty line-link.) Simple variables are stored here. Each variable occupies 7 bytes of memory, regardless of the type. ARYTAB points to the next available space.
ARYTAB: is (also) the beginning of the space allocated for array storage. It points to the location following immediately after the last byte allocated for simple variables.
Note: Since arrays are stored directly above simple variables, adding a new simple variable requires all arrays to be moved in memory in order to provide the required space!
STREND: is the lower end of the space used for storing string literals. (Strings grow from top to bottom.) This is also the beginning of any free memory, which hasn’t been allocated yet, between the array space and any stored string literals.
FRETOP: marks the top end of the unallocated memory, below the last allocated string.
FRESPC: is a utility pointer used internally by BASIC. It is not directly involved in variable management, but used for string handling (scratch space, etc).
MEMSIZ: is the top address of accessible memory. Strings start growing down from here.

Pointers in Action

Here’s an example for what this looks like in action after running a simple BASIC program (PET 2001, ROM 2.0, 8K RAM: $0000–$1FFF):

• PET 2001 8K, ROM 2.0 ("new ROM")

10 DIM A(10)
20 FOR I=0 TO 10:A(I)=I:NEXT
30 B$=CHR$(66)

→ dump of BASIC prg

0400: 00 0D 04 0A 00 86 20 41  ...... A
0408: 28 31 30 29 00 25 04 14  (10).%..
0410: 00 81 20 49 B2 30 20 A4  .. I.0 .
0418: 20 31 30 3A 41 28 49 29   10:A(I)
0420: B2 49 3A 82 00 32 04 1E  .I:..2..
0428: 00 42 24 B2 C7 28 36 36  .B$..(66
0430: 29 00 00 00              )...

→ RUN →

0028: 01 04  ...  TXTTAB: $0401
002A: 34 04  ...  VARTAB: $0434
002C: 42 04  ...  ARYTAB: $0442
002E: 80 04  ...  STREND: $0480
0030: FF 1F  ...  FRETOP: $1FFF
0032: FF 1F  ...  FRESPC: $1FFF
0034: 00 20  ...  MEMSIZ: $2000

→ dump at $0434 (addr in VARTAB):

0434:             49 00 84 30      I..0  real I
0438: 00 00 00 42 80 01 FF 1F  ...B....  string B, len. 1 → $1FFF
0440: 00 00 41 00 3E 00 01 00  ..A.>...  array A(10)
0448: 0B 00 00 00 00 00 81 00  ........
0450: 00 00 00 82 00 00 00 00  ........
0458: 82 40 00 00 00 83 00 00  .@......
0460: 00 00 83 20 00 00 00 83  ... ....
0468: 40 00 00 00 83 60 00 00  @....`..
0470: 00 84 00 00 00 00 84 10  ........
0478: 00 00 00 84 20 00 00 00  .... ...
0480: AA                       ........  end of array storage
....
1FF8: AA AA AA AA AA AA AA 42  .......B  literal "B"
                                         top of string storage

While this may be still a bit complicated, now have a look at what this pointers look like, just after we load a very simple program without running it:

‘10 PRINT "HELLO WORLD"’

→ loaded into memory, $0401 - $0416

0400: 00 15 04 0A 00 99 20 22  ...... "
0408: 48 45 4C 4C 4F 20 57 4F  HELLO WO
0410: 52 4C 44 22 00 00 00     RLD"...

0028: 01 04  ...  TXTTAB: $0401
002A: 17 04  ...  VARTAB: $0417
002C: 17 04  ...  ARYTAB: $0417
002E: 17 04  ...  STREND: $0417
0030: 00 20  ...  FRETOP: $2000
0032: 00 20  ...  FRESPC: $2000
0034: 00 20  ...  MEMSIZ: $2000

All, there is, is

TXTTAB pointing to the standard start of BASIC,
the address following immediately after the end of program (three times),
and the top of memory (three times).

This, we easily can provide. (Mind that the address, pointers VARTAB, ARYTAB, STREND are pointing to, is the same address, which our renumbering routine returns.)
Moreover, if our program grew into the variable space in the process of renumbering, any previous data is corrupted anyway. There’s no harm in resetting the pointers to similar (and similarly sane) values as after loading a program. What’s good enough for the Kernal, can’t be totally wrong for us.

Fixing the Pointers

Now that we know this, we may easily adjust the pointers to sane values, like here in this example, where we call our renumbering utility from a function inside the emulator UI (slightly adjusted for the purpose of this write-up):

function renumber() {

   // write a 2-byte value to a pointer (lo-byte, hi-byte)
   function setPointer(ptrAddr, word) {
      Emulator.write(ptrAddr,     word & 0xFF);
      Emulator.write(ptrAddr + 1, (word >> 8) & 0xFF);
   }

   var txtTab     = Emulator.getRomVers() == 1? 0x7A:0x28, //old or new PET ROM?
       mem        = Emulator.getRam(),
       maxRAM     = mem.length,
       basicStart = mem[txtTab] | (mem[txtTab + 1] << 8),
       result     = Utils.renumber(mem, basicStart, startLineNo, step);
       endAddr    = result.addr;

   if (endAddr > 0) {
      // success, reset pointers
      setPointer(txtTab, basicStart);
      setPointer(txtTab +  2, endAddr);
      setPointer(txtTab +  4, endAddr);
      setPointer(txtTab +  6, endAddr);
      setPointer(txtTab +  8, maxRAM);
      setPointer(txtTab + 10, maxRAM);
      setPointer(txtTab + 12, maxRAM);
   }
   return result.message;

}

And this is actually all there is. We are finished. We may have even learned some on the way.
(E.g., every time BASIC encounters a new variable, it has to move any arrays in their entirety, as defined as the block of memory between the addresse in pointers ARYTAB and STREND. You may want to optimize your BASIC programs in light of this.)

And we may call our utility in the emulator (see the Utils/Export menu), where we get something like this:

BASIC renumbering. (PET 2001 emulation) — Renumbering a tiny BASIC program, before and after.

TRY IT NOW

Norbert Landsteiner,
Vienna, 2020-02-27