The Case of the Missing 4th Commodore BASIC Variable (and the 5th Byte)

Another investigation into data types in Commodore BASIC.

Title illustation

Previously, this article was held in the style of a “Damsel in Distress” detective story, told by a cynical private eye narrator. This was part of a broader experiment, looking for real-world data on how unique information presented in an edgy form would compare to a more pleasant, but also dangerously hallucinated generated content in terms of algorithms and ranking.
Now, that this experiment is over, we may present the content in a more comprensive form. For reference, the previous, “edgy” version is archived here.

It’s common knowledge that Commodore BASIC features three basic types of variables: Float, Integer, and String. We had a closer look at their format and implementation already, but here is short recap:

IdentifierTypeIn-Memory SignatureIn-Memory Representation
A1Floating PointA 1(0x40 0x31)5 bytes: exponent/sign, 4 bytes mantissa
I2%IntegerI̅ 2̅(0xC9 0xB2)5 bytes: 2 bytes binary value, 3 zero-bytes (unused)
S3$StringS 3̅(0x53 0xB3)5 bytes: length, 2-byte memory pointer, 2 zero-bytes

(Identifiers start with a letter, followed by an arbitrary length of letters and numbers, but only the first two characters are significant. The effective identifier used by BASIC is just two characters: either a single letter, followed by a zero-byte, or a letter and an alpha-numeric character. An overbar in the signiture represents a set sign-bit [0x80].)

Any variables occupy 7 bytes in memory (which makes them easy and fast to traverse, when looking up an identifier): 2 bytes for the name, followed by a 5-byte variable body, which holds the actual value. The name also encodes the type: As characters and numbers utilize only the lower 7 bits of a byte in PETSCII (or ASCII, BTW), the sign-bit of the two name-bytes is free to encode the type information.

Particularly, a variable without a type specifier is a float, which is also the default type with no sign-bits set. An integer (specified by a trailing “%”) is distinguished by having both sign-bits set, and a string variable (ending in “$”) has just the the sign-bit of the second name-byte set:

- Commodore BASIC variables by sign-bit -

0 0   Float
1 1   Integer
0 1   String

Since the actual signature of a variable — as given by the two name-bytes of the in-memory representation — comprises both the type and the name, each type occupies its own namespace and we may use variables of the same identifying name, but of different type at the same time without fearing a collision. As in A (float), A% (integer) and A$ (string), which may all co-exist at once.

Notably, in Commodore BASIC, where all arithmetic is done in floating point, integer is just a storage format. Since all simple variables are stored in a 7-byte format, prioritizing search time efficiency over memory use, this only pays off for arrays, where integers are more densely packed, using just 2 bytes per data cell.

We may put this assorted knowledge of Commodore BASIC variables to a quick test, just to understand this properly. For this, we’ll use a new tool available in our trusty PET 2001 emulator, namely, the utility “Disassemble Variables”, which allows us to take a closer look at variables and their values and formats as stored in memory:

screenshot of an emulated PET screen sowing a BASIC program
A very basic example, featuring float, integer, and string variables.
### COMMODORE BASIC ###

 15359 BYTES FREE

READY.
10 A1 =2.345
20 I2%=258
30 S3$="BLA"

RUN

READY.
→ Utils/Export → Disassemble Variables
                         .[simple BASIC variables]

042B  41 31               A1
042D  82 16 14 7A E2      =  2.345
0432  C9 B2               I2%
0434  01 02 00 00 00      =  258
0439  53 B3               S3$
043B  03 24 04 00 00      len: 3, @ $0424

                         .[end of BASIC variables]

The disassembly aligns with any other disassembly format (e.g., for tokenized BASIC, or machine language) and provides for each variable a line with its name, followed by another line giving the “payload” or value-part along with its interpretation. (Mind that they look a bit different for arrays, where individual values are listed by subscript.)

Here,

The takeaway here is that simple variables are stored in a 7-byte format, each, by a two-character signature encoding name and type, followed by a 5-byte payload representing the value or some properties, like length and pointer. The type is stored in the sign-bits (bit 7) of the two name-bytes, as provided in the table

0 0   Float    (type 0)
1 1   Integer  (type 3)
0 1   String   (type 1)

Now, if you have been around with binary encodings for some time, this may give rise to a suspicion: Isn’t there room for another type (namely, a type 2), and, while it makes some sense to encode integers as the opposite of floats, with all sign-bits set, why aren’t integers encoding as “1 0” so that the types are encoded in sequence, as in 0, 1, 2? In other words, is there yet another type in the hiding?

Another Type?

I caught a first glimpse of this supposedly unknown type, when taking this to a test with an old source code by Jason Cook (Check out his new PET game!):

addr   memory

1C0A   D2 00 B4 0A 13 1C B5

Which translates into a variable named “R” with the sign-bit set on the first name-byte only (PETSCII 0x72 + 0x80 = 0xD2) and the second name-byte consisting just of zero-padding. There it was, in plain sight, an example of the possible 4th type, holding an unknown 5-byte payload.

So there actually are,

- Commodore BASIC variables by sign-bit -

0 0   Float
1 1   Integer
0 1   String
1 0   – ??? –

So, what is this unknown type?

This is even more of a conundrum, as Commodore never made much of a mystery of variable formats, right from the beginning. The PET manuals clearly describe how BASIC interacts with memory and provides some examples for in-memory formats, but it only mentions 3 types: floating point, integer, and string. So what may this 4th variable type be, and what mysteries are lurging behind this?

As we already know the name, the single letter “R”, and since variables are stored in the order, they are encountered during the execution of a program, it shouldn’t be to difficult to trace this to its origins, hidden in a bunch of densely formatted BASIC statements:

150 DEFFNR(X)=INT(X*RND(U)):GOSUB8010:A1$="NLTSMR"

(STARTREK1978.PRG by Jason Cook)

It’s a DEFFN variable! — This makes actually some sense, that references to user defined functions should be stored as variables, in order to look them up by name.

So let’s have a closer look, using a much simpler example that lends itself a bit easier to investigations:

10 DEFFNR(X)=1+X*X
20 PRINT FNR(3)

RUN
 10

Now let‘s have a look at the variable as in memory:

→ Utils/Export → Disassemble Variables

                         .[simple BASIC variables]

0420  D2 00               FNR()
0422  0C 04 29 04 31      – ??? –
0427  58 00               X
0429  00 00 00 00 00      =  0

                         .[end of BASIC variables]

And, as we’re at it, let’s inspect the tokenized program as in memory, as well:

→ Utils/Export → Disassemble Program

                         .[tokenized BASIC text]

0401  12 04               link: $0412
0403  0A 00               line# 10
0405  96                  token DEF
0406  A5                  token FN
0407  52 28 58 29         ascii «R(X)»
040B  B2                  token =
040C  31                  ascii «1»
040D  AA                  token +
040E  58                  ascii «X»
040F  AC                  token *
0410  58                  ascii «X»
0411  00                  -EOL-
0412  1E 04               link: $041E
0414  14 00               line# 20
0416  99                  token PRINT
0417  20                  ascii « »
0418  A5                  token FN
0419  52 28 33 29         ascii «R(3)»
041D  00                  -EOL-
041E  00 00               -EOP- (link = null)

                         .[end of BASIC text]

If you’re familiar with the memory layout of the PET, you may have spotted it already: the first two words are pointers into memory, as given away by their second (high) byte of 04, pointing at addresses in the 0x04000x04FF range, with BASIC starting on the PET at 0x0401, populated by the tokenized BASIC text, followed by simple variables and then arrays, if there are any.

Let’s have a look just at the first line and the variables:

0401  12 04               link: $0412
0403  0A 00               line# 10
0405  96                  token DEF
0406  A5                  token FN
0407  52 28 58 29         ascii «R(X)»
040B  B2                  token =
040C  31                  ascii «1»
040D  AA                  token +
040E  58                  ascii «X»
040F  AC                  token *
0410  58                  ascii «X»
0411  00                  -EOL-

      (...)

0420  D2 00               FNR()
0422  0C 04               pointer to $040C (low, high)
0424  29 04               pointer to $0429 (low, high)
0426  31                  – ??? –
0427  58 00               X
0429  00 00 00 00 00      =  0

This already promises some speedy and optimized execution at run-time, as the pointers refer immediately to memory as needed. Moreover, we can see, why only floating point values are allowed as an argument to any user defined functions, as the pointer to the argument skips past any notion of the name and type of that variable, assuming, it‘s a float, right away.

The Mystery of the 5th Byte

So, what may the 5th byte be about? Some of this may remind us of how strings are stored, by a first byte storing the length and then a pointer to the in-memory location, at which the string starts. Is it a length of sorts? (This may seem even more plausible, as the code for executing “DEFFN” borrows some from the code for string handling.)

This was actually my first assumption, nourished by some coincidence. However, this, of course, it is not. The execution at run-time just stops at the first colon (“:“) or the first end of line, what ever comes first, extending over a single BASIC statement. No lengths required for that.
Is it related to the variable name? But this was yet another coincidence in my early investigations into this. As can be clearly seen by the above example, where 0x31 gives the ASCII code for “1”, which bears no relation to “R”. So, what is it?

Let‘s expand on our little experiment:

10 DEFFNR(X)=1+X*X
20 DEFFNG(Y)=3*Y+4

Which (after RUN) provides the following variable read-out:

0425  D2 00               FNR()
0427  0C 04 2E 04 31      @ $040C, arg @ $042E, ??
042C  58 00               X
042E  00 00 00 00 00      =  0
0433  C7 00               FNG()
0435  1D 04 3C 04 33      @ $041D, arg @ $043C, ??
043A  59 00               Y
043C  00 00 00 00 00      =  0

So, the first variable has a 5th byte of 0x31 and the second variable one of 0x33. Is it some counter? (This also shows, once again, that this isn‘t related to any names, since nothing in either “R”, “G”, “X”, or “Y” translates to a difference of 2.)

So let’s add another DEFFN definition to this, just to verify:

10 DEFFNR(X)=1+X*X
20 DEFFNG(Y)=3*Y+4
30 DEFFNI(T)=3*T-2


0436  D2 00               FNR()
0438  0C 04 3F 04 31      @ $040C, arg @ $043F, ??
043D  58 00               X
043F  00 00 00 00 00      =  0
0444  C7 00               FNG()
0446  1D 04 4D 04 33      @ $041D, arg @ $044D, ??
044B  59 00               Y
044D  00 00 00 00 00      =  0
0452  C9 00               FNI()
0454  2E 04 5B 04 33      @ $042E, arg @ $045B, ??
0459  54 00               T
045B  00 00 00 00 00      =  0

Hum, this is somewhat disappointing: both the second and the third FN variable have 0x33 as their last byte. So it isn’t a counter, at all. Moreover, adding some other variables to our short program or changing any of the names doesn’t show any effect on this 5th byte of the variable body by any means.

However, if we change the very first character of the function body, we finally do make a difference:

30 DEFFNI(T)=4*T-2

0452  C9 00               FNI()
0454  2E 04 5B 04 34      @ $042E, arg @ $045B, ??

Let’s make this

30 DEFFNI(T)=T-2

0450  C9 00               FNI()
0452  2E 04 59 04 54      @ $042E, arg @ $0459, ??

As the attentive may have observed already, 0x34 is the ASCII code for “1” and 0x54 is ASCII “T”.
It’s the first byte literal of our DEFFN function body!

Let’s check this with a token in the first position:

30 DEFFNI(T)=INT(T)

0451  C9 00               FNI()
0453  2E 04 5A 04 B5      @ $042E, arg @ $045A, ??

Yes, 0xB5 has the sign-bit set, giving away the BASIC token, and it is the BASIC token for INT, indeed:

0425  1E 00               line# 30
0427  96                  token DEF
0428  A5                  token FN
0429  49 28 54 29         ascii «I(T)»
042D  B2                  token =
042E  B5                  token INT
042F  28 54 29            ascii «(T)»
0432  00                  -EOL-

Well, this is that mystery solved.

But, what happens. if we were to change this 5th byte on-the-fly? Does this 5th byte matter, at all?

10 DEFFNR(X)=1+X*X
20 DEFFNG(Y)=3*Y+4
30 DEFFNI(T)=INT(T)
40 POKE 1160,32 : REM DEC 1160 = $0488
50 PRINT FNI(4.1)

RUN
 4

READY.

It doesn’t seem so. The result is still what we’d expect as a result of the BASIC function INT. It’s also not what we’d expected, if we replaced the token INT in the BASIC text by 32 (0x20), which is a simple space/blank, resulting in the string “ (T)”.

And we actually did changed that last byte, as the disassembly reads now:

0482  C9 00               FNI()
0484  2E 04 8B 04 20      @ $042E, arg @ $048B, « »

Let‘s have another go at this, this time replacing the “1” in the variable body of FNR by the ASCII code for “2”:

10 DEFFNR(X)=1+X*X
20 DEFFNG(Y)=3*Y+4
30 DEFFNI(T)=INT(T)
40 POKE 1130,50 : REM DEC 1130 = $046A
50 PRINT FNR(2)

RUN
 5

0464  D2 00               FNR()
0466  0C 04 6D 04 32      @ $040C, arg @ $046D, «2»

This didn‘t make a difference, as well.

As far as my own research goes, a thorough investigation into literature on the matter produced a sole source, namely “Programming the PET/CBM” by Raeto West. Here, we find FN variables actually described as a distinctive type, on p. 9, where the very last byte is described as “INITIAL OF VAR.”

faximile: Raeto West, Programming the PET/CBM; p.9

As so often, when you may think you have discovered something obscure or even unknown about the PET and/or Commodore BASIC, it’s already in West’s book. It’s just that it may be in a form that doesn’t give away its meaning immediately (or, sometimes, that it’s actually covered, at all.)

Here, the exact meaning of “INITIAL OF VAR.” may not be that clear as it‘s provided without further context. However — as we‘ve established already —, this is fair and correct, if we’re meant to understand, “the initial byte of the function body referred to by the variable.” (As opposed to, e.g., “the first character of the variable identifier,“ or similar.) The descriptive text, which goes along with this table, goes as follows,

A function definition has two pointers; one to the definition in the body of the BASIC program, and one to the floating-point dependent variable. They point just after the '=' sign and to the exponent byte respectively. The final byte is garbage, generated when the definition is set up, and is not used.

Well, I guess, that’s it. Especially, as (as already mentioned) the code makes use of some resources dedicated to string handling. However, it is still a bit strange that this 5th byte isn’t just set to 0 as with any other surplus bytes in integer or string variables.

So, while we’re done for the purpose of our disassembly, there’s still a bit of a mystery left, in how the very beginning of the function body ends up as the last byte of the variable body. But this, I guess, is yet another story.

Wait, there’s still more…

Above, we observed that there are only global variables in Commodore BASIC. This is also technically true and correct, as far as the creation and representation of the function parameter for user defined functions in memory is concerned.

However, if we look at the runtime implementation of user defined functions in the BASIC ROM, we discover that, prior to execution, the current content of the variable body (1 byte exponent and 4 bytes mantissa) is saved before the variable is accessed as a parameter/argument and then restored again, when the execution of that call has finished. Since user defined functions are callable from inside user defined functions, this cannot be achieved by just a simple buffer in the zero-page (meaning, there may be more than one value to be saved at any given time), rather, the contents of the variable body is pushed to the processor stack and eventually restored from this.

So, in effect, the function parameter is actually a local variable, shadowing any global variable going by the same name:

10 X=1
20 DEFFNR(X)=1+X*X
30 PRINT X
40 PRINT FNR(2)
50 PRINT X
RUN
 1
 5
 1

READY.
█

Moreover, even, if there is no conflict, the function parameter/argument isn‘t accessible from outside:

10 DEFFNR(X)=1+X*X
20 PRINT FNR(2)
30 PRINT X
RUN
 5
 0

READY.
█

So, why does this use a global variable, at all? Well as we’ve seen before, the function body of a user defined function is executed like any other BASIC statement. The single difference being that the execution ends at the first colon or end-of-line, causing the BASIC interpreter to return to its previous context. This way, the function parameter/argument is accessible to normal BASIC execution inside the function body.

Still, we may note that the effort taken to make this behave like a local variable (5 pushs and 5 pulls to and from the stack, together with the reads and writes that go with this) somewhat counteracts the efficiency suggested by the argument pointer tapping directly in the variable body.

BTW, if you want to have a look at the new version of the PET emulator, here it is running all the latest demos: