The Case of the Missing 4th Commodore BASIC Variable (and the 5th Byte)
Another investigation into data types in Commodore BASIC.
Previously, this article was held in the style of a “Damsel in Distress” detective story, told by a cynical private eye narrator. This was part of a broader experiment, looking for real-world data on how unique information presented in an edgy form would compare to a more pleasant, but also dangerously hallucinated generated content in terms of algorithms and ranking.
Now, that this experiment is over, we may present the content in a more comprensive form. For reference, the previous, “edgy” version is archived here.
It’s common knowledge that Commodore BASIC features three basic types of variables: Float, Integer, and String. We had a closer look at their format and implementation already, but here is short recap:
Identifier | Type | In-Memory Signature | In-Memory Representation | |
---|---|---|---|---|
A1 | Floating Point | A 1 | (0x40 0x31) | 5 bytes: exponent/sign, 4 bytes mantissa |
I2% | Integer | I̅ 2̅ | (0xC9 0xB2) | 5 bytes: 2 bytes binary value, 3 zero-bytes (unused) |
S3$ | String | S 3̅ | (0x53 0xB3) | 5 bytes: length, 2-byte memory pointer, 2 zero-bytes |
(Identifiers start with a letter, followed by an arbitrary length of letters and numbers, but only the first two characters are significant. The effective identifier used by BASIC is just two characters: either a single letter, followed by a zero-byte, or a letter and an alpha-numeric character. An overbar in the signiture represents a set sign-bit [0x80
].)
Any variables occupy 7 bytes in memory (which makes them easy and fast to traverse, when looking up an identifier): 2 bytes for the name, followed by a 5-byte variable body, which holds the actual value. The name also encodes the type: As characters and numbers utilize only the lower 7 bits of a byte in PETSCII (or ASCII, BTW), the sign-bit of the two name-bytes is free to encode the type information.
Particularly, a variable without a type specifier is a float, which is also the default type with no sign-bits set. An integer (specified by a trailing “%
”) is distinguished by having both sign-bits set, and a string variable (ending in “$
”) has just the the sign-bit of the second name-byte set:
- Commodore BASIC variables by sign-bit - 0 0 Float 1 1 Integer 0 1 String
Since the actual signature of a variable — as given by the two name-bytes of the in-memory representation — comprises both the type and the name, each type occupies its own namespace and we may use variables of the same identifying name, but of different type at the same time without fearing a collision. As in A
(float), A%
(integer) and A$
(string), which may all co-exist at once.
Notably, in Commodore BASIC, where all arithmetic is done in floating point, integer is just a storage format. Since all simple variables are stored in a 7-byte format, prioritizing search time efficiency over memory use, this only pays off for arrays, where integers are more densely packed, using just 2 bytes per data cell.
We may put this assorted knowledge of Commodore BASIC variables to a quick test, just to understand this properly. For this, we’ll use a new tool available in our trusty PET 2001 emulator, namely, the utility “Disassemble Variables”, which allows us to take a closer look at variables and their values and formats as stored in memory:
### COMMODORE BASIC ###
15359 BYTES FREE
READY.
10 A1 =2.345
20 I2%=258
30 S3$="BLA"
RUN
READY.
█
→ Utils/Export → Disassemble Variables
.[simple BASIC variables] 042B 41 31 A1 042D 82 16 14 7A E2 = 2.345 0432 C9 B2 I2% 0434 01 02 00 00 00 = 258 0439 53 B3 S3$ 043B 03 24 04 00 00 len: 3, @ $0424 .[end of BASIC variables]
The disassembly aligns with any other disassembly format (e.g., for tokenized BASIC, or machine language) and provides for each variable a line with its name, followed by another line giving the “payload” or value-part along with its interpretation. (Mind that they look a bit different for arrays, where individual values are listed by subscript.)
Here,
- the float “
A1
” holds the signed exponent0x82
and the mantissa0x16 0x14 0x7A 0xE2
, representing the decimal real number2.345
. - Then, string variable “
I2%
” holds the 16-bit value0x01 0x02
, representing (in low-byte, high/byte order) the decimal value258
, padded by 3 zero-bytes. - Finally, the string variable “
S3$
” holds a payload indicating a length of3
(as the length is encoded in a single byte, a string may be of 0–255 characters length), and a 2-byte pointer to where this string is found in memory (again in low-byte, high/byte order), padded by 2 zero-bytes.
The takeaway here is that simple variables are stored in a 7-byte format, each, by a two-character signature encoding name and type, followed by a 5-byte payload representing the value or some properties, like length and pointer. The type is stored in the sign-bits (bit 7) of the two name-bytes, as provided in the table
0 0 Float (type 0) 1 1 Integer (type 3) 0 1 String (type 1)
Now, if you have been around with binary encodings for some time, this may give rise to a suspicion: Isn’t there room for another type (namely, a type 2), and, while it makes some sense to encode integers as the opposite of floats, with all sign-bits set, why aren’t integers encoding as “1 0
” so that the types are encoded in sequence, as in 0, 1, 2? In other words, is there yet another type in the hiding?
Another Type?
I caught a first glimpse of this supposedly unknown type, when taking this to a test with an old source code by Jason Cook (Check out his new PET game!):
addr memory 1C0A D2 00 B4 0A 13 1C B5
Which translates into a variable named “R
” with the sign-bit set on the first name-byte only (PETSCII 0x72
+ 0x80
= 0xD2
) and the second name-byte consisting just of zero-padding. There it was, in plain sight, an example of the possible 4th type, holding an unknown 5-byte payload.
So there actually are,
- Commodore BASIC variables by sign-bit - 0 0 Float 1 1 Integer 0 1 String 1 0 – ??? –
So, what is this unknown type?
This is even more of a conundrum, as Commodore never made much of a mystery of variable formats, right from the beginning. The PET manuals clearly describe how BASIC interacts with memory and provides some examples for in-memory formats, but it only mentions 3 types: floating point, integer, and string. So what may this 4th variable type be, and what mysteries are lurging behind this?
As we already know the name, the single letter “R”, and since variables are stored in the order, they are encountered during the execution of a program, it shouldn’t be to difficult to trace this to its origins, hidden in a bunch of densely formatted BASIC statements:
150 DEFFNR(X)=INT(X*RND(U)):GOSUB8010:A1$="NLTSMR"
(STARTREK1978.PRG by Jason Cook)
It’s a DEFFN
variable! — This makes actually some sense, that references to user defined functions should be stored as variables, in order to look them up by name.
So let’s have a closer look, using a much simpler example that lends itself a bit easier to investigations:
10 DEFFNR(X)=1+X*X 20 PRINT FNR(3) RUN 10
Now let‘s have a look at the variable as in memory:
→ Utils/Export → Disassemble Variables
.[simple BASIC variables] 0420 D2 00 FNR() 0422 0C 04 29 04 31 – ??? – 0427 58 00 X 0429 00 00 00 00 00 = 0 .[end of BASIC variables]
And, as we’re at it, let’s inspect the tokenized program as in memory, as well:
→ Utils/Export → Disassemble Program
.[tokenized BASIC text] 0401 12 04 link: $0412 0403 0A 00 line# 10 0405 96 token DEF 0406 A5 token FN 0407 52 28 58 29 ascii «R(X)» 040B B2 token = 040C 31 ascii «1» 040D AA token + 040E 58 ascii «X» 040F AC token * 0410 58 ascii «X» 0411 00 -EOL- 0412 1E 04 link: $041E 0414 14 00 line# 20 0416 99 token PRINT 0417 20 ascii « » 0418 A5 token FN 0419 52 28 33 29 ascii «R(3)» 041D 00 -EOL- 041E 00 00 -EOP- (link = null) .[end of BASIC text]
If you’re familiar with the memory layout of the PET, you may have spotted it already: the first two words are pointers into memory, as given away by their second (high) byte of 04
, pointing at addresses in the 0x0400
– 0x04FF
range, with BASIC starting on the PET at 0x0401
, populated by the tokenized BASIC text, followed by simple variables and then arrays, if there are any.
Let’s have a look just at the first line and the variables:
0401 12 04 link: $0412 0403 0A 00 line# 10 0405 96 token DEF 0406 A5 token FN 0407 52 28 58 29 ascii «R(X)» 040B B2 token = 040C 31 ascii «1» 040D AA token + 040E 58 ascii «X» 040F AC token * 0410 58 ascii «X» 0411 00 -EOL- (...) 0420 D2 00 FNR() 0422 0C 04 pointer to $040C (low, high) 0424 29 04 pointer to $0429 (low, high) 0426 31 – ??? – 0427 58 00 X 0429 00 00 00 00 00 = 0
- The first pointer taps directly into the function body after the assignment to the function definition.
- The second pointer taps directly into the variable body of the argument “
X
”, which is actually a global variable. (Which does make some sense, as there are only global variables in BASIC.)
This already promises some speedy and optimized execution at run-time, as the pointers refer immediately to memory as needed. Moreover, we can see, why only floating point values are allowed as an argument to any user defined functions, as the pointer to the argument skips past any notion of the name and type of that variable, assuming, it‘s a float, right away.
The Mystery of the 5th Byte
So, what may the 5th byte be about? Some of this may remind us of how strings are stored, by a first byte storing the length and then a pointer to the in-memory location, at which the string starts. Is it a length of sorts? (This may seem even more plausible, as the code for executing “DEFFN
” borrows some from the code for string handling.)
This was actually my first assumption, nourished by some coincidence. However, this, of course, it is not. The execution at run-time just stops at the first colon (“:
“) or the first end of line, what ever comes first, extending over a single BASIC statement. No lengths required for that.
Is it related to the variable name? But this was yet another coincidence in my early investigations into this. As can be clearly seen by the above example, where 0x31
gives the ASCII code for “1
”, which bears no relation to “R
”. So, what is it?
Let‘s expand on our little experiment:
10 DEFFNR(X)=1+X*X 20 DEFFNG(Y)=3*Y+4
Which (after RUN
) provides the following variable read-out:
0425 D2 00 FNR() 0427 0C 04 2E 04 31 @ $040C, arg @ $042E, ?? 042C 58 00 X 042E 00 00 00 00 00 = 0 0433 C7 00 FNG() 0435 1D 04 3C 04 33 @ $041D, arg @ $043C, ?? 043A 59 00 Y 043C 00 00 00 00 00 = 0
So, the first variable has a 5th byte of 0x31
and the second variable one of 0x33
. Is it some counter? (This also shows, once again, that this isn‘t related to any names, since nothing in either “R
”, “G
”, “X
”, or “Y
” translates to a difference of 2.)
So let’s add another DEFFN
definition to this, just to verify:
10 DEFFNR(X)=1+X*X 20 DEFFNG(Y)=3*Y+4 30 DEFFNI(T)=3*T-2 0436 D2 00 FNR() 0438 0C 04 3F 04 31 @ $040C, arg @ $043F, ?? 043D 58 00 X 043F 00 00 00 00 00 = 0 0444 C7 00 FNG() 0446 1D 04 4D 04 33 @ $041D, arg @ $044D, ?? 044B 59 00 Y 044D 00 00 00 00 00 = 0 0452 C9 00 FNI() 0454 2E 04 5B 04 33 @ $042E, arg @ $045B, ?? 0459 54 00 T 045B 00 00 00 00 00 = 0
Hum, this is somewhat disappointing: both the second and the third FN
variable have 0x33
as their last byte. So it isn’t a counter, at all. Moreover, adding some other variables to our short program or changing any of the names doesn’t show any effect on this 5th byte of the variable body by any means.
However, if we change the very first character of the function body, we finally do make a difference:
30 DEFFNI(T)=4*T-2 0452 C9 00 FNI() 0454 2E 04 5B 04 34 @ $042E, arg @ $045B, ??
Let’s make this
30 DEFFNI(T)=T-2 0450 C9 00 FNI() 0452 2E 04 59 04 54 @ $042E, arg @ $0459, ??
As the attentive may have observed already, 0x34
is the ASCII code for “1
” and 0x54
is ASCII “T
”.
It’s the first byte literal of our DEFFN
function body!
Let’s check this with a token in the first position:
30 DEFFNI(T)=INT(T) 0451 C9 00 FNI() 0453 2E 04 5A 04 B5 @ $042E, arg @ $045A, ??
Yes, 0xB5
has the sign-bit set, giving away the BASIC token, and it is the BASIC token for INT
, indeed:
0425 1E 00 line# 30 0427 96 token DEF 0428 A5 token FN 0429 49 28 54 29 ascii «I(T)» 042D B2 token = 042E B5 token INT 042F 28 54 29 ascii «(T)» 0432 00 -EOL-
Well, this is that mystery solved.
But, what happens. if we were to change this 5th byte on-the-fly? Does this 5th byte matter, at all?
10 DEFFNR(X)=1+X*X 20 DEFFNG(Y)=3*Y+4 30 DEFFNI(T)=INT(T) 40 POKE 1160,32 : REM DEC 1160 = $0488 50 PRINT FNI(4.1) RUN 4 READY.
It doesn’t seem so. The result is still what we’d expect as a result of the BASIC function INT
. It’s also not what we’d expected, if we replaced the token INT
in the BASIC text by 32
(0x20
), which is a simple space/blank, resulting in the string “ (T)
”.
And we actually did changed that last byte, as the disassembly reads now:
0482 C9 00 FNI()
0484 2E 04 8B 04 20 @ $042E, arg @ $048B, « »
Let‘s have another go at this, this time replacing the “1
” in the variable body of FNR
by the ASCII code for “2
”:
10 DEFFNR(X)=1+X*X 20 DEFFNG(Y)=3*Y+4 30 DEFFNI(T)=INT(T) 40 POKE 1130,50 : REM DEC 1130 = $046A 50 PRINT FNR(2) RUN 5 0464 D2 00 FNR() 0466 0C 04 6D 04 32 @ $040C, arg @ $046D, «2»
This didn‘t make a difference, as well.
As far as my own research goes, a thorough investigation into literature on the matter produced a sole source, namely “Programming the PET/CBM” by Raeto West.
Here, we find FN
variables actually described as a distinctive type, on p. 9, where the very last byte is described as “INITIAL OF VAR.”
As so often, when you may think you have discovered something obscure or even unknown about the PET and/or Commodore BASIC, it’s already in West’s book. It’s just that it may be in a form that doesn’t give away its meaning immediately (or, sometimes, that it’s actually covered, at all.)
Here, the exact meaning of “INITIAL OF VAR.” may not be that clear as it‘s provided without further context. However — as we‘ve established already —, this is fair and correct, if we’re meant to understand, “the initial byte of the function body referred to by the variable.” (As opposed to, e.g., “the first character of the variable identifier,“ or similar.) The descriptive text, which goes along with this table, goes as follows,
A function definition has two pointers; one to the definition in the body of the BASIC program, and one to the floating-point dependent variable. They point just after the '=' sign and to the exponent byte respectively. The final byte is garbage, generated when the definition is set up, and is not used.
Well, I guess, that’s it. Especially, as (as already mentioned) the code makes use of some resources dedicated to string handling. However, it is still a bit strange that this 5th byte isn’t just set to 0
as with any other surplus bytes in integer or string variables.
So, while we’re done for the purpose of our disassembly, there’s still a bit of a mystery left, in how the very beginning of the function body ends up as the last byte of the variable body. But this, I guess, is yet another story.
Wait, there’s still more…
Above, we observed that there are only global variables in Commodore BASIC. This is also technically true and correct, as far as the creation and representation of the function parameter for user defined functions in memory is concerned.
However, if we look at the runtime implementation of user defined functions in the BASIC ROM, we discover that, prior to execution, the current content of the variable body (1 byte exponent and 4 bytes mantissa) is saved before the variable is accessed as a parameter/argument and then restored again, when the execution of that call has finished. Since user defined functions are callable from inside user defined functions, this cannot be achieved by just a simple buffer in the zero-page (meaning, there may be more than one value to be saved at any given time), rather, the contents of the variable body is pushed to the processor stack and eventually restored from this.
So, in effect, the function parameter is actually a local variable, shadowing any global variable going by the same name:
10 X=1 20 DEFFNR(X)=1+X*X 30 PRINT X 40 PRINT FNR(2) 50 PRINT X RUN 1 5 1 READY. █
Moreover, even, if there is no conflict, the function parameter/argument isn‘t accessible from outside:
10 DEFFNR(X)=1+X*X 20 PRINT FNR(2) 30 PRINT X RUN 5 0 READY. █
So, why does this use a global variable, at all? Well as we’ve seen before, the function body of a user defined function is executed like any other BASIC statement. The single difference being that the execution ends at the first colon or end-of-line, causing the BASIC interpreter to return to its previous context. This way, the function parameter/argument is accessible to normal BASIC execution inside the function body.
Still, we may note that the effort taken to make this behave like a local variable (5 pushs and 5 pulls to and from the stack, together with the reads and writes that go with this) somewhat counteracts the efficiency suggested by the argument pointer tapping directly in the variable body.
BTW, if you want to have a look at the new version of the PET emulator, here it is running all the latest demos:
Norbert Landsteiner,
Vienna, 2023-03-15 (revised text version 2024-11-12)