Episode 9: Quest for Speed
After last episode's midterm fun, we're back to more serious considerations, regarding our little program.
Ambitious Ideas
I'm not too happy about the rendering speed of the LCD and am consiquently musing about machine language. Maybe we define an integer array and compile a little program for the display into the space reserved by it. The space required for the display data could be reserved in the same array, or, even better, in an integer array of it's own.
My ideas regarding the ML program (intel 8085) went as far as follows (based on a demo program in the PC-8201A Technical Manual; NEC Corporation, 1984):
; writing up to 50 bytes to the display OFF75 EQU 765Ch ; addr. disable interrupts (Model 100) ON75 EQU 743Ch ; addr. enable interrupts (Model 100) PORTC EQU FEh PORTD EQU FFh ; address count is decimal! LCDWRT: 00 CD 5C 76 CALL OFF75 ; disable interrupts 03 CD 21 00 CALL LCDBUSY ; wait for the display being ready 06 3A 28 00 LDA PGOFS ; load page and offset 09 D3 FE OUT PORTC ; send it to the display 11 21 2C 00 LXI H,DATA ; get data address 14 3A 2A 00 LDA COUNT ; get number of bytes 17 41 MOVE C,A ; move it to C WRITE: 18 CD 21 00 CALL LCDBUSY ; wait for the display 21 7E MOVE A,M ; get byte 22 D3 FF OUT PORTD ; write it to the display 24 23 INX H ; increment HL registers 25 0D DCR C ; decrement C 26 C2 12 00 JNZ WRITE ; redo, if not zero 29 CD 3C 74 CALL ON75 ; enable interrupts 32 C9 RET ; return LCDBUSY: 33 DB FE IN PORTC ; read status 35 07 RLC ; rotate to get busy state 36 CA 21 00 JC LCDBUSY ; repeat, if busy 39 C9 RET ; return 40 PGOFS: 0,0 ; page and pixel offset 42 COUNT: 0,0 ; length of data 44 DATA: 0 ; data starts here
△! Caution: Assembled by hand, don't trust me!
Green: machine dependent, blue: to be converted to absolute addresses, red: to be set by BASIC before each call.
All we had to do, is to define an integer array of half the length of the program (integer is 2 bytes) + 50 bytes for the data as in "DIM SP%(48)
", get the start address by "AD=VARPTR(SP%(0))
" and fix up all the local addresses . That is: OFF75 and ON75 depending on the model used, LCDBUSY (AD+33
), PGOFS (AD+40
), COUNT (AD+42
), and DATA (AD+44
).
In case you'd ask: PGOFS and COUNT are 2 bytes in order to match a subscript of the integer array. Integers are stored in little-endian, so any even offset-address would match an index to directly write to it from BASIC. Preferably, we put all the data in an array of it's own, thus starting at subscript 0.
Once again, the NEC PC-8201A's BASIC (N82-BASIC) is different, as it hasn't an implementation of VARPNTR. But there's a little piece of BASIC/ML code to emulate it and this wouldn't really pose a problem.
(See VARPTR.NEC by Steve Sarna, 11/12/84.)
Transferring Data
There's another, more intesting question here, regarding the method of talking to our little program. Obviously, it's nice to set the page, offset, and the number of bytes to write just by using a BASIC assignment to an integer subscript. Should we do this also with the up to 50 bytes of display data (thus having to duplicate the increment on HL, as in INX H
at offset 24, in order to advance by two bytes at once), or should we rather POKE the values directly into the data space?
Obviously we would do this in a loop, and probably we would have the appropriate array subscript already at hand, while we would have to add it to the base address when using POKE. Also, subscripts would be all integer, POKING would require at least single precision variables.
So, what would be faster?
Surprise
Let's have two little, mostly identical test programs:
10 REM (1) Using Subscripts 100 DEFSNG A, DEFINT B-Z 110 DIM B(49) 120 PRINT TIME$ 130 FOR I=0 TO 100 140 FOR J=0 TO 49:B(J)=J:NEXT 150 NEXT 160 PRINT TIME$:END 10 REM (2) Using Pokes 100 DEFSNG A, DEFINT B-Z 110 DIM B(49):A=VARPTR(B(0)) 120 PRINT TIME$ 130 FOR I=0 TO 100 140 FOR J=0 TO 49:POKE A+J, J:NEXT 150 NEXT 160 PRINT TIME$:END
On the Olivetti M10 program (1) uses a runtime of 18 seconds, while program (2) needs 20 seconds to finish. Bummer!
So, POKE is substantially slower, even without the addition to assemble the target address. (Tested.) Bummer, again!
This is really slow, especially with regard to all the display data we've to write. So, would we gain any by writing the display data first to memory and displaying it then by our machine language routine?
Let's have another program, testing the relative performance of BASIC's OUT command, we're using for talking to the display:
10 REM (3) Testing Performance of OUT (Olivetti M10) 100 DEFSNG A, DEFINT B-Z 110 PA=185:PB=186:PC=254:PD=255:B=4 120 CLS:PRINT "a":CALL 29558:OUT PA,2:OUT PB,0 130 FOR I=0 TO 100:OUT PC,0 140 FOR J=0 TO 49:OUT PD,B:NEXT 150 NEXT 160 CALL 28998:PRINT "b":END
We can't use TIME$
here, since we have interrupts disabled and by this the clock ticks, too. So we just write "a" and "b" to the display and have an eye at the watch to get the interval. And this, dear reader, is below 15 seconds!
In other words, assigning anything to a variable or memory location is already slower than sending the same value to the display via the processor's serial port.
Maybe, it's the data lookup missing in version (3)? Let's modify line 140 accordingly:
10 REM (4) Testing Performance of OUT with subscripts (Olivetti M10) 100 DEFSNG A, DEFINT B-Z 110 DIM B(49):PA=185:PB=186:PC=254:PD=255 120 CLS:PRINT "a":CALL 29558:OUT PA,2:OUT PB,0 130 FOR I=0 TO 100:OUT PC,0 140 FOR J=0 TO 49:OUT PD,B(J):NEXT 150 NEXT 160 CALL 28998:PRINT "b":END
No difference at all! Lookups of individial subscripts of the integer array B come virtually for free.
So, interrupts are turned off, maybe this is making the difference?
Let's have a look at subscripted assignments again (change in line 140, everything else the same as in v. 4):
10 REM (5) Testing Performance of subscripts w/o interrupts (Olivetti M10) 100 DEFSNG A, DEFINT B-Z 110 DIM B(49):PA=185:PB=186:PC=254:PD=255 120 CLS:PRINT "a":CALL 29558:OUT PA,2:OUT PB,0 130 FOR I=0 TO 100:OUT PC,0 140 FOR J=0 TO 49:B(J)=J:NEXT 150 NEXT 160 CALL 28998:PRINT "b":END
No difference to program (1), again!
(Mind that we're measuring time in seconds rather than milliseconds or ticks. The minor differences in runtime caused by interrupts don't show up, because they are too tiny.)
We conclude, assignments of any kind (and this includes POKE) are generally a speed killer in MS BASIC. It's interesting that a few arithmetic operations or lookups aren't really making a difference, but assignments do.
Using OUT to drive the display directly from BASIC is by all means faster than any kind of method of transferring the data to memory. That's actually a pity, since the idea of assembling the data first in BASIC and then writing it rapidly to the display by a ML subroutine would have had some charm to it. Especially, since we're spending most of the time with the data assembly in BASIC, while having interrupts disabled, which doesn't really recommend itself as a method of choice for a game that should be doing networking, too.
Enough for today. With a nod to networking performance, the ML approach may be still a viable option.
▶ Next: Episode 10: Cross-Platform ROM Diving
◀ Previous: Episode 8: Creative Pico Murder — A Virtual Marketing Campaign
▲ Back to the index.
2016-01-18, Vienna, Austria
www.masswerk.at – contact me.
— This series is part of Retrochallenge 2016/01. —