Several years ago, I was in a ROM hacking IRC room where another regular Alchemic was reverse engineering the text system of the SNES game Shadowrun. He figured it out and wrote a python script to decompress the text but had some questions about why it was designed the way it was. So we're going to walk through figuring out how the code works, with some help from his notes, and try to understand the design.
If you don't want spoilers and would rather try to reverse engineer it yourself, just read up to the end of the Trace format section and see how much you can figure out on your own.
The starting point for our reverse engineering will be an instruction-level trace of the text decompression routine which Alchemic provided in the IRC discussion.
To interpret it, we will need documentation on the SNES's processor. That unofficial documentation of the 65xx processor family is credited to Mark Ormston and is a surprisingly entertaining read for a processor manual due to its completeness and informal style that you don't expect for such a document; I actually got distracted just reading it while writing this post.
The specific 65xx variant used in the SNES is the 5A22 which based on the 65c816, so for the purposes of that document, we are looking at the 65c816.
The trace has one line for each instruction executed:
$80/FE49 9C 22 02 STZ $0222 [$7E:0222] A:C2D5 X:001D Y:0008 P:envmxdIzc $80/FE4C E2 20 SEP #$20 A:C2D5 X:001D Y:0008 P:envmxdIzc $80/FE4E A0 00 00 LDY #$0000 A:C2D5 X:001D Y:0008 P:envMxdIzc $80/FE51 C2 20 REP #$20 A:C2D5 X:001D Y:0000 P:envMxdIZc $80/FE53 A9 00 00 LDA #$0000 A:C2D5 X:001D Y:0000 P:envmxdIZc $80/FE56 AA TAX A:0000 X:001D Y:0000 P:envmxdIZc
The first column is the program counter (PC), the location in the
program code where the instruction is. After that is the
machine code of the instruction in bytes followed by the
assembly mnemonic for the instruction. For some of the
instructions, there is an address in square brackets (
the actual value for any computed addresses. At the end is the values
of the registers and flags before the instruction is executed.
This part of the trace uses a feature of the 65c816 processor which
is important to be aware of, which is the "Memory size flag" which is
listed in the flags as
m when clear (memory reads/writes are 16-bit)
M when set (memory reads/writes are 8-bit).
SEP #$20 sets the
flag putting the processor into 8-bit mode, and
REP #$20 clears it,
putting the processor into 16-bit mode.
Since this is a trace of what we expect to be a relatively simple and repetitive piece of code, the same instructions should be executed over and over again, which we will know by the PC in the first column being the same. We can search for other instances of the same instruction by searching for its PC, which is particularly useful for finding instances of a conditional branch that go the other way to better understand the branch.
Starting our analysis
That code quoted above just clears all of the registers and the memory
$7E:0222, presumably because it expects to use those for its
temporary values and doesn't want to get confused by whatever previous
values they had.
Identifying the variables
Since we don't have variable names, to get a high-level view of what
the code is doing, we will first look at the reads from and writes
to memory, and then try to figure out how they relate to each other.
Those are done with the
LDA ("LoaD Accumulator with value") and
("STore Accumulator into memory") instructions which transfer data
between memory and the accumulator (register
sed command selects out all of the loads and stores
(ignoring loads with arguments that start with
# because that just
loads a constant value):
$ sed -n '/\(LDA\|STA\)/s/[^SL]*\([SL][TD]A [^\#][^ *\).*/\1/p' texttrace.txt | sort | uniq LDA $9D8000,x LDA [$CF] STA $0222 STA $0224 STA $21A0,y STA $CF
There are loads from
[$CF] (and also stores to
$9D8000,x syntax is the "Absolute Long with X Indexing" addressing
mode, which means that it reads from an address in memory computed as a
constant 24-bit address plus an offset in the
X register, so
is some sort of data structure which is indexed by some computed value.
[$CF] syntax is the "Direct Page Indirect Long" addressing mode,
which means that an address is read from
$CF, which is "Direct Page"
addressing referring to memory location
$0000CF. In other words, it
deferences the pointer stored in temporary location
One of those two has to be the actual data. As the text could be
anywhere, and, in particular, there is likely more than 64 kibibytes of
it, most likely
LDA [$CF] loads the actual text data.
Looking further into this, if look just before the
instructions, we notice they occur after a conditional addition of 2 to
X register (by two
INX "INcrement X" instructions in a
$ grep -F -B3 'LDA $9D8000,x' texttrace.txt $80/FE5B 90 02 BCC $02 [$FE5F] A:0000 X:0000 Y:0000 P:eNvmxdIzC $80/FE5D E8 INX A:0000 X:0000 Y:0000 P:eNvmxdIzC $80/FE5E E8 INX A:0000 X:0001 Y:0000 P:envmxdIzC $80/FE5F BF 00 80 9D LDA $9D8000,x[$9D:8002] A:0000 X:0002 Y:0000 P:envmxdIzC -- ... -- $81/AED7 0E 24 02 ASL $0224 [$7E:0224] A:04EC X:04EC Y:0000 P:envmxdIzC $81/AEDA 6B RTL A:04EC X:04EC Y:0000 P:envmxdIzc $80/FE5B 90 02 BCC $02 [$FE5F] A:04EC X:04EC Y:0000 P:envmxdIzc $80/FE5F BF 00 80 9D LDA $9D8000,x[$9D:84EC] A:04EC X:04EC Y:0000 P:envmxdIzc -- ...
For now we are not worrying about what the
BCC $02 is conditioned on,
just that the lookup is a computation like that suggests that it is a
lookup table of some sort, not the text data.
Similarly looking around the uses of
$CF, we notice that after every
LDA [$CF], there's a double
INC $CF a couple instructions later,
$CF is pointing at a piece of memory being read
linearly, making it a much better candidate for being the compressed
text. Additionally, the very first line of the trace is
STA $CF. The
trace is for the text decompression function and it would make sense
for it to take as input in the accumulator the pointer to the text to
We already figured out
$CF, but that leaves
$21A0,y. The first two are single locations (using the "Absolute"
addressing mode), so they are temporary variables used in the
computation while the last one is indexed by
Y so it is an output to
some range of memory.
Since we expect an output of the decompressed text, most likely
$21A0,y is where the decompressed text gets written out to. As
evidence, we see the writes are to adjacent increasing indexes:
$ grep -F -A1 'STA $21A0,y' texttrace.txt $80/FE6F 99 A0 21 STA $21A0,y[$7E:21A0] A:6F59 X:04F8 Y:0000 P:envMxdIzC $80/FE72 C8 INY A:6F59 X:04F8 Y:0000 P:envMxdIzC -- $80/FE7A 99 A0 21 STA $21A0,y[$7E:21A1] A:596F X:04F8 Y:0001 P:envMxdIzC $80/FE7D C8 INY A:596F X:04F8 Y:0001 P:envMxdIzC -- $80/FE6F 99 A0 21 STA $21A0,y[$7E:21A2] A:2075 X:0374 Y:0002 P:envMxdIzC $80/FE72 C8 INY A:2075 X:0374 Y:0002 P:envMxdIzC -- ...
Note that the
M flag is set so this is writing only one byte at time,
which is why there is only a single
INY ("INcrement Y") instruction
Here I have to admit I simplified a bit above:
STA isn't the only
instruction that can modify memory:
$ grep -F '$0222' texttrace.txt | cut -d' ' -f8-9 | sort | uniq -c 40 DEC $0222 3 STA $0222 1 STZ $0222 $ grep -F '$0224' texttrace.txt | cut -d' ' -f8-9 | sort | uniq -c 40 ASL $0224 3 STA $0224
I included the counts with
uniq -c to highlight the
fact that most of the mutations of these values are in fact not
$0222 gets decremented (
$0224 gets bit shifted a lot (
ASL is "Arithmetic
The fact that both counts are 40 seems unlikely to be a coincidence.
Looking closer we notice that the
DEC $0222 is followed by a
("Branch on PLus") instruction and usually branches straight to the
ASL $0224 instruction but sometimes does some additional work in the
$81/AEC0 CE 22 02 DEC $0222 [$7E:0222] A:04AC X:04AC Y:0006 P:envmxdIzC $81/AEC3 10 12 BPL $12 [$AED7] A:04AC X:04AC Y:0006 P:eNvmxdIzC ... $81/AED7 0E 24 02 ASL $0224 [$7E:0224] A:04AC X:04AC Y:0006 P:envmxdIzC
BPL branches if the
n ("Negative") flag is clear. The
in the processor documentation says the effect on the
n - Was the high bit of the result set?
In other words,
n is set if the new value is negative as the high bit
is the sign bit. So
$0222 is a value that counts the times
ASL $0224 has been executed and does some extra work whenever it hits
To find out what it is counting down from, we look for
$ grep -F -B1 'STA $0222' texttrace.txt $81/AEC0 CE 22 02 DEC $0222 [$7E:0222] A:0000 X:0000 Y:0000 P:envmxdIZc $81/AEC3 10 12 BPL $12 [$AED7] A:0000 X:0000 Y:0000 P:eNvmxdIzc $81/AEC5 48 PHA A:0000 X:0000 Y:0000 P:eNvmxdIzc $81/AEC6 A9 0F 00 LDA #$000F A:0000 X:0000 Y:0000 P:eNvmxdIzc $81/AEC9 8D 22 02 STA $0222 [$7E:0222] A:000F X:0000 Y:0000 P:envmxdIzc -- ...
It always gets set to the constant
#$000F or 15 whenever it goes below
zero, so it is counting off 16 shifts, which makes sense for a 16-bit
value being stored in
$0224 being shifted off one bit at a time.
So what is that value stored in
$ grep -F -B2 'STA $0224' texttrace.txt $81/AECC A7 CF LDA [$CF] [$9D:C2D5] A:000F X:0000 Y:0000 P:envmxdIzc $81/AECE EB XBA A:B0F0 X:0000 Y:0000 P:eNvmxdIzc $81/AECF 8D 24 02 STA $0224 [$7E:0224] A:F0B0 X:0000 Y:0000 P:eNvmxdIzc -- ...
It's the value loaded from
[$CF], which we guessed above is the
pointer to the text data. So
$0224 contains the current two bytes
of compressed text data being decompressed.
XBA ("XBA eXchange B (upper 8bits of accumulator) and A (lower
8bits of accumulator)") instruction in there is the confusing part which
is why I went to analyze this in the first place: to try to understand
the code well enough to figure out why the
XBA is there as it seems
like it would be simpler to just store the data in the ROM such that the
XBA instruction is unneeded.
Analyzing the logic
Now that we have some concept of the what variables do, it should be easier to look at the code in terms of functions.
After the header code quoted above, the first instruction is a
("Jump Subroutine Long"), which is the 65c816 instruction for calling
a function. It pairs with
RTL ("ReTurn from subroutine Long") which
returns from a function called using
$ grep -Fc 'JSL $81AEC0' texttrace.txt 40
We see that number 40 again, so we expect this to be where the
ASL $0224 instructions happen. Let's look through the code
$80/FE57 22 C0 AE 81 JSL $81AEC0[$81:AEC0] A:0000 X:0000 Y:0000 P:envmxdIZc $81/AEC0 CE 22 02 DEC $0222 [$7E:0222] A:0000 X:0000 Y:0000 P:envmxdIZc $81/AEC3 10 12 BPL $12 [$AED7] A:0000 X:0000 Y:0000 P:eNvmxdIzc
This enters the function and runs the following code only if the value
$0222 on entry is zero (by decrementing it and then checking if it
is negative afterward). If
$0222 is positive entering the function,
then execution skips ahead to the
$81/AEC5 48 PHA A:0000 X:0000 Y:0000 P:eNvmxdIzc
PHA ("PusH Accumulator") stashes the accumulator on the stack so the
accumulator can be used for something else. The value in the accumulator
isn't actually used, so in this case it would be okay to clobber it,
but the compiler doesn't know that so it stashes it anyway. This is
good evidence that this code was generated by a compiler and not
$81/AEC6 A9 0F 00 LDA #$000F A:0000 X:0000 Y:0000 P:eNvmxdIzc $81/AEC9 8D 22 02 STA $0222 [$7E:0222] A:000F X:0000 Y:0000 P:envmxdIzc
0xf (15 in decimal).
$81/AECC A7 CF LDA [$CF] [$9D:C2D5] A:000F X:0000 Y:0000 P:envmxdIzc $81/AECE EB XBA A:B0F0 X:0000 Y:0000 P:eNvmxdIzc $81/AECF 8D 24 02 STA $0224 [$7E:0224] A:F0B0 X:0000 Y:0000 P:eNvmxdIzc $81/AED2 E6 CF INC $CF [$00:00CF] A:F0B0 X:0000 Y:0000 P:eNvmxdIzc $81/AED4 E6 CF INC $CF [$00:00CF] A:F0B0 X:0000 Y:0000 P:eNvmxdIzc
$0224 to the big endian two byte value pointed at by
XBA to switch the endianness) and increment
point to the next two byte value.
$81/AED6 68 PLA A:F0B0 X:0000 Y:0000 P:eNvmxdIzc
PLA ("PuLl Accumulator") restores the accumulator from the stack;
as mentioned above, this has no effect on the behavior of the program
because that value is never used.
$81/AED7 0E 24 02 ASL $0224 [$7E:0224] A:0000 X:0000 Y:0000 P:envmxdIZc $81/AEDA 6B RTL A:0000 X:0000 Y:0000 P:eNvmxdIzC
The function ends with the
ASL $0224 instruction. The relevant effect
of the shift is that the bit shifted off is stored in the
flag (and moves the next bit into position for the next run of that
JSL $81AEC0 function puts the next bit of the string of bits
$CF (the pointer to the compressed text in the accumulator
when the function is called) into the carry flag.
Now that we know what that function does, we can skip over it when
analyzing the rest of the code. After it returns, the next instruction
BCC ("Branch on Carry Clear") which reads the carry flag that it sets.
$80/FE5B 90 02 BCC $02 [$FE5F] A:0000 X:0000 Y:0000 P:eNvmxdIzC $80/FE5D E8 INX A:0000 X:0000 Y:0000 P:eNvmxdIzC $80/FE5E E8 INX A:0000 X:0001 Y:0000 P:envmxdIzC $80/FE5F BF 00 80 9D LDA $9D8000,x[$9D:8002] A:0000 X:0002 Y:0000 P:envmxdIzC
If the carry flag is clear, it skips over the
INX instructions, so the
logic is if the next bit of the compressed text is set then increment
the index into the lookup table
$9D8000 of two byte values.
Use of lookup table
The first use of the loaded value isn't actually reading the value: it's
reading the flags.
LDA sets the
n ("negative") flag if the value
loaded is negative, and the next instruction is a
BPL which branches
conditioned on that flag:
$80/FE63 10 F1 BPL $F1 [$FE56] A:04F0 X:04EC Y:0000 P:envmxdIzc
If the loaded value is positive (high bit clear), then it's stored
X register as the new offset for the next read and then it
restarts the loop with reading the next bit:
$80/FE56 AA TAX A:04F0 X:04EC Y:0000 P:envmxdIzc $80/FE57 22 C0 AE 81 JSL $81AEC0[$81:AEC0] A:04F0 X:04F0 Y:0000 P:envmxdIzc
If the loaded value is negative (high bit set), then a lot more happens:
$80/FE65 29 7F 7F AND #$7F7F A:D9EF X:04F8 Y:0000 P:eNvmxdIzc $80/FE68 EB XBA A:596F X:04F8 Y:0000 P:envmxdIzc
Clean up the loaded value by swaping the bytes and clearing the high bits of both of them.
$80/FE69 E2 20 SEP #$20 A:6F59 X:04F8 Y:0000 P:envmxdIzc $80/FE6B C9 0A CMP #$0A A:6F59 X:04F8 Y:0000 P:envMxdIzc $80/FE6D F0 11 BEQ $11 [$FE80] A:6F59 X:04F8 Y:0000 P:envMxdIzC $80/FE6F 99 A0 21 STA $21A0,y[$7E:21A0] A:6F59 X:04F8 Y:0000 P:envMxdIzC $80/FE72 C8 INY A:6F59 X:04F8 Y:0000 P:envMxdIzC
Consider just the low byte; if it's not
newline), then append that byte to the string at
is the current length of that output string.
$80/FE73 EB XBA A:6F59 X:04F8 Y:0001 P:envMxdIzC
Swap to look at the other byte.
$80/FE74 F0 DB BEQ $DB [$FE51] A:596F X:04F8 Y:0001 P:envMxdIzC
If that byte is zero, then skip back to
$FE51 which is in the
initialization code. Specifically, it sets the lookup index
to zero and puts the processor back in 16-bit mode before reading the
next bit of the input:
$80/FE51 C2 20 REP #$20 A:C2D5 X:001D Y:0000 P:envMxdIZc $80/FE53 A9 00 00 LDA #$0000 A:C2D5 X:001D Y:0000 P:envmxdIZc $80/FE56 AA TAX A:0000 X:001D Y:0000 P:envmxdIZc $80/FE57 22 C0 AE 81 JSL $81AEC0[$81:AEC0] A:0000 X:0000 Y:0000 P:envmxdIZc
If the byte is not zero, then it's handled the same as the first byte:
$80/FE76 C9 0A CMP #$0A A:596F X:04F8 Y:0001 P:envMxdIzC $80/FE78 F0 06 BEQ $06 [$FE80] A:596F X:04F8 Y:0001 P:envMxdIzC $80/FE7A 99 A0 21 STA $21A0,y[$7E:21A1] A:596F X:04F8 Y:0001 P:envMxdIzC $80/FE7D C8 INY A:596F X:04F8 Y:0001 P:envMxdIzC
Finally it jumps back to
$FE51 to continue reading the input with the
X register cleared:
$80/FE7E 80 D1 BRA $D1 [$FE51] A:596F X:04F8 Y:0002 P:envMxdIzC
Summary of use of lookup table
Entries in the lookup table consist of two 16-bit values. The bit read
from the input string is used to choose which of the values to read.
The values are either pointers to elsewhere in the lookup table or
characters to add to the output string. The second character may be null
if there's only one character to decompress. Either character may be
0x0A to mark the end of the string.
The lookup table isn't best characterized as a table; it's a
binary tree where
X holds the pointer to the current
node. The 16-bit values are tree nodes. The high bit is a flag
distinguishing inner nodes from leaf nodes. Specifically, it's a
Huffman coding tree, which is a standard form of text
Why is there big endian data?
Now that we know what every instruction in this trace is doing, we can
try to figure out if there is some reason why some of the data is in
big endian format requiring
XBA instructions to fix it because the
65c816 is a little endian processor.
Strangely, while the compressed text and leaf nodes are both in big
endian format and need
XBA instructions after their loads, the
internal pointers in the tree are in little endian format and do not.
This suggests that the compression was done offline by some process
that generated big endian data and the internal pointers in the tree
were generated by the compiler.
As far as I can tell, the
XBA instructions just waste time when the
data could have been in little endian format in the first place.
Which really just brings up another question: how did they manage to
accidentally generate big endian data? Looking at the
endianness support of different processors, most likely
the Huffman encoding was done on a Motorola 68000 which was
used as the processor for several desktop computers including the
early Apple Macintosh.