A Tricky Bit of SNES Code

I’ve been working to disassemble and comment the source to a certain SNES cartridge. I’ve decided to write all the tools from the ground up — as a learning experience. I’m making good progress and, at the current rate, I should have something reasonable in 4 or 5 months. I’m happy with that. I just hope I can keep it going long enough to produce something worth posting. Today, I wanted to explain why I’ve had a harder time than I expected writing the disassembler and show a piece of particularly tricky (for someone, like myself, who has never worked with the 6502 or 65816) code.

Writing the disassembler has been tough. I’ve never written one before, but I believe the quirks of the 65816 make it a little more of a pain than usual. Disassembling a flat binary isn’t too bad if you have a few entry points (typically interrupt vector locations); you follow the control flow graph visiting every connected instruction and mark the rest as data. Of course there are a few stumbling blocks:

Indirect or computed jumps are not so simple to deal with. For now, my method is the human method — taking a look and adding a map from indirect jump addresses to a list of possible destinations. My brain fantasizes that using abstract interpretation may give enough information to reason this out occasionally (especially with the use of a constraint solver like STP). Of course, I don’t have much confidence in the idea. I doubt I will try it, but I do continue to think about it.
Not playing nice with CALL or JSR instructions. A call instruction pushes the return address on the stack. Some pieces of code (including the one I talk about below) will either modify this stack variable or remove it from the stack and use it to compute the next jump. This makes it very difficult to follow control flow and breaks the call/return semantics we usually assume.

Those are the usual problems encountered when writing a disassembler. The 65816 adds another layer of pain: a particular opcode can have a variable sized argument depending on a runtime flag state. Huh? Let me say that again, the interpretation of an instruction can vary over two dynamic execution. Example: The bytes A9 00 60 have two very different interpretations based on the value of the M flag in the P (Processor Status) register. If M is set (8-bit accumulator/memory access):

008000:    lda #$0x00
008002:    rts

If M is unset (16-bit accumulator/memory access):

008000:    lda #$0x6000

Can you see how this might make things difficult to debug statically? My solution is to associate a flag state with each address I encounter. The instruction isn’t decoded until the state of the M and X (the X and Y registers can also be dynamically sized) are known. I assume that once one path hits the instruction the value for the flags is valid. In other words, no two paths result in conflicting flag settings. This assumption could be invalid. Take the case of a path which is not valid due to the constraints placed on external data. If that path is evaluated first, the flag settings could be invalid and so my further processing would be invalid. In practice, I don’t think this would happen outside of intentional obfuscation. In other words, my method is a simple data flow analysis without the fix-point iteration. (In case you were wondering, this is why I didn’t just add the (human generated) indirect jump destination addresses to the set of entry points — they need to be associated with a source jump to have a flags value.)

I now have my disassembler chugging along and have been working from both a reset and a VSYNC interrupt. I’ve commented about 16 pages of assembly and have a feel for the main game loop. I’m just getting to the real logic — so far it has just been bookkeeping for the GFX hardware and sound co-processor. In my analysis of the main game loop, I’ve encountered some super awesome code I had to share with someone:

0cc135: jsr $00879c
00879c: sty $05
00879e: ply
00879f: sty $02
0087a1: rep #$30
0087a3: and #$00ff
0087a6: sta $03
0087a8: asl A
0087a9: adc $03
0087ab: tay
0087ac: pla
0087ad: sta $03
0087af: iny
0087b0: lda [$02],Y
0087b2: sta $00
0087b4: iny
0087b5: lda [$02],Y
0087b7: sta $01
0087b9: sep #$30
0087bb: ldy $05
0087bd: jmp [$0000]

Below the fold is the translated version of this gem.

If you squint a bit and happen to know 65816 assembly, you can turn the above mess into:

void jump_table_follows_call(BYTE index) {
    BYTE jump_addr[3];

    jump_addr[0] = retaddr + index * 3;
    jump_addr[1] = retaddr + index * 3 + 1;
    jump_addr[2] = retaddr + index * 3 + 2;

    long_jump(jump_addr);
}

Let’s walk through the decompilation:

0cc135: jsr $00879c

A long subroutine call. This instruction will push the current address (return address = current address + 1) and jump to $00879c. Since it is a long jump, all 3 bytes of the address will be pushed,

00879c: sty $05

Save the value of Y. Normally, a ‘phy’ (Push Y) would be used, but the return address needs to be on top of the stack.

00879e: ply
00879f: sty $02

Put the first byte (LSB) of the return address into Y.

0087a1: rep #$30
0087a3: and #$00ff
0087a6: sta $03
0087a8: asl A
0087a9: adc $03

The ‘rep’ instruction sets the M and X flags. This means from here on A, X, and Y are 16-bits wide. Then, the accumulator (A) is masked to keep only the low byte. A is stored at location $03, shifted left once and then added to the value in $03. This gives us A = A * 3.

0087ab: tay
0087ac: pla
0087ad: sta $03

Y is overwritten with A. A then gets the two MSBs of the return address. Then, the two bytes of A are written to $03. At this point, the 3 bytes starting at $02 contain the return address and Y contains (the value of A at entry to the subroutine) * 3.

0087af: iny
0087b0: lda [$02],Y
0087b2: sta $00
0087b4: iny
0087b5: lda [$02],Y
0087b7: sta $01

Y is incremented once and the two bytes at (return address ($02)) + Y are written to $00 and $01. Then, Y is incremented again and the two bytes at (return address ($02)) + Y are written to $01 and $02. Note: Y is incremented before anything is read because the value on the stack on the entry to a subroutine is 1 less than the return address. I’ve been using the term return address to simplify the explanation. It has really been the “current address” at the time of the jump — in other words, return address – 1.

0087b9: sep #$30
0087bb: ldy $05
0087bd: jmp [$0000]

Finally, the flags and Y are restored and the processor jumps to the long (3-byte) address at $0000.

Hope you find that as interesting as me. I also hope to have the tools a bit more polished soon. I plan to post them here when I’m less unhappy with them. In the meantime, look at the pretty graphs:
CFG rooted at reset (small, medium). Blue arrows are subroutine calls; black arrows are branches.

TODO: Still need to write up the PathWords hack. Also, considering moving hosting to wordpress.com — I really don’t feel like keeping up with the wordpress releases. Alternatively, go to a static system.

Written by dionthegod

November 23, 2008 at 7:12 am

Posted in Uncategorized

Stupid Compiler