Friday, May 3, 2013

x86 Assembly: How do Disassemblers know how to break up instructions?

How does a x86 disassembler know where to break up the instructions?

I am looking at the 8088 instruction set. For example the move instruction has 7 variations that range from 2 to 4 bytes. The instructions themselves seem to follow no particular order. Another reason for Why is x86 ugly?.

For example:
                        76543210  76543210  76543210  76543210
reg/mem to/from reg     100010dw  ||regr/m  
imm to reg/mem          1100011w  ||000r/m  dat       dat w=1
imm to reg              1011wreg  data      dat w=1
imm to accum            1010000w  addr-low  addrhigh
accum to mem            1010001w  addr-low  addrhigh
reg/mem to seg          10001100  ||0ssr/m
seg to reg/mem          10001100  ||0ssr/m

Legend:
||=mod {NO-DISP=0,DISP-LOW,DISP-HIGH,REG}
ss=seg enum{es=0,cs,ss,ds}
reg=enum{ax=0,bx,cd,dx,bx,sp,bp,si,di (if w=1)} enum{al,bl...} (if w=0)
r/m=reg or mem (mod=3 then REG, else mem)

many instruction can overlap in the first byte:
                        76543210  76543210  76543210  76543210
push                    11111111  ||110r/m
inc                     1111111w  ||000r/m

The bitmasks appears to have arbitrary assignment. How does a disassembler break apart the instructions?

This question is a sub set of How to write a disassembler.
 ===
Looking at my 8086/8088 Users Manual Programmers reference (ISBN 1-55512-010-5), likely decades out of print...Appendix A shows the instruction decoding in opcode order 0b00000000 thru 0b11111111. Does not appear to be chaotic at all. Add, sub, and, xor, cmp, etc are all grouped in such a way that a mux can use the opcode bits directly to route the inputs and outputs, and other bits select the operation the alu performs on those bits.

For writing a disassembler you want to use this kind of table or an opcode chart for the top level sorting of instructions.

In your particular example, notice how whenever you see the first opcode as 0xFF there are three bits in the middle of the second byte that tell you the rest of the story as to which instruction is which. All 8 of those combinations (one is undefined) are represented and easily decoded from those 3 bits.
Yes, the x86 instruction set is crazy. Interesting and fun features, but considerably better instruction sets have been invented since. The only reason x86 has not gone the way of the 6502 for example is momentum, not quality.

You should look at this one too:
http://stackoverflow.com/questions/3917098/how-are-hex-sequence-translated-to-assembly-without-ambiguity

How to disassemble this and any other variable word length instruction set is by doing it in execution order. You will fail if you try to do it linearly in address order. Start with the vector table to get the entry addresses then follow those instructions in address order, making a note of and following all branches until you hit an unconditional branch or return or other instruction that terminates that string of instructions. Repeat this for every branch destination. That wont cover all of the instructions possible as the code may compute addresses while executing (not much you can do about disassembling that).

If any of this code was hand written intentionally or accidentally to trip up a disassembler you can expect to have collisions where the second or third byte of one opcode based on one execution path appears to be the first opcode of an instruction based on a different execution path. For example a clear a flag instruction followed by a conditional branch if flag is clear, followed by a byte of data, followed by a real instruction that is a branch destination. Yep, I have come across this. And it should be trapped by your disassembler, you need to put checks in to stop disassembling one or both of those execution paths when they collide. For complete disassembly expect to have to support some sort of user input to exclude addresses as opcodes, as well as for the user to manually add valid opcodes for you to follow the execution path from.

For fixed length instruction sets you can easily disassemble in address or execution order, your choice, address order from 0 to the end of memory is the easiest of course. Dont error out on undefined instructions, just mark them as such and keep going, some of those are data.

x86 is definitely the LAST variable length instruction set I would attempt to disassemble and I have written many disassemblers. No desire to ever attempt that project. Start with some fixed length ones like the pic and arm/thumb. Try the msp430 for variable word length, then maybe the 6502 (asteroids, asteroids deluxe, lunar lander, etc). Maybe a week or two worth of evenings to cover the above and get the feel for it, then attack the x86 if the desire remains. If you limit yourself strictly to the 8088/8086 it is not so bad, need to make sure your tools are generating those instructions and not getting into the 386 on up instructions.

If push vs inc is bothering you, definitely try something else like the msp430 for example first.
===
Thank you very much for the long descriptive answer. Things are starting to make a little more sense. The OP Code map reminds me of EBCDIC mlsite.net/8086/#tbl_oper (8086 Map). It does not appear so random any more. I was trying to envision it as a linear or set bit-mask to flag the instruction type. This site provides disassembler written in Python: mlsite.net/blog/?p=58

Reference: 
http://stackoverflow.com/questions/3983735/x86-assembly-how-do-disassemblers-know-how-to-break-up-instructions?rq=1

No comments: