How does a x86 disassembler know where to break up the instructions?
I am looking at the 8088 instruction set. For example the move instruction has 7 variations that range from 2 to 4 bytes. The instructions themselves seem to follow no particular order. Another reason for Why is x86 ugly?.
For example:
many instruction can overlap in the first byte:
The bitmasks appears to have arbitrary assignment. How does a disassembler break apart the instructions?
This question is a sub set of How to write a disassembler.
===
I am looking at the 8088 instruction set. For example the move instruction has 7 variations that range from 2 to 4 bytes. The instructions themselves seem to follow no particular order. Another reason for Why is x86 ugly?.
For example:
76543210 76543210 76543210 76543210
reg/mem to/from reg 100010dw ||regr/m
imm to reg/mem 1100011w ||000r/m dat dat w=1
imm to reg 1011wreg data dat w=1
imm to accum 1010000w addr-low addrhigh
accum to mem 1010001w addr-low addrhigh
reg/mem to seg 10001100 ||0ssr/m
seg to reg/mem 10001100 ||0ssr/m
Legend:
||=mod {NO-DISP=0,DISP-LOW,DISP-HIGH,REG}
ss=seg enum{es=0,cs,ss,ds}
reg=enum{ax=0,bx,cd,dx,bx,sp,bp,si,di (if w=1)} enum{al,bl...} (if w=0)
r/m=reg or mem (mod=3 then REG, else mem)
many instruction can overlap in the first byte:
76543210 76543210 76543210 76543210
push 11111111 ||110r/m
inc 1111111w ||000r/m
The bitmasks appears to have arbitrary assignment. How does a disassembler break apart the instructions?
This question is a sub set of How to write a disassembler.
===
Looking at my 8086/8088 Users Manual
Programmers reference (ISBN 1-55512-010-5), likely decades out of
print...Appendix A shows the instruction decoding in opcode order
0b00000000 thru 0b11111111. Does not appear to be chaotic at all. Add,
sub, and, xor, cmp, etc are all grouped in such a way that a mux can
use the opcode bits directly to route the inputs and outputs, and other
bits select the operation the alu performs on those bits.
For writing a disassembler you want to use this kind of table or an opcode chart for the top level sorting of instructions.
In your particular example, notice how whenever you see the first opcode as 0xFF there are three bits in the middle of the second byte that tell you the rest of the story as to which instruction is which. All 8 of those combinations (one is undefined) are represented and easily decoded from those 3 bits.
Yes, the x86 instruction set is crazy. Interesting and fun features, but considerably better instruction sets have been invented since. The only reason x86 has not gone the way of the 6502 for example is momentum, not quality.
You should look at this one too:
http://stackoverflow.com/questions/3917098/how-are-hex-sequence-translated-to-assembly-without-ambiguity
How to disassemble this and any other variable word length instruction set is by doing it in execution order. You will fail if you try to do it linearly in address order. Start with the vector table to get the entry addresses then follow those instructions in address order, making a note of and following all branches until you hit an unconditional branch or return or other instruction that terminates that string of instructions. Repeat this for every branch destination. That wont cover all of the instructions possible as the code may compute addresses while executing (not much you can do about disassembling that).
If any of this code was hand written intentionally or accidentally to trip up a disassembler you can expect to have collisions where the second or third byte of one opcode based on one execution path appears to be the first opcode of an instruction based on a different execution path. For example a clear a flag instruction followed by a conditional branch if flag is clear, followed by a byte of data, followed by a real instruction that is a branch destination. Yep, I have come across this. And it should be trapped by your disassembler, you need to put checks in to stop disassembling one or both of those execution paths when they collide. For complete disassembly expect to have to support some sort of user input to exclude addresses as opcodes, as well as for the user to manually add valid opcodes for you to follow the execution path from.
For fixed length instruction sets you can easily disassemble in address or execution order, your choice, address order from 0 to the end of memory is the easiest of course. Dont error out on undefined instructions, just mark them as such and keep going, some of those are data.
x86 is definitely the LAST variable length instruction set I would attempt to disassemble and I have written many disassemblers. No desire to ever attempt that project. Start with some fixed length ones like the pic and arm/thumb. Try the msp430 for variable word length, then maybe the 6502 (asteroids, asteroids deluxe, lunar lander, etc). Maybe a week or two worth of evenings to cover the above and get the feel for it, then attack the x86 if the desire remains. If you limit yourself strictly to the 8088/8086 it is not so bad, need to make sure your tools are generating those instructions and not getting into the 386 on up instructions.
If push vs inc is bothering you, definitely try something else like the msp430 for example first.
===
Thank you very much for the long descriptive answer. Things are starting to make a little more sense. The OP Code map reminds me of EBCDIC mlsite.net/8086/#tbl_oper (8086 Map). It does not appear so random any more. I was trying to envision it as a linear or set bit-mask to flag the instruction type. This site provides disassembler written in Python: mlsite.net/blog/?p=58
Reference:
http://stackoverflow.com/questions/3983735/x86-assembly-how-do-disassemblers-know-how-to-break-up-instructions?rq=1
For writing a disassembler you want to use this kind of table or an opcode chart for the top level sorting of instructions.
In your particular example, notice how whenever you see the first opcode as 0xFF there are three bits in the middle of the second byte that tell you the rest of the story as to which instruction is which. All 8 of those combinations (one is undefined) are represented and easily decoded from those 3 bits.
Yes, the x86 instruction set is crazy. Interesting and fun features, but considerably better instruction sets have been invented since. The only reason x86 has not gone the way of the 6502 for example is momentum, not quality.
You should look at this one too:
http://stackoverflow.com/questions/3917098/how-are-hex-sequence-translated-to-assembly-without-ambiguity
How to disassemble this and any other variable word length instruction set is by doing it in execution order. You will fail if you try to do it linearly in address order. Start with the vector table to get the entry addresses then follow those instructions in address order, making a note of and following all branches until you hit an unconditional branch or return or other instruction that terminates that string of instructions. Repeat this for every branch destination. That wont cover all of the instructions possible as the code may compute addresses while executing (not much you can do about disassembling that).
If any of this code was hand written intentionally or accidentally to trip up a disassembler you can expect to have collisions where the second or third byte of one opcode based on one execution path appears to be the first opcode of an instruction based on a different execution path. For example a clear a flag instruction followed by a conditional branch if flag is clear, followed by a byte of data, followed by a real instruction that is a branch destination. Yep, I have come across this. And it should be trapped by your disassembler, you need to put checks in to stop disassembling one or both of those execution paths when they collide. For complete disassembly expect to have to support some sort of user input to exclude addresses as opcodes, as well as for the user to manually add valid opcodes for you to follow the execution path from.
For fixed length instruction sets you can easily disassemble in address or execution order, your choice, address order from 0 to the end of memory is the easiest of course. Dont error out on undefined instructions, just mark them as such and keep going, some of those are data.
x86 is definitely the LAST variable length instruction set I would attempt to disassemble and I have written many disassemblers. No desire to ever attempt that project. Start with some fixed length ones like the pic and arm/thumb. Try the msp430 for variable word length, then maybe the 6502 (asteroids, asteroids deluxe, lunar lander, etc). Maybe a week or two worth of evenings to cover the above and get the feel for it, then attack the x86 if the desire remains. If you limit yourself strictly to the 8088/8086 it is not so bad, need to make sure your tools are generating those instructions and not getting into the 386 on up instructions.
If push vs inc is bothering you, definitely try something else like the msp430 for example first.
===
Thank you very much for the long descriptive answer. Things are starting to make a little more sense. The OP Code map reminds me of EBCDIC mlsite.net/8086/#tbl_oper (8086 Map). It does not appear so random any more. I was trying to envision it as a linear or set bit-mask to flag the instruction type. This site provides disassembler written in Python: mlsite.net/blog/?p=58
Reference:
http://stackoverflow.com/questions/3983735/x86-assembly-how-do-disassemblers-know-how-to-break-up-instructions?rq=1
No comments:
Post a Comment