I'm interested in writing an x86 dissembler as an educational project.
The only real resource I have found is Spiral Space's, "How to write a disassembler". While this gives a nice high level description of the various components of a disassembler, I'm interested in some more detailed resources. I've also taken a quick look at NASM's source code but this is somewhat of a heavyweight to learn from.
I realize one of the major challenges of this project is the rather large x86 instruction set I'm going to have to handle. I'm also interested in basic structure, basic disassembler links, etc.
Can anyone point me to any detailed resources on writing a x86 disassembler?
===
Not an answer but the answer in stackoverflow.com/questions/82432/… is also a good read for those who are starting.
===
The only real resource I have found is Spiral Space's, "How to write a disassembler". While this gives a nice high level description of the various components of a disassembler, I'm interested in some more detailed resources. I've also taken a quick look at NASM's source code but this is somewhat of a heavyweight to learn from.
I realize one of the major challenges of this project is the rather large x86 instruction set I'm going to have to handle. I'm also interested in basic structure, basic disassembler links, etc.
Can anyone point me to any detailed resources on writing a x86 disassembler?
===
Not an answer but the answer in stackoverflow.com/questions/82432/… is also a good read for those who are starting.
===
Start with some small program that has been
assembled, and which gives you both the generated code and the
instructions. Get yourself a reference with the instruction architecture,
and work through some of the generated code with the architecture
reference, by hand. You'll find that the instructions have a very
stereotypical structure of inst op op op with varying number of
operands. All you need to do is translate the hex or octal
representation of the code to match the instructions; a little playing
around will reveal it.
That process, automated, is the core of a disassembler. Ideally, you're probably going to want to construct a n array of instruction structures internally (or externally, if the program is really large). You can then translate that array into the instructions in assembler format.
===
That process, automated, is the core of a disassembler. Ideally, you're probably going to want to construct a n array of instruction structures internally (or externally, if the program is really large). You can then translate that array into the instructions in assembler format.
===
You need a table of opcodes to load from.
The fundamental lookup datastructure is a trie, however a table will do well enough if you don't care much about speed.
To get the base opcode type, beginswith match on the table.
There are a few stock ways of decoding register arguments; however, there are enough special cases to require implementing most of them individually.
Since this is educational, have a look at ndisasm.
===
The fundamental lookup datastructure is a trie, however a table will do well enough if you don't care much about speed.
To get the base opcode type, beginswith match on the table.
There are a few stock ways of decoding register arguments; however, there are enough special cases to require implementing most of them individually.
Since this is educational, have a look at ndisasm.
===
I would recommend checking out some open source disassemblers, preferably distorm and especially "disOps (Instructions Sets DataBase)" (ctrl+find it on the page).
The documentation itself is full of juicy information about opcodes and instructions.
Quote from https://code.google.com/p/distorm/wiki/x86_x64_Machine_Code
Quote:
The original (dead) links are kept for historical reasons:
http://ragestorm.net/distorm/vol1.html and http://ragestorm.net/distorm/vol2.html
===
The documentation itself is full of juicy information about opcodes and instructions.
Quote from https://code.google.com/p/distorm/wiki/x86_x64_Machine_Code
80x86 Instruction:The data structures and decoding phases are explained in https://code.google.com/p/distorm/wiki/diStorm_Internals
A 80x86 instruction is divided to a number of elements:
The format looks as follows:
- Instruction prefixes, affects the behaviour of the instruction's operation.
- Mandatory prefix used as an opcode byte for SSE instructions.
- Opcode bytes, could be one or more bytes (up to 3 whole bytes).
- ModR/M byte is optional and sometimes could contain a part of the opcode itself.
- SIB byte is optional and represents complex memory indirection forms.
- Displacement is optional and it is a value of a varying size of bytes(byte, word, long) and used as an offset.
- Immediate is optional and it is used as a general number value built from a varying size of bytes(byte, word, long).
/-------------------------------------------------------------------------------------------------------------------------------------------\ |*Prefixes | *Mandatory Prefix | *REX Prefix | Opcode Bytes | *ModR/M | *SIB | *Displacement (1,2 or 4 bytes) | *Immediate (1,2 or 4 bytes) | \-------------------------------------------------------------------------------------------------------------------------------------------/ * means the element is optional.
Quote:
Decoding PhasesEach step is explained also.
- [Prefixes]
- [Fetch Opcode]
- [Filter Opcode]
- [Extract Operand(s)]
- [Text Formatting]
- [Hex Dump]
- [Decoded Instruction]
The original (dead) links are kept for historical reasons:
http://ragestorm.net/distorm/vol1.html and http://ragestorm.net/distorm/vol2.html
===
Take a look at section 17.2 of the 80386 Programmer's Reference Manual. A disassembler is really just a glorified finite-state machine. The steps in disassembly are:
- Check if the current byte is an instruction prefix byte (
F3
,F2
, orF0
); if so, then you've got aREP
/REPE
/REPNE
/LOCK
prefix. Advance to the next byte. - Check to see if the current byte is an address size byte (
67
). If so, decode addresses in the rest of the instruction in 16-bit mode if currently in 32-bit mode, or decode addresses in 32-bit mode if currently in 16-bit mode - Check to see if the current byte is an operand size byte (
66
). If so, decode immediate operands in 16-bit mode if currently in 32-bit mode, or decode immediate operands in 32-bit mode if currently in 16-bit mode - Check to see if the current byte is a segment override byte (
2E
,36
,3E
,26
,64
, or65
). If so, use the corresponding segment register for decoding addresses instead of the default segment register. - The next byte is the opcode. If the opcode is
0F
, then it is an extended opcode, and read the next byte as the extended opcode. - Depending on the particular opcode, read in and decode a Mod R/M byte, a Scale Index Base (SIB) byte, a displacement (0, 1, 2, or 4 bytes), and/or an immediate value (0, 1, 2, or 4 bytes). The sizes of these fields depend on the opcode , address size override, and operand size overrides previously decoded.
Intel manual says
Groups 1 through 4 may be placed in any
order relative to each other
, so steps 1-4 may be not in that order
===
Checkout objdump sources - it's a great tool, it
contains many opcode tables and it's sources can provide a nice base for
making your own disassembler.
Reference:
No comments:
Post a Comment