8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
A senquence like above can be segmented in various ways,each segment
can be translated to corresponding assembly instruction, but each binary
executable has its only DEFINITE assembly ,what's the mathematical
principle that avoids ambiguity?UPDATE
The answer with most votes doesn't actually answer my question at all.
===
===
A binary doesn't have a definate set, it
has a set of definate starting points(for example, you could skip up to
four bytes of prefixes and still get almost the same output). Imo,
you'd do well looking at the info on sandpile.org and have a look at the
ollydbg & bea mini disassem engine – Necrolis Oct 13 '10 at 16:46
===
===
I'm pretty sure I've heard tales of
getting very small demos which rely on running the same bytes as
different code depending where you start; certainly anything on the x86
with has an optional lock prefix on its first instruction can be two
functions depending which instruction you jump to. – Pete Kirkham Oct 13 '10 at 22:05
===
First of all you have to distinguish between RISC and CISC architectures.
In a RISC architecture you usually have instructions of the same size, so ambiguity cannot be presented. Your CPU will fetch for example 4 bytes for every instruction, and since it will have to start from somewhere (your CPU doesn't have just a sequence like the one you presented, it will have a starting point for sure) once that it has the right alignment no problem can occur.
What happens with a CISC instruction set is essentially the same: starting from the entry point of the program it will fetch instructions accordingly to your opcodes. It doesn't need to know how to matematically distinguish ambiguities since it won't happen that it just doesn't know how long is the next instruction or where the last one finished.
So asking how to separate every instruction is like asking how to separate every word in
So if the CPU could work on the previous sentence it will find the first senseful instruction "the", then "pen", "is", "on" and the "son" couldn't never be recognized anyway.
EDIT:
To be cleared, in CISC architectures, the only contraint you have to be sure not to have ambiguities is to avoid having an instruction that is a prefix of another. Let's assume a finite alphabet composed by letters a-z instead that hex numbers (just for practical purposes).
If the program counter points to
You can extend this argumentation to the whole string. You will see that, if the CPU finds itself in a state in which the next byte of the binary points to the FIRST byte of an instruction, and there are no instruction that are prefixes of other instruction, then in the next state the program counter will point to the FIRST byte of the next, correct, instruction.
In a RISC architecture you usually have instructions of the same size, so ambiguity cannot be presented. Your CPU will fetch for example 4 bytes for every instruction, and since it will have to start from somewhere (your CPU doesn't have just a sequence like the one you presented, it will have a starting point for sure) once that it has the right alignment no problem can occur.
What happens with a CISC instruction set is essentially the same: starting from the entry point of the program it will fetch instructions accordingly to your opcodes. It doesn't need to know how to matematically distinguish ambiguities since it won't happen that it just doesn't know how long is the next instruction or where the last one finished.
So asking how to separate every instruction is like asking how to separate every word in
thepenisonthetableThere's not mathematical proof but you know which letters are correct together and which ones are not meaningful. The previous sentence contains "son" but you know that it is obtained from "is on". You wouldn't be able to say so without having a meaningful phrase, but your CPU only executes meaningful programs so what's the point?
So if the CPU could work on the previous sentence it will find the first senseful instruction "the", then "pen", "is", "on" and the "son" couldn't never be recognized anyway.
EDIT:
To be cleared, in CISC architectures, the only contraint you have to be sure not to have ambiguities is to avoid having an instruction that is a prefix of another. Let's assume a finite alphabet composed by letters a-z instead that hex numbers (just for practical purposes).
If the program counter points to
abbcbcaabdeffabd
you can have that abb
is a whole instruction. In that case ab
wouldn't be a valid instruction, otherwise the CPU couldn't know where to stop, at the same time abbc
can't be an instruction too or it may create problems. Keeping it on you can have for example that ca
is the next instruction, c
couldn't and cbc
neither.You can extend this argumentation to the whole string. You will see that, if the CPU finds itself in a state in which the next byte of the binary points to the FIRST byte of an instruction, and there are no instruction that are prefixes of other instruction, then in the next state the program counter will point to the FIRST byte of the next, correct, instruction.
===
I would argue that your example is not
even 100% parse-able unless you assert that it is a complete sentence in
English. It could be 'the pen is on the table,' or 'the penis on the
table.' – Dave Oct 13 '10 at 13:51
===
===
It's like the example I provided with
the sentence. Let's change it to remove the hidden ambiguity of "pen
is". Assume "the cat is on the table". As long as every word (or
instruction) is not a prefix of another existing word (that is not true
for natural language, but it is for CPU instructions) a CPU that knows
where to start will always be able to split up instructions with no
ambiguity. – Jack Oct 13 '10 at 15:33
===
Knowing your starting point.
In other words, given a specific starting byte of an instruction, it is unambiguous where the instruction ends, thus giving you the starting byte of the next instruction and allowing you to continue. Given an arbitrary block of memory it is impossible to break it up into individual instructions without knowing where the first instruction begins.
From a more mathematical perspective, there is no valid instruction whose bytes are a prefix of another valid instruction. So if
===
In other words, given a specific starting byte of an instruction, it is unambiguous where the instruction ends, thus giving you the starting byte of the next instruction and allowing you to continue. Given an arbitrary block of memory it is impossible to break it up into individual instructions without knowing where the first instruction begins.
From a more mathematical perspective, there is no valid instruction whose bytes are a prefix of another valid instruction. So if
ab
is valid, then you know that ab cd
cannot be valid so ab
must be one instruction and cd
is the start of the next instruction.===
Nice :-) However, given certain
sequences it seems like it would be possible to determine the "starting
point" based upon the detection of invalid instructions (or the
detection of sequences which seem valid). If the above is true, would
you give a particular "name" to such an operation? – user166390 Oct 12 '10 at 16:57
===
As flippant as this is, it's true. There
are a number of exploits involving referencing assembly code at the
wrong offset to produce unexpected behavior. – Andres Oct 12 '10 at 16:59
===
What if there're these 4 instructions:
8B EC
,,, 56 8B F4 68
,,,8B EC 56 8B
, ,,F4 68
,how does CPU know whether 8B EC 56 8B F4 68
should be interpreted as 8B EC
+ 56 8B F4 68
, or 8B EC 56 8B
+ F4 68
? – justnobody Oct 13 '10 at 8:26
===
@justnobody: There are no such cases. If
8B EC
is an instruction, then 8B EC 56 8B
is not one. That's what the ISA specification of the architecture defines uniquely. – Bahbar Oct 13 '10 at 15:55
===
The sequence you listed shows exactly 1 number. In binary, it's
100010111110110001010110100010111111010001101000000000000111000001000000000000001111111100010101101111001000001001000000
.
In decimal, it's
726522768938664460674442126658667072
.
These are all just different ways of writing exactly the same value. A
particular architecture's ISA will divide the bits into fields and
assign them meaning. Most processors have easy to get manuals that
describe the meaning assigned to each of the bits in those fields.
===
But the bits can be divided in various ways,how does computer figures the final segmentation? – ollydbg Oct 12 '10 at 17:03
===
@ollydbg, Do you want to know how the
hardware physically does it, or how the programmer knows what the
hardware will do with their bits? – nmichaels Oct 12 '10 at 17:06
===
Dont confuse linearly trying to disassemble
with execution order of code. The binary is decoded in execution order,
starting with a known location. Other than intentional hacks for
various reasons there is no ambiguity.
Try writing a disassembler for a variable word length instruction set. At the end of the day it has to be done in execution order, and even there you can only disassemble some of the program as some branches can be based on addresses computed at run time. Modern compiler generated code is much better than older hand assembled code. In an old standup arcade game for example there are conditional branches preceeded by an instruction that guarantees that only one of the conditions is met (why was that in there? we will never know) and the data that follows the conditional branch resembles an opcode in such a way that you run into a collision with other opcodes.
Old dos programs trying to defeat disassemblers would have self modifying code, one instruction would modify another instruction one or two instructions ahead, if single stepped that modification would happen, but if run at full speed the instruction was already fetched in the pipeline, and the modified/broken one in memory was not used. pretty neat trick.
Anyway, the answer to your question is do not look at the bytes in linear order, look at them in execution order starting at the addresses defined by the reset and other vectors in the vector table.
===
on fixed length instruction sets you can
examine the data linearly from 0 to N, so long as you remember that not
all of those bytes are instructions (and you examine them using the
correct alignment). – dwelch Oct 12 '10 at Try writing a disassembler for a variable word length instruction set. At the end of the day it has to be done in execution order, and even there you can only disassemble some of the program as some branches can be based on addresses computed at run time. Modern compiler generated code is much better than older hand assembled code. In an old standup arcade game for example there are conditional branches preceeded by an instruction that guarantees that only one of the conditions is met (why was that in there? we will never know) and the data that follows the conditional branch resembles an opcode in such a way that you run into a collision with other opcodes.
Old dos programs trying to defeat disassemblers would have self modifying code, one instruction would modify another instruction one or two instructions ahead, if single stepped that modification would happen, but if run at full speed the instruction was already fetched in the pipeline, and the modified/broken one in memory was not used. pretty neat trick.
Anyway, the answer to your question is do not look at the bytes in linear order, look at them in execution order starting at the addresses defined by the reset and other vectors in the vector table.
===
===
It sounds like the answer to your question is
the somewhat flippant "Know your starting point", but maybe you want
something a little more verbose. Given your string:
So let's say one operation is 8B EC 56 8B (Depending on your operation length, etc), then the NEXT operation is F4 68... In this case, it's impossible for the machine to try to interpret an operation 56 8B F4 68 because no operation ended at just that point.
Now, if your start point was the 56, then you can get that group but cannot get either of the ones mentioned previously.
The layout of your memory is very specific and start points/jump points are exact and unforgiving--they are required as surely as the code itself.
===8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
AND a starting point (Let's say the 8B is your starting point) there is only one possible interpretation of the bytes.So let's say one operation is 8B EC 56 8B (Depending on your operation length, etc), then the NEXT operation is F4 68... In this case, it's impossible for the machine to try to interpret an operation 56 8B F4 68 because no operation ended at just that point.
Now, if your start point was the 56, then you can get that group but cannot get either of the ones mentioned previously.
The layout of your memory is very specific and start points/jump points are exact and unforgiving--they are required as surely as the code itself.
If I understand your question correctly, you're trying to understand why
8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
Could be split e.g. as
8BEC 568BF4 68007040 00FF 15BC 8240
Rathern than say,
8B EC568B F4 68007040 00FF 15BC 8240
This is entirely specified by the ISA of your architecture. That document describes exactly how instructions are uniquely constructed from a series of bytes.
For the ISA to be well formed, a single series of bytes can correspond to at most a single series of decoded instructions (might be less, if there are invalid instructions).
To get a bit more concrete, lets take the x86 example: If you want to know what each byte corresponds to, have a look here.
You'll see that, e.g. an instruction starting with 00 is an add (additional parameters are in the next byte, with a specific encoding).
You'll also see that some values are actually prefixes that modify the following instruction (0F - prefix to extend the opcode space, 26, 2E, 36, 3E, 64, 65, 66, 67, F0, F2, F3), and that some of them take different meaning based on the exact following instruction. Those are not opcodes, but they can alter the encoding of the arguments of the opcode, or introduce a completely new opcode space (e.g. SSE uses 0F).
Overall, the x86 encoding is very complex, thanks for disassemblers.
===
8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
Could be split e.g. as
8BEC 568BF4 68007040 00FF 15BC 8240
Rathern than say,
8B EC568B F4 68007040 00FF 15BC 8240
This is entirely specified by the ISA of your architecture. That document describes exactly how instructions are uniquely constructed from a series of bytes.
For the ISA to be well formed, a single series of bytes can correspond to at most a single series of decoded instructions (might be less, if there are invalid instructions).
To get a bit more concrete, lets take the x86 example: If you want to know what each byte corresponds to, have a look here.
You'll see that, e.g. an instruction starting with 00 is an add (additional parameters are in the next byte, with a specific encoding).
You'll also see that some values are actually prefixes that modify the following instruction (0F - prefix to extend the opcode space, 26, 2E, 36, 3E, 64, 65, 66, 67, F0, F2, F3), and that some of them take different meaning based on the exact following instruction. Those are not opcodes, but they can alter the encoding of the arguments of the opcode, or introduce a completely new opcode space (e.g. SSE uses 0F).
Overall, the x86 encoding is very complex, thanks for disassemblers.
===
There might also be some clues elsewhere about
what is a valid starting address. There is always a reset vector
address, and there are usually interrupt vector addresses, which all
must be valid start points for blocks of code. More usefully, if you
come across a jump or call instruction elsewhere which references an
address in the block you are trying to decode, then that gives you
another start address.
I think I see your worry, and as far as I know its correct - if the program counter gets upset by one and that causes invalid instructions or unintended instructions to be executed, the program probably crashes. True, and also if you run into a data block and try to execute that. At least the latter can be avoided by using a Harvard architecture, where code and data are in seperate memory spaces and may be different bit widths.
===I think I see your worry, and as far as I know its correct - if the program counter gets upset by one and that causes invalid instructions or unintended instructions to be executed, the program probably crashes. True, and also if you run into a data block and try to execute that. At least the latter can be avoided by using a Harvard architecture, where code and data are in seperate memory spaces and may be different bit widths.
If you open a binary in a hex editor, copy a
portion of data and paste in a disassembler, you will very probably not
copy a complete instruction. Let's use your example.. in a Windows XP
32bits SP3 English, if I assembly
Now let's pretend that instead of assembling
you added
Now check below what a single byte did in our instructions:
As you can see it was completely modified!
The processor know where to start and what arguments to the instruction to take by the opcode instruction. In the first example our first opcode was
In our second example it start with
Imagine that inside the processor has a huge (but nano) circuit, that behave like a chain of ifs (conditions), that depending on the value of the hexcode (or opcode) it knows "where to go" or "how to behave".
===8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
I'll get:Hex dump Command
8BEC mov ebp,esp
56 push esi
8BF4 mov esi,esp
68 00704000 push 407000
FF15 BC824000 call dword ptr ds:[4082bc]
As you can see it assembled completely different then the answer of the other guys below...Now let's pretend that instead of assembling
8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
you added
C0
opcode at the beginning C0 8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
Now check below what a single byte did in our instructions:
Hex dump Command
C08B EC568BF4 68 ror byte ptr ds:[ebx+f48b56ec],68
0070 40 add byte ptr ds:[eax+40],dh
00FF add bh,bh
15 BC824000 adc eax,4082bc
As you can see it was completely modified!
The processor know where to start and what arguments to the instruction to take by the opcode instruction. In the first example our first opcode was
8B
so it knows that it can be followed by another byte. If this byte is EC
so the instruction ends here, and it means mov ebp, esp
.In our second example it start with
C0
and it can be followed by another byte, meaning another instruction. Then C08B
is the instruction, and EC568BF4 68
is the argument.Imagine that inside the processor has a huge (but nano) circuit, that behave like a chain of ifs (conditions), that depending on the value of the hexcode (or opcode) it knows "where to go" or "how to behave".
Maybe you find it interesting to think about
the other direction: How would you have to design your code to be easy
to segment for others? You could require the most significant bit of the
byte starting a sequence to be zero, and those in the middle of a
sequence to be one, like UTF-8 does it. Then if you start from a random
position – assuming you know where the bytes are – it is easy to find
the next sequence. Going one step further, how would you code a pure bit
stream such that the start of a sequence is easy to find. How many bits
were wasted by such a coding?
Since you asked about the maths, I think the relevant topics are “Coding Theory”, “Variable-length codes” or “Prefix codes”.
How do you find a gene in a sequence of base pairs?
Since you asked about the maths, I think the relevant topics are “Coding Theory”, “Variable-length codes” or “Prefix codes”.
How do you find a gene in a sequence of base pairs?
Reference:
No comments:
Post a Comment