Friday, May 3, 2013

How are hex sequence translated to assembly without ambiguity

8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40 
 
A senquence like above can be segmented in various ways,each segment can be translated to corresponding assembly instruction, but each binary executable has its only DEFINITE assembly ,what's the mathematical principle that avoids ambiguity?

UPDATE
The answer with most votes doesn't actually answer my question at all.
 ===
I updated the most popular answer to hopefully make it a bit more clear. – Gabe Oct 13 '10 at 16:00
===
A binary doesn't have a definate set, it has a set of definate starting points(for example, you could skip up to four bytes of prefixes and still get almost the same output). Imo, you'd do well looking at the info on sandpile.org and have a look at the ollydbg & bea mini disassem engine – Necrolis Oct 13 '10 at 16:46
===
You don't have to accept the most upvoted answer if you don't like it. – Vilx- Oct 13 '10 at 21:16
===
I'm pretty sure I've heard tales of getting very small demos which rely on running the same bytes as different code depending where you start; certainly anything on the x86 with has an optional lock prefix on its first instruction can be two functions depending which instruction you jump to. – Pete Kirkham Oct 13 '10 at 22:05
 ===
First of all you have to distinguish between RISC and CISC architectures.

In a RISC architecture you usually have instructions of the same size, so ambiguity cannot be presented. Your CPU will fetch for example 4 bytes for every instruction, and since it will have to start from somewhere (your CPU doesn't have just a sequence like the one you presented, it will have a starting point for sure) once that it has the right alignment no problem can occur.

What happens with a CISC instruction set is essentially the same: starting from the entry point of the program it will fetch instructions accordingly to your opcodes. It doesn't need to know how to matematically distinguish ambiguities since it won't happen that it just doesn't know how long is the next instruction or where the last one finished.

So asking how to separate every instruction is like asking how to separate every word in
thepenisonthetable
There's not mathematical proof but you know which letters are correct together and which ones are not meaningful. The previous sentence contains "son" but you know that it is obtained from "is on". You wouldn't be able to say so without having a meaningful phrase, but your CPU only executes meaningful programs so what's the point?

So if the CPU could work on the previous sentence it will find the first senseful instruction "the", then "pen", "is", "on" and the "son" couldn't never be recognized anyway.

EDIT:
To be cleared, in CISC architectures, the only contraint you have to be sure not to have ambiguities is to avoid having an instruction that is a prefix of another. Let's assume a finite alphabet composed by letters a-z instead that hex numbers (just for practical purposes).

If the program counter points to
abbcbcaabdeffabd
 
you can have that abb is a whole instruction. In that case ab wouldn't be a valid instruction, otherwise the CPU couldn't know where to stop, at the same time abbc can't be an instruction too or it may create problems. Keeping it on you can have for example that ca is the next instruction, c couldn't and cbc neither.

You can extend this argumentation to the whole string. You will see that, if the CPU finds itself in a state in which the next byte of the binary points to the FIRST byte of an instruction, and there are no instruction that are prefixes of other instruction, then in the next state the program counter will point to the FIRST byte of the next, correct, instruction.
===
 I would argue that your example is not even 100% parse-able unless you assert that it is a complete sentence in English. It could be 'the pen is on the table,' or 'the penis on the table.' – Dave Oct 13 '10 at 13:51
 ===
Can you elaborate about the opcodes of CISC ? – ollydbg Oct 13 '10 at 15:24
 ===
It's like the example I provided with the sentence. Let's change it to remove the hidden ambiguity of "pen is". Assume "the cat is on the table". As long as every word (or instruction) is not a prefix of another existing word (that is not true for natural language, but it is for CPU instructions) a CPU that knows where to start will always be able to split up instructions with no ambiguity. – Jack Oct 13 '10 at 15:33
===
Knowing your starting point.

In other words, given a specific starting byte of an instruction, it is unambiguous where the instruction ends, thus giving you the starting byte of the next instruction and allowing you to continue. Given an arbitrary block of memory it is impossible to break it up into individual instructions without knowing where the first instruction begins.

From a more mathematical perspective, there is no valid instruction whose bytes are a prefix of another valid instruction. So if ab is valid, then you know that ab cd cannot be valid so ab must be one instruction and cd is the start of the next instruction.
 ===
Nice :-) However, given certain sequences it seems like it would be possible to determine the "starting point" based upon the detection of invalid instructions (or the detection of sequences which seem valid). If the above is true, would you give a particular "name" to such an operation? – user166390 Oct 12 '10 at 16:57
===
As flippant as this is, it's true. There are a number of exploits involving referencing assembly code at the wrong offset to produce unexpected behavior. – Andres Oct 12 '10 at 16:59
===
What if there're these 4 instructions: 8B EC,,, 56 8B F4 68 ,,,8B EC 56 8B, ,,F4 68 ,how does CPU know whether 8B EC 56 8B F4 68 should be interpreted as 8B EC + 56 8B F4 68 , or 8B EC 56 8B + F4 68 ? – justnobody Oct 13 '10 at 8:26
===
@justnobody: There are no such cases. If 8B EC is an instruction, then 8B EC 56 8B is not one. That's what the ISA specification of the architecture defines uniquely. – Bahbar Oct 13 '10 at 15:55
===
The sequence you listed shows exactly 1 number. In binary, it's 100010111110110001010110100010111111010001101000000000000111000001000000000000001111111100010101101111001000001001000000
 
In decimal, it's 726522768938664460674442126658667072. These are all just different ways of writing exactly the same value. A particular architecture's ISA will divide the bits into fields and assign them meaning. Most processors have easy to get manuals that describe the meaning assigned to each of the bits in those fields.
===
But the bits can be divided in various ways,how does computer figures the final segmentation? – ollydbg Oct 12 '10 at 17:03
 ===
@ollydbg, Do you want to know how the hardware physically does it, or how the programmer knows what the hardware will do with their bits? – nmichaels Oct 12 '10 at 17:06
===
Dont confuse linearly trying to disassemble with execution order of code. The binary is decoded in execution order, starting with a known location. Other than intentional hacks for various reasons there is no ambiguity.

Try writing a disassembler for a variable word length instruction set. At the end of the day it has to be done in execution order, and even there you can only disassemble some of the program as some branches can be based on addresses computed at run time. Modern compiler generated code is much better than older hand assembled code. In an old standup arcade game for example there are conditional branches preceeded by an instruction that guarantees that only one of the conditions is met (why was that in there? we will never know) and the data that follows the conditional branch resembles an opcode in such a way that you run into a collision with other opcodes.

Old dos programs trying to defeat disassemblers would have self modifying code, one instruction would modify another instruction one or two instructions ahead, if single stepped that modification would happen, but if run at full speed the instruction was already fetched in the pipeline, and the modified/broken one in memory was not used. pretty neat trick.

Anyway, the answer to your question is do not look at the bytes in linear order, look at them in execution order starting at the addresses defined by the reset and other vectors in the vector table.
 ===
 on fixed length instruction sets you can examine the data linearly from 0 to N, so long as you remember that not all of those bytes are instructions (and you examine them using the correct alignment). – dwelch Oct 12 '10 at 
===
It sounds like the answer to your question is the somewhat flippant "Know your starting point", but maybe you want something a little more verbose. Given your string:

8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
 
AND a starting point (Let's say the 8B is your starting point) there is only one possible interpretation of the bytes.

So let's say one operation is 8B EC 56 8B (Depending on your operation length, etc), then the NEXT operation is F4 68... In this case, it's impossible for the machine to try to interpret an operation 56 8B F4 68 because no operation ended at just that point.

Now, if your start point was the 56, then you can get that group but cannot get either of the ones mentioned previously.

The layout of your memory is very specific and start points/jump points are exact and unforgiving--they are required as surely as the code itself.
 ===
If I understand your question correctly, you're trying to understand why

8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40

Could be split e.g. as

8BEC 568BF4 68007040 00FF 15BC 8240

Rathern than say,

8B EC568B F4 68007040 00FF 15BC 8240

This is entirely specified by the ISA of your architecture. That document describes exactly how instructions are uniquely constructed from a series of bytes.

For the ISA to be well formed, a single series of bytes can correspond to at most a single series of decoded instructions (might be less, if there are invalid instructions).
To get a bit more concrete, lets take the x86 example: If you want to know what each byte corresponds to, have a look here.

You'll see that, e.g. an instruction starting with 00 is an add (additional parameters are in the next byte, with a specific encoding).

You'll also see that some values are actually prefixes that modify the following instruction (0F - prefix to extend the opcode space, 26, 2E, 36, 3E, 64, 65, 66, 67, F0, F2, F3), and that some of them take different meaning based on the exact following instruction. Those are not opcodes, but they can alter the encoding of the arguments of the opcode, or introduce a completely new opcode space (e.g. SSE uses 0F).

Overall, the x86 encoding is very complex, thanks for disassemblers.
 ===
There might also be some clues elsewhere about what is a valid starting address. There is always a reset vector address, and there are usually interrupt vector addresses, which all must be valid start points for blocks of code. More usefully, if you come across a jump or call instruction elsewhere which references an address in the block you are trying to decode, then that gives you another start address.

I think I see your worry, and as far as I know its correct - if the program counter gets upset by one and that causes invalid instructions or unintended instructions to be executed, the program probably crashes. True, and also if you run into a data block and try to execute that. At least the latter can be avoided by using a Harvard architecture, where code and data are in seperate memory spaces and may be different bit widths.
 ===
If you open a binary in a hex editor, copy a portion of data and paste in a disassembler, you will very probably not copy a complete instruction. Let's use your example.. in a Windows XP 32bits SP3 English, if I assembly 8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40 I'll get:

Hex dump          Command
8BEC              mov ebp,esp
56                push esi
8BF4              mov esi,esp
68 00704000       push 407000
FF15 BC824000     call dword ptr ds:[4082bc]
 
As you can see it assembled completely different then the answer of the other guys below...

Now let's pretend that instead of assembling 8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
you added C0 opcode at the beginning C0 8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40

Now check below what a single byte did in our instructions:
 
Hex dump             Command
C08B EC568BF4 68     ror byte ptr ds:[ebx+f48b56ec],68
0070 40              add byte ptr ds:[eax+40],dh
00FF                 add bh,bh
15 BC824000          adc eax,4082bc

As you can see it was completely modified!

The processor know where to start and what arguments to the instruction to take by the opcode instruction. In the first example our first opcode was 8B so it knows that it can be followed by another byte. If this byte is EC so the instruction ends here, and it means mov ebp, esp.

In our second example it start with C0 and it can be followed by another byte, meaning another instruction. Then C08B is the instruction, and EC568BF4 68 is the argument.

Imagine that inside the processor has a huge (but nano) circuit, that behave like a chain of ifs (conditions), that depending on the value of the hexcode (or opcode) it knows "where to go" or "how to behave".
 ===
Maybe you find it interesting to think about the other direction: How would you have to design your code to be easy to segment for others? You could require the most significant bit of the byte starting a sequence to be zero, and those in the middle of a sequence to be one, like UTF-8 does it. Then if you start from a random position – assuming you know where the bytes are – it is easy to find the next sequence. Going one step further, how would you code a pure bit stream such that the start of a sequence is easy to find. How many bits were wasted by such a coding?

Since you asked about the maths, I think the relevant topics are “Coding Theory”, “Variable-length codes” or “Prefix codes”.

How do you find a gene in a sequence of base pairs?
 
Reference:
http://stackoverflow.com/questions/3917098/how-are-hex-sequence-translated-to-assembly-without-ambiguity
 
 
 
 
 
 
 
 
 

No comments: