Tuesday, April 30, 2013

NASM is pure assembly, but MASM is high level Assembly?


My recommendation purely from a "reverse engineering" perspective is to understand how a compiler translates high-level concepts into assembly language instructions in the first place. The understanding of how register allocation is done in various compilers and how various optimizations will obscure the high-level representation of nested loops (et.al.) is more important than being able to write one particular dialect of assembly input.
Your best bet is to start with the assembly language intermediate files from source code that you write (seethis question for more information). Then you can change the source and see how it affects the intermediate files. The other place to start is by using an interactive disassembler like IDA Pro.
Actually writing assembly language programs and learning the syntax of NASM, MASM, gas, of as is the easiest part and it does not really matter which one you learn. They are very similar because the syntax of the source language is very basic. If you are planning to learn how to disassemble and understand what a program is doing, then I would completely ignore macro assemblers since the macros completely disappear during translation and you will not see them when looking at disassembler output.
Diatribe on Learning Assembly
Learning an assembly language is different than learning a higher level programming language. There are fewer syntactical constructs if you ignore macro assemblers. The problem is that every compiler chain has a slightly different representation so you have to concentrate on the concepts such as supported address modes, register restrictions, etc. These aren't part of the language per se as they are dictated by the hardware.
The approach that I took (partially because the university forced me to), is to explore and understand the hardware itself (e.g., # of registers, size of registers, type of branch instructions supported, etc.) and slightly more academic concepts such as interrupts and using bitwise manipulation for integer match before you start to write assembly language programs. This is a much longer route but results in a rich understanding of assembly and how to write high performance programs.
The interesting thing is that in the time that I spent learning assembly and compiler construction (which is intrinsically related), I actually wrote very few assembly programs. More often, I am required to write little snippets of inline assembly here and there (e.g. setting up index registers when the runtime loader didn't). I have spent an enormous amount of time dissecting crash dumps from a memory location, loader map file, and assembler output listings. I can honestly say that the syntax of each assembler is dramatically different as well as what various compilers will do to muddle the intent into fast or small code.
Learning how to write assembly programs was the least worthwhile part of the education process. It was necessary to understand how source is translated into the bits and bytes that the computer executes, but it really was not what I really needed to reverse engineer from a raw binary (disassembler -> assembly listing -> best guess of high level intent) or a memory dump. I do more of the latter, but the requirements of the job are the same.
  1. You really have to understand what the constraints of the architecture are.
  2. You have to know the very basic syntax of the assembler in question - how are address modes indicated, how are registers indicated, what is the order of arguments for a move
  3. What transformations a compiler does to go from if (a > 0) to mov.b r0,d0 ... bnz $L
Start by learning about computer architecture (e.g., read something from Andrew Tanenbaum), then how an OS actually loads and runs a program (Levine's Linkers & Loaders), then compile simple programs in C/C++ and look at the assembly language listings.

But for understanding the disassembly... don't I need to learn assembly?

@questions: I think the best way to learn assembly is to look at disassembled code and figure out what it does. But that's my personal opinion.
Reference:

No comments: