The analysis performed by dcc is based on traditional compiler optimization techniques and graph theory. The former is capable of eliminating registers and intermediate instructions to reconstruct high-level statements; the later is capable of determining the control structures in each subroutine.
The structure of a decompiler resembles that of a compiler: a front-, middle-, and back-end which perform separate tasks. The front-end is a machine-language dependent module that reads in machine code for a particular machine and transforms it into an intermediate, machine-independent representation of the program. The middle-end (aka the Universal Decompiling Machine or UDM) is a machine and language independent module that performs the core of the decompiling analysis: data flow and control flow analysis. Finally, the back-end is high-level language dependent and generates code for the program (C in the case of dcc).
In practice, several programs are used with the decompiler to create the high-level program. These programs aid in the detection of compiler and library signatures, hence augmenting the readability of programs and eliminating compiler start-up and library routines from the decompilation analysis.
55 8B EC 83 EC 04 56 57 1E B8 94 00 50 9A 0E 00 3C 17 59 59 16 8D 46 FC 50 1E B8 B1 00 50 9A 07 00 F0 17 83 C4 08 BE 01 00 EB 3B 1E B8 B4 00 50 9A 0E 00 3C 17 59 59 16 8D 46 FE 50 1E B8 C3 00 50 9A 07 00 F0 17 83 C4 08 FF 76 FE 9A 7C 00 3B 16 59 8B F8 57 FF 76 FE 1E B8 C6 00 50 9A 0E 00 3C 17 83 C4 08 46 3B 76 FC 7E C0 33 C0 50 9A 0A 00 49 16 59 5F 5E 8B E5 5D CB 55 8B EC 56 8B 76 06 83 FE 02 7E 1E 8B C6 48 50 0E E8 EC FF 59 50 8B C6 05 FE FF 50 0E E8 E0 FF 59 8B D0 58 03 C2 EB 07 EB 05 B8 01 00 EB 00 5E 5D CB
Figure 1 - Machine Code for Fibonacci.exe
proc_1 PROC FAR 000 00053C 55 PUSH bp 001 00053D 8BEC MOV bp, sp 002 00053F 56 PUSH si 003 000540 8B7606 MOV si, [bp+6] 004 000543 83FE02 CMP si, 2 005 000546 7E1E JLE L1 006 000548 8BC6 MOV ax, si 007 00054A 48 DEC ax 008 00054B 50 PUSH ax 009 00054C 0E PUSH cs 010 00054D E8ECFF CALL near ptr proc_1 011 000550 59 POP cx 012 000551 50 PUSH ax 013 000552 8BC6 MOV ax, si 014 000554 05FEFF ADD ax, 0FFFEh 015 000557 50 PUSH ax 016 000558 0E PUSH cs 017 000559 E8E0FF CALL near ptr proc_1 018 00055C 59 POP cx 019 00055D 8BD0 MOV dx, ax 020 00055F 58 POP ax 021 000560 03C2 ADD ax, dx 023 00056B 5E L2: POP si 024 00056C 5D POP bp 025 00056D CB RETF 026 000566 B80100 L1: MOV ax, 1 027 000569 EB00 JMP L2 proc_1 ENDP main PROC FAR 000 0004C2 55 PUSH bp 001 0004C3 8BEC MOV bp, sp 002 0004C5 83EC04 SUB sp, 4 003 0004C8 56 PUSH si 004 0004C9 57 PUSH di 005 0004CA 1E PUSH ds 006 0004CB B89400 MOV ax, 94h 007 0004CE 50 PUSH ax 008 0004CF 9A0E004D01 CALL far ptr printf 009 0004D4 59 POP cx 010 0004D5 59 POP cx 011 0004D6 16 PUSH ss 012 0004D7 8D46FC LEA ax, [bp-4] 013 0004DA 50 PUSH ax 014 0004DB 1E PUSH ds 015 0004DC B8B100 MOV ax, 0B1h 016 0004DF 50 PUSH ax 017 0004E0 9A07000102 CALL far ptr scanf 018 0004E5 83C408 ADD sp, 8 019 0004E8 BE0100 MOV si, 1 021 000528 3B76FC L3: CMP si, [bp-4] 022 00052B 7EC0 JLE L4 023 00052D 33C0 XOR ax, ax 024 00052F 50 PUSH ax 025 000530 9A0A005A00 CALL far ptr exit 026 000535 59 POP cx 027 000536 5F POP di 028 000537 5E POP si 029 000538 8BE5 MOV sp, bp 030 00053A 5D POP bp 031 00053B CB RETF 032 0004ED 1E L4: PUSH ds 033 0004EE B8B400 MOV ax, 0B4h 034 0004F1 50 PUSH ax 035 0004F2 9A0E004D01 CALL far ptr printf 036 0004F7 59 POP cx 037 0004F8 59 POP cx 038 0004F9 16 PUSH ss 039 0004FA 8D46FE LEA ax, [bp-2] 040 0004FD 50 PUSH ax 041 0004FE 1E PUSH ds 042 0004FF B8C300 MOV ax, 0C3h 043 000502 50 PUSH ax 044 000503 9A07000102 CALL far ptr scanf 045 000508 83C408 ADD sp, 8 046 00050B FF76FE PUSH word ptr [bp-2] 047 00050E 9A7C004C00 CALL far ptr proc_1 048 000513 59 POP cx 049 000514 8BF8 MOV di, ax 050 000516 57 PUSH di 051 000517 FF76FE PUSH word ptr [bp-2] 052 00051A 1E PUSH ds 053 00051B B8C600 MOV ax, 0C6h 054 00051E 50 PUSH ax 055 00051F 9A0E004D01 CALL far ptr printf 056 000524 83C408 ADD sp, 8 057 000527 46 INC si 058 JMP L3 ;Synthetic inst main ENDP
Figure 2 - Code produced by the Disassembler
/* * Input file : fibo.exe * File type : EXE */ int proc_1 (int arg0) /* Takes 2 bytes of parameters. * High-level language prologue code. * C calling convention. */ { int loc1; int loc2; /* ax */ loc1 = arg0; if (loc1 > 2) { loc2 = (proc_1 ((loc1 - 1)) + proc_1 ((loc1 + 0xFFFE))); } else { loc2 = 1; } return (loc2); } void main () /* Takes no parameters. * High-level language prologue code. */ { int loc1; int loc2; int loc3; int loc4; printf ("Input number of iterations: "); scanf ("%d", &loc1); loc3 = 1; while ((loc3 <= loc1)) { printf ("Input number: "); scanf ("%d", &loc2); loc4 = proc_1 (loc2); printf ("fibonacci(%d) = %u\n", loc2, loc4); loc3 = (loc3 + 1); } /* end of while */ exit (0); }
Figure 3 - Code produced by dcc in C
#include <stdio.h> int main() { int i, numtimes, number; unsigned value, fib(); printf("Input number of iterations: "); scanf ("%d", &numtimes); for (i = 1; i <= numtimes; i++) { printf ("Input number: "); scanf ("%d", &number); value = fib(number); printf("fibonacci(%d) = %u\n", number, value); } exit(0); } unsigned fib(x) /* compute fibonacci number recursively */ int x; { if (x > 2) return (fib(x - 1) + fib(x - 2)); else return (1); }
Figure 4 - Initial C Program
Techniques for writing reverse compilers or decompilers are presented in this thesis. These techniques are based on compiler and optimization theory, and are applied to decompilation in a unique way; these techniques have never before been published.
A decompiler is composed of several phases which are grouped into modules dependent on language or machine features. The front-end is a machine dependent module that parses the binary program, analyzes the semantics of the instructions in the program, and generates an intermediate low-level representation of the program, as well as a control flow graph of each subroutine. The universal decompiling machine is a language and machine independent module that analyzes the low-level intermediate code and transforms it into a high-level representation available in any high-level language, and analyzes the structure of the control flow graph(s) and transform them into graphs that make use of high-level control structures. Finally, the back-end is a target language dependent module that generates code for the target language.
Decompilation is a process that involves the use of tools to load the binary program into memory, parse or disassemble such a program, and decompile or analyze the program to generate a high-level language program. This process benefits from compiler and library signatures to recognize particular compilers and library subroutines. Whenever a compiler signature is recognized in the binary program, all compiler start-up and library subroutines are not decompiled; in the former case, the routines are eliminated from the final target program and the entry point to the main program is used for the decompiler analysis, in the latter case the subroutines are replaced by their library name.
The presented techniques were implemented in a prototype decompiler for the Intel i80286 architecture running under the DOS operating system, dcc, which produces target C programs for source .exe or .com files. Sample decompiled programs, comparisons against the initial high-level language program, and an analysis of results is presented in Chapter 9.
Chapter 1 gives an introduction to decompilation from a compiler point of view, Chapter 2 gives an overview of the history of decompilation since its appearance in the early 1960s, Chapter 3 presents the relations between the static binary code of the source binary program and the actions performed at run-time to implement the program, Chapter 4 describes the phases of the front-end module, Chapter 5 defines data optimization techniques to analyze the intermediate code and transform it into a higher-representation, Chapter 6 defines control structure transformation techniques to analyze the structure of the control flow graph and transform it into a graph of high-level control structures, Chapter 7 describes the back-end module, Chapter 8 presents the decompilation tool programs, Chapter 9 gives an overview of the implementation of dcc and the results obtained, and Chapter 10 gives the conclusions and future work of this research.
The techniques presented in this thesis expand on earlier work described in the literature. Previous work in decompilation did not document on the interprocedural register analysis required to determine register arguments and register return values, the analysis required to eliminate stack-related instructions (i.e. push and pop), or the structuring of a generic set of control structures. Innovative work done for this research is described in Chapters 5, 6, and 8. Chapter 5, Sections 5.2 and 5.4 illustrate and describe nine different types of optimizations that transform the low-level intermediate code into a high-level representation. These optimizations take into account condition codes, subroutine calls (i.e. interprocedural analysis) and register spilling, eliminating all low-level features of the intermediate instructions (such as condition codes and registers) and introducing the high-level concept of expressions into the intermediate representation. Chapter 6, Sections 6.2 and 6.6 illustrate and describe algorithms to structure different types of loops and conditional, including multi-way branch conditionals (e.g. case statements). Previous work in this area has concentrated in the structuring of loops, few papers attempt to structure 2-way conditional branches, no work on multi-way conditional branches is described in the literature. This thesis presents a complete method for structuring all types of structures based on a predetermined, generic set of high-level control structures. A criterion for determining the generic set of control structures is given in Chapter 6, Section 6.4. Chapter 8 describes all tools used to decompile programs, the most important tool is the signature generator (Section 8.2) which is used to determine compiler and library signatures in architectures that have an operating system that do not share libraries, such as the DOS operating system.
Testing of the generality of the data flow analysis stage of the decompiler in Intel, MIPS, SPARC and PowerPC architectures.
Improvement of the structuring algorithm for control flow graphs.
Incorporation of better data-type detection.
This project will be integrated with the retargetable binary translation work in due course. No current updates are being done on this project.
Restrictions Our ftp server is restricted to valid DNS entries only. Hence, if your computer does not have a valid DNS entry, you won't get ftp access. Contact your systems administrator if this is the case; we cannot help you in this case .
Support Please note that the authors are not currently working on this project and therefore cannot support any changes required on dcc. Source code is provided "as is". Read the documentation first.
This page: http://www.it.uq.edu.au/groups/csm/dcc.html