Package mra_bi_assembler dev

Welcome to the MRA bi-assembler project. The code repository is publicly available.

To jump to the API documentation, see Libraries.

The goal of this project is to provide a framework for automatically-generating instruction assemblers and disassemblers, given a machine-readable description of an architecture's assembly syntax. The generated assemblers are intended to be invertible, hence the name bi-assembler.

The produced bi-assembler will be able to convert to and from instruction bytes and an assembly mnemonic. For example, for aarch64, one such conversion is:

add x1, x2, x3, lsl #4  <->  0x8b031041

We note that this project does not do object assembly (i.e., each instruction is (dis)assembled in isolation and it does not support symbolic addresses or labels).

Overview

This section describes the high-level process which might be followed to produce a bi-assembler for an ARM architecture. We note this project is designed to provide generic building blocks for the implementation of a bi-assembler, this overview is just one possible option.

[See overview diagram image]

In this diagram, there are 3 distinct stages. Looking in the top-to-bottom direction, the first and last stages are implemented using the parsing and unparsing functions provided by Lang. The middle stage uses the bidirectional language in Bidir. This is more structured than the parsing/unparsing, but requires more precise specification in the bidirecitonal DSL.

The right hand side of the diagram is the "frontend" for the ARM MRA, extracting information from the MRA files to produce the Lang and Bidir abstractions used in each stage. The mentioned MRA elements are found in the XML files of the MRA and can also be viewed online. This frontend is implemented by the Arm library.

Libraries

The libraries that make up this project, in suggested reading order. Each library's main page also has more information about that library.

Invertible parsers (Lang)

Bidirectional imperative programming (Bidir)

Frontend for ARM MRA (Arm)

Frontends

It is anticipated that frontends will construct the abstractions in Lang and Bidir from some architecture-specific data files. In order to be functional for assembly and disassembly, a frontend should produce approximately:

Then, the assembly parser is constructed by joining all the encoding parsers with the Lang.Common.parseable.Or operator.

This project as a whole is essentially the same as Alastair Reid's work on Bidirectional ARM Assembly Syntax Specifications. We have similar motivations and even use the same source for the instruction information (the ARM machine-readable architecture). We differ slightly in the implementation; we have a goal of being more generic and independent of the ARM specification language. This is most obvious in how we implement conversion of assembly symbols (such as x1, #100) to/from their bit encodings. To implement these translations, Alastair Reid writes snippets of ASL code for one direction (byte to string) and some manually-specified regular expressions for the inverse (string to byte). In our project, this is handled by the Bidir module. By defining a bidirectional imperative language and using this to extract parse patterns, we are able to implement a more general and automatic translation.

The project can also be compared to Ghidra's Sleigh (itself a continuation of SLED). This is a comprehensive and powerful language for defining instruction sets and assembly code. We approach the problem more abstractly, designing the Lang and Bidir modules to solve abstract problems instead of specialising to assembly code. This is approach is mostly beneficial, but requires ISA frontends to do more work when generating the DSLs for their language.

The languages definable in the Lang.Common.parseable DSL are a restriction of the star-free languages. The star-free languages are regular expressions without the Kleene star for unbounded repetition. We restrict this even further by omitting the complement operation. One consequence of this is that all our definable languages have a finite maximum size (both that the maximum length of an accepted word is finite, and that the number of distinct accepted words is finite).

This work is very much within the field of bidirectional transformations, programs which can be run "forwards" and "backwards". Bidirectional transformations are implemented by, for example, the Boomerang language (Wikipedia) using functional lenses to define the transformation. Our parsing/unparsing approach is a more ad-hoc implementation of a transformation. Our bidirectional DSL is very similar to Boomerang but more minimal and likely less flexible. We develop this project separately because Boomerang is somewhat dormant with outdated dependencies, and we do not need its full power since our language is relatively simple.

The ocamlregexkit package defines analyses and transformations for regular expressions. In particular, their Tree.re type is very similar to our Lang.Common.parseable. They also define conversions to and from NFA/DFA. We exploit our more restricted language to implement features such as Lang.Analysis.unparse, which are crucial to our application.

In Haskell, there are the casette and syntax packages which elegantly implement bidirectional parsing using better abstractions. casette uses continuation-passing style to support unbounded backtracking, and syntax uses a kind of weakened isomorphism to represent the invertible computation.

Package info

authors
  • Kait Lam <k@rina.fyi>
depends
homepage
issues
license
  • Apache-2.0
maintainers
  • Kait Lam <k@rina.fyi>
online-doc
readme-files
repo
  • git+https://github.com/katrinafyi/mra-bi-assembler.git
tags
version
  • dev