index (mra_bi_assembler.index)

Overview

This section describes the high-level process which might be followed to produce a bi-assembler for an ARM architecture. We note this project is designed to provide generic building blocks for the implementation of a bi-assembler, this overview is just one possible option.

[See overview diagram image]

In this diagram, there are 3 distinct stages. Looking in the top-to-bottom direction, the first and last stages are implemented using the parsing and unparsing functions provided by Lang. The middle stage uses the bidirectional language in Bidir. This is more structured than the parsing/unparsing, but requires more precise specification in the bidirecitonal DSL.

The right hand side of the diagram is the "frontend" for the ARM MRA, extracting information from the MRA files to produce the Lang and Bidir abstractions used in each stage. The mentioned MRA elements are found in the XML files of the MRA and can also be viewed online. This frontend is implemented by the Arm library.

Libraries

The libraries that make up this project, in suggested reading order. Each library's main page also has more information about that library.

Invertible parsers (`Lang`)

Lang This library provides a domain-specific language for defining invertible parsers, Lang.Common.parseable. Assembly parsers and decoders are unified in this framework. For example, a decoder is just a parser with byte strings as input. Abiding by some minor constraints, these parses can be made to be reversible, enabling their use in a bi-directional conversion (see Lang.Analysis.unparse).

Bidirectional imperative programming (`Bidir`)

Bidir This library implements a simple bidirectional imperative language, Bidir.Types.bidir. Programs in this language can be interpreted forwards or backwards, with each direction being the inverse of the other. This is designed to be used with fully-invertible programs and provides facilities for invertible expressions and statements.

Frontend for ARM MRA (`Arm`)

Arm This is the frontend for the ARM architectures. It accepts JSON-formatted data provided by mra-tools and converts it into the Lang.Common.parseable format. The original source of the machine-readable architecture files is from Arm's Exploration Tools.

Frontends

It is anticipated that frontends will construct the abstractions in Lang and Bidir from some architecture-specific data files. In order to be functional for assembly and disassembly, a frontend should produce approximately:

an instruction decoder: parsing instruction bytes (e.g., as a string of binary characters) to an encoding name.
for each encoding name:
- an assembly parser: parsing assembly text and extracting its fields (e.g., mov x1, x2 parses to {Xd=x1; Xn=x2}).
- a bitfield extractor: parsing a binary string and extracting its fields (e.g., 0011 parses to Xd=11).
for each kind of field used in the syntax:
- a field formatter: converting a binary encoding to its textual string (e.g., 11 parses to Xd).

Then, the assembly parser is constructed by joining all the encoding parsers with the Lang.Common.parseable.Or operator.

This project as a whole is essentially the same as Alastair Reid's work on Bidirectional ARM Assembly Syntax Specifications. We have similar motivations and even use the same source for the instruction information (the ARM machine-readable architecture). We differ slightly in the implementation; we have a goal of being more generic and independent of the ARM specification language. This is most obvious in how we implement conversion of assembly symbols (such as x1, #100) to/from their bit encodings. To implement these translations, Alastair Reid writes snippets of ASL code for one direction (byte to string) and some manually-specified regular expressions for the inverse (string to byte). In our project, this is handled by the Bidir module. By defining a bidirectional imperative language and using this to extract parse patterns, we are able to implement a more general and automatic translation.

The project can also be compared to Ghidra's Sleigh (itself a continuation of SLED). This is a comprehensive and powerful language for defining instruction sets and assembly code. We approach the problem more abstractly, designing the Lang and Bidir modules to solve abstract problems instead of specialising to assembly code. This is approach is mostly beneficial, but requires ISA frontends to do more work when generating the DSLs for their language.

The languages definable in the Lang.Common.parseable DSL are a restriction of the star-free languages. The star-free languages are regular expressions without the Kleene star for unbounded repetition. We restrict this even further by omitting the complement operation. One consequence of this is that all our definable languages have a finite maximum size (both that the maximum length of an accepted word is finite, and that the number of distinct accepted words is finite).

This work is very much within the field of bidirectional transformations, programs which can be run "forwards" and "backwards". Bidirectional transformations are implemented by, for example, the Boomerang language (Wikipedia) using functional lenses to define the transformation. Our parsing/unparsing approach is a more ad-hoc implementation of a transformation. Our bidirectional DSL is very similar to Boomerang but more minimal and likely less flexible. We develop this project separately because Boomerang is somewhat dormant with outdated dependencies, and we do not need its full power since our language is relatively simple.

The ocamlregexkit package defines analyses and transformations for regular expressions. In particular, their Tree.re type is very similar to our Lang.Common.parseable. They also define conversions to and from NFA/DFA. We exploit our more restricted language to implement features such as Lang.Analysis.unparse, which are crucial to our application.

In Haskell, there are the casette and syntax packages which elegantly implement bidirectional parsing using better abstractions. casette uses continuation-passing style to support unbounded backtracking, and syntax uses a kind of weakened isomorphism to represent the invertible computation.

Package info

authors	Kait Lam <k@rina.fyi>
depends	angstrom containers dune faraday js_of_ocaml mdx ocaml odoc ppx_deriving ppx_deriving_yojson ppx_expect ppx_inline_test re
homepage	https://github.com/katrinafyi/mra-bi-assembler
issues	https://github.com/katrinafyi/mra-bi-assembler/issues
license	Apache-2.0
maintainers	Kait Lam <k@rina.fyi>
online-doc	https://katrinafyi.github.io/mra-bi-assembler/mra_bi_assembler/
readme-files	README.md
repo	git+https://github.com/katrinafyi/mra-bi-assembler.git
tags	assembly disassembly parser
version	dev

Overview

Libraries

Invertible parsers (Lang)

Bidirectional imperative programming (Bidir)

Frontend for ARM MRA (Arm)

Frontends

Related work

Package info

Invertible parsers (`Lang`)

Bidirectional imperative programming (`Bidir`)

Frontend for ARM MRA (`Arm`)