Let’s Bap

Bap, the binary analysis platform, is a framework that disassembles binary code in a variety of formats and for a variety of architectures and lifts them into a common representation. It then supplies analysis you may perform and tools with which to build your own custom analysis

Bap is quite the beast.

To me starting out there was a lot to swallow. First I had to learn Ocaml, second I knew even less about program analysis and binary stuff than I do now.

You may also want to look at JT’s Bap 2.0 Notes

Come on down the the BAP gitter. It is a friendly place and basically the only way to use bap sometimes is to ask questions there

Installing

opam is the standard ocaml package manager. Long story short:

opam depext --install bap # installs bap and its dependencies

More details here https://github.com/BinaryAnalysisPlatform/bap#from-sources

A toy program for you to try

You really should get bap running on your machine and just try dumping out this program with the various options.

int main(){
    return 3;
}
gcc foo.c
objdump -d a.out

The bap command

After installing, if you type bap you will get a list of information

Typing bap on my system gets me. You should actually take a minute to read this. For serious. The bap command line itself is some of the best documentation of what bap has.

Usage:
  bap <COMMAND> [<OPTIONS>]

Common options:
  --version                prints the version number and exits;
  --plugin-path <DIR>      adds <DIR> to the plugins search path;
  --logdir <DIR>           creates a log file in <DIR>;
  --recipe <VAL>           extracts command line arguments from the recipe.

Commands:
  objdump                  disassembles and prints a binary using the linear sweep algorithm
  mc                       disassembles a (human readable) string of bytes
  primus-observations      prints a list of Primus observations
  primus-components        prints a list of Primus components
  primus-systems           prints a list of Primus systems
  compare                  compares several alternative versions of the binary with the base
  disassemble              disassembles and analyzes the input file
  print-recipes            prints available recipes
  recipes                  provides commands to manipulate the recipe subsystem
  list                     explores various BAP facilities
  .                        does nothing and prints nothing
  cache                    provides options to control cache size and cache garbage collector
  analyze                  analyses the knowledge base
  eval-lisp                runs the Primus lisp program
  show-lisp                shows the static semantics of Primus Lisp definitions
  primus-lisp-documentation no description provided
  dependencies             outputs program dependencies such as libraries and symbols
  specification            displays information about the binary


Plugins:
  abi                      apply abi information to a project
  analyze                  implements the analyze command
  api                      add parameters to subroutines based on known API
  arm                      the target support package that enables support for the ARM family of
  beagle                   microx powered obfuscated string solver
  bil                      provides bil optimizations
  byteweight               find function starts using Byteweight algorithm
  cache                    provide caching services
  callgraph-collator       compares projects by their callgraphs
  callsites                annotate callsites with subroutine's arguments
  constant-tracker         constant Tracking Analysis based on Primus
  cxxfilt                  provide c++filt based demangler
  demangle                 demangle subroutine names
  dependencies             analyses the binary dependencies
  disassemble              implements the disassemble command
  dump-symbols             dump symbol information as a list of blocks
  elf-loader               read ELF and DWARF formats in a pure OCaml
  flatten                  flattens (unrolls) BIR expressions into a trivial form
  frontc-parser            parse c files with FrontC
  glibc-runtime            enables ad-hoc support for glibc runtime code
  llvm                     provide loader and disassembler using LLVM library
  map-terms                map terms using BML DSL
  mc                       bAP Core Library
  mips                     provide MIPS lifter
  objdump                  this plugin provides a symbolizer based on objdump
  optimization             automatically removes dead code and propagates consts
  phoenix                  output project information in a phoenix format
  powerpc                  provide PowerPC lifter
  primus-dictionary        provides a key-value storage
  primus-exploring         evaluates all machines, prioritizing the least visited
  primus-greedy            evaluates all machines in the DFS order
  primus-limit             ensures termination by limiting Primus machines
  primus-lisp              install and load Primus lisp libraries
  primus-loader            generic program loader for Primus
  primus-mark-visited      registers the bap:mark-visited component
  primus-powerpc           powerpc support package
  primus-print             prints Primus states and observations
  primus-promiscuous       enables the promiscuous mode of execution
  primus-propagate-taint   a compatibility layer between different taint analysis frameworks
  primus-random            provides components for Primus state randomization and controls their
  primus-region            interval sets
  primus-round-robin       evaluates all machines in the BFS order
  primus-symbolic-executor enables symbolic execution in Primus
  primus-systems           loads Primus systems and registers them in the system repository
  primus-taint             a taint analysis control interface
  primus-test              primus Program Testing and Verification Kit
  primus-wandering         evaluates all machines while
  primus-x86               x86 support package
  print                    print project in various formats
  propagate-taint          propagate taints through a program
  raw                      provides a loader for raw binaries
  read-symbols             provides functions addresses and (optionally) names from a
  recipe-command           manipulates bap recipes
  relocatable              extracts symbolic information from the program relocations
  report                   reports program status
  riscv                    provide Riscv target
  run                      a pass that will run a program
  specification            prints the specification of the binary (like readelf)
  ssa                      translates a program into the SSA form
  strings                  find strings of characters
  stub-resolver            identifies functions that are stubs and redirects calls to stubs to
  systemz                  provide Systemz lifter
  taint                    taint specified terms
  thumb                    provide Thumb lifter
  trace                    manage execution traces
  trivial-condition-form   eliminates complex conditionals in branches
  warn-unused              warn about unused results of certain functions
  x86                      provide x86 lifter

I have no idea what many of these do. A majority of the commands are just ways to query for available capabilities of different kinds

Here are the highlights in my opinion.

  • Commands
    • primus-observations prints a list of Primus observations
    • primus-components prints a list of Primus components
    • primus-systems prints a list of Primus systems
    • list explores various BAP facilities. Especially helpful for finding slots in the knowledge base
    • cache provides options to control cache size and cache garbage collector
    • analyze analyses the knowledge base
    • eval-lisp runs the Primus lisp program
    • show-lisp shows the static semantics of Primus Lisp definitions
    • primus-lisp-documentation no description provided
  • Plugins
    • print
    • run
    • api
    • primus-lisp

Typing bap list on my system gives

  entities                 prints this message
  plugins                  installed BAP plugins
  commands                 bap utility commands
  recipes                  installed recipes
  features                 plugin capabilities
  passes                   disassembler passes
  formats                  data output formats
  classes                  knowledge representation classes
  theories                 installed theories
  agents                   knowledge providers
  rules                    knowledge base rules
  collators                project collators (comparators)

all of which can be further queries to bap list. Of particular interest are bap list theories and bap list classes which are important information about the knowledge base (listing the available “theories” which are like different interpretations or analysis of the code. classes lists the registered classes and their slots in the knowledge base, aka interesting fields of data you can query bap for).

bap -dbir -dbil -dknowledge -dasm -dcfg are different options to dump different representations of the code. BIR and BIL are two ba intermediatre representations. -dknowledge dumps the knowledge base which is kind of everything bap could figure out about the binary.

bap --help is an overwhelming amount of information. Typically you need to try to grep for an appropriate keyword. bap --help | grep -C 10 keyword will show a context of 10 lines around the found keyword. The keyword I use is often the name of bap plugin from the big list above.

The bap command is the same as bap disassemble. The code can be found here https://github.com/BinaryAnalysisPlatform/bap/tree/97fb7fa6a8a90faeae2b077d8ad59b8b882d7c32/plugins/disassemble

Bap equivalents of binutils

objdump let’s you see various outputs about a binary.

  • assembly -d
  • symbol -t or --syms
  • sections & segments -x

readelf has some overlap.

bap -dasm is like objdump -d

bap specification is kind of like readelf --all

bap dependencies is similar to ldd I think.

Giving Bap C info

In order for bap to recover high level function arguments you can supply a header file. If you know this plugin is called api you can find the options available by bap --help | grep -C 4 api

  • --api-path=somefolder where somefolder has a folder called C in it.
  • --api-show
  • --api-list-paths

Saving and restoring the knowledge base

You can have bap save it’s info and restore it. bap a.out -k my_knowledge_base --update will build a knowledge base. Leaving out --update is useful for read only access to the KB. This is used for example with bap analyze -k my_knowledge_base, which gives a kind of repl for exploring some pieces (but not all) of the knowledge base. Try typing help at the prompt for more info.

Recipes

Recipes are bundles of command line flags I think. Well, they are at least that. This can save copying ad pasting some huge command a bunch. https://github.com/BinaryAnalysisPlatform/bap-toolkit Has some interesting example recipes

Interesting Places to Look

  • Ivan’s gists - https://gist.github.com/ivg
  • Tests
  • Defining instruction semantics using primus lisp tutorial https://github.com/BinaryAnalysisPlatform/bap/wiki/Defining-instructions-semantics-using-Primus-Lisp-(Tutorial)
  • https://github.com/BinaryAnalysisPlatform/bap-toolkit Has examples recipes

IRs

BIL

BIL is a simple imperative language bap lifts to with unstructured control flow.

BIR

BIR is a refinement of BIL into a CFG. Each block of BIR contains BIL statements.

Core Theory

The inextensibility of the two above IRs has chaffed the developer’s of BAP. So underneath these IRs is the Core Theory language, which is quite a machine.

Looking at the “herbrand theory” https://github.com/BinaryAnalysisPlatform/bap/blob/master/plugins/core_theory/core_theory_main.ml can help to understand core theory

I discussed this some in this blog post https://www.philipzucker.com/bap-chc/

Disassembler

Loader, Image.

Bap has a experimental ghidra backend now btw.

The semantics of the disassembler can be extended via Primus Lisp.

Primus Lisp

It may seem odd to mention Primus Lisp before Primus, but it has outgrown it’s original intended use case there. It is interesting in and of itself.

Primus Lisp is the language by which you may add new instruction semantics to BAP lifters. See this tutorial by Ivan.

It is also the language you can use you script the Primus interpreter.

There is an introduction documentation to the language here

bap --help | grep -C 10 lisp to see what bap has to sayis available as options. Note --primus-lisp-add to add directories and --primus-lisp-load to actually load particular primus files. Leave off the “.lisp” suffix when you give the file to load

bap primus-lisp-documentation will make an org mode file of all registered primus lisp functions. You can add ocaml functions as new primitives to primus lisp. It’s also online here

You can run primus lisp functions by making a file demo.lisp filled with the content

(defun mymain ()
    (declare (external 'mymain))
    (msg "hello world"))

And invoke it via bap eval-lisp mymain --primus-lisp-load=demo --primus-print-obs=lisp-message,exception. Note that you have to turn on printing for primus observations to see anything. I always turn on message and exception at minimum.

loading into semantics primus lisp bap show-lisp foo --primus-lisp-load=demo -tarmv5+le

foo:
((core:eff ((set R0 1)))
 (bap:ir-graph "00000009:
                0000000a: R0 := 1")
 (bap:insn-dests (()))
 (bap:bir (%00000009))
 (bap:bil "{
             R0 := 1
           }")
 (core:value ((core:val (1)) (bap:static-value (0x1)) (bap:exp 1))))

Primus

Primus is an extensible interpreter/emulator. It can execute code lifted from binaries. You may mix and match Primus Components

The simplest way to run it is

bap /bin/true --run --run-entry-points=main --primus-print-observations=exception,pc-change

For more options bap --help | grep -C 10 run

There is a built in taint tracking functionality and symbolic executor for primus.

You can script the primus interpreter using Primus Lisp files

Primus

Primus is an extensible interpreter. You can use it for fuzzing, symbolic execution, taint tracking, and more.

The pieces of functionality are implemented by components. They can be mixed and matched.

To see the list of available components type

bap primus-components

Prepackaged combos of components are called “systems”. You can see the available systems by typing

bap primus-systems

On my system this returns

- bap:constant-tracker:
  Uses promiscuous-executor for constant tracking.

- bap:microexecutor-base:
  The base system for microexecution systems.

- bap:symbolic-executor:
  Uses symbolic execution to analyze all feasible and
  linearly independent paths.

- bap:base-taint-analyzer:
  Uses promiscuous-executor for taint analysis.
  No policy is specified

- bap:binary-executor:
  Executes a binary program.

- bap:taint-analyzer:
  Uses promiscuous-executor for taint analysis.
  The default taint propagation policy is selected
  using the --primus-taint-select-default-policy
  option (defaults to propagate-by-computation)

- bap:reflective-taint-analyzer:
  A taint analyzer that reflects taints to/from BIR terms

- bap:stubbed-executor:
  Executes a binary together with the Lisp Machine

- bap:legacy-main:
  The legacy Primus system that contains all components registered with the
  Machine.add_component function.

- bap:base-lisp-machine:
  Executes Primus Lisp program.

- bap:promiscuous-executor:
  Executes all linearly independent paths and never fails.

- bap:terminating-stubbed-executor:
  Executes a binary together with the Lisp Machine that
  is guaranteed to terminate.

- bap:multi-analyzer:
  Runs several analyses in parallel.

- bap:exact-taint-analyzer:
  Uses promiscuous-executor for taint analysis.
  Propagates taint exactly.

- bap:string-deobfuscator:
  Uses promiscuous-executor to find obfuscated strings.

To use primus in the simplest way, you can use the bap run plugin. Using the run plugin can be confusing because key command line options are supplied to other plugins.

Now try reading the basic documentation of Primus https://binaryanalysisplatform.github.io/bap/api/master/bap-primus/Bap_primus/Std/index.html

Under the hood, primus is extensible because you can register callbacks to certain events called “observations”. This is reminscent of register event listers in a browser for example.

To see the list of available observations type

bap primus-observations

Ocaml

Primus is written as a generic monad transformer. You should ignore that. Do not look in the Machine module. You actually want Analysis.t which is this monad transformer specialized to the knowledge base monad.

https://binaryanalysisplatform.github.io/bap/api/master/bap-primus/Bap_primus/Std/Primus/Analysis/index.html

Systems

There is significant descriptive documentation of the primus machine execution cycle in the System module. You should read it. https://binaryanalysisplatform.github.io/bap/api/master/bap-primus/Bap_primus/Std/Primus/System/index.html

You may write .asd files to describe systems. .asd is borrowed from the common lisp world. You can find the definition of the built in systems here https://github.com/BinaryAnalysisPlatform/bap/blob/97fb7fa6a8a90faeae2b077d8ad59b8b882d7c32/plugins/primus_systems/systems/core.asd

There is a short tutorial in the wiki here https://github.com/BinaryAnalysisPlatform/bap/wiki/Tutorial:-writing-a-symbolic-taint-analyzer on how to make a new one.

State

You can add new kinds of state. You need to register the state globally with a unique uuid (it’s odd). It looks like this to declare state

let state =
  Primus.Machine.State.declare
    ~name:"unchecked-return-value"
    ~uuid:"7390b60d-fac6-42f7-b13b-94b85bba7586"
    (fun _ -> {addrs = Set.empty (module Addr); verbose=false})

Forking

Components

To make a component, write a function in the Analysis.t monad that registers the callbacks to the events you care about. Pass this unit Analysis.t to Component.register. That’s it.

Observations

Most observations of interest are in the Primus.Interprete module https://binaryanalysisplatform.github.io/bap/api/master/bap-primus/Bap_primus/Std/Primus/Interpreter/index.html

Linker

It may surprise you that really important functionality is in the Linker module.

https://binaryanalysisplatform.github.io/bap/api/master/bap-primus/Bap_primus/Std/Primus/Linker/index.html

Ocaml Stuff

Finding Stuff

I use a combination of Merlin, Github Search.

The most interesting folders are lib and plugins

  • https://github.com/ivg/bap/blob/master/lib/bap/bap.mli the bap std library
  • https://github.com/ivg/bap/tree/master/lib/bap_types many bap types are actually defined here

Bap is not a library.

Bap is organized is an extensible kind of decentralized way that I find very confusing. I am constantly tempted to treat it as a library and shoot my foot off.

Plugins vs Commands vs Passes vs Scripts

There are at least 3 different ways you might extend bap.

Commands

Commands do something different than the default bap command at the top level.

And this is the whole idea of BAP as a framework instead of a library. There are extension points, which enable you to extend bap without having to worry about how to create a project, how to properly find the file, how to specify the architecture and other parameters. You just register a pass that takes a ready project and focus on your analysis instead of writing all this boilerplate. E.g., in the example above it is rightful to assume that you want to get the project before starting enqueing jobs, so you can register a pass that will be called once the project is ready and if the pass is selected,

You make plugins by building and installing it

bapbuild comment.plugin
bapbundle install comment.plugin

Any side effects of setting up the plugin should happen inside an Extension.declare. It consistently causes problem that bap requires certain things to happen at certain stages, and if you go off the reservation, you’ll probably eat shit.

let () =
  Bap_main.Extension.declare (fun _ctx -> dostuff)

The stuff you might do might involve declaring new slots in the knowledge base, declaring new interpetations, declaring new primus components, stuff like that.

Scripts

“Scripts” are a thing I’ve made up and am not sure are actually recommended. You can make a standalone binary using that call Bap_main.init. To make a basic file to explore some binary, first make a dune file

( executable
  (name main)
  (libraries bap bap-primus)
  (flags -linkall)
  (preprocess (pps ppx_jane))
)
open Core_kernel
open Bap.Std
include Self()

(* Must call init before everything*)
let () = match Bap_main.init () with
        | Ok _ -> ()
        | Error s -> failwith s

(* Load a file as a project *)
let myfile = "/home/philip/Documents/ocaml/a.out"
let proj = Project.create (Project.Input.file ~loader:"llvm" ~filename:myfile) |> Or_error.ok_exn

(* Getting the current knowledge base *)
let state = Toplevel.current ()

Registries

A ubiquitous programming pattern in the implementation of bap is to define global level registries for various constructs.

  • Plugins
  • Commands
  • Knowledge Base Classes
  • Knowledge Base Slots
  • Core Theory implementations
  • Primus Components
  • Primus Observations

To learn about the contents of these registries, and hence the available capaibilities, the best way is to find the appropriate way to ask the bap command line tool. I then use the github search feature on the bap repo to search for a important string in question.

Ivan has an Ascii Cinema here

Get some info from the Knowledge Base. bap list

Bap Lifecycle

We open Bap.Std open Self()

Call Bap_main.init Bap_toplevel is in Bap_types. It holds an empty knowledge base at first

Project.create calls the disassembler calls the brancher

Core theory holds both the finally tagless module definition, but also a bunch of slots.

https://github.com/BinaryAnalysisPlatform/bap/blob/ef6afa455a086fdf6413d2f32db98fa9ff1b28d8/plugins/bil/bil_ir.mli Bil.reify What is this. Why is this in plugins

Bap Project Directory Structure

It may be the case that the easiest thing to do sometimes is some source code spelunking.

  • plugins
  • src
  • lib

Github search odoc

Bap Plugins

And this is the whole idea of BAP as a framework instead of a library. There are extension points, which enable you to extend bap without having to worry about how to create a project, how to properly find the file, how to specify the architecture and other parameters. You just register a pass that takes a ready project and focus on your analysis instead of writing all this boilerplate. E.g., in the example above it is rightful to assume that you want to get the project before starting enqueing jobs, so you can register a pass that will be called once the project is ready and if the pass is selected,

https://binaryanalysisplatform.github.io/bap/api/master/bap-main/Bap_main/index.html https://gitter.im/BinaryAnalysisPlatform/bap?at=610c3e322453386b6c373696 https://en.wikipedia.org/wiki/Dependency_injection

The bap main thing

Bap api

The flags corresponding to plugins are typically namespace like --pluginname-pluginoption

In order for bap to recover high level function arguments you can supply a header file. If you know this plugin is called api you can find the options available by bap --help | grep -C 4 api

–api-path=somefolder where somefolder has a folder called C in it. –api-show –api-list-paths

Bap Disassemble

Bap disassemble is the default command if you choose no command. You can use it similar to objdump --print-symbol lets you print just a particular function

where print flags are defined -d is dump options to bap diassemble. currently I see:

  • knowledge
  • cfg
  • graph
  • asm
  • asm.adt
  • adm.decoded
  • asm.sexp
  • bil
  • bil.adt
  • bil.sexp
  • bir
  • ogre

for example bap -dogre a.out

Saving and restoring the knowledge base

-k –project –update

Bap Analyze

https://asciinema.org/a/358996

Bap Passes

Bap.Std

Bap.Std

BIL - Bap Instruction Language. http://binaryanalysisplatform.github.io/bap/api/master/bap/Bap/Std/index.html

OCaml and Registries

There is a pattern avialable in OCaml, which has mutable global state avaialable if you need it, to use a pattern where you make a global table with which to

In this form, this pattern is ubiquitous in bap

The Knowledge Base

The Knowledge Base is a key value store? Database. It is also kind of an alternative class (like object oriented classes) system It is also kind of a

The knowledge base is backed by global tables. New keys to these tables are generated

Knowledge Base

The BAP knowledge base is a centralized store of information meant to hold all the things discovered about a binary. This include properties like the architecture, addresses, and lifted code. User defined analysis may intercommunicate by storing their results into the knowledge base.

The term knowledge base may mean nothing more than database, but to me it implies that it is a combination of a database and mechanisms to automatically derive or populate more information.

bap list classes shows a list of classes and slots of those classes.

Toplevel.current () is the implicit current knowledge base. This has the program you’ve lifted in it

In spirit, each object in the KB is a record. Orindary ocaml records look like

type myrecord = {myint : int; mybool : bool}

but this is not acceptable, because the record needs to be user extensible. Hence some fairly sophisticated “extensible record” programming techniques are used. The Bap type Dict.t is an extensible record. https://github.com/ivg/bap/blob/master/lib/bap_types/bap_value.ml

Classes, Slots

A class is a description of the fields (slots) of a record. You can declare a new class with no field via KB.Class.declare A slot os a field of a record. It has a type and name.

You can add another slot to a class by calling KB.Class.property

Objects and Values

Domains

Promises

Slots are often lazily filled when something collects them. These lazy promises are registered via the KB.promise funcion.

Record.t = Dict.t + Domains

It is desirable for bap to define a notion of record merging. This is a useful concept if you want to combine information built by different pieces of code into a single record. One way of doing this is require each field implement the domain interface, so that you can merge record by merging each field individually. The common case is that some code may only fill out certain fields of the record and other code other disjoint fields. The empty fields are represented by ocaml None values. Then the merge you want is the record with all the fields filled out. This is the optional domain. Another common case is to merge a field that is a set by using the union or intersection of those sets.

Value = Record + Sorts

code implementing Knowledge base code

Discussion of Ivan’s talk

Jt’s notes

Core Theory Blithering

Dumping the concrete syntax of core theory https://gitter.im/BinaryAnalysisPlatform/bap?at=61fd525703f27047821b4c08 bap /bin/true --core-theory-herbrand -dknowledge

Always try --no-cache

To describe the operations of different machines, bap lifts into a common intermediate representation. From the common representation you can perform different binary analysis.

The main form representation in previous version of bap was an algebraic data type, BIL. This is to some degree still the case for some purposes, but now there is a different programming construct intended as the primary source. This thing is called the the Core Theory of bap. The word “theory”, strange as it sounds to some ears, refers to a common set of typed apis that analysis need to implement to receive lifted code. A “theory” in the context of logic is a set of types, functions/constants, and axioms. We don’t really have axioms expressible in ocaml, but we can certainly talk about types and functions.

Instead of having analysis receive a normal ocaml algebraic data type, instead they implement a finally tagless style type signature of core_theory. Instead of registering as a pass to receive the adt, they register their core_theory instance with a global database that is typically automatically called by other parts of bap.

There is a confusing intertwining of the concepts of the knowledge base and core_theory. This is done I think so that there is a possibility of using arbitrary data tucked away in the knowledge base, which operates as a kind of global state monad. In principle, it is possible to construct an analog of core_theory that has nothing to do with the knowledge base.

It is a curious fact that there is an equivalence between ordinary tree-like and a functional representation of how to use them. The idea being is that the only thing you can do to an algebraic data structure is take it apart and “summarize” it in some way. This taking apart operation is the analog of folding a list. This is related to the concept of a Church encoding.

More concretely, consider the following simple datatype describing arithmetic expressions.

type aexpr = Add of initadd * initadd | Lit of int

You can interpret it in different ways

let rec int_interp x = match x with
   | Add (x,y) -> (int_inter x) + (int_interp y)
   | Lit n -> n

let rec string_interp x = match x with
   | Add (x,y) -> sprintf "(%s + %s)" (string_inter x) (string_interp y)
   | Lit n -> int_of_string n

let ex1 = Add (Lit 1, Lit 2)
let ex1_int = int_interp ex1
let ex1_string = string_interp ex1

We can completely mechanically turn this type definition into a module signature definition. Every constructor becomes a function in the module signature.

module type AEXPR = sig
  type t
  val lit : int -> t
  val add : t -> t -> t
end

Every interpretation becomes a new module that implements this signature

module IntAExpr = struct
  type t = int
  let lit x = x
  let add x y = x + y
end

module StringAExpr = struct
  type t = string
  let lit x = string_of_int x
  let add x y = sprintf "(%s + %s)" x y
end

module Ex1 (S : AExpr) = struct
  let ex1 : S.t = S.add (S.lit 1) (S.lit 2)
end

module Int_Ex1 = Ex1(IntAExpr)
module String_Ex1 = Ex1(StringAExpr)

When you boil it down, a module is mostly a record. I say mostly because modules may also contain types. You can achieve a very similar effect by using records.

type 't aexpr_rec = {
  lit : int -> 't;
  add : 't -> 't -> 't
}

let int_rec : int aexpr_rec = {
  lit = fun x -> x;
  add : fun x y -> x + y
}

let string_rec : string aexpr_rec = {
  lit = string_of_int;
  add : fun x y -> sprintf "(%s + %s)" x y
}

let ex1 (s : 't aexpr_rec) : 't = s.add (s.lit 1) (s.lit 2)
let ex1_int = ex1 int_rec
let ex1_string = ex1 string_rec

To more closely match the above, we shouldn’t expose the underlying type 't. We can to this by packing it away in an existential type. Packing it away means we can never do anything with it though. This brings in the concept of keys which are a witness of the packed away type. It tends to be the case these keys are implemented using extensible variants.

type _ k = ..
type packed_aexpr = { key : 't k;  impl : 't aexpr_rec}

type _ k += Int : int k
type _ k += String : string k

let packed_int_aexpr = {key =  Int; impl = int_rec}
let packed_string_aexpr = {key =  String; impl = string_rec}

Using first class modules amounts to about the same thing?

Now we can make a registry of implementations.

Finally tagless is interesting for multiple reasons. Because you can combine multiple signatures in a module, finally tagless is open to new “constructors” in a way a concrete data type is not. Interpreting over the initial data type is just that, an interpreter. Interpreters usually have a performance hit. It can be the case that finally tagless has less indirection and exposes more optimizations to the compiler. YMMV. Finally tagless style is also interesting because it allows type safe interfaces that would require gadts to describe in initial style. This is because gadts allow you to constrain and hide the types in constructors in a way that you recover the types inside a pattern match. Since ordinary functions always allowed annotating more constrained types than their most general type, this is easy in the context of a finally tagless encoding

type 'a abexpr = 
   | Lit : 'a -> 'a abexpr
   | Add : int abexpr -> int abexpr -> int abexpr
   | Ite : bool abexpr -> 'a abexpr -> 'a abexpr -> 'a abexpr

let rec interp_val (type a) (x : a abexpr) : a = 
  match x with
  | Lit x -> x
  | Add (x,y) -> (interp_val x) + (interp_val y)

module type ABEXPR = sig
  type 'a t
  val lit : 'a -> 'a t
  val add : int t -> int t -> int t
  val ite : bool t -> 'a t -> 'a t -> 'a t 
end

The finally tagless style in bap performs a refactoring on this registry. Way back at the beginning, the module namespace is ultimately under the hood a dictionary in the compiler associating the implementations with a name. We can reify this to a first class value instead

packed_aexpr String.Map.t

Universal Types, Existentials and Packing

Extensible

Types describe how data can be treated. If you want to have different types treatable in the same way, you need to find to convert them to a format that is uniform. Ocaml itself achieves this by having a uniform tagging convention and pointer.s Maybe more deeply under the hood, everything is made up of the same stuff.

But two common options are boxing and serialization.

Sexp.t is a uniform serialization format in ocaml. It is actually a good universal type. “stringly typed” indeed. This is perhaps what Ivan was getting at in his comment about Ogre.

Instead of doing fancy typelevel tricks, a hlist can be presented as type hlist = Sexp.t list

An extensible record is Sexp.t String.Map.t which we could hide behind a safe interface?

Packing is sexp_of_t : t -> Sexp.t

And unpacking from the universal type is t_of_sexp : Sexp.t -> t option

module Dict : sig
  type t
  type 'a key
  type create_key : string -> (Sexp.t -> 'a) -> (t -> Sexp.t option) -> 'a key
  val put : t -> 'a key -> 'a -> t
  val get : t -> 'a key -> 'a option
end = struct
type 'a key = { name : string;
                t_of_sexp : Sexp.t -> 'a;
                sexp_of_t : 'a -> Sexp.t option
              }
type t = Sexp.t String.Map.t
let create_key name t_of_sexp sexp_of_t = {name; t_of_sexp; sexp_of_t}
let empty = String.Map.empty
let put d k v = String.Map.put d k.name (k.sexp_of_t v)
let get d k = (String.Map.get d k.name) >>= k.t_of_sexp
end

It’s almost silly.

Now, not everything has a Sexp.t representation. Functions typically don’t (we could in the case of finite types as a domain reify the function into a table and serialize that, but otherwise we need to serialize the text of the function or code, or closure, all of which is tough (probably also not impossible though).)

A more low level approach might be to use Marshalling https://ocaml.org/api/Marshal.html

It is interesting to consider what we lose here and what we gain by going to more convoluted representations.

A different representation can be had by using existential packing. The following type can indeed contain anything.

type some_type = Pack : 'a -> some_type

We can get pretty far in the above

let create_key name = name
let put d k v -> String.Map.put d k (Pack v)
let get d k -> let Pack x = String.Map.get d k in ? (* We have lost the connection between the original type and the typed key. *)  

We could maintain a table of all keys registered and only allow it to happen once in which case we’re possibly dignified in using an unsafe cast, but who knows. These things can be subtle.

let get d k -> let Pack x = String.Map.get d k in Obj.magic x

So we want to seal away the types so that things of different types are uniformly represented, but not forever. This leads us to a notion of a typed key, which is a kind of evidence we can pack up of the original type.

Typed Keys

Let’s say we have something that may work over ints and floats.

type ty = Int | Float
type univ = IntVal of int | FloatVal of float

let int_float_add (t : ty) (x : univ) (y : univ) : univ =
 match t, x, y with
 | Int, IntVal x, IntVal y -> IntVal (x + y)
 | Float, FloatVal x, FloatVal y -> FloatVal (x +. y)
 | _, _ , _ -> failwith "type mismatch in int_float_add"

You may have see this trick before. GADTs offer a convenient way to build value level thingies that tell us stuff about types. For example if we wanted to tell the different between floats and ints.

type _ ty = 
   | Int : int ty
   | Float : float ty

Now we can use this value as evidence that an incoming type is either an int or float. (As a rule of thumb, it is wise to use the type a syntax when dealing with gadts)

let int_float_add (type a) (t : a ty) (x : a) (y : a) : a = 
  match ty
  | Int -> x + y
  | Float -> x +. y

We’ve removed the indirection of univ and made a compile time type check so that we no longer have a failure case.

In a sense, this reflection gives us a first class notion of quantifying over types, reflecting the type level down to the value level. Separately, modules and functors already gave a way of second class quantifying over types, and first class modules make that first class.

Something related is known as the singleton technique for reflecting values into the type level.

type zero = Zero
type 'a succ = Succ of 'a

type _ snat = 
  | SZero : zero snat
  | SSucc : 'a snat -> 'a succ snat 

(* plus as a relation *)
type (_,_,_) plus =
  | PZero : (zero, 'a, 'a) plus
  | PSucc : ('a, 'b, 'c) plus -> ('a succ, 'b, 'c succ) plus

(* exists c, (a,b,c) plus *)
type ('a,'b) exists_plus = Pack : 'c snat * ('a,'b,'c) plus -> ('a,'b) exists_plus

let rec splus : type a b. a snat -> b snat -> (a,b) exists_plus =
  fun x y ->
  match x with
  | SZero -> Pack (y, PZero)
  | SSucc n -> let (Pack (z, pf)) = splus n y in
               Pack (SSucc z, PSucc pf)

Ocaml has a very very interesting addition to this story you won’t find in many other languages, using extensible variants. Extensible variants I believe have their history associated with exceptions, for which you could add new exception constructors to. This mechanism was generalized and interoperates with gadts and first class modules. Pretty crazy. Now instead of having to have a closed universe of possible type witnesses like in 'a ty above, we can leave the universe open and extensible.


Other discussions of interest

  • The Key Monad https://www.reddit.com/r/haskell/comments/574t1e/the_key_monad_type_safe_unconstrained_dynamic/ . Interesting examples- convert phoas to well typed de bruijn. Off the wall question: can we embed true hoas pattern matching?
  • https://apfelmus.nfshost.com/blog/2011/09/04-vault.html
  • The value restriction http://mlton.org/ValueRestriction https://counterexamples.org/polymorphic-references.html I wonder if there is a general theme that every counterexample is basically a technique in disguise. This may or may not work in rust as this is in some sense dependent upon aliasing.
  • https://github.com/c-cube/datalog/blob/a29910e3262f4e0490fb69575b1246c4b4a165ee/src/bottom_up/bottomUp.ml#L8 Universal type
  • https://blog.janestreet.com/more-expressive-gadt-encodings-via-first-class-modules/
  • http://okmij.org/ftp/ML/first-class-modules/index.html “simplistic gadts” uses a ref cell tunnel. http://okmij.org/ftp/ML/GADT.ml This is exactly using the value restriction counterexample to good use.

Channel passing style. subroutine style giving output pointers. (a -> b) -> (a -> unit * unit -> b) in ocaml. Can break apart. “Spooky action at a distance” or (a -> b) -> (a -> b ref -> unit) (a -> b ref -> ()) -> (a -> b)

###

type person = {name : string; age : int} type people = person list

A refactoring one might want to do. type people = {names : string list; ages : int list}

Now this isn’t an isomorphism, but it’s close. The new type is not constrained to have as many names as ages

These distinct reprentations show up all over the place.

There is a natural notion of product. We call it so because it obeys in a sense the usual laws of products (int * bool) * string <-> int * (bool * string)

The usual tuple interpretation functor for finally tagless: module Tup(S : ADDEXPR)(T : ADDEXPR) = struct

end

modules are kind of like records. In some sense, what we have done with defining multiple modules is isomorphic to

type ‘a all_expr_impls = { intaexpr : ‘a; stringaexpr : ‘a; } type all_aexpr = aexpr all_aexpr_impl

We can factor through two composed records like this.

We can perform a refactoring of this to The is the bap-style finally tagless for non extensbile records

type (‘a,’b) t = ‘a option hold a key? A tuple is a hetergenous record of sorts. but kind of not really. Only simpler in that the number of fields is small.

I guess type het_tup = {key1 : ‘a key; ‘a : key2 : ‘b key; val2 : ‘b} Is complicating the key packed record on one axis, by increasing the number of fields. If we want a runtime dynamic number of fields we go to keyed list

module FstString = type t = (string * _ option) let lit x = (int_of_string x, None) end

module SndInt = type t = (_ option * ) end module Join( : Fst)( : Snd) =

Project Structure

Binary / Program Analysis

Call graph Control flow graph Functions Blocks insns

Dataflow analysis. Backwards Forwards. Fixedpoint on graphs, topological sort. May/Must

Other tidbits

JT’s gists Ivan’s gists Choice gitter tips

Universal Values

https://discuss.ocaml.org/t/types-as-first-class-citizens-in-ocaml/2030 https://github.com/janestreet/core_kernel/blob/master/univ/src/univ.ml http://binaryanalysisplatform.github.io/bap/api/odoc/bap/Bap/Std/Value/index.html https://blog.shaynefletcher.org/2017/03/universal-type.html

locally abstract types. Using (type u) as an argument - useful for first class modules ande gadts

Storing first class modules in a hashtable is an example.

DIY typeclasses

universal value + registry of typeclass instances?

#Primus

  • https://feliam.wordpress.com/2010/10/07/the-symbolic-maze/ solving a maze with symbolic execution
  • tabulating a function
  • a supercompiler?
  • taint tracking is like a cup game? Some myterious program swaps and does claculations on the entries. Which final entry came from your input? Taint tracking is like abstract interpetation with 100% path sensitvity. Concolic-ish. You can track what was possible along your exact path. Could i tag probabilities in any meaningful way on variables? I tfeels like probabilites tag the whole state, not pieces of it. Partial
  • Partial evaluation/binding time analysis - Mark one input as “runtime” tainted and another as “compile” time tainted.

Primus is built in this extensible style.

You can find them in the plugins directory

I’m trying to understand the different primus plugins.

I suppose all of them can be mixed and matched but for some it be very strange to. Basically I would think you’d want only one Schedulers

Schedulers determine when and how forked machines are put into and take out of some kind of storage structure

Schedulers:

  • primus-promiscuous
  • primus-greedy

Forkers decide how and when to fork

  • primus-random - fuzzing?
  • primus-symbolic - symbolic execution?

primus-limit = limit the number of somethigns a machine can do before being killed. depth limited search primus-print - just print various observations when they get fired

Misc

Brumley’s old bap notes

--print-missing flag for missing instruction semantics https://github.com/BinaryAnalysisPlatform/bap/pull/1409

https://watch.ocaml.org/videos/watch/8dc2d8d3-c140-4c3d-a8fe-a6fcf6fba988 JVM and C support in the futute primus lisp - common lisp like C and python ctypes bindings

semantics - either in ocaml dsl or primus lisp dsl

(defun rTST (rn rm _ _ ) “tst rn, rm” (let ((rd (logand rn rm)) (set ZF (is-zero rd)))) (set NF ()) )

dependency injection dynamic linking

framework inital style insufficient because need to update extensible variants (GADTS)? no not abstract. Heavyweight. Not serializable higher kinded