714 lines
27 KiB
Org Mode
714 lines
27 KiB
Org Mode
\begin{comment}
|
||
* TODO Scaletta [1/5]
|
||
- [X] Abstract
|
||
- [-] Background [40%]
|
||
- [X]
|
||
- [ ] Lambda code [0%]
|
||
- [ ] Compiler optimizations
|
||
- [ ] other instructions
|
||
- [X] Pattern matching
|
||
- [ ] Symbolic execution
|
||
- [ ] Translation Validation
|
||
- [ ] Translation validation of the Pattern Matching Compiler
|
||
- [ ] Source translation
|
||
- [ ] Formal Grammar
|
||
- [ ] Compilation of source patterns
|
||
- [ ] Rest?
|
||
- [ ] Target translation
|
||
- [ ] Formal Grammar
|
||
- [ ] Symbolic execution of the lambda target
|
||
- [ ] Equivalence between source and target
|
||
- [ ] Proof of correctness
|
||
- [ ] Practical results
|
||
|
||
\end{comment}
|
||
|
||
#+TITLE: Translation Verification of the pattern matching compiler
|
||
#+AUTHOR: Francesco Mecca
|
||
#+EMAIL: me@francescomecca.eu
|
||
#+DATE:
|
||
|
||
#+LANGUAGE: en
|
||
#+LaTeX_CLASS: article
|
||
#+LaTeX_HEADER: \usepackage{algorithm}
|
||
#+LaTeX_HEADER: \usepackage{comment}
|
||
#+LaTeX_HEADER: \usepackage{algpseudocode}
|
||
#+LaTeX_HEADER: \usepackage{amsmath,amssymb,amsthm}
|
||
#+Latex_HEADER: \newtheorem{definition}{Definition}
|
||
#+LaTeX_HEADER: \usepackage{graphicx}
|
||
#+LaTeX_HEADER: \usepackage{listings}
|
||
#+LaTeX_HEADER: \usepackage{color}
|
||
#+LaTeX_HEADER: \usepackage{stmaryrd}
|
||
#+LaTeX_HEADER: \newcommand{\sem}[1]{{\llbracket{#1}\rrbracket}}
|
||
#+EXPORT_SELECT_TAGS: export
|
||
#+EXPORT_EXCLUDE_TAGS: noexport
|
||
#+OPTIONS: H:2 toc:nil \n:nil @:t ::t |:t ^:{} _:{} *:t TeX:t LaTeX:t
|
||
#+STARTUP: showall
|
||
\section{Introduction}
|
||
|
||
This dissertation presents an algorithm for the translation validation of the OCaml pattern
|
||
matching compiler. Given a source program and its compiled version the
|
||
algorithm checks wheter the two are equivalent or produce a counter
|
||
example in case of a mismatch.
|
||
|
||
Our equivalence algorithm works with decision trees. Source patterns are
|
||
converted into a decision tree using a matrix decomposition algorithm.
|
||
Target programs, described in the Lambda intermediate
|
||
representation language of the OCaml compiler, are turned into decision trees
|
||
by applying symbolic execution.
|
||
|
||
\subsection{Translation validation}
|
||
A pattern matching compiler turns a series of pattern matching clauses
|
||
into simple control flow structures such as \texttt{if, switch}, for example:
|
||
\begin{lstlisting}
|
||
match x with
|
||
| [] -> (0, None)
|
||
| x::[] -> (1, Some x)
|
||
| _::y::_ -> (2, Some y)
|
||
\end{lstlisting}
|
||
\begin{lstlisting}
|
||
(if scrutinee
|
||
(let (field_1 =a (field 1 scrutinee))
|
||
(if field_1
|
||
(let
|
||
(field_1_1 =a (field 1 field_1)
|
||
x =a (field 0 field_1))
|
||
(makeblock 0 2 (makeblock 0 x)))
|
||
(let (y =a (field 0 scrutinee))
|
||
(makeblock 0 1 (makeblock 0 y)))))
|
||
[0: 0 0a])
|
||
\end{lstlisting}
|
||
%% TODO: side by side
|
||
The code in the right is in the Lambda intermediate representation of
|
||
the OCaml compiler. The Lambda representation of a program is shown by
|
||
calling the \texttt{ocamlc} compiler with \texttt{-drawlambda} flag.
|
||
|
||
The OCaml pattern matching compiler is a critical part of the OCaml compiler
|
||
in terms of correctness because any bug would result in wrong code
|
||
production rather than triggering compilation failures.
|
||
Such bugs also are hard to catch by testing because they arise in
|
||
corner cases of complex patterns which are typically not in the
|
||
compiler test suite.
|
||
|
||
The OCaml core developers group considered evolving the pattern matching compiler, either by
|
||
using a new algorithm or by incremental refactorings of its codebase.
|
||
For this reason we want to verify that new implementations of the
|
||
compiler avoid the introduction of new bugs and that such
|
||
modifications don't result in a different behaviour than the current one.
|
||
|
||
One possible approach is to formally verify the pattern matching compiler
|
||
implementation using a machine checked proof.
|
||
Another possibility, albeit with a weaker result, is to verify that
|
||
each source program and target program pair are semantically correct.
|
||
We chose the latter technique, translation validation because is easier to adopt in
|
||
the case of a production compiler and to integrate with an existing codebase. The compiler is treated as a
|
||
blackbox and proof only depends on our equivalence algorithm.
|
||
|
||
\subsection{Our approach}
|
||
%% replace common TODO
|
||
Our algorithm translates both source and target programs into a common
|
||
representation, decision trees. Decision trees where choosen because
|
||
they model the space of possible values at a given branch of
|
||
execution.
|
||
Here is the decision tree for the source example program.
|
||
\begin{verbatim}
|
||
Node(Root)
|
||
/ \
|
||
(= []) (= ::)
|
||
/ \
|
||
Leaf Node(Root.1)
|
||
(0, None) / \
|
||
(= []) (= ::)
|
||
/ \
|
||
Leaf Leaf
|
||
(1, Some(Root.0)) (2, Some(Root.1.0))
|
||
\end{verbatim}
|
||
\texttt{(Root.0)} is called an \emph{accessor}, that represents the
|
||
access path to a value that can be reached by deconstructing the
|
||
scrutinee. In this example \texttt{Root.0} is the first subvalue of
|
||
the scrutinee.
|
||
|
||
Target decision trees have a similar shape but the tests on the
|
||
branches are related to the low level representation of values in
|
||
Lambda code. For example, cons cells \texttt{x::xs} are blocks with
|
||
tag 0.
|
||
|
||
To check the equivalence of a source and a target decision tree,
|
||
we proceed by case analysis.
|
||
If we have two terminals, such as leaves in the previous example,
|
||
we check that the two right-hand-sides are equivalent.
|
||
If we have a node $N$ and another tree $T$ we check equivalence for
|
||
each child of $N$, which is a pair of a branch condition $\pi_i$ and a
|
||
subtree $C_i$. For every child $(\pi_i, C_i)$ we reduce $T$ by killing all
|
||
the branches that are incompatible with $\pi_i$ and check that the
|
||
reduced tree is equivalent to $C_i$.
|
||
|
||
For the internship we have choosen a simple subset of the OCaml
|
||
language and implemented a prototype equivalence checker along with a
|
||
formal statement of correctness and proof sketches.
|
||
The prototype is to be included in the OCaml compiler infrastructure
|
||
and will aid the development.
|
||
|
||
\subsection{From source programs to decision trees}
|
||
Our source language supports integers, lists, tuples and all algebraic
|
||
datatypes. Patterns support wildcards, constructors and literals, or
|
||
patterns $(p_1|p_2)$ and pattern variables.
|
||
We also support \texttt{when} guards.
|
||
Decision trees have nodes of the form:
|
||
\begin{lstlisting}
|
||
type decision_tree =
|
||
| Unreachable
|
||
| Failure
|
||
| Leaf of source_expr
|
||
| Guard of source_expr * decision_tree * decision_tree
|
||
| Switch of accessor * (constructor * decision_tree) list * decision_tree
|
||
\end{lstlisting}
|
||
In the \texttt{Switch} node we have one subtree for every head constructor
|
||
that appears in the pattern matching clauses and a fallback case for
|
||
other values. The branch condition $\pi_i$ expresses that the value at the
|
||
switch accessor starts with the given constructor.
|
||
\texttt{Failure} nodes express match failures for values that are not
|
||
matched by the source clauses.
|
||
\texttt{Unreachable} is used when we statically know that no value
|
||
can flow to that subtree.
|
||
|
||
We write $\sem{t_S}_S$ for the decision tree of the source program
|
||
$t_S$, computed by a matrix decomposition algorithm (each column
|
||
decomposition step gives a \texttt{Switch} node).
|
||
It satisfies the following correctness statement:
|
||
\[
|
||
\forall t_S, \forall v_S, \quad t_S(v_S) = \sem{t_S}_S(v_S)
|
||
\]
|
||
Running any source values $v_S$ against the source program gives the
|
||
same result as running it against the decision tree.
|
||
|
||
\subsection{From target programs to decision trees}
|
||
The target programs include the following Lambda constructs:
|
||
\texttt{let, if, switch, Match\_failure, catch, exit, field} and
|
||
various comparation operations, guards. The symbolic execution engine
|
||
traverses the target program and builds an environment that maps
|
||
variables to accessors. It branches at every control flow statement
|
||
and emits a Switch node. The branch condition $\pi_i$ is expressed as
|
||
an interval set of possible values at that point.
|
||
Guards result in branching. In comparison with the source decision
|
||
trees, \texttt{Unreachable} nodes are never emitted.
|
||
|
||
We write $\sem{t_T}_T$ for the decision tree of the target program
|
||
$t_T$, satisfying the following correctness statement:
|
||
\[
|
||
\forall t_T, \forall v_T, \quad t_T(v_T) = \sem{t_T}_T(v_T)
|
||
\]
|
||
|
||
\subsection{Equivalence checking}
|
||
|
||
|
||
|
||
* Background
|
||
|
||
**
|
||
Objective Caml () is a dialect of the ML (Meta-Language) family of programming
|
||
languages.
|
||
shares many features with other dialects of ML, such as SML and Caml Light,
|
||
The main features of ML languages are the use of the Hindley-Milner type system that
|
||
provides many advantages with respect to static type systems of traditional imperative and object
|
||
oriented language such as C, C++ and Java, such as:
|
||
- Polymorphism: in certain scenarios a function can accept more than one
|
||
type for the input parameters. For example a function that computes the lenght of a
|
||
list doesn't need to inspect the type of the elements of the list and for this reason
|
||
a List.length function can accept lists of integers, lists of strings and in general
|
||
lists of any type. Such languages offer polymorphic functions through subtyping at
|
||
runtime only, while other languages such as C++ offer polymorphism through compile
|
||
time templates and function overloading.
|
||
With the Hindley-Milner type system each well typed function can have more than one
|
||
type but always has a unique best type, called the /principal type/.
|
||
For example the principal type of the List.length function is "For any /a/, function from
|
||
list of /a/ to /int/" and /a/ is called the /type parameter/.
|
||
- Strong typing: Languages such as C and C++ allow the programmer to operate on data
|
||
without considering its type, mainly through pointers. Other languages such as C#
|
||
and Go allow type erasure so at runtime the type of the data can't be queried.
|
||
In the case of programming languages using an Hindley-Milner type system the
|
||
programmer is not allowed to operate on data by ignoring or promoting its type.
|
||
- Type Inference: the principal type of a well formed term can be inferred without any
|
||
annotation or declaration.
|
||
- Algebraic data types: types that are modelled by the use of two
|
||
algebraic operations, sum and product.
|
||
A sum type is a type that can hold of many different types of
|
||
objects, but only one at a time. For example the sum type defined
|
||
as /A + B/ can hold at any moment a value of type A or a value of
|
||
type B. Sum types are also called tagged union or variants.
|
||
A product type is a type constructed as a direct product
|
||
of multiple types and contains at any moment one instance for
|
||
every type of its operands. Product types are also called tuples
|
||
or records. Algebraic data types can be recursive
|
||
in their definition and can be combined.
|
||
Moreover ML languages are functional, meaning that functions are
|
||
treated as first class citizens and variables are immutable,
|
||
although mutable statements and imperative constructs are permitted.
|
||
In addition to that features an object system, that provides
|
||
inheritance, subtyping and dynamic binding, and modules, that
|
||
provide a way to encapsulate definitions. Modules are checked
|
||
statically and can be reificated through functors.
|
||
|
||
** Lambda form compilation
|
||
\begin{comment}
|
||
https://dev.realworld.org/compiler-backend.html
|
||
\end{comment}
|
||
|
||
provides compilation in form of a byecode executable with an
|
||
optionally embeddable interpreter and a native executable that could
|
||
be statically linked to provide a single file executable.
|
||
|
||
After the typechecker has proven that the program is type safe,
|
||
the compiler lower the code to /Lambda/, an s-expression based
|
||
language that assumes that its input has already been proved safe.
|
||
On the /Lambda/ representation of the source program, the compiler
|
||
performes a series of optimization passes before translating the
|
||
lambda form to assembly code.
|
||
|
||
*** datatypes
|
||
|
||
Most native data types in , such as integers, tuples, lists,
|
||
records, can be seen as instances of the following definition
|
||
|
||
#+BEGIN_SRC
|
||
type t = Nil | One of int | Cons of int * t
|
||
#+END_SRC
|
||
that is a type /t/ with three constructors that define its complete
|
||
signature.
|
||
Every constructor has an arity. Nil, a constructor of arity 0, is
|
||
called a constant constructor.
|
||
|
||
*** Lambda form types
|
||
A lambda form target file produced by the compiler consists of a
|
||
single s-expression. Every s-expression consist of /(/, a sequence of
|
||
elements separated by a whitespace and a closing /)/.
|
||
Elements of s-expressions are:
|
||
- Atoms: sequences of ascii letters, digits or symbols
|
||
- Variables
|
||
- Strings: enclosed in double quotes and possibly escaped
|
||
- S-expressions: allowing arbitrary nesting
|
||
|
||
There are several numeric types:
|
||
- integers: that us either 31 or 63 bit two's complement arithmetic
|
||
depending on system word size, and also wrapping on overflow
|
||
- 32 bit and 64 bit integers: that use 32-bit and 64-bit two's complement arithmetic
|
||
with wrap on overflow
|
||
- big integers: offer integers with arbitrary precision
|
||
- floats: that use IEEE754 double-precision (64-bit) arithmetic with
|
||
the addition of the literals /infinity/, /neg_infinity/ and /nan/.
|
||
|
||
The are varios numeric operations defined:
|
||
|
||
- Arithmetic operations: +, -, *, /, % (modulo), neg (unary negation)
|
||
- Bitwise operations: &, |, ^, <<, >> (zero-shifting), a>> (sign extending)
|
||
- Numeric comparisons: <, >, <=, >=, ==
|
||
|
||
*** Functions
|
||
|
||
Functions are defined using the following syntax, and close over all
|
||
bindings in scope: (lambda (arg1 arg2 arg3) BODY)
|
||
and are applied using the following syntax: (apply FUNC ARG ARG ARG)
|
||
Evaluation is eager.
|
||
|
||
*** Bindings
|
||
The atom /let/ introduces a sequence of bindings:
|
||
(let BINDING BINDING BINDING ... BODY)
|
||
|
||
*** Other atoms
|
||
TODO: if, switch, stringswitch...
|
||
TODO: magari esempi
|
||
|
||
|
||
** Pattern matching
|
||
|
||
Pattern matching is a widely adopted mechanism to interact with ADT.
|
||
C family languages provide branching on predicates through the use of
|
||
if statements and switch statements.
|
||
Pattern matching on the other hands express predicates through
|
||
syntactic templates that also allow to bind on data structures of
|
||
arbitrary shapes. One common example of pattern matching is the use of regular
|
||
expressions on strings. provides pattern matching on ADT and
|
||
primitive data types.
|
||
The result of a pattern matching operation is always one of:
|
||
- this value does not match this pattern”
|
||
- this value matches this pattern, resulting the following bindings of
|
||
names to values and the jump to the expression pointed at the
|
||
pattern.
|
||
|
||
#+BEGIN_SRC
|
||
type color = | Red | Blue | Green | Black | White
|
||
|
||
match color with
|
||
| Red -> print "red"
|
||
| Blue -> print "red"
|
||
| Green -> print "red"
|
||
| _ -> print "white or black"
|
||
#+END_SRC
|
||
|
||
provides tokens to express data destructoring.
|
||
For example we can examine the content of a list with patten matching
|
||
|
||
#+BEGIN_SRC
|
||
|
||
begin match list with
|
||
| [ ] -> print "empty list"
|
||
| element1 :: [ ] -> print "one element"
|
||
| (element1 :: element2) :: [ ] -> print "two elements"
|
||
| head :: tail-> print "head followed by many elements"
|
||
#+END_SRC
|
||
|
||
Parenthesized patterns, such as the third one in the previous example,
|
||
matches the same value as the pattern without parenthesis.
|
||
|
||
The same could be done with tuples
|
||
#+BEGIN_SRC
|
||
|
||
begin match tuple with
|
||
| (Some _, Some _) -> print "Pair of optional types"
|
||
| (Some _, None) | (None, Some _) -> print "Pair of optional types, one of which is null"
|
||
| (None, None) -> print "Pair of optional types, both null"
|
||
#+END_SRC
|
||
|
||
The pattern pattern₁ | pattern₂ represents the logical "or" of the
|
||
two patterns pattern₁ and pattern₂. A value matches pattern₁ |
|
||
pattern₂ if it matches pattern₁ or pattern₂.
|
||
|
||
Pattern clauses can make the use of /guards/ to test predicates and
|
||
variables can captured (binded in scope).
|
||
|
||
#+BEGIN_SRC
|
||
|
||
begin match token_list with
|
||
| "switch"::var::"{"::rest -> ...
|
||
| "case"::":"::var::rest when is_int var -> ...
|
||
| "case"::":"::var::rest when is_string var -> ...
|
||
| "}"::[ ] -> ...
|
||
| "}"::rest -> error "syntax error: " rest
|
||
|
||
#+END_SRC
|
||
|
||
Moreover, the pattern matching compiler emits a warning when a
|
||
pattern is not exhaustive or some patterns are shadowed by precedent ones.
|
||
|
||
** Symbolic execution
|
||
|
||
** Translation validation
|
||
Translators, such as translators and code generators, are huge pieces of
|
||
software usually consisting of multiple subsystem and
|
||
constructing an actual specification of a translator implementation for
|
||
formal validation is a very long task. Moreover, different
|
||
translators implement different algorithms, so the correctness proof of
|
||
a translator cannot be generalized and reused to prove another translator.
|
||
Translation validation is an alternative to the verification of
|
||
existing translators that consists of taking the source and the target
|
||
(compiled) program and proving /a posteriori/ their semantic equivalence.
|
||
|
||
- [ ] Techniques for translation validation
|
||
- [ ] What does semantically equivalent mean
|
||
- [ ] What happens when there is no semantic equivalence
|
||
- [ ] Translation validation through symbolic execution
|
||
|
||
** Translation validation of the Pattern Matching Compiler
|
||
|
||
*** Source program
|
||
The algorithm takes as its input a source program and translates it
|
||
into an algebraic data structure called /constraint_tree/.
|
||
|
||
#+BEGIN_SRC
|
||
type constraint_tree =
|
||
| Unreachable
|
||
| Failure
|
||
| Leaf of source_expr
|
||
| Guard of source_blackbox * constraint_tree * constraint_tree
|
||
| Node of accessor * (constructor * constraint_tree) list * constraint_tree
|
||
#+END_SRC
|
||
|
||
Unreachable, Leaf of source_expr and Failure are the terminals of the three.
|
||
We distinguish
|
||
- Unreachable: statically it is known that no value can go there
|
||
- Failure: a value matching this part results in an error
|
||
- Leaf: a value matching this part results into the evaluation of a
|
||
source blackbox of code
|
||
|
||
The algorithm doesn't support type-declaration-based analysis
|
||
to know the list of constructors at a given type.
|
||
Let's consider some trivial examples:
|
||
|
||
#+BEGIN_SRC
|
||
function true -> 1
|
||
#+END_SRC
|
||
|
||
[ ] Converti a disegni
|
||
|
||
Is translated to
|
||
|Node ([(true, Leaf 1)], Failure)
|
||
while
|
||
#+BEGIN_SRC
|
||
function
|
||
true -> 1
|
||
| false -> 2
|
||
#+END_SRC
|
||
will give
|
||
|Node ([(true, Leaf 1); (false, Leaf 2)])
|
||
|
||
It is possible to produce Unreachable examples by using
|
||
refutation clauses (a "dot" in the right-hand-side)
|
||
#+BEGIN_SRC
|
||
function
|
||
true -> 1
|
||
| false -> 2
|
||
| _ -> .
|
||
#+END_SRC
|
||
that gets translated into
|
||
Node ([(true, Leaf 1); (false, Leaf 2)], Unreachable)
|
||
|
||
We trust this annotation, which is reasonable as the type-checker
|
||
verifies that it indeed holds.
|
||
|
||
Guard nodes of the tree are emitted whenever a guard is found. Guards
|
||
node contains a blackbox of code that is never evaluated and two
|
||
branches, one that is taken in case the guard evaluates to true and
|
||
the other one that contains the path taken when the guard evaluates to
|
||
true.
|
||
|
||
[ ] Finisci con Node
|
||
[ ] Spiega del fallback
|
||
[ ] rivedi compilazione per tenere in considerazione il tuo albero invece che le lambda
|
||
[ ] Specifica che stesso algoritmo usato per compilare a lambda, piu` optimizations
|
||
|
||
The source code of a pattern matching function in has the
|
||
following form:
|
||
|
||
|match variable with
|
||
|\vert pattern₁ -> expr₁
|
||
|\vert pattern₂ when guard -> expr₂
|
||
|\vert pattern₃ as var -> expr₃
|
||
|⋮
|
||
|\vert pₙ -> exprₙ
|
||
|
||
and can include any expression that is legal for the compiler,
|
||
such as "when" conditions and assignments. Patterns could or could not
|
||
be exhaustive.
|
||
|
||
Pattern matching code could also be written using the more compact form:
|
||
|function
|
||
|\vert pattern₁ -> expr₁
|
||
|\vert pattern₂ when guard -> expr₂
|
||
|\vert pattern₃ as var -> expr₃
|
||
|⋮
|
||
|\vert pₙ -> exprₙ
|
||
|
||
|
||
This BNF grammar describes formally the grammar of the source program:
|
||
|
||
#+BEGIN_SRC bnf
|
||
start ::= "match" id "with" patterns | "function" patterns
|
||
patterns ::= (pattern0|pattern1) pattern1+
|
||
;; pattern0 and pattern1 are needed to distinguish the first case in which
|
||
;; we can avoid writing the optional vertical line
|
||
pattern0 ::= clause
|
||
pattern1 ::= "|" clause
|
||
clause ::= lexpr "->" rexpr
|
||
|
||
lexpr ::= rule (ε|condition)
|
||
rexpr ::= _code ;; arbitrary code
|
||
|
||
rule ::= wildcard|variable|constructor_pattern|or_pattern ;;
|
||
|
||
;; rules
|
||
wildcard ::= "_"
|
||
variable ::= identifier
|
||
constructor_pattern ::= constructor (rule|ε) (assignment|ε)
|
||
|
||
constructor ::= int|float|char|string|bool
|
||
|unit|record|exn|objects|ref
|
||
|list|tuple|array
|
||
|variant|parameterized_variant ;; data types
|
||
|
||
or_pattern ::= wildcard|variable|constructor_pattern ("|" wildcard|variable|constructor_pattern)+
|
||
|
||
condition ::= "when" bexpr
|
||
assignment ::= "as" id
|
||
bexpr ::= _code ;; arbitrary code
|
||
#+END_SRC
|
||
|
||
\begin{comment}
|
||
Check that it is still this
|
||
\end{comment}
|
||
|
||
Patterns are of the form
|
||
| pattern | type of pattern |
|
||
|-----------------+---------------------|
|
||
| _ | wildcard |
|
||
| x | variable |
|
||
| c(p₁,p₂,...,pₙ) | constructor pattern |
|
||
| (p₁\vert p₂) | or-pattern |
|
||
|
||
During compilation by the translators expressions are compiled into
|
||
lambda code and are referred as lambda code actions lᵢ.
|
||
|
||
The entire pattern matching code is represented as a clause matrix
|
||
that associates rows of patterns (p_{i,1}, p_{i,2}, ..., p_{i,n}) to
|
||
lambda code action lⁱ
|
||
\begin{equation*}
|
||
(P → L) =
|
||
\begin{pmatrix}
|
||
p_{1,1} & p_{1,2} & \cdots & p_{1,n} & → l₁ \\
|
||
p_{2,1} & p_{2,2} & \cdots & p_{2,n} & → l₂ \\
|
||
\vdots & \vdots & \ddots & \vdots & → \vdots \\
|
||
p_{m,1} & p_{m,2} & \cdots & p_{m,n} & → lₘ
|
||
\end{pmatrix}
|
||
\end{equation*}
|
||
|
||
The pattern /p/ matches a value /v/, written as p ≼ v, when one of the
|
||
following rules apply
|
||
|
||
|--------------------+---+--------------------+-------------------------------------------|
|
||
| _ | ≼ | v | ∀v |
|
||
| x | ≼ | v | ∀v |
|
||
| (p₁ \vert\ p₂) | ≼ | v | iff p₁ ≼ v or p₂ ≼ v |
|
||
| c(p₁, p₂, ..., pₐ) | ≼ | c(v₁, v₂, ..., vₐ) | iff (p₁, p₂, ..., pₐ) ≼ (v₁, v₂, ..., vₐ) |
|
||
| (p₁, p₂, ..., pₐ) | ≼ | (v₁, v₂, ..., vₐ) | iff pᵢ ≼ vᵢ ∀i ∈ [1..a] |
|
||
|--------------------+---+--------------------+-------------------------------------------|
|
||
|
||
When a value /v/ matches pattern /p/ we say that /v/ is an /instance/ of /p/.
|
||
|
||
Considering the pattern matrix P we say that the value vector
|
||
$\vec{v}$ = (v₁, v₂, ..., vᵢ) matches the line number i in P if and only if the following two
|
||
conditions are satisfied:
|
||
- p_{i,1}, p_{i,2}, \cdots, p_{i,n} ≼ (v₁, v₂, ..., vᵢ)
|
||
- ∀j < i p_{j,1}, p_{j,2}, \cdots, p_{j,n} ⋠ (v₁, v₂, ..., vᵢ)
|
||
|
||
We can define the following three relations with respect to patterns:
|
||
- Patter p is less precise than pattern q, written p ≼ q, when all
|
||
instances of q are instances of p
|
||
- Pattern p and q are equivalent, written p ≡ q, when their instances
|
||
are the same
|
||
- Patterns p and q are compatible when they share a common instance
|
||
|
||
**** Initial state of the compilation
|
||
|
||
Given a source of the following form:
|
||
|
||
|
||
|match variable with
|
||
|\vert pattern₁ -> e₁
|
||
|\vert pattern₂ -> e₂
|
||
|⋮
|
||
|\vert pₘ -> eₘ
|
||
|
||
the initial input of the algorithm C consists of a vector of variables
|
||
$\vec{x}$ = (x₁, x₂, ..., xₙ) of size /n/ where /n/ is the arity of
|
||
the type of /x/ and a clause matrix P → L of width n and height m.
|
||
That is:
|
||
|
||
\begin{equation*}
|
||
C((\vec{x} = (x₁, x₂, ..., xₙ),
|
||
\begin{pmatrix}
|
||
p_{1,1} & p_{1,2} & \cdots & p_{1,n} → l₁ \\
|
||
p_{2,1} & p_{2,2} & \cdots & p_{2,n} → l₂ \\
|
||
\vdots & \vdots & \ddots & \vdots → \vdots \\
|
||
p_{m,1} & p_{m,2} & \cdots & p_{m,n} → lₘ)
|
||
\end{pmatrix}
|
||
\end{equation*}
|
||
|
||
The base case C₀ of the algorithm is the case in which the $\vec{x}$
|
||
is empty, that is $\vec{x}$ = (), then the result of the compilation
|
||
C₀ is l₁
|
||
\begin{equation*}
|
||
C₀((),
|
||
\begin{pmatrix}
|
||
→ l₁ \\
|
||
→ l₂ \\
|
||
→ \vdots \\
|
||
→ lₘ
|
||
\end{pmatrix})
|
||
) = l₁
|
||
\end{equation*}
|
||
|
||
When $\vec{x}$ ≠ () then the compilation advances using one of the
|
||
following four rules:
|
||
|
||
1) Variable rule: if all patterns of the first column of P are wildcard patterns or
|
||
bind the value to a variable, then
|
||
|
||
\begin{equation*}
|
||
C(\vec{x}, P → L) = C((x₂, x₃, ..., xₙ), P' → L')
|
||
\end{equation*}
|
||
where
|
||
\begin{equation*}
|
||
\begin{pmatrix}
|
||
p_{1,2} & \cdots & p_{1,n} & → & (let & y₁ & x₁) & l₁ \\
|
||
p_{2,2} & \cdots & p_{2,n} & → & (let & y₂ & x₁) & l₂ \\
|
||
\vdots & \ddots & \vdots & → & \vdots & \vdots & \vdots & \vdots \\
|
||
p_{m,2} & \cdots & p_{m,n} & → & (let & yₘ & x₁) & lₘ
|
||
\end{pmatrix}
|
||
\end{equation*}
|
||
|
||
That means in every lambda action lᵢ there is a binding of x₁ to the
|
||
variable that appears on the pattern $p_{i,1}. Bindings are omitted
|
||
for wildcard patterns and the lambda action lᵢ remains unchanged.
|
||
|
||
2) Constructor rule: if all patterns in the first column of P are
|
||
constructors patterns of the form k(q₁, q₂, ..., qₙ) we define a
|
||
new matrix, the specialized clause matrix S, by applying the
|
||
following transformation on every row /p/:
|
||
\begin{lstlisting}[mathescape,columns=fullflexible,basicstyle=\fontfamily{lmvtt}\selectfont,]
|
||
for every c ∈ Set of constructors
|
||
for i ← 1 .. m
|
||
let kᵢ ← constructor_of($p_{i,1}$)
|
||
if kᵢ = c then
|
||
p ← $q_{i,1}$, $q_{i,2}$, ..., $q_{i,n'}$, $p_{i,2}$, $p_{i,3}$, ..., $p_{i, n}$
|
||
\end{lstlisting}
|
||
Patterns of the form $q_{i,j}$ matches on the values of the
|
||
constructor and we define new fresh variables y₁, y₂, ..., yₐ so
|
||
that the lambda action lᵢ becomes
|
||
|
||
\begin{lstlisting}[mathescape,columns=fullflexible,basicstyle=\fontfamily{lmvtt}\selectfont,]
|
||
(let (y₁ (field 0 x₁))
|
||
(y₂ (field 1 x₁))
|
||
...
|
||
(yₐ (field (a-1) x₁))
|
||
lᵢ)
|
||
\end{lstlisting}
|
||
|
||
and the result of the compilation for the set of constructors
|
||
{c₁, c₂, ..., cₖ} is:
|
||
|
||
\begin{lstlisting}[mathescape,columns=fullflexible,basicstyle=\fontfamily{lmvtt}\selectfont,]
|
||
switch x₁ with
|
||
case c₁: l₁
|
||
case c₂: l₂
|
||
...
|
||
case cₖ: lₖ
|
||
default: exit
|
||
\end{lstlisting}
|
||
|
||
3) Orpat rule: there are various strategies for dealing with
|
||
or-patterns. The most naive one is to split the or-patterns.
|
||
For example a row p containing an or-pattern:
|
||
\begin{equation*}
|
||
(p_{i,1}|q_{i,1}|r_{i,1}), p_{i,2}, ..., p_{i,m} → lᵢ
|
||
\end{equation*}
|
||
results in three rows added to the clause matrix
|
||
\begin{equation*}
|
||
p_{i,1}, p_{i,2}, ..., p_{i,m} → lᵢ \\
|
||
\end{equation*}
|
||
\begin{equation*}
|
||
q_{i,1}, p_{i,2}, ..., p_{i,m} → lᵢ \\
|
||
\end{equation*}
|
||
\begin{equation*}
|
||
r_{i,1}, p_{i,2}, ..., p_{i,m} → lᵢ
|
||
\end{equation*}
|
||
4) Mixture rule:
|
||
When none of the previous rules apply the clause matrix P → L is
|
||
splitted into two clause matrices, the first P₁ → L₁ that is the
|
||
largest prefix matrix for which one of the three previous rules
|
||
apply, and P₂ → L₂ containing the remaining rows. The algorithm is
|
||
applied to both matrices.
|
||
|
||
\begin{comment}
|
||
#+BEGIN_COMMENT
|
||
CITE paper?
|
||
#+END_COMMENT’
|
||
\end{comment}
|