UniTO/tesi/tesi_unicode.org

715 lines
27 KiB
Org Mode
Raw Normal View History

2020-02-24 19:46:00 +01:00
\begin{comment}
2020-03-02 14:46:37 +01:00
* TODO Scaletta [1/5]
2020-02-17 17:31:11 +01:00
- [X] Abstract
2020-02-24 19:46:00 +01:00
- [-] Background [40%]
2020-03-03 17:18:40 +01:00
- [X]
2020-02-21 11:29:04 +01:00
- [ ] Lambda code [0%]
2020-03-02 14:46:37 +01:00
- [ ] Compiler optimizations
- [ ] other instructions
- [X] Pattern matching
2020-02-17 17:31:11 +01:00
- [ ] Symbolic execution
2020-03-02 14:46:37 +01:00
- [ ] Translation Validation
- [ ] Translation validation of the Pattern Matching Compiler
2020-02-21 11:29:04 +01:00
- [ ] Source translation
- [ ] Formal Grammar
- [ ] Compilation of source patterns
2020-03-02 14:46:37 +01:00
- [ ] Rest?
2020-02-21 11:29:04 +01:00
- [ ] Target translation
- [ ] Formal Grammar
- [ ] Symbolic execution of the lambda target
- [ ] Equivalence between source and target
2020-03-02 14:46:37 +01:00
- [ ] Proof of correctness
2020-02-24 19:46:00 +01:00
- [ ] Practical results
2020-02-21 11:29:04 +01:00
2020-02-24 19:46:00 +01:00
\end{comment}
2020-02-17 17:31:11 +01:00
2020-03-03 17:18:40 +01:00
#+TITLE: Translation Verification of the pattern matching compiler
2020-02-17 17:31:11 +01:00
#+AUTHOR: Francesco Mecca
#+EMAIL: me@francescomecca.eu
#+DATE:
#+LANGUAGE: en
#+LaTeX_CLASS: article
#+LaTeX_HEADER: \usepackage{algorithm}
2020-02-24 19:46:00 +01:00
#+LaTeX_HEADER: \usepackage{comment}
2020-02-17 17:31:11 +01:00
#+LaTeX_HEADER: \usepackage{algpseudocode}
#+LaTeX_HEADER: \usepackage{amsmath,amssymb,amsthm}
#+Latex_HEADER: \newtheorem{definition}{Definition}
#+LaTeX_HEADER: \usepackage{graphicx}
#+LaTeX_HEADER: \usepackage{listings}
#+LaTeX_HEADER: \usepackage{color}
2020-03-12 12:20:07 +01:00
#+LaTeX_HEADER: \usepackage{stmaryrd}
#+LaTeX_HEADER: \newcommand{\sem}[1]{{\llbracket{#1}\rrbracket}}
2020-02-17 17:31:11 +01:00
#+EXPORT_SELECT_TAGS: export
#+EXPORT_EXCLUDE_TAGS: noexport
#+OPTIONS: H:2 toc:nil \n:nil @:t ::t |:t ^:{} _:{} *:t TeX:t LaTeX:t
#+STARTUP: showall
2020-03-12 12:20:07 +01:00
\section{Introduction}
This dissertation presents an algorithm for the translation validation of the OCaml pattern
matching compiler. Given a source program and its compiled version the
algorithm checks wheter the two are equivalent or produce a counter
example in case of a mismatch.
Our equivalence algorithm works with decision trees. Source patterns are
converted into a decision tree using a matrix decomposition algorithm.
Target programs, described in the Lambda intermediate
representation language of the OCaml compiler, are turned into decision trees
by applying symbolic execution.
\subsection{Translation validation}
A pattern matching compiler turns a series of pattern matching clauses
into simple control flow structures such as \texttt{if, switch}, for example:
\begin{lstlisting}
match x with
| [] -> (0, None)
| x::[] -> (1, Some x)
| _::y::_ -> (2, Some y)
\end{lstlisting}
\begin{lstlisting}
(if scrutinee
(let (field_1 =a (field 1 scrutinee))
(if field_1
(let
(field_1_1 =a (field 1 field_1)
x =a (field 0 field_1))
(makeblock 0 2 (makeblock 0 x)))
(let (y =a (field 0 scrutinee))
(makeblock 0 1 (makeblock 0 y)))))
[0: 0 0a])
\end{lstlisting}
%% TODO: side by side
The code in the right is in the Lambda intermediate representation of
the OCaml compiler. The Lambda representation of a program is shown by
calling the \texttt{ocamlc} compiler with \texttt{-drawlambda} flag.
The OCaml pattern matching compiler is a critical part of the OCaml compiler
in terms of correctness because any bug would result in wrong code
production rather than triggering compilation failures.
Such bugs also are hard to catch by testing because they arise in
corner cases of complex patterns which are typically not in the
compiler test suite.
The OCaml core developers group considered evolving the pattern matching compiler, either by
using a new algorithm or by incremental refactorings of its codebase.
For this reason we want to verify that new implementations of the
compiler avoid the introduction of new bugs and that such
modifications don't result in a different behaviour than the current one.
One possible approach is to formally verify the pattern matching compiler
implementation using a machine checked proof.
Another possibility, albeit with a weaker result, is to verify that
each source program and target program pair are semantically correct.
We chose the latter technique, translation validation because is easier to adopt in
the case of a production compiler and to integrate with an existing codebase. The compiler is treated as a
blackbox and proof only depends on our equivalence algorithm.
\subsection{Our approach}
%% replace common TODO
Our algorithm translates both source and target programs into a common
representation, decision trees. Decision trees where choosen because
they model the space of possible values at a given branch of
execution.
Here is the decision tree for the source example program.
\begin{verbatim}
Node(Root)
/ \
(= []) (= ::)
/ \
Leaf Node(Root.1)
(0, None) / \
(= []) (= ::)
/ \
Leaf Leaf
(1, Some(Root.0)) (2, Some(Root.1.0))
\end{verbatim}
\texttt{(Root.0)} is called an \emph{accessor}, that represents the
access path to a value that can be reached by deconstructing the
scrutinee. In this example \texttt{Root.0} is the first subvalue of
the scrutinee.
Target decision trees have a similar shape but the tests on the
branches are related to the low level representation of values in
Lambda code. For example, cons cells \texttt{x::xs} are blocks with
tag 0.
To check the equivalence of a source and a target decision tree,
we proceed by case analysis.
If we have two terminals, such as leaves in the previous example,
we check that the two right-hand-sides are equivalent.
If we have a node $N$ and another tree $T$ we check equivalence for
each child of $N$, which is a pair of a branch condition $\pi_i$ and a
subtree $C_i$. For every child $(\pi_i, C_i)$ we reduce $T$ by killing all
the branches that are incompatible with $\pi_i$ and check that the
reduced tree is equivalent to $C_i$.
For the internship we have choosen a simple subset of the OCaml
language and implemented a prototype equivalence checker along with a
formal statement of correctness and proof sketches.
The prototype is to be included in the OCaml compiler infrastructure
and will aid the development.
\subsection{From source programs to decision trees}
Our source language supports integers, lists, tuples and all algebraic
datatypes. Patterns support wildcards, constructors and literals, or
patterns $(p_1|p_2)$ and pattern variables.
We also support \texttt{when} guards.
Decision trees have nodes of the form:
\begin{lstlisting}
type decision_tree =
| Unreachable
| Failure
| Leaf of source_expr
| Guard of source_expr * decision_tree * decision_tree
| Switch of accessor * (constructor * decision_tree) list * decision_tree
\end{lstlisting}
In the \texttt{Switch} node we have one subtree for every head constructor
that appears in the pattern matching clauses and a fallback case for
other values. The branch condition $\pi_i$ expresses that the value at the
switch accessor starts with the given constructor.
\texttt{Failure} nodes express match failures for values that are not
matched by the source clauses.
\texttt{Unreachable} is used when we statically know that no value
can flow to that subtree.
We write $\sem{t_S}_S$ for the decision tree of the source program
$t_S$, computed by a matrix decomposition algorithm (each column
decomposition step gives a \texttt{Switch} node).
It satisfies the following correctness statement:
\[
\forall t_S, \forall v_S, \quad t_S(v_S) = \sem{t_S}_S(v_S)
\]
Running any source values $v_S$ against the source program gives the
same result as running it against the decision tree.
\subsection{From target programs to decision trees}
The target programs include the following Lambda constructs:
\texttt{let, if, switch, Match\_failure, catch, exit, field} and
various comparation operations, guards. The symbolic execution engine
traverses the target program and builds an environment that maps
variables to accessors. It branches at every control flow statement
and emits a Switch node. The branch condition $\pi_i$ is expressed as
an interval set of possible values at that point.
Guards result in branching. In comparison with the source decision
trees, \texttt{Unreachable} nodes are never emitted.
We write $\sem{t_T}_T$ for the decision tree of the target program
$t_T$, satisfying the following correctness statement:
\[
\forall t_T, \forall v_T, \quad t_T(v_T) = \sem{t_T}_T(v_T)
\]
\subsection{Equivalence checking}
2020-02-17 17:31:11 +01:00
2020-02-24 14:36:26 +01:00
* Background
2020-02-21 11:29:04 +01:00
2020-03-03 17:18:40 +01:00
**
Objective Caml () is a dialect of the ML (Meta-Language) family of programming
2020-02-21 11:29:04 +01:00
languages.
2020-03-03 17:18:40 +01:00
shares many features with other dialects of ML, such as SML and Caml Light,
2020-02-21 11:29:04 +01:00
The main features of ML languages are the use of the Hindley-Milner type system that
provides many advantages with respect to static type systems of traditional imperative and object
oriented language such as C, C++ and Java, such as:
2020-03-02 14:46:37 +01:00
- Polymorphism: in certain scenarios a function can accept more than one
2020-02-24 14:36:26 +01:00
type for the input parameters. For example a function that computes the lenght of a
list doesn't need to inspect the type of the elements of the list and for this reason
2020-03-02 14:46:37 +01:00
a List.length function can accept lists of integers, lists of strings and in general
lists of any type. Such languages offer polymorphic functions through subtyping at
2020-02-24 14:36:26 +01:00
runtime only, while other languages such as C++ offer polymorphism through compile
time templates and function overloading.
With the Hindley-Milner type system each well typed function can have more than one
type but always has a unique best type, called the /principal type/.
For example the principal type of the List.length function is "For any /a/, function from
list of /a/ to /int/" and /a/ is called the /type parameter/.
2020-02-21 11:29:04 +01:00
- Strong typing: Languages such as C and C++ allow the programmer to operate on data
2020-02-24 14:36:26 +01:00
without considering its type, mainly through pointers. Other languages such as C#
and Go allow type erasure so at runtime the type of the data can't be queried.
In the case of programming languages using an Hindley-Milner type system the
programmer is not allowed to operate on data by ignoring or promoting its type.
2020-02-21 11:29:04 +01:00
- Type Inference: the principal type of a well formed term can be inferred without any
2020-02-24 14:36:26 +01:00
annotation or declaration.
2020-02-21 11:29:04 +01:00
- Algebraic data types: types that are modelled by the use of two
2020-02-24 14:36:26 +01:00
algebraic operations, sum and product.
A sum type is a type that can hold of many different types of
objects, but only one at a time. For example the sum type defined
as /A + B/ can hold at any moment a value of type A or a value of
type B. Sum types are also called tagged union or variants.
A product type is a type constructed as a direct product
of multiple types and contains at any moment one instance for
every type of its operands. Product types are also called tuples
or records. Algebraic data types can be recursive
in their definition and can be combined.
2020-02-21 11:29:04 +01:00
Moreover ML languages are functional, meaning that functions are
treated as first class citizens and variables are immutable,
although mutable statements and imperative constructs are permitted.
2020-03-03 17:18:40 +01:00
In addition to that features an object system, that provides
2020-02-21 11:29:04 +01:00
inheritance, subtyping and dynamic binding, and modules, that
provide a way to encapsulate definitions. Modules are checked
statically and can be reificated through functors.
2020-03-02 14:46:37 +01:00
** Lambda form compilation
\begin{comment}
2020-03-03 17:18:40 +01:00
https://dev.realworld.org/compiler-backend.html
2020-03-02 14:46:37 +01:00
\end{comment}
2020-03-03 17:18:40 +01:00
provides compilation in form of a byecode executable with an
2020-03-02 14:46:37 +01:00
optionally embeddable interpreter and a native executable that could
be statically linked to provide a single file executable.
2020-03-03 17:18:40 +01:00
After the typechecker has proven that the program is type safe,
the compiler lower the code to /Lambda/, an s-expression based
2020-03-02 14:46:37 +01:00
language that assumes that its input has already been proved safe.
On the /Lambda/ representation of the source program, the compiler
performes a series of optimization passes before translating the
lambda form to assembly code.
2020-03-03 17:18:40 +01:00
*** datatypes
2020-03-02 14:46:37 +01:00
2020-03-03 17:18:40 +01:00
Most native data types in , such as integers, tuples, lists,
2020-03-02 14:46:37 +01:00
records, can be seen as instances of the following definition
2020-03-03 17:18:40 +01:00
#+BEGIN_SRC
2020-03-02 14:46:37 +01:00
type t = Nil | One of int | Cons of int * t
#+END_SRC
that is a type /t/ with three constructors that define its complete
signature.
Every constructor has an arity. Nil, a constructor of arity 0, is
called a constant constructor.
*** Lambda form types
2020-03-03 17:18:40 +01:00
A lambda form target file produced by the compiler consists of a
2020-03-02 14:46:37 +01:00
single s-expression. Every s-expression consist of /(/, a sequence of
elements separated by a whitespace and a closing /)/.
Elements of s-expressions are:
- Atoms: sequences of ascii letters, digits or symbols
- Variables
- Strings: enclosed in double quotes and possibly escaped
- S-expressions: allowing arbitrary nesting
There are several numeric types:
- integers: that us either 31 or 63 bit two's complement arithmetic
depending on system word size, and also wrapping on overflow
- 32 bit and 64 bit integers: that use 32-bit and 64-bit two's complement arithmetic
with wrap on overflow
- big integers: offer integers with arbitrary precision
- floats: that use IEEE754 double-precision (64-bit) arithmetic with
the addition of the literals /infinity/, /neg_infinity/ and /nan/.
The are varios numeric operations defined:
- Arithmetic operations: +, -, *, /, % (modulo), neg (unary negation)
- Bitwise operations: &, |, ^, <<, >> (zero-shifting), a>> (sign extending)
- Numeric comparisons: <, >, <=, >=, ==
*** Functions
Functions are defined using the following syntax, and close over all
bindings in scope: (lambda (arg1 arg2 arg3) BODY)
and are applied using the following syntax: (apply FUNC ARG ARG ARG)
Evaluation is eager.
*** Bindings
The atom /let/ introduces a sequence of bindings:
(let BINDING BINDING BINDING ... BODY)
*** Other atoms
TODO: if, switch, stringswitch...
TODO: magari esempi
** Pattern matching
2020-02-21 11:29:04 +01:00
Pattern matching is a widely adopted mechanism to interact with ADT.
C family languages provide branching on predicates through the use of
if statements and switch statements.
2020-02-24 19:46:00 +01:00
Pattern matching on the other hands express predicates through
syntactic templates that also allow to bind on data structures of
arbitrary shapes. One common example of pattern matching is the use of regular
2020-03-03 17:18:40 +01:00
expressions on strings. provides pattern matching on ADT and
2020-02-21 11:29:04 +01:00
primitive data types.
2020-02-24 19:46:00 +01:00
The result of a pattern matching operation is always one of:
- this value does not match this pattern”
- this value matches this pattern, resulting the following bindings of
names to values and the jump to the expression pointed at the
pattern.
2020-02-21 11:29:04 +01:00
2020-03-03 17:18:40 +01:00
#+BEGIN_SRC
2020-02-24 19:46:00 +01:00
type color = | Red | Blue | Green | Black | White
2020-02-21 11:29:04 +01:00
2020-02-24 19:46:00 +01:00
match color with
2020-02-21 11:29:04 +01:00
| Red -> print "red"
| Blue -> print "red"
| Green -> print "red"
2020-02-24 19:46:00 +01:00
| _ -> print "white or black"
2020-02-21 11:29:04 +01:00
#+END_SRC
2020-03-03 17:18:40 +01:00
provides tokens to express data destructoring.
2020-02-24 19:46:00 +01:00
For example we can examine the content of a list with patten matching
2020-02-21 11:29:04 +01:00
2020-03-03 17:18:40 +01:00
#+BEGIN_SRC
2020-02-21 11:29:04 +01:00
begin match list with
| [ ] -> print "empty list"
| element1 :: [ ] -> print "one element"
2020-02-24 19:46:00 +01:00
| (element1 :: element2) :: [ ] -> print "two elements"
2020-02-21 11:29:04 +01:00
| head :: tail-> print "head followed by many elements"
#+END_SRC
2020-02-24 19:46:00 +01:00
Parenthesized patterns, such as the third one in the previous example,
matches the same value as the pattern without parenthesis.
The same could be done with tuples
2020-03-03 17:18:40 +01:00
#+BEGIN_SRC
2020-02-21 11:29:04 +01:00
begin match tuple with
| (Some _, Some _) -> print "Pair of optional types"
2020-02-24 19:46:00 +01:00
| (Some _, None) | (None, Some _) -> print "Pair of optional types, one of which is null"
2020-02-21 11:29:04 +01:00
| (None, None) -> print "Pair of optional types, both null"
#+END_SRC
2020-02-24 19:46:00 +01:00
The pattern pattern₁ | pattern₂ represents the logical "or" of the
two patterns pattern₁ and pattern₂. A value matches pattern₁ |
pattern₂ if it matches pattern₁ or pattern₂.
2020-02-21 11:29:04 +01:00
Pattern clauses can make the use of /guards/ to test predicates and
2020-02-24 19:46:00 +01:00
variables can captured (binded in scope).
2020-02-21 11:29:04 +01:00
2020-03-03 17:18:40 +01:00
#+BEGIN_SRC
2020-02-21 11:29:04 +01:00
begin match token_list with
2020-02-24 19:46:00 +01:00
| "switch"::var::"{"::rest -> ...
| "case"::":"::var::rest when is_int var -> ...
| "case"::":"::var::rest when is_string var -> ...
| "}"::[ ] -> ...
2020-02-21 11:29:04 +01:00
| "}"::rest -> error "syntax error: " rest
#+END_SRC
2020-03-03 17:18:40 +01:00
Moreover, the pattern matching compiler emits a warning when a
2020-02-24 19:46:00 +01:00
pattern is not exhaustive or some patterns are shadowed by precedent ones.
2020-02-21 11:29:04 +01:00
2020-03-03 17:18:40 +01:00
** Symbolic execution
** Translation validation
Translators, such as translators and code generators, are huge pieces of
software usually consisting of multiple subsystem and
constructing an actual specification of a translator implementation for
formal validation is a very long task. Moreover, different
translators implement different algorithms, so the correctness proof of
a translator cannot be generalized and reused to prove another translator.
Translation validation is an alternative to the verification of
existing translators that consists of taking the source and the target
(compiled) program and proving /a posteriori/ their semantic equivalence.
- [ ] Techniques for translation validation
- [ ] What does semantically equivalent mean
- [ ] What happens when there is no semantic equivalence
- [ ] Translation validation through symbolic execution
** Translation validation of the Pattern Matching Compiler
*** Source program
The algorithm takes as its input a source program and translates it
into an algebraic data structure called /constraint_tree/.
#+BEGIN_SRC
type constraint_tree =
| Unreachable
| Failure
| Leaf of source_expr
| Guard of source_blackbox * constraint_tree * constraint_tree
| Node of accessor * (constructor * constraint_tree) list * constraint_tree
#+END_SRC
Unreachable, Leaf of source_expr and Failure are the terminals of the three.
We distinguish
- Unreachable: statically it is known that no value can go there
- Failure: a value matching this part results in an error
- Leaf: a value matching this part results into the evaluation of a
source blackbox of code
The algorithm doesn't support type-declaration-based analysis
to know the list of constructors at a given type.
Let's consider some trivial examples:
#+BEGIN_SRC
function true -> 1
#+END_SRC
[ ] Converti a disegni
Is translated to
|Node ([(true, Leaf 1)], Failure)
while
#+BEGIN_SRC
function
true -> 1
| false -> 2
#+END_SRC
will give
|Node ([(true, Leaf 1); (false, Leaf 2)])
It is possible to produce Unreachable examples by using
refutation clauses (a "dot" in the right-hand-side)
#+BEGIN_SRC
function
true -> 1
| false -> 2
| _ -> .
#+END_SRC
that gets translated into
Node ([(true, Leaf 1); (false, Leaf 2)], Unreachable)
We trust this annotation, which is reasonable as the type-checker
verifies that it indeed holds.
Guard nodes of the tree are emitted whenever a guard is found. Guards
node contains a blackbox of code that is never evaluated and two
branches, one that is taken in case the guard evaluates to true and
the other one that contains the path taken when the guard evaluates to
true.
[ ] Finisci con Node
[ ] Spiega del fallback
[ ] rivedi compilazione per tenere in considerazione il tuo albero invece che le lambda
[ ] Specifica che stesso algoritmo usato per compilare a lambda, piu` optimizations
The source code of a pattern matching function in has the
following form:
2020-02-21 11:29:04 +01:00
2020-02-24 19:46:00 +01:00
|match variable with
|\vert pattern₁ -> expr₁
|\vert pattern₂ when guard -> expr₂
|\vert pattern₃ as var -> expr₃
|⋮
|\vert pₙ -> exprₙ
2020-02-21 11:29:04 +01:00
2020-03-03 17:18:40 +01:00
and can include any expression that is legal for the compiler,
such as "when" conditions and assignments. Patterns could or could not
be exhaustive.
2020-02-21 11:29:04 +01:00
2020-03-03 17:18:40 +01:00
Pattern matching code could also be written using the more compact form:
|function
|\vert pattern₁ -> expr₁
|\vert pattern₂ when guard -> expr₂
|\vert pattern₃ as var -> expr₃
|⋮
|\vert pₙ -> exprₙ
2020-02-24 19:46:00 +01:00
2020-02-21 11:29:04 +01:00
2020-03-03 17:18:40 +01:00
This BNF grammar describes formally the grammar of the source program:
#+BEGIN_SRC bnf
start ::= "match" id "with" patterns | "function" patterns
patterns ::= (pattern0|pattern1) pattern1+
;; pattern0 and pattern1 are needed to distinguish the first case in which
;; we can avoid writing the optional vertical line
pattern0 ::= clause
pattern1 ::= "|" clause
clause ::= lexpr "->" rexpr
lexpr ::= rule (ε|condition)
rexpr ::= _code ;; arbitrary code
rule ::= wildcard|variable|constructor_pattern|or_pattern ;;
;; rules
wildcard ::= "_"
variable ::= identifier
constructor_pattern ::= constructor (rule|ε) (assignment|ε)
constructor ::= int|float|char|string|bool
|unit|record|exn|objects|ref
|list|tuple|array
|variant|parameterized_variant ;; data types
or_pattern ::= wildcard|variable|constructor_pattern ("|" wildcard|variable|constructor_pattern)+
condition ::= "when" bexpr
assignment ::= "as" id
bexpr ::= _code ;; arbitrary code
2020-02-24 19:46:00 +01:00
#+END_SRC
2020-02-21 11:29:04 +01:00
2020-03-02 14:46:37 +01:00
\begin{comment}
2020-03-03 17:18:40 +01:00
Check that it is still this
\end{comment}
2020-02-21 11:29:04 +01:00
2020-03-03 17:18:40 +01:00
Patterns are of the form
2020-02-24 19:46:00 +01:00
| pattern | type of pattern |
2020-02-24 14:36:26 +01:00
|-----------------+---------------------|
| _ | wildcard |
| x | variable |
| c(p₁,p₂,...,pₙ) | constructor pattern |
| (p₁\vert p₂) | or-pattern |
2020-02-21 11:29:04 +01:00
2020-03-03 17:18:40 +01:00
During compilation by the translators expressions are compiled into
lambda code and are referred as lambda code actions lᵢ.
2020-02-21 11:29:04 +01:00
2020-03-03 17:18:40 +01:00
The entire pattern matching code is represented as a clause matrix
2020-02-21 11:29:04 +01:00
that associates rows of patterns (p_{i,1}, p_{i,2}, ..., p_{i,n}) to
lambda code action lⁱ
\begin{equation*}
(P → L) =
\begin{pmatrix}
2020-02-24 14:36:26 +01:00
p_{1,1} & p_{1,2} & \cdots & p_{1,n} & → l₁ \\
p_{2,1} & p_{2,2} & \cdots & p_{2,n} & → l₂ \\
\vdots & \vdots & \ddots & \vdots & → \vdots \\
p_{m,1} & p_{m,2} & \cdots & p_{m,n} & → lₘ
2020-02-21 11:29:04 +01:00
\end{pmatrix}
\end{equation*}
2020-02-24 14:36:26 +01:00
The pattern /p/ matches a value /v/, written as p ≼ v, when one of the
following rules apply
2020-02-21 11:29:04 +01:00
2020-02-24 14:36:26 +01:00
|--------------------+---+--------------------+-------------------------------------------|
| _ | ≼ | v | ∀v |
| x | ≼ | v | ∀v |
2020-02-21 11:29:04 +01:00
| (p₁ \vert\ p₂) | ≼ | v | iff p₁ ≼ v or p₂ ≼ v |
| c(p₁, p₂, ..., pₐ) | ≼ | c(v₁, v₂, ..., vₐ) | iff (p₁, p₂, ..., pₐ) ≼ (v₁, v₂, ..., vₐ) |
| (p₁, p₂, ..., pₐ) | ≼ | (v₁, v₂, ..., vₐ) | iff pᵢ ≼ vᵢ ∀i ∈ [1..a] |
2020-02-24 14:36:26 +01:00
|--------------------+---+--------------------+-------------------------------------------|
2020-02-24 19:46:00 +01:00
When a value /v/ matches pattern /p/ we say that /v/ is an /instance/ of /p/.
2020-02-21 11:29:04 +01:00
2020-02-24 19:46:00 +01:00
Considering the pattern matrix P we say that the value vector
2020-02-24 14:36:26 +01:00
$\vec{v}$ = (v₁, v₂, ..., vᵢ) matches the line number i in P if and only if the following two
2020-02-21 11:29:04 +01:00
conditions are satisfied:
2020-02-24 14:36:26 +01:00
- p_{i,1}, p_{i,2}, \cdots, p_{i,n} ≼ (v₁, v₂, ..., vᵢ)
- ∀j < i p_{j,1}, p_{j,2}, \cdots, p_{j,n} ⋠ (v₁, v₂, ..., vᵢ)
2020-02-21 11:29:04 +01:00
We can define the following three relations with respect to patterns:
2020-02-24 14:36:26 +01:00
- Patter p is less precise than pattern q, written p ≼ q, when all
2020-02-21 11:29:04 +01:00
instances of q are instances of p
- Pattern p and q are equivalent, written p ≡ q, when their instances
are the same
- Patterns p and q are compatible when they share a common instance
2020-02-24 19:46:00 +01:00
**** Initial state of the compilation
2020-02-21 11:29:04 +01:00
Given a source of the following form:
2020-02-24 19:46:00 +01:00
|match variable with
|\vert pattern₁ -> e₁
|\vert pattern₂ -> e₂
|⋮
|\vert pₘ -> eₘ
the initial input of the algorithm C consists of a vector of variables
$\vec{x}$ = (x₁, x₂, ..., xₙ) of size /n/ where /n/ is the arity of
the type of /x/ and a clause matrix P → L of width n and height m.
That is:
2020-02-21 11:29:04 +01:00
\begin{equation*}
2020-02-24 19:46:00 +01:00
C((\vec{x} = (x₁, x₂, ..., xₙ),
2020-02-21 11:29:04 +01:00
\begin{pmatrix}
p_{1,1} & p_{1,2} & \cdots & p_{1,n} → l₁ \\
p_{2,1} & p_{2,2} & \cdots & p_{2,n} → l₂ \\
2020-02-24 19:46:00 +01:00
\vdots & \vdots & \ddots & \vdots → \vdots \\
p_{m,1} & p_{m,2} & \cdots & p_{m,n} → lₘ)
2020-02-21 11:29:04 +01:00
\end{pmatrix}
\end{equation*}
2020-02-24 19:46:00 +01:00
The base case C₀ of the algorithm is the case in which the $\vec{x}$
is empty, that is $\vec{x}$ = (), then the result of the compilation
C₀ is l₁
\begin{equation*}
C₀((),
\begin{pmatrix}
→ l₁ \\
→ l₂ \\
→ \vdots \\
→ lₘ
\end{pmatrix})
) = l₁
\end{equation*}
When $\vec{x}$ ≠ () then the compilation advances using one of the
following four rules:
1) Variable rule: if all patterns of the first column of P are wildcard patterns or
bind the value to a variable, then
\begin{equation*}
C(\vec{x}, P → L) = C((x₂, x₃, ..., xₙ), P' → L')
\end{equation*}
where
\begin{equation*}
\begin{pmatrix}
p_{1,2} & \cdots & p_{1,n} & → & (let & y₁ & x₁) & l₁ \\
p_{2,2} & \cdots & p_{2,n} & → & (let & y₂ & x₁) & l₂ \\
\vdots & \ddots & \vdots & → & \vdots & \vdots & \vdots & \vdots \\
p_{m,2} & \cdots & p_{m,n} & → & (let & yₘ & x₁) & lₘ
\end{pmatrix}
\end{equation*}
That means in every lambda action lᵢ there is a binding of x₁ to the
variable that appears on the pattern $p_{i,1}. Bindings are omitted
for wildcard patterns and the lambda action lᵢ remains unchanged.
2) Constructor rule: if all patterns in the first column of P are
constructors patterns of the form k(q₁, q₂, ..., qₙ) we define a
new matrix, the specialized clause matrix S, by applying the
following transformation on every row /p/:
\begin{lstlisting}[mathescape,columns=fullflexible,basicstyle=\fontfamily{lmvtt}\selectfont,]
for every c ∈ Set of constructors
for i ← 1 .. m
let kᵢ ← constructor_of($p_{i,1}$)
if kᵢ = c then
p ← $q_{i,1}$, $q_{i,2}$, ..., $q_{i,n'}$, $p_{i,2}$, $p_{i,3}$, ..., $p_{i, n}$
\end{lstlisting}
Patterns of the form $q_{i,j}$ matches on the values of the
constructor and we define new fresh variables y₁, y₂, ..., yₐ so
that the lambda action lᵢ becomes
\begin{lstlisting}[mathescape,columns=fullflexible,basicstyle=\fontfamily{lmvtt}\selectfont,]
(let (y₁ (field 0 x₁))
(y₂ (field 1 x₁))
...
(yₐ (field (a-1) x₁))
lᵢ)
\end{lstlisting}
and the result of the compilation for the set of constructors
{c₁, c₂, ..., cₖ} is:
\begin{lstlisting}[mathescape,columns=fullflexible,basicstyle=\fontfamily{lmvtt}\selectfont,]
switch x₁ with
case c₁: l₁
case c₂: l₂
...
case cₖ: lₖ
default: exit
\end{lstlisting}
3) Orpat rule: there are various strategies for dealing with
or-patterns. The most naive one is to split the or-patterns.
For example a row p containing an or-pattern:
\begin{equation*}
(p_{i,1}|q_{i,1}|r_{i,1}), p_{i,2}, ..., p_{i,m} → lᵢ
\end{equation*}
results in three rows added to the clause matrix
\begin{equation*}
p_{i,1}, p_{i,2}, ..., p_{i,m} → lᵢ \\
\end{equation*}
\begin{equation*}
q_{i,1}, p_{i,2}, ..., p_{i,m} → lᵢ \\
\end{equation*}
\begin{equation*}
r_{i,1}, p_{i,2}, ..., p_{i,m} → lᵢ
\end{equation*}
4) Mixture rule:
When none of the previous rules apply the clause matrix P → L is
splitted into two clause matrices, the first P₁ → L₁ that is the
largest prefix matrix for which one of the three previous rules
apply, and P₂ → L₂ containing the remaining rows. The algorithm is
applied to both matrices.
\begin{comment}
#+BEGIN_COMMENT
CITE paper?
#+END_COMMENT
\end{comment}