2020-02-24 19:46:00 +01:00
|
|
|
|
\begin{comment}
|
2020-03-30 21:23:55 +02:00
|
|
|
|
* TODO Scaletta [1/6]
|
2020-03-12 19:12:23 +01:00
|
|
|
|
- [X] Introduction
|
2020-03-30 21:23:55 +02:00
|
|
|
|
- [-] Background [60%]
|
|
|
|
|
- [X] Low level representation
|
|
|
|
|
- [X] Lambda code [0%]
|
2020-03-02 14:46:37 +01:00
|
|
|
|
- [X] Pattern matching
|
2020-02-17 17:31:11 +01:00
|
|
|
|
- [ ] Symbolic execution
|
2020-03-02 14:46:37 +01:00
|
|
|
|
- [ ] Translation Validation
|
|
|
|
|
- [ ] Translation validation of the Pattern Matching Compiler
|
2020-02-21 11:29:04 +01:00
|
|
|
|
- [ ] Source translation
|
|
|
|
|
- [ ] Formal Grammar
|
|
|
|
|
- [ ] Compilation of source patterns
|
2020-03-02 14:46:37 +01:00
|
|
|
|
- [ ] Rest?
|
2020-02-21 11:29:04 +01:00
|
|
|
|
- [ ] Target translation
|
|
|
|
|
- [ ] Formal Grammar
|
2020-03-12 19:12:23 +01:00
|
|
|
|
- [ ] Symbolic execution of the Lambda target
|
2020-02-21 11:29:04 +01:00
|
|
|
|
- [ ] Equivalence between source and target
|
2020-03-12 12:21:08 +01:00
|
|
|
|
- [ ] Statement of correctness
|
2020-03-02 14:46:37 +01:00
|
|
|
|
- [ ] Proof of correctness
|
2020-02-24 19:46:00 +01:00
|
|
|
|
- [ ] Practical results
|
2020-03-29 22:54:33 +02:00
|
|
|
|
|
|
|
|
|
Magari prima pattern matching poi compilatore?
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
2020-02-24 19:46:00 +01:00
|
|
|
|
\end{comment}
|
2020-02-17 17:31:11 +01:00
|
|
|
|
|
2020-03-03 17:18:40 +01:00
|
|
|
|
#+TITLE: Translation Verification of the pattern matching compiler
|
2020-02-17 17:31:11 +01:00
|
|
|
|
#+AUTHOR: Francesco Mecca
|
|
|
|
|
#+EMAIL: me@francescomecca.eu
|
|
|
|
|
#+DATE:
|
|
|
|
|
|
|
|
|
|
#+LANGUAGE: en
|
|
|
|
|
#+LaTeX_CLASS: article
|
|
|
|
|
#+LaTeX_HEADER: \usepackage{algorithm}
|
2020-02-24 19:46:00 +01:00
|
|
|
|
#+LaTeX_HEADER: \usepackage{comment}
|
2020-02-17 17:31:11 +01:00
|
|
|
|
#+LaTeX_HEADER: \usepackage{algpseudocode}
|
|
|
|
|
#+LaTeX_HEADER: \usepackage{amsmath,amssymb,amsthm}
|
2020-04-03 15:53:54 +02:00
|
|
|
|
#+LaTeX_HEADER: \newtheorem{definition}{Definition}
|
|
|
|
|
#+LaTeX_HEADER: \usepackage{mathpartir}
|
2020-02-17 17:31:11 +01:00
|
|
|
|
#+LaTeX_HEADER: \usepackage{graphicx}
|
|
|
|
|
#+LaTeX_HEADER: \usepackage{listings}
|
|
|
|
|
#+LaTeX_HEADER: \usepackage{color}
|
2020-03-12 12:20:07 +01:00
|
|
|
|
#+LaTeX_HEADER: \usepackage{stmaryrd}
|
|
|
|
|
#+LaTeX_HEADER: \newcommand{\sem}[1]{{\llbracket{#1}\rrbracket}}
|
2020-04-03 15:53:54 +02:00
|
|
|
|
#+LaTeX_HEADER: \newcommand{\Equiv}[3]{\mathsf{equiv}(#1, #2, #3)} % \equiv is already taken
|
|
|
|
|
#+LaTeX_HEADER: \newcommand{\covers}[2]{#1 \mathrel{\mathsf{covers}} #2}
|
|
|
|
|
#+LaTeX_HEADER: \newcommand{\Yes}{\mathsf{Yes}}
|
|
|
|
|
#+LaTeX_HEADER: \newcommand{\No}[2]{\mathsf{No}(#1, #2)}
|
|
|
|
|
|
2020-02-17 17:31:11 +01:00
|
|
|
|
#+EXPORT_SELECT_TAGS: export
|
|
|
|
|
#+EXPORT_EXCLUDE_TAGS: noexport
|
|
|
|
|
#+OPTIONS: H:2 toc:nil \n:nil @:t ::t |:t ^:{} _:{} *:t TeX:t LaTeX:t
|
|
|
|
|
#+STARTUP: showall
|
2020-03-12 12:20:07 +01:00
|
|
|
|
\section{Introduction}
|
|
|
|
|
|
|
|
|
|
This dissertation presents an algorithm for the translation validation of the OCaml pattern
|
|
|
|
|
matching compiler. Given a source program and its compiled version the
|
2020-03-12 19:37:38 +01:00
|
|
|
|
algorithm checks whether the two are equivalent or produce a counter
|
2020-03-12 12:20:07 +01:00
|
|
|
|
example in case of a mismatch.
|
2020-03-12 19:37:38 +01:00
|
|
|
|
For the prototype of this algorithm we have chosen a subset of the OCaml
|
|
|
|
|
language and implemented a prototype equivalence checker along with a
|
|
|
|
|
formal statement of correctness and its proof.
|
|
|
|
|
The prototype is to be included in the OCaml compiler infrastructure
|
|
|
|
|
and will aid the development.
|
2020-03-12 12:20:07 +01:00
|
|
|
|
|
|
|
|
|
Our equivalence algorithm works with decision trees. Source patterns are
|
|
|
|
|
converted into a decision tree using a matrix decomposition algorithm.
|
|
|
|
|
Target programs, described in the Lambda intermediate
|
|
|
|
|
representation language of the OCaml compiler, are turned into decision trees
|
|
|
|
|
by applying symbolic execution.
|
|
|
|
|
|
2020-03-12 19:37:38 +01:00
|
|
|
|
\begin{comment}
|
2020-03-12 12:20:07 +01:00
|
|
|
|
\subsection{Translation validation}
|
2020-03-12 19:37:38 +01:00
|
|
|
|
\end{comment}
|
2020-03-12 12:20:07 +01:00
|
|
|
|
A pattern matching compiler turns a series of pattern matching clauses
|
|
|
|
|
into simple control flow structures such as \texttt{if, switch}, for example:
|
|
|
|
|
\begin{lstlisting}
|
|
|
|
|
match x with
|
|
|
|
|
| [] -> (0, None)
|
|
|
|
|
| x::[] -> (1, Some x)
|
|
|
|
|
| _::y::_ -> (2, Some y)
|
|
|
|
|
\end{lstlisting}
|
|
|
|
|
\begin{lstlisting}
|
|
|
|
|
(if scrutinee
|
|
|
|
|
(let (field_1 =a (field 1 scrutinee))
|
|
|
|
|
(if field_1
|
|
|
|
|
(let
|
|
|
|
|
(field_1_1 =a (field 1 field_1)
|
|
|
|
|
x =a (field 0 field_1))
|
|
|
|
|
(makeblock 0 2 (makeblock 0 x)))
|
|
|
|
|
(let (y =a (field 0 scrutinee))
|
|
|
|
|
(makeblock 0 1 (makeblock 0 y)))))
|
|
|
|
|
[0: 0 0a])
|
|
|
|
|
\end{lstlisting}
|
2020-03-12 19:37:38 +01:00
|
|
|
|
\begin{comment}
|
2020-03-12 12:20:07 +01:00
|
|
|
|
%% TODO: side by side
|
2020-03-12 19:37:38 +01:00
|
|
|
|
\end{comment}
|
2020-04-03 15:53:54 +02:00
|
|
|
|
The code on the right is in the Lambda intermediate representation of
|
2020-03-12 12:20:07 +01:00
|
|
|
|
the OCaml compiler. The Lambda representation of a program is shown by
|
|
|
|
|
calling the \texttt{ocamlc} compiler with \texttt{-drawlambda} flag.
|
|
|
|
|
|
|
|
|
|
The OCaml pattern matching compiler is a critical part of the OCaml compiler
|
2020-04-03 15:53:54 +02:00
|
|
|
|
in terms of correctness because bugs typically would result in wrong code
|
2020-03-12 12:20:07 +01:00
|
|
|
|
production rather than triggering compilation failures.
|
|
|
|
|
Such bugs also are hard to catch by testing because they arise in
|
|
|
|
|
corner cases of complex patterns which are typically not in the
|
2020-04-03 15:53:54 +02:00
|
|
|
|
compiler test suite or most user programs.
|
2020-03-12 12:20:07 +01:00
|
|
|
|
|
|
|
|
|
The OCaml core developers group considered evolving the pattern matching compiler, either by
|
2020-03-12 19:37:38 +01:00
|
|
|
|
using a new algorithm or by incremental refactoring of its code base.
|
2020-03-12 12:20:07 +01:00
|
|
|
|
For this reason we want to verify that new implementations of the
|
|
|
|
|
compiler avoid the introduction of new bugs and that such
|
2020-03-12 19:37:38 +01:00
|
|
|
|
modifications don't result in a different behavior than the current one.
|
2020-03-12 12:20:07 +01:00
|
|
|
|
|
|
|
|
|
One possible approach is to formally verify the pattern matching compiler
|
|
|
|
|
implementation using a machine checked proof.
|
|
|
|
|
Another possibility, albeit with a weaker result, is to verify that
|
|
|
|
|
each source program and target program pair are semantically correct.
|
|
|
|
|
We chose the latter technique, translation validation because is easier to adopt in
|
2020-03-12 19:37:38 +01:00
|
|
|
|
the case of a production compiler and to integrate with an existing code base. The compiler is treated as a
|
|
|
|
|
black-box and proof only depends on our equivalence algorithm.
|
2020-03-12 12:20:07 +01:00
|
|
|
|
|
|
|
|
|
\subsection{Our approach}
|
|
|
|
|
%% replace common TODO
|
|
|
|
|
Our algorithm translates both source and target programs into a common
|
2020-03-12 19:37:38 +01:00
|
|
|
|
representation, decision trees. Decision trees where chosen because
|
2020-03-12 12:20:07 +01:00
|
|
|
|
they model the space of possible values at a given branch of
|
|
|
|
|
execution.
|
2020-04-03 15:53:54 +02:00
|
|
|
|
Here are the decision trees for the source and target example program.
|
|
|
|
|
|
|
|
|
|
\begin{minipage}{0.5\linewidth}
|
2020-03-12 12:20:07 +01:00
|
|
|
|
\begin{verbatim}
|
2020-04-03 15:53:54 +02:00
|
|
|
|
Switch(Root)
|
2020-03-12 12:20:07 +01:00
|
|
|
|
/ \
|
|
|
|
|
(= []) (= ::)
|
|
|
|
|
/ \
|
2020-04-03 15:53:54 +02:00
|
|
|
|
Leaf Switch(Root.1)
|
2020-03-12 12:20:07 +01:00
|
|
|
|
(0, None) / \
|
|
|
|
|
(= []) (= ::)
|
|
|
|
|
/ \
|
|
|
|
|
Leaf Leaf
|
2020-04-03 15:53:54 +02:00
|
|
|
|
[x = Root.0] [y = Root.1.0]
|
|
|
|
|
(1, Some x) (2, Some y)
|
|
|
|
|
\end{verbatim}
|
|
|
|
|
\end{minipage}
|
|
|
|
|
\hfill
|
|
|
|
|
\begin{minipage}{0.5\linewidth}
|
|
|
|
|
\begin{verbatim}
|
|
|
|
|
Switch(Root)
|
|
|
|
|
/ \
|
|
|
|
|
(= int 0) (!= int 0)
|
|
|
|
|
/ \
|
|
|
|
|
Leaf Switch(Root.1)
|
|
|
|
|
(makeblock 0 / \
|
|
|
|
|
0 0a) / \
|
|
|
|
|
(= int 0) (!= int 0)
|
|
|
|
|
/ \
|
|
|
|
|
Leaf Leaf
|
|
|
|
|
[x = Root.0] [y = Root.1.0]
|
|
|
|
|
(makeblock 0 (makeblock 0
|
|
|
|
|
1 (makeblock 0 x)) 2 (makeblock 0 y))
|
2020-03-12 12:20:07 +01:00
|
|
|
|
\end{verbatim}
|
2020-04-03 15:53:54 +02:00
|
|
|
|
\end{minipage}
|
|
|
|
|
|
2020-03-12 12:20:07 +01:00
|
|
|
|
\texttt{(Root.0)} is called an \emph{accessor}, that represents the
|
|
|
|
|
access path to a value that can be reached by deconstructing the
|
|
|
|
|
scrutinee. In this example \texttt{Root.0} is the first subvalue of
|
|
|
|
|
the scrutinee.
|
|
|
|
|
|
|
|
|
|
Target decision trees have a similar shape but the tests on the
|
|
|
|
|
branches are related to the low level representation of values in
|
2020-04-03 15:53:54 +02:00
|
|
|
|
Lambda code. For example, cons cells \texttt{x::xs} or tuples
|
|
|
|
|
\texttt{(x,y)} are blocks with tag 0.
|
2020-03-12 12:20:07 +01:00
|
|
|
|
|
|
|
|
|
To check the equivalence of a source and a target decision tree,
|
|
|
|
|
we proceed by case analysis.
|
|
|
|
|
If we have two terminals, such as leaves in the previous example,
|
|
|
|
|
we check that the two right-hand-sides are equivalent.
|
|
|
|
|
If we have a node $N$ and another tree $T$ we check equivalence for
|
|
|
|
|
each child of $N$, which is a pair of a branch condition $\pi_i$ and a
|
|
|
|
|
subtree $C_i$. For every child $(\pi_i, C_i)$ we reduce $T$ by killing all
|
|
|
|
|
the branches that are incompatible with $\pi_i$ and check that the
|
|
|
|
|
reduced tree is equivalent to $C_i$.
|
|
|
|
|
|
|
|
|
|
\subsection{From source programs to decision trees}
|
|
|
|
|
Our source language supports integers, lists, tuples and all algebraic
|
2020-04-03 15:53:54 +02:00
|
|
|
|
datatypes. Patterns support wildcards, constructors and literals,
|
|
|
|
|
or-patterns $(p_1 | p_2)$ and pattern variables. We also support
|
|
|
|
|
\texttt{when} guards, which are interesting as they introduce the
|
|
|
|
|
evaluation of expressions during matching. Decision trees have nodes
|
|
|
|
|
of the form:
|
2020-03-12 12:20:07 +01:00
|
|
|
|
\begin{lstlisting}
|
|
|
|
|
type decision_tree =
|
|
|
|
|
| Unreachable
|
|
|
|
|
| Failure
|
|
|
|
|
| Leaf of source_expr
|
|
|
|
|
| Guard of source_expr * decision_tree * decision_tree
|
|
|
|
|
| Switch of accessor * (constructor * decision_tree) list * decision_tree
|
|
|
|
|
\end{lstlisting}
|
|
|
|
|
In the \texttt{Switch} node we have one subtree for every head constructor
|
|
|
|
|
that appears in the pattern matching clauses and a fallback case for
|
|
|
|
|
other values. The branch condition $\pi_i$ expresses that the value at the
|
|
|
|
|
switch accessor starts with the given constructor.
|
|
|
|
|
\texttt{Failure} nodes express match failures for values that are not
|
|
|
|
|
matched by the source clauses.
|
|
|
|
|
\texttt{Unreachable} is used when we statically know that no value
|
|
|
|
|
can flow to that subtree.
|
|
|
|
|
|
|
|
|
|
We write $\sem{t_S}_S$ for the decision tree of the source program
|
|
|
|
|
$t_S$, computed by a matrix decomposition algorithm (each column
|
|
|
|
|
decomposition step gives a \texttt{Switch} node).
|
|
|
|
|
It satisfies the following correctness statement:
|
|
|
|
|
\[
|
|
|
|
|
\forall t_S, \forall v_S, \quad t_S(v_S) = \sem{t_S}_S(v_S)
|
|
|
|
|
\]
|
2020-04-03 15:53:54 +02:00
|
|
|
|
Running any source value $v_S$ against the source program gives the
|
2020-03-12 12:20:07 +01:00
|
|
|
|
same result as running it against the decision tree.
|
|
|
|
|
|
|
|
|
|
\subsection{From target programs to decision trees}
|
|
|
|
|
The target programs include the following Lambda constructs:
|
|
|
|
|
\texttt{let, if, switch, Match\_failure, catch, exit, field} and
|
2020-03-12 19:37:38 +01:00
|
|
|
|
various comparison operations, guards. The symbolic execution engine
|
2020-03-12 12:20:07 +01:00
|
|
|
|
traverses the target program and builds an environment that maps
|
|
|
|
|
variables to accessors. It branches at every control flow statement
|
2020-04-03 15:53:54 +02:00
|
|
|
|
and emits a \texttt{Switch} node. The branch condition $\pi_i$ is expressed as
|
2020-03-12 12:20:07 +01:00
|
|
|
|
an interval set of possible values at that point.
|
2020-04-03 15:53:54 +02:00
|
|
|
|
In comparison with the source decision trees, \texttt{Unreachable}
|
|
|
|
|
nodes are never emitted.
|
2020-03-12 12:20:07 +01:00
|
|
|
|
Guards result in branching. In comparison with the source decision
|
|
|
|
|
trees, \texttt{Unreachable} nodes are never emitted.
|
|
|
|
|
|
|
|
|
|
We write $\sem{t_T}_T$ for the decision tree of the target program
|
|
|
|
|
$t_T$, satisfying the following correctness statement:
|
|
|
|
|
\[
|
|
|
|
|
\forall t_T, \forall v_T, \quad t_T(v_T) = \sem{t_T}_T(v_T)
|
|
|
|
|
\]
|
|
|
|
|
|
|
|
|
|
\subsection{Equivalence checking}
|
2020-04-03 15:53:54 +02:00
|
|
|
|
The equivalence checking algorithm takes as input a domain of
|
|
|
|
|
possible values \emph{S} and a pair of source and target decision trees and
|
|
|
|
|
in case the two trees are not equivalent it returns a counter example.
|
|
|
|
|
The algorithm respects the following correctness statement:
|
|
|
|
|
\begin{comment}
|
|
|
|
|
% TODO: we have to define what \covers mean for readers to understand the specifications
|
|
|
|
|
% (or we use a simplifying lie by hiding \covers in the statements).
|
|
|
|
|
\end{comment}
|
|
|
|
|
|
|
|
|
|
\begin{align*}
|
|
|
|
|
\Equiv S {C_S} {C_T} = \Yes \;\land\; \covers {C_T} S
|
|
|
|
|
& \implies
|
|
|
|
|
\forall v_S \approx v_T \in S,\; C_S(v_S) = C_T(v_T)
|
|
|
|
|
\\
|
|
|
|
|
\Equiv S {C_S} {C_T} = \No {v_S} {v_T} \;\land\; \covers {C_T} S
|
|
|
|
|
& \implies
|
|
|
|
|
v_S \approx v_T \in S \;\land\; C_S(v_S) \neq C_T(v_T)
|
|
|
|
|
\end{align*}
|
|
|
|
|
|
|
|
|
|
The algorithm proceeds by case analysis. Inference rules are shown.
|
|
|
|
|
If $S$ is empty the results is $\Yes$.
|
|
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
|
------------------------
|
|
|
|
|
equiv \emptyset Cs Ct gs
|
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
|
|
If the two decision trees are both terminal nodes the algorithm checks
|
|
|
|
|
for content equality.
|
|
|
|
|
\begin{verbatim}
|
|
|
|
|
--------------------------
|
|
|
|
|
equiv S Failure Failure []
|
|
|
|
|
|
|
|
|
|
equiv_BB BBs BBt
|
|
|
|
|
-------------------------------
|
|
|
|
|
equiv S (Leaf BBs) (Leaf BBt) []
|
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
|
|
If the source decision tree (left hand side) is a terminal while the
|
|
|
|
|
target decistion tree (right hand side) is not, the algorithm proceeds
|
|
|
|
|
by \emph{explosion} of the right hand side. Explosion means that every
|
|
|
|
|
child of the right hand side is tested for equality against the left
|
|
|
|
|
hand side.
|
|
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
|
(equiv S Cs Ci gs)^i
|
|
|
|
|
equiv S Cs Cf gs
|
|
|
|
|
-----------------------------------------
|
|
|
|
|
equiv S Cs (Node(a, (Domi,Ci)^i, Cf)) gs
|
|
|
|
|
\end{verbatim}
|
2020-03-12 12:20:07 +01:00
|
|
|
|
|
2020-02-17 17:31:11 +01:00
|
|
|
|
|
2020-04-03 15:53:54 +02:00
|
|
|
|
\begin{comment}
|
|
|
|
|
% TODO: [Gabriel] in practice the $dom_S$ are constructors and the
|
|
|
|
|
% $dom_T$ are domains. Do we want to hide this fact, or be explicit
|
|
|
|
|
% about it? (Maybe we should introduce explicitly/clearly the idea of
|
|
|
|
|
% target domains at some point).
|
|
|
|
|
\end{comment}
|
|
|
|
|
|
|
|
|
|
When the left hand side is not a terminal, the algorithm explodes the
|
|
|
|
|
left hand side while trimming every right hand side subtree. Trimming
|
|
|
|
|
a left hand side tree on an interval set $dom_S$ computed from the right hand
|
|
|
|
|
side tree constructor means mapping every branch condition $dom_T$ (interval set of
|
|
|
|
|
possible values) on the left to the intersection of $dom_T$ and $dom_S$ when
|
|
|
|
|
the accessors on both side are equal, and removing the branches that
|
|
|
|
|
result in an empty intersection. If the accessors are different,
|
|
|
|
|
\emph{$dom_T$} is left unchanged.
|
|
|
|
|
|
|
|
|
|
\begin{verbatim}
|
|
|
|
|
equiv S Ci (trim Ct a=Ki) gs
|
|
|
|
|
equiv S Cf (trim Ct (a\notin(K_i)^i) gs
|
|
|
|
|
-------------------------------------
|
|
|
|
|
equiv S (Node(a, (Ki,Ci)^i, Cf) Ct gs
|
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
|
|
The equivalence checking algorithm deals with guards by storing a
|
|
|
|
|
queue. A guard blackbox is pushed to the queue whenever the algorithm
|
|
|
|
|
encounters a Guard node on the right, while it pops a blackbox from
|
|
|
|
|
the queue whenever a Guard node appears on the left hand side.
|
|
|
|
|
The algorithm stops with failure if the popped blackbox and the and
|
|
|
|
|
blackbox on the left hand Guard node are different, otherwise in
|
|
|
|
|
continues by exploding to two subtrees, one in which the guard
|
|
|
|
|
condition evaluates to true, the other when it evaluates to false.
|
|
|
|
|
Termination of the algorithm is successful only when the guards queue
|
|
|
|
|
is empty.
|
|
|
|
|
\begin{verbatim}
|
|
|
|
|
equiv S Ctrue Ct (gs++[condition])
|
|
|
|
|
equiv S Cfalse Ct (gs++[condition])
|
|
|
|
|
--------------------------------------------
|
|
|
|
|
equiv S (Guard condition Ctrue Cfalse) Ct gs
|
|
|
|
|
|
|
|
|
|
equiv S Cs Ctrue gs
|
|
|
|
|
equiv S Cs Cfalse gs
|
|
|
|
|
--------------------------------------------
|
|
|
|
|
equiv S Cs (Guard condition Ctrue Cfalse) ([condition]++gs)
|
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
|
|
\begin{comment}
|
|
|
|
|
TODO: replace inference rules with good latex
|
|
|
|
|
\end{comment}
|
|
|
|
|
|
2020-02-24 14:36:26 +01:00
|
|
|
|
* Background
|
2020-03-12 19:12:23 +01:00
|
|
|
|
|
|
|
|
|
** OCaml
|
|
|
|
|
Objective Caml (OCaml) is a dialect of the ML (Meta-Language)
|
|
|
|
|
family of programming that features with other dialects of ML, such
|
|
|
|
|
as SML and Caml Light.
|
2020-02-21 11:29:04 +01:00
|
|
|
|
The main features of ML languages are the use of the Hindley-Milner type system that
|
|
|
|
|
provides many advantages with respect to static type systems of traditional imperative and object
|
|
|
|
|
oriented language such as C, C++ and Java, such as:
|
2020-03-02 14:46:37 +01:00
|
|
|
|
- Polymorphism: in certain scenarios a function can accept more than one
|
2020-03-12 19:37:38 +01:00
|
|
|
|
type for the input parameters. For example a function that computes the length of a
|
2020-02-24 14:36:26 +01:00
|
|
|
|
list doesn't need to inspect the type of the elements of the list and for this reason
|
2020-03-02 14:46:37 +01:00
|
|
|
|
a List.length function can accept lists of integers, lists of strings and in general
|
|
|
|
|
lists of any type. Such languages offer polymorphic functions through subtyping at
|
2020-02-24 14:36:26 +01:00
|
|
|
|
runtime only, while other languages such as C++ offer polymorphism through compile
|
|
|
|
|
time templates and function overloading.
|
|
|
|
|
With the Hindley-Milner type system each well typed function can have more than one
|
|
|
|
|
type but always has a unique best type, called the /principal type/.
|
|
|
|
|
For example the principal type of the List.length function is "For any /a/, function from
|
|
|
|
|
list of /a/ to /int/" and /a/ is called the /type parameter/.
|
2020-02-21 11:29:04 +01:00
|
|
|
|
- Strong typing: Languages such as C and C++ allow the programmer to operate on data
|
2020-02-24 14:36:26 +01:00
|
|
|
|
without considering its type, mainly through pointers. Other languages such as C#
|
|
|
|
|
and Go allow type erasure so at runtime the type of the data can't be queried.
|
|
|
|
|
In the case of programming languages using an Hindley-Milner type system the
|
|
|
|
|
programmer is not allowed to operate on data by ignoring or promoting its type.
|
2020-02-21 11:29:04 +01:00
|
|
|
|
- Type Inference: the principal type of a well formed term can be inferred without any
|
2020-02-24 14:36:26 +01:00
|
|
|
|
annotation or declaration.
|
2020-03-12 19:37:38 +01:00
|
|
|
|
- Algebraic data types: types that are modeled by the use of two
|
2020-02-24 14:36:26 +01:00
|
|
|
|
algebraic operations, sum and product.
|
|
|
|
|
A sum type is a type that can hold of many different types of
|
|
|
|
|
objects, but only one at a time. For example the sum type defined
|
|
|
|
|
as /A + B/ can hold at any moment a value of type A or a value of
|
|
|
|
|
type B. Sum types are also called tagged union or variants.
|
|
|
|
|
A product type is a type constructed as a direct product
|
|
|
|
|
of multiple types and contains at any moment one instance for
|
|
|
|
|
every type of its operands. Product types are also called tuples
|
|
|
|
|
or records. Algebraic data types can be recursive
|
|
|
|
|
in their definition and can be combined.
|
2020-02-21 11:29:04 +01:00
|
|
|
|
Moreover ML languages are functional, meaning that functions are
|
|
|
|
|
treated as first class citizens and variables are immutable,
|
|
|
|
|
although mutable statements and imperative constructs are permitted.
|
2020-03-03 17:18:40 +01:00
|
|
|
|
In addition to that features an object system, that provides
|
2020-02-21 11:29:04 +01:00
|
|
|
|
inheritance, subtyping and dynamic binding, and modules, that
|
|
|
|
|
provide a way to encapsulate definitions. Modules are checked
|
2020-03-12 19:37:38 +01:00
|
|
|
|
statically and can be reifycated through functors.
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
2020-03-29 22:54:33 +02:00
|
|
|
|
** Compiling OCaml code
|
2020-03-02 14:46:37 +01:00
|
|
|
|
|
2020-03-29 22:54:33 +02:00
|
|
|
|
The OCaml compiler provides compilation of source files in form of a bytecode executable with an
|
|
|
|
|
optionally embeddable interpreter or as a native executable that could
|
2020-03-02 14:46:37 +01:00
|
|
|
|
be statically linked to provide a single file executable.
|
2020-03-29 22:54:33 +02:00
|
|
|
|
Every source file is treated as a separate /compilation unit/ that is
|
|
|
|
|
advanced through different states.
|
|
|
|
|
The first stage of compilation is the parsing of the input code that
|
|
|
|
|
is trasformed into an untyped syntax tree. Code with syntax errors is
|
|
|
|
|
rejected at this stage.
|
|
|
|
|
After that the AST is processed by the type checker that performs
|
|
|
|
|
three steps at once:
|
|
|
|
|
- type inference, using the classical /Algorithm W/
|
|
|
|
|
- perform subtyping and gathers type information from the module system
|
|
|
|
|
- ensures that the code obeys the rule of the OCaml type system
|
|
|
|
|
At this stage, incorrectly typed code is rejected. In case of success,
|
|
|
|
|
the untyped AST in trasformed into a /Typed Tree/.
|
2020-03-03 17:18:40 +01:00
|
|
|
|
After the typechecker has proven that the program is type safe,
|
|
|
|
|
the compiler lower the code to /Lambda/, an s-expression based
|
2020-03-02 14:46:37 +01:00
|
|
|
|
language that assumes that its input has already been proved safe.
|
2020-03-29 22:54:33 +02:00
|
|
|
|
After the Lambda pass, the Lambda code is either translated into
|
|
|
|
|
bytecode or goes through a series of optimization steps performed by
|
|
|
|
|
the /Flambda/ optimizer before being translated into assembly.
|
|
|
|
|
\begin{comment}
|
|
|
|
|
TODO: Talk about flambda passes?
|
|
|
|
|
\end{comment}
|
2020-03-02 14:46:37 +01:00
|
|
|
|
|
2020-03-29 22:54:33 +02:00
|
|
|
|
This is an overview of the different compiler steps.
|
|
|
|
|
[[./files/ocamlcompilation.png]]
|
2020-03-02 14:46:37 +01:00
|
|
|
|
|
2020-03-29 22:54:33 +02:00
|
|
|
|
** Memory representation of OCaml values
|
|
|
|
|
An usual OCaml source program contains few to none explicit type
|
|
|
|
|
signatures.
|
|
|
|
|
This is possible because of type inference that allows to annotate the
|
|
|
|
|
AST with type informations. However, since the OCaml typechecker guarantes that a program is well typed
|
|
|
|
|
before being transformed into Lambda code, values at runtime contains
|
|
|
|
|
only a minimal subset of type informations needed to distinguish
|
|
|
|
|
polymorphic values.
|
|
|
|
|
For runtime values, OCaml uses a uniform memory representation in
|
|
|
|
|
which every variable is stored as a value in a contiguous block of
|
|
|
|
|
memory.
|
|
|
|
|
Every value is a single word that is either a concrete integer or a
|
|
|
|
|
pointer to another block of memory, that is called /cell/ or /box/.
|
|
|
|
|
We can abstract the type of OCaml runtime values as the following:
|
2020-03-03 17:18:40 +01:00
|
|
|
|
#+BEGIN_SRC
|
2020-03-29 22:54:33 +02:00
|
|
|
|
type t = Constant | Cell of int * t
|
2020-03-02 14:46:37 +01:00
|
|
|
|
#+END_SRC
|
2020-03-29 22:54:33 +02:00
|
|
|
|
where a one bit tag is used to distinguish between Constant or Cell.
|
|
|
|
|
In particular this bit of metadata is stored as the lowest bit of a
|
|
|
|
|
memory block.
|
|
|
|
|
|
|
|
|
|
Given that all the OCaml target architectures guarantee that all
|
|
|
|
|
pointers are divisible by four and that means that two lowest bits are
|
|
|
|
|
always 00 storing this bit of metadata at the lowest bit allows an
|
|
|
|
|
optimization. Constant values in OCaml, such as integers, empty lists,
|
|
|
|
|
Unit values and constructors of arity zero (/constant/ constructors)
|
|
|
|
|
are unboxed at runtime while pointers are recognized by the lowest bit
|
|
|
|
|
set to 0.
|
|
|
|
|
|
|
|
|
|
|
2020-03-02 14:46:37 +01:00
|
|
|
|
|
2020-03-29 22:54:33 +02:00
|
|
|
|
** Lambda form compilation
|
|
|
|
|
\begin{comment}
|
|
|
|
|
https://dev.realworld.org/compiler-backend.html
|
|
|
|
|
CITE: realworldocaml
|
|
|
|
|
\end{comment}
|
2020-03-30 21:23:55 +02:00
|
|
|
|
A Lambda code target file is produced by the compiler and consists of a
|
2020-03-02 14:46:37 +01:00
|
|
|
|
single s-expression. Every s-expression consist of /(/, a sequence of
|
|
|
|
|
elements separated by a whitespace and a closing /)/.
|
|
|
|
|
Elements of s-expressions are:
|
|
|
|
|
- Atoms: sequences of ascii letters, digits or symbols
|
|
|
|
|
- Variables
|
|
|
|
|
- Strings: enclosed in double quotes and possibly escaped
|
|
|
|
|
- S-expressions: allowing arbitrary nesting
|
|
|
|
|
|
2020-03-30 21:23:55 +02:00
|
|
|
|
The Lambda form is a key stage where the compiler discards type
|
|
|
|
|
informations and maps the original source code to the runtime memory
|
|
|
|
|
model described.
|
|
|
|
|
In this stage of the compiler pipeline pattern match statements are
|
|
|
|
|
analyzed and compiled into an automata.
|
|
|
|
|
\begin{comment}
|
|
|
|
|
evidenzia centralita` rispetto alla tesi
|
|
|
|
|
\end{comment}
|
|
|
|
|
#+BEGIN_SRC
|
|
|
|
|
type t = | Foo | Bar | Baz | Fred
|
|
|
|
|
|
|
|
|
|
let test = function
|
|
|
|
|
| Foo -> "foo"
|
|
|
|
|
| Bar -> "bar"
|
|
|
|
|
| Baz -> "baz"
|
|
|
|
|
| Fred -> "fred"
|
|
|
|
|
#+END_SRC
|
|
|
|
|
The Lambda output for this code can be obtained by running the
|
|
|
|
|
compiler with the /-dlambda/ flag:
|
|
|
|
|
#+BEGIN_SRC
|
|
|
|
|
(setglobal Prova!
|
|
|
|
|
(let
|
|
|
|
|
(test/85 =
|
|
|
|
|
(function param/86
|
|
|
|
|
(switch* param/86
|
|
|
|
|
case int 0: "foo"
|
|
|
|
|
case int 1: "bar"
|
|
|
|
|
case int 2: "baz"
|
|
|
|
|
case int 3: "fred")))
|
|
|
|
|
(makeblock 0 test/85)))
|
|
|
|
|
#+END_SRC
|
|
|
|
|
As outlined by the example, the /makeblock/ directive is responsible
|
|
|
|
|
for allocating low level OCaml values and every constant constructor
|
|
|
|
|
ot the algebraic type /t/ is stored in memory as an integer.
|
|
|
|
|
The /setglobal/ directives declares a new binding in the global scope:
|
|
|
|
|
Every concept of modules is lost at this stage of compilation.
|
|
|
|
|
The pattern matching compiler uses a jump table to map every pattern
|
|
|
|
|
matching clauses to its target expression. Values are addressed by a
|
|
|
|
|
unique name.
|
|
|
|
|
#+BEGIN_SRC
|
|
|
|
|
type t = | English of p | French of q
|
|
|
|
|
type p = | Foo | Bar
|
|
|
|
|
type q = | Tata| Titi
|
|
|
|
|
type t = | English of p | French of q
|
|
|
|
|
|
|
|
|
|
let test = function
|
|
|
|
|
| English Foo -> "foo"
|
|
|
|
|
| English Bar -> "bar"
|
|
|
|
|
| French Tata -> "baz"
|
|
|
|
|
| French Titi -> "fred"
|
|
|
|
|
#+END_SRC
|
|
|
|
|
In the case of types with a smaller number of variants, the pattern
|
|
|
|
|
matching compiler may avoid the overhead of computing a jump table.
|
|
|
|
|
This example also highlights the fact that non constant constructor
|
|
|
|
|
are mapped to cons cell that are accessed using the /tag/ directive.
|
|
|
|
|
#+BEGIN_SRC
|
|
|
|
|
(setglobal Prova!
|
|
|
|
|
(let
|
|
|
|
|
(test/89 =
|
|
|
|
|
(function param/90
|
|
|
|
|
(switch* param/90
|
|
|
|
|
case tag 0: (if (!= (field 0 param/90) 0) "bar" "foo")
|
|
|
|
|
case tag 1: (if (!= (field 0 param/90) 0) "fred" "baz"))))
|
|
|
|
|
(makeblock 0 test/89)))
|
|
|
|
|
#+END_SRC
|
|
|
|
|
In the Lambda language are several numeric types:
|
2020-03-02 14:46:37 +01:00
|
|
|
|
- integers: that us either 31 or 63 bit two's complement arithmetic
|
|
|
|
|
depending on system word size, and also wrapping on overflow
|
|
|
|
|
- 32 bit and 64 bit integers: that use 32-bit and 64-bit two's complement arithmetic
|
|
|
|
|
with wrap on overflow
|
|
|
|
|
- big integers: offer integers with arbitrary precision
|
|
|
|
|
- floats: that use IEEE754 double-precision (64-bit) arithmetic with
|
|
|
|
|
the addition of the literals /infinity/, /neg_infinity/ and /nan/.
|
|
|
|
|
|
2020-03-12 19:37:38 +01:00
|
|
|
|
The are various numeric operations defined:
|
2020-03-02 14:46:37 +01:00
|
|
|
|
|
|
|
|
|
- Arithmetic operations: +, -, *, /, % (modulo), neg (unary negation)
|
|
|
|
|
- Bitwise operations: &, |, ^, <<, >> (zero-shifting), a>> (sign extending)
|
|
|
|
|
- Numeric comparisons: <, >, <=, >=, ==
|
|
|
|
|
|
|
|
|
|
*** Functions
|
|
|
|
|
|
|
|
|
|
Functions are defined using the following syntax, and close over all
|
|
|
|
|
bindings in scope: (lambda (arg1 arg2 arg3) BODY)
|
|
|
|
|
and are applied using the following syntax: (apply FUNC ARG ARG ARG)
|
|
|
|
|
Evaluation is eager.
|
|
|
|
|
|
2020-03-30 21:23:55 +02:00
|
|
|
|
*** Other atoms
|
|
|
|
|
The atom /let/ introduces a sequence of bindings at a smaller scope
|
|
|
|
|
than the global one:
|
2020-03-02 14:46:37 +01:00
|
|
|
|
(let BINDING BINDING BINDING ... BODY)
|
|
|
|
|
|
2020-03-30 21:23:55 +02:00
|
|
|
|
The Lambda form supports many other directives such as /strinswitch/
|
|
|
|
|
that is constructs aspecialized jump tables for string, integer range
|
|
|
|
|
comparisons and so on.
|
|
|
|
|
These construct are explicitely undocumented because the Lambda code
|
|
|
|
|
intermediate language can change across compiler releases.
|
|
|
|
|
\begin{comment}
|
|
|
|
|
Spiega che la sintassi che supporti e` quella nella BNF
|
|
|
|
|
\end{comment}
|
|
|
|
|
|
2020-03-02 14:46:37 +01:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
** Pattern matching
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
|
|
|
|
Pattern matching is a widely adopted mechanism to interact with ADT.
|
|
|
|
|
C family languages provide branching on predicates through the use of
|
|
|
|
|
if statements and switch statements.
|
2020-02-24 19:46:00 +01:00
|
|
|
|
Pattern matching on the other hands express predicates through
|
|
|
|
|
syntactic templates that also allow to bind on data structures of
|
|
|
|
|
arbitrary shapes. One common example of pattern matching is the use of regular
|
2020-03-03 17:18:40 +01:00
|
|
|
|
expressions on strings. provides pattern matching on ADT and
|
2020-02-21 11:29:04 +01:00
|
|
|
|
primitive data types.
|
2020-02-24 19:46:00 +01:00
|
|
|
|
The result of a pattern matching operation is always one of:
|
|
|
|
|
- this value does not match this pattern”
|
|
|
|
|
- this value matches this pattern, resulting the following bindings of
|
|
|
|
|
names to values and the jump to the expression pointed at the
|
|
|
|
|
pattern.
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
2020-03-03 17:18:40 +01:00
|
|
|
|
#+BEGIN_SRC
|
2020-02-24 19:46:00 +01:00
|
|
|
|
type color = | Red | Blue | Green | Black | White
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
2020-02-24 19:46:00 +01:00
|
|
|
|
match color with
|
2020-02-21 11:29:04 +01:00
|
|
|
|
| Red -> print "red"
|
|
|
|
|
| Blue -> print "red"
|
|
|
|
|
| Green -> print "red"
|
2020-02-24 19:46:00 +01:00
|
|
|
|
| _ -> print "white or black"
|
2020-02-21 11:29:04 +01:00
|
|
|
|
#+END_SRC
|
|
|
|
|
|
2020-03-03 17:18:40 +01:00
|
|
|
|
provides tokens to express data destructoring.
|
2020-03-12 19:37:38 +01:00
|
|
|
|
For example we can examine the content of a list with pattern matching
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
2020-03-03 17:18:40 +01:00
|
|
|
|
#+BEGIN_SRC
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
|
|
|
|
begin match list with
|
|
|
|
|
| [ ] -> print "empty list"
|
|
|
|
|
| element1 :: [ ] -> print "one element"
|
2020-02-24 19:46:00 +01:00
|
|
|
|
| (element1 :: element2) :: [ ] -> print "two elements"
|
2020-02-21 11:29:04 +01:00
|
|
|
|
| head :: tail-> print "head followed by many elements"
|
|
|
|
|
#+END_SRC
|
|
|
|
|
|
2020-02-24 19:46:00 +01:00
|
|
|
|
Parenthesized patterns, such as the third one in the previous example,
|
|
|
|
|
matches the same value as the pattern without parenthesis.
|
|
|
|
|
|
|
|
|
|
The same could be done with tuples
|
2020-03-03 17:18:40 +01:00
|
|
|
|
#+BEGIN_SRC
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
|
|
|
|
begin match tuple with
|
|
|
|
|
| (Some _, Some _) -> print "Pair of optional types"
|
2020-02-24 19:46:00 +01:00
|
|
|
|
| (Some _, None) | (None, Some _) -> print "Pair of optional types, one of which is null"
|
2020-02-21 11:29:04 +01:00
|
|
|
|
| (None, None) -> print "Pair of optional types, both null"
|
|
|
|
|
#+END_SRC
|
|
|
|
|
|
2020-02-24 19:46:00 +01:00
|
|
|
|
The pattern pattern₁ | pattern₂ represents the logical "or" of the
|
|
|
|
|
two patterns pattern₁ and pattern₂. A value matches pattern₁ |
|
|
|
|
|
pattern₂ if it matches pattern₁ or pattern₂.
|
|
|
|
|
|
2020-02-21 11:29:04 +01:00
|
|
|
|
Pattern clauses can make the use of /guards/ to test predicates and
|
2020-02-24 19:46:00 +01:00
|
|
|
|
variables can captured (binded in scope).
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
2020-03-03 17:18:40 +01:00
|
|
|
|
#+BEGIN_SRC
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
|
|
|
|
begin match token_list with
|
2020-02-24 19:46:00 +01:00
|
|
|
|
| "switch"::var::"{"::rest -> ...
|
|
|
|
|
| "case"::":"::var::rest when is_int var -> ...
|
|
|
|
|
| "case"::":"::var::rest when is_string var -> ...
|
|
|
|
|
| "}"::[ ] -> ...
|
2020-02-21 11:29:04 +01:00
|
|
|
|
| "}"::rest -> error "syntax error: " rest
|
|
|
|
|
|
|
|
|
|
#+END_SRC
|
|
|
|
|
|
2020-03-03 17:18:40 +01:00
|
|
|
|
Moreover, the pattern matching compiler emits a warning when a
|
2020-02-24 19:46:00 +01:00
|
|
|
|
pattern is not exhaustive or some patterns are shadowed by precedent ones.
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
2020-03-03 17:18:40 +01:00
|
|
|
|
** Symbolic execution
|
2020-04-03 15:53:54 +02:00
|
|
|
|
|
|
|
|
|
Symbolic execution is a widely used techniques in the field of
|
|
|
|
|
computer security.
|
2020-04-03 18:12:32 +02:00
|
|
|
|
It allows to analyze different execution paths of a program
|
|
|
|
|
simultanously while tracking which inputs trigger the execution of
|
|
|
|
|
different parts of the program.
|
|
|
|
|
Inputs are modelled symbolically rather than taking "concrete" values.
|
|
|
|
|
A symbolic execution engine keeps track of expressions and variables
|
|
|
|
|
in terms of these symbolic symbols and attaches logical constraints to every
|
|
|
|
|
branch that is being followed.
|
|
|
|
|
Symbolic execution engines are used to track bugs by modelling the
|
|
|
|
|
domain of all possible inputs of a program, detecting infeasible
|
|
|
|
|
paths, dead code and proving that two code segments are equivalent.
|
|
|
|
|
|
|
|
|
|
Let's take as example this signedness bug that was found in the
|
|
|
|
|
FreeBSD kernel and allowed, when calling the getpeername function, to
|
|
|
|
|
read portions of kernel memory.
|
|
|
|
|
#+BEGIN_SRC
|
|
|
|
|
int compat;
|
|
|
|
|
{
|
|
|
|
|
struct file *fp;
|
|
|
|
|
register struct socket *so;
|
|
|
|
|
struct sockaddr *sa;
|
|
|
|
|
int len, error;
|
|
|
|
|
|
|
|
|
|
...
|
|
|
|
|
|
|
|
|
|
len = MIN(len, sa->sa_len); /* [1] */
|
|
|
|
|
error = copyout(sa, (caddr_t)uap->asa, (u_int)len);
|
|
|
|
|
if (error)
|
|
|
|
|
goto bad;
|
|
|
|
|
|
|
|
|
|
...
|
|
|
|
|
|
|
|
|
|
bad:
|
|
|
|
|
if (sa)
|
|
|
|
|
FREE(sa, M_SONAME);
|
|
|
|
|
fdrop(fp, p);
|
|
|
|
|
return (error);
|
|
|
|
|
}
|
|
|
|
|
#+END_SRC
|
|
|
|
|
|
|
|
|
|
The tree of the execution when the function is evaluated considering
|
|
|
|
|
/int len/ our symbolic variable α, sa->sa_len as symbolic variable β
|
|
|
|
|
and π as the set of constraints on a symbolic variable:
|
|
|
|
|
|
|
|
|
|
#+BEGIN_SRC
|
|
|
|
|
[1] compat (...) { π_{α}: -∞ < α < ∞ }
|
|
|
|
|
|
|
|
|
|
|
[2] min (σ₁, σ₂) { π_{σ}: -∞ < (σ₁,σ₂) < ∞ ; π_{α}: -∞ < α < β ; π_{β}: ...}
|
|
|
|
|
|
|
|
|
|
|
[3] cast(u_int) (...) { π_{σ}: 0 ≤ (σ) < ∞ ; π_{α}: -∞ < α < β ; π_{β}: ...}
|
|
|
|
|
|
|
|
|
|
|
... // rest of the execution
|
|
|
|
|
#+END_SRC
|
|
|
|
|
We can see that at step 3 the set of possible values of the scrutinee
|
|
|
|
|
α is bigger than the set of possible values of the input σ to the
|
|
|
|
|
/cast/ directive, that is: π_{α} ⊈ π_{σ}. For this reason the /cast/ may fail when α is /len/
|
|
|
|
|
negative number, outside the domain π_{σ}. In C this would trigger undefined behaviour (signed
|
|
|
|
|
overflow) that made the exploitation possible.
|
|
|
|
|
|
|
|
|
|
Every step of evaluation can be modelled as the following transition:
|
|
|
|
|
#+BEGIN_SRC
|
|
|
|
|
(π_{σ}, (πᵢ)ⁱ) → (π'_{σ}, (π'ᵢ)ⁱ)
|
|
|
|
|
#+END_SRC
|
|
|
|
|
if we express the π constraints as logical formulas we can model the
|
|
|
|
|
execution of the program in terms of Hoare Logic.
|
|
|
|
|
State of the computation is a Hoare triple {P}C{Q} where P and Q are
|
|
|
|
|
respectively the /precondition/ and the /postcondition/ that
|
|
|
|
|
constitute the assertions of the program. C is the directive being
|
|
|
|
|
executed.
|
|
|
|
|
The language of the assertions P is:
|
|
|
|
|
#+BEGIN_SRC
|
|
|
|
|
P ::= true | false | a < b | P₁ ∧ P₂ | P₁ ∨ P₂ | ~P
|
|
|
|
|
#+END_SRC
|
|
|
|
|
where a and b are numbers.
|
|
|
|
|
In the Hoare rules assertions could also take the form of
|
|
|
|
|
#+BEGIN_SRC
|
|
|
|
|
P ::= ∀i. P | ∃i. P | P₁ ⇒ P₂
|
|
|
|
|
#+END_SRC
|
|
|
|
|
where i is a logical variable, but assertions of these kinds increases
|
|
|
|
|
the complexity of the symbolic engine.
|
|
|
|
|
Execution follows the rules of Hoare logic:
|
|
|
|
|
- Empty statement :
|
|
|
|
|
\begin{verbatim}
|
|
|
|
|
————————————
|
|
|
|
|
{P}/skip/{P}
|
|
|
|
|
\end{verbatim}
|
|
|
|
|
- Assignment statement : The truthness of P[a/x] is equivalent to the
|
|
|
|
|
truth of {P} after the assignment.
|
|
|
|
|
\begin{verbatim}
|
|
|
|
|
————————————
|
|
|
|
|
{P[a/x]}x:=a{P}
|
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
|
|
- Composition : c₁ and c₂ are directives that are executed in order;
|
|
|
|
|
{Q} is the /midcondition/.
|
|
|
|
|
\begin{verbatim}
|
|
|
|
|
{P}c₁{R}, {R}c₂{Q}
|
|
|
|
|
——————————————————
|
|
|
|
|
{P}c₁;c₂{Q}
|
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
|
|
- Conditional :
|
|
|
|
|
\begin{verbatim}
|
|
|
|
|
{P∧b}c₁{Q}, {P∧~b}c₂{Q}
|
|
|
|
|
————————————————————————
|
|
|
|
|
{P}if b then c₁ else c₂{Q}
|
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
|
|
- Loop : {P} is the loop invariant. After the loop is finished /P/
|
|
|
|
|
holds and ~b caused the loop to end.
|
|
|
|
|
\begin{verbatim}
|
|
|
|
|
{P∧b}c{P}
|
|
|
|
|
————————————————————————
|
|
|
|
|
{P}while b do c{P∧~b}
|
|
|
|
|
\end{verbatim}
|
|
|
|
|
|
|
|
|
|
Even if the semantics of symbolic execution engines are well defined,
|
|
|
|
|
the user may run into different complications when applying such
|
|
|
|
|
analysis to non trivial codebases.
|
|
|
|
|
For example, depending on the domain, loop termination is not
|
|
|
|
|
guaranteed. Even when termination is guaranteed, looping causes
|
|
|
|
|
exponential branching that may lead to path explosion or state
|
|
|
|
|
explosion.
|
|
|
|
|
Reasoning about all possible executions of a program is not always
|
|
|
|
|
feasible and in case of explosion usually symbolic execution engines
|
|
|
|
|
implement heuristics to reduce the size of the search space.
|
2020-03-03 17:18:40 +01:00
|
|
|
|
|
|
|
|
|
** Translation validation
|
|
|
|
|
Translators, such as translators and code generators, are huge pieces of
|
|
|
|
|
software usually consisting of multiple subsystem and
|
|
|
|
|
constructing an actual specification of a translator implementation for
|
|
|
|
|
formal validation is a very long task. Moreover, different
|
|
|
|
|
translators implement different algorithms, so the correctness proof of
|
|
|
|
|
a translator cannot be generalized and reused to prove another translator.
|
|
|
|
|
Translation validation is an alternative to the verification of
|
|
|
|
|
existing translators that consists of taking the source and the target
|
|
|
|
|
(compiled) program and proving /a posteriori/ their semantic equivalence.
|
|
|
|
|
|
|
|
|
|
- [ ] Techniques for translation validation
|
|
|
|
|
- [ ] What does semantically equivalent mean
|
|
|
|
|
- [ ] What happens when there is no semantic equivalence
|
|
|
|
|
- [ ] Translation validation through symbolic execution
|
|
|
|
|
|
2020-03-12 19:12:23 +01:00
|
|
|
|
* Translation validation of the Pattern Matching Compiler
|
2020-03-03 17:18:40 +01:00
|
|
|
|
|
2020-03-12 19:12:23 +01:00
|
|
|
|
** Source program
|
2020-03-03 17:18:40 +01:00
|
|
|
|
The algorithm takes as its input a source program and translates it
|
2020-03-12 19:12:23 +01:00
|
|
|
|
into an algebraic data structure called /decision_tree/.
|
2020-03-03 17:18:40 +01:00
|
|
|
|
|
|
|
|
|
#+BEGIN_SRC
|
2020-03-12 19:12:23 +01:00
|
|
|
|
type decision_tree =
|
2020-03-03 17:18:40 +01:00
|
|
|
|
| Unreachable
|
|
|
|
|
| Failure
|
|
|
|
|
| Leaf of source_expr
|
2020-03-12 19:12:23 +01:00
|
|
|
|
| Guard of source_blackbox * decision_tree * decision_tree
|
|
|
|
|
| Node of accessor * (constructor * decision_tree) list * decision_tree
|
2020-03-03 17:18:40 +01:00
|
|
|
|
#+END_SRC
|
|
|
|
|
|
|
|
|
|
Unreachable, Leaf of source_expr and Failure are the terminals of the three.
|
|
|
|
|
We distinguish
|
|
|
|
|
- Unreachable: statically it is known that no value can go there
|
|
|
|
|
- Failure: a value matching this part results in an error
|
|
|
|
|
- Leaf: a value matching this part results into the evaluation of a
|
2020-03-12 19:37:38 +01:00
|
|
|
|
source black box of code
|
2020-03-03 17:18:40 +01:00
|
|
|
|
|
|
|
|
|
The algorithm doesn't support type-declaration-based analysis
|
|
|
|
|
to know the list of constructors at a given type.
|
|
|
|
|
Let's consider some trivial examples:
|
|
|
|
|
|
|
|
|
|
#+BEGIN_SRC
|
|
|
|
|
function true -> 1
|
|
|
|
|
#+END_SRC
|
|
|
|
|
|
|
|
|
|
[ ] Converti a disegni
|
|
|
|
|
|
|
|
|
|
Is translated to
|
|
|
|
|
|Node ([(true, Leaf 1)], Failure)
|
|
|
|
|
while
|
|
|
|
|
#+BEGIN_SRC
|
|
|
|
|
function
|
|
|
|
|
true -> 1
|
|
|
|
|
| false -> 2
|
|
|
|
|
#+END_SRC
|
|
|
|
|
will give
|
|
|
|
|
|Node ([(true, Leaf 1); (false, Leaf 2)])
|
|
|
|
|
|
|
|
|
|
It is possible to produce Unreachable examples by using
|
|
|
|
|
refutation clauses (a "dot" in the right-hand-side)
|
|
|
|
|
#+BEGIN_SRC
|
|
|
|
|
function
|
|
|
|
|
true -> 1
|
|
|
|
|
| false -> 2
|
|
|
|
|
| _ -> .
|
|
|
|
|
#+END_SRC
|
|
|
|
|
that gets translated into
|
|
|
|
|
Node ([(true, Leaf 1); (false, Leaf 2)], Unreachable)
|
|
|
|
|
|
|
|
|
|
We trust this annotation, which is reasonable as the type-checker
|
|
|
|
|
verifies that it indeed holds.
|
|
|
|
|
|
|
|
|
|
Guard nodes of the tree are emitted whenever a guard is found. Guards
|
|
|
|
|
node contains a blackbox of code that is never evaluated and two
|
|
|
|
|
branches, one that is taken in case the guard evaluates to true and
|
|
|
|
|
the other one that contains the path taken when the guard evaluates to
|
|
|
|
|
true.
|
|
|
|
|
|
|
|
|
|
[ ] Finisci con Node
|
|
|
|
|
[ ] Spiega del fallback
|
2020-03-12 19:12:23 +01:00
|
|
|
|
[ ] rivedi compilazione per tenere in considerazione il tuo albero invece che le Lambda
|
2020-03-03 17:18:40 +01:00
|
|
|
|
[ ] Specifica che stesso algoritmo usato per compilare a lambda, piu` optimizations
|
|
|
|
|
|
2020-03-12 19:12:23 +01:00
|
|
|
|
The source code of a pattern matching function has the
|
2020-03-03 17:18:40 +01:00
|
|
|
|
following form:
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
2020-02-24 19:46:00 +01:00
|
|
|
|
|match variable with
|
|
|
|
|
|\vert pattern₁ -> expr₁
|
|
|
|
|
|\vert pattern₂ when guard -> expr₂
|
|
|
|
|
|\vert pattern₃ as var -> expr₃
|
|
|
|
|
|⋮
|
|
|
|
|
|\vert pₙ -> exprₙ
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
2020-03-12 19:12:23 +01:00
|
|
|
|
and can include any expression that is legal for the OCaml compiler,
|
|
|
|
|
such as /when/ guards and assignments. Patterns could or could not
|
2020-03-03 17:18:40 +01:00
|
|
|
|
be exhaustive.
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
2020-03-03 17:18:40 +01:00
|
|
|
|
Pattern matching code could also be written using the more compact form:
|
|
|
|
|
|function
|
|
|
|
|
|\vert pattern₁ -> expr₁
|
|
|
|
|
|\vert pattern₂ when guard -> expr₂
|
|
|
|
|
|\vert pattern₃ as var -> expr₃
|
|
|
|
|
|⋮
|
|
|
|
|
|\vert pₙ -> exprₙ
|
2020-02-24 19:46:00 +01:00
|
|
|
|
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
2020-03-03 17:18:40 +01:00
|
|
|
|
This BNF grammar describes formally the grammar of the source program:
|
|
|
|
|
|
|
|
|
|
#+BEGIN_SRC bnf
|
|
|
|
|
start ::= "match" id "with" patterns | "function" patterns
|
|
|
|
|
patterns ::= (pattern0|pattern1) pattern1+
|
|
|
|
|
;; pattern0 and pattern1 are needed to distinguish the first case in which
|
|
|
|
|
;; we can avoid writing the optional vertical line
|
|
|
|
|
pattern0 ::= clause
|
|
|
|
|
pattern1 ::= "|" clause
|
|
|
|
|
clause ::= lexpr "->" rexpr
|
|
|
|
|
|
|
|
|
|
lexpr ::= rule (ε|condition)
|
|
|
|
|
rexpr ::= _code ;; arbitrary code
|
|
|
|
|
|
|
|
|
|
rule ::= wildcard|variable|constructor_pattern|or_pattern ;;
|
|
|
|
|
|
|
|
|
|
;; rules
|
|
|
|
|
wildcard ::= "_"
|
|
|
|
|
variable ::= identifier
|
|
|
|
|
constructor_pattern ::= constructor (rule|ε) (assignment|ε)
|
|
|
|
|
|
|
|
|
|
constructor ::= int|float|char|string|bool
|
|
|
|
|
|unit|record|exn|objects|ref
|
|
|
|
|
|list|tuple|array
|
|
|
|
|
|variant|parameterized_variant ;; data types
|
|
|
|
|
|
2020-03-12 19:12:23 +01:00
|
|
|
|
or_pattern ::= rule ("|" wildcard|variable|constructor_pattern)+
|
2020-03-03 17:18:40 +01:00
|
|
|
|
|
|
|
|
|
condition ::= "when" bexpr
|
|
|
|
|
assignment ::= "as" id
|
|
|
|
|
bexpr ::= _code ;; arbitrary code
|
2020-02-24 19:46:00 +01:00
|
|
|
|
#+END_SRC
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
2020-03-02 14:46:37 +01:00
|
|
|
|
\begin{comment}
|
2020-03-12 19:12:23 +01:00
|
|
|
|
Check that it is still coherent to this bnf
|
2020-03-03 17:18:40 +01:00
|
|
|
|
\end{comment}
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
2020-03-03 17:18:40 +01:00
|
|
|
|
Patterns are of the form
|
2020-02-24 19:46:00 +01:00
|
|
|
|
| pattern | type of pattern |
|
2020-02-24 14:36:26 +01:00
|
|
|
|
|-----------------+---------------------|
|
|
|
|
|
| _ | wildcard |
|
|
|
|
|
| x | variable |
|
|
|
|
|
| c(p₁,p₂,...,pₙ) | constructor pattern |
|
|
|
|
|
| (p₁\vert p₂) | or-pattern |
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
2020-03-03 17:18:40 +01:00
|
|
|
|
During compilation by the translators expressions are compiled into
|
2020-03-12 19:12:23 +01:00
|
|
|
|
Lambda code and are referred as lambda code actions lᵢ.
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
2020-03-03 17:18:40 +01:00
|
|
|
|
The entire pattern matching code is represented as a clause matrix
|
2020-02-21 11:29:04 +01:00
|
|
|
|
that associates rows of patterns (p_{i,1}, p_{i,2}, ..., p_{i,n}) to
|
|
|
|
|
lambda code action lⁱ
|
|
|
|
|
\begin{equation*}
|
|
|
|
|
(P → L) =
|
|
|
|
|
\begin{pmatrix}
|
2020-02-24 14:36:26 +01:00
|
|
|
|
p_{1,1} & p_{1,2} & \cdots & p_{1,n} & → l₁ \\
|
|
|
|
|
p_{2,1} & p_{2,2} & \cdots & p_{2,n} & → l₂ \\
|
|
|
|
|
\vdots & \vdots & \ddots & \vdots & → \vdots \\
|
|
|
|
|
p_{m,1} & p_{m,2} & \cdots & p_{m,n} & → lₘ
|
2020-02-21 11:29:04 +01:00
|
|
|
|
\end{pmatrix}
|
|
|
|
|
\end{equation*}
|
|
|
|
|
|
2020-02-24 14:36:26 +01:00
|
|
|
|
The pattern /p/ matches a value /v/, written as p ≼ v, when one of the
|
|
|
|
|
following rules apply
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
2020-02-24 14:36:26 +01:00
|
|
|
|
|--------------------+---+--------------------+-------------------------------------------|
|
|
|
|
|
| _ | ≼ | v | ∀v |
|
|
|
|
|
| x | ≼ | v | ∀v |
|
2020-02-21 11:29:04 +01:00
|
|
|
|
| (p₁ \vert\ p₂) | ≼ | v | iff p₁ ≼ v or p₂ ≼ v |
|
|
|
|
|
| c(p₁, p₂, ..., pₐ) | ≼ | c(v₁, v₂, ..., vₐ) | iff (p₁, p₂, ..., pₐ) ≼ (v₁, v₂, ..., vₐ) |
|
|
|
|
|
| (p₁, p₂, ..., pₐ) | ≼ | (v₁, v₂, ..., vₐ) | iff pᵢ ≼ vᵢ ∀i ∈ [1..a] |
|
2020-02-24 14:36:26 +01:00
|
|
|
|
|--------------------+---+--------------------+-------------------------------------------|
|
|
|
|
|
|
2020-02-24 19:46:00 +01:00
|
|
|
|
When a value /v/ matches pattern /p/ we say that /v/ is an /instance/ of /p/.
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
2020-02-24 19:46:00 +01:00
|
|
|
|
Considering the pattern matrix P we say that the value vector
|
2020-02-24 14:36:26 +01:00
|
|
|
|
$\vec{v}$ = (v₁, v₂, ..., vᵢ) matches the line number i in P if and only if the following two
|
2020-02-21 11:29:04 +01:00
|
|
|
|
conditions are satisfied:
|
2020-02-24 14:36:26 +01:00
|
|
|
|
- p_{i,1}, p_{i,2}, \cdots, p_{i,n} ≼ (v₁, v₂, ..., vᵢ)
|
|
|
|
|
- ∀j < i p_{j,1}, p_{j,2}, \cdots, p_{j,n} ⋠ (v₁, v₂, ..., vᵢ)
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
|
|
|
|
We can define the following three relations with respect to patterns:
|
2020-02-24 14:36:26 +01:00
|
|
|
|
- Patter p is less precise than pattern q, written p ≼ q, when all
|
2020-02-21 11:29:04 +01:00
|
|
|
|
instances of q are instances of p
|
|
|
|
|
- Pattern p and q are equivalent, written p ≡ q, when their instances
|
|
|
|
|
are the same
|
|
|
|
|
- Patterns p and q are compatible when they share a common instance
|
|
|
|
|
|
2020-03-12 19:12:23 +01:00
|
|
|
|
\subsubsection{Parsing of the source program}
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
2020-03-12 19:12:23 +01:00
|
|
|
|
The source program of the following general form is parsed using a parser
|
|
|
|
|
generated by Menhir, a LR(1) parser generator for the OCaml programming language.
|
|
|
|
|
Menhir compiles LR(1) a grammar specification, in this case the OCaml pattern matching
|
|
|
|
|
grammar, down to OCaml code.
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
2020-02-24 19:46:00 +01:00
|
|
|
|
|match variable with
|
|
|
|
|
|\vert pattern₁ -> e₁
|
|
|
|
|
|\vert pattern₂ -> e₂
|
|
|
|
|
|⋮
|
|
|
|
|
|\vert pₘ -> eₘ
|
|
|
|
|
|
2020-03-12 19:12:23 +01:00
|
|
|
|
The result of parsing, when successful, results in a list of clauses
|
|
|
|
|
and a list of type declarations.
|
|
|
|
|
Every clause consists of three objects: a left-hand-side that is the
|
|
|
|
|
kind of pattern expressed, an option guard and a right-hand-side expression.
|
|
|
|
|
Patterns are encoded in the following way:
|
|
|
|
|
| pattern | type |
|
|
|
|
|
|-----------------+-------------|
|
|
|
|
|
| _ | Wildcard |
|
|
|
|
|
| p₁ as x | Assignment |
|
|
|
|
|
| c(p₁,p₂,...,pₙ) | Constructor |
|
|
|
|
|
| (p₁\vert p₂) | Orpat |
|
|
|
|
|
|
|
|
|
|
Guards and right-hand-sides are treated as a blackbox of OCaml code.
|
|
|
|
|
A sound approach for treating these blackbox would be to inspect the
|
|
|
|
|
OCaml compiler during translation to Lambda code and extract the
|
|
|
|
|
blackboxes compiled in their Lambda representation.
|
|
|
|
|
This would allow to test for equality with the respective blackbox at
|
|
|
|
|
the target level.
|
|
|
|
|
Given that this level of introspection is currently not possibile, we
|
|
|
|
|
decided to restrict the structure of blackboxes to the following (valid) OCaml
|
|
|
|
|
code:
|
|
|
|
|
|
|
|
|
|
#+BEGIN_SRC
|
|
|
|
|
external guard : 'a -> 'b = "guard"
|
|
|
|
|
external observe : 'a -> 'b = "observe"
|
|
|
|
|
#+END_SRC
|
|
|
|
|
|
|
|
|
|
We assume these two external functions /guard/ and /observe/ with a valid
|
|
|
|
|
type that lets the user pass any number of arguments to them.
|
|
|
|
|
All the guards are of the form \texttt{guard <arg> <arg> <arg>}, where the
|
|
|
|
|
<arg> are expressed using the OCaml pattern matching language.
|
|
|
|
|
Similarly, all the right-hand-side expressions are of the form
|
|
|
|
|
\texttt{observe <arg> <arg> ...} with the same constraints on arguments.
|
|
|
|
|
|
|
|
|
|
#+BEGIN_SRC
|
|
|
|
|
type t = Z | S of t
|
|
|
|
|
|
|
|
|
|
let _ = function
|
|
|
|
|
| Z -> observe 0
|
|
|
|
|
| S Z -> observe 1
|
|
|
|
|
| S x when guard x -> observe 2
|
|
|
|
|
| S (S x) as y when guard x y -> observe 3
|
|
|
|
|
| S _ -> observe 4
|
|
|
|
|
#+END_SRC
|
|
|
|
|
|
|
|
|
|
Once parsed, the type declarations and the list of clauses are encoded in the form of a matrix
|
|
|
|
|
that is later evaluated using a matrix decomposition algorithm.
|
|
|
|
|
|
|
|
|
|
\subsubsection{Matrix decomposition of pattern clauses}
|
|
|
|
|
|
|
|
|
|
The initial input of the decomposition algorithm C consists of a vector of variables
|
2020-02-24 19:46:00 +01:00
|
|
|
|
$\vec{x}$ = (x₁, x₂, ..., xₙ) of size /n/ where /n/ is the arity of
|
|
|
|
|
the type of /x/ and a clause matrix P → L of width n and height m.
|
|
|
|
|
That is:
|
2020-02-21 11:29:04 +01:00
|
|
|
|
|
|
|
|
|
\begin{equation*}
|
2020-02-24 19:46:00 +01:00
|
|
|
|
C((\vec{x} = (x₁, x₂, ..., xₙ),
|
2020-02-21 11:29:04 +01:00
|
|
|
|
\begin{pmatrix}
|
|
|
|
|
p_{1,1} & p_{1,2} & \cdots & p_{1,n} → l₁ \\
|
|
|
|
|
p_{2,1} & p_{2,2} & \cdots & p_{2,n} → l₂ \\
|
2020-02-24 19:46:00 +01:00
|
|
|
|
\vdots & \vdots & \ddots & \vdots → \vdots \\
|
|
|
|
|
p_{m,1} & p_{m,2} & \cdots & p_{m,n} → lₘ)
|
2020-02-21 11:29:04 +01:00
|
|
|
|
\end{pmatrix}
|
|
|
|
|
\end{equation*}
|
2020-02-24 19:46:00 +01:00
|
|
|
|
|
|
|
|
|
The base case C₀ of the algorithm is the case in which the $\vec{x}$
|
|
|
|
|
is empty, that is $\vec{x}$ = (), then the result of the compilation
|
|
|
|
|
C₀ is l₁
|
|
|
|
|
\begin{equation*}
|
|
|
|
|
C₀((),
|
|
|
|
|
\begin{pmatrix}
|
|
|
|
|
→ l₁ \\
|
|
|
|
|
→ l₂ \\
|
|
|
|
|
→ \vdots \\
|
|
|
|
|
→ lₘ
|
|
|
|
|
\end{pmatrix})
|
|
|
|
|
) = l₁
|
|
|
|
|
\end{equation*}
|
|
|
|
|
|
|
|
|
|
When $\vec{x}$ ≠ () then the compilation advances using one of the
|
|
|
|
|
following four rules:
|
|
|
|
|
|
|
|
|
|
1) Variable rule: if all patterns of the first column of P are wildcard patterns or
|
|
|
|
|
bind the value to a variable, then
|
|
|
|
|
|
|
|
|
|
\begin{equation*}
|
|
|
|
|
C(\vec{x}, P → L) = C((x₂, x₃, ..., xₙ), P' → L')
|
|
|
|
|
\end{equation*}
|
|
|
|
|
where
|
|
|
|
|
\begin{equation*}
|
|
|
|
|
\begin{pmatrix}
|
|
|
|
|
p_{1,2} & \cdots & p_{1,n} & → & (let & y₁ & x₁) & l₁ \\
|
|
|
|
|
p_{2,2} & \cdots & p_{2,n} & → & (let & y₂ & x₁) & l₂ \\
|
|
|
|
|
\vdots & \ddots & \vdots & → & \vdots & \vdots & \vdots & \vdots \\
|
|
|
|
|
p_{m,2} & \cdots & p_{m,n} & → & (let & yₘ & x₁) & lₘ
|
|
|
|
|
\end{pmatrix}
|
|
|
|
|
\end{equation*}
|
|
|
|
|
|
|
|
|
|
That means in every lambda action lᵢ there is a binding of x₁ to the
|
|
|
|
|
variable that appears on the pattern $p_{i,1}. Bindings are omitted
|
|
|
|
|
for wildcard patterns and the lambda action lᵢ remains unchanged.
|
|
|
|
|
|
|
|
|
|
2) Constructor rule: if all patterns in the first column of P are
|
|
|
|
|
constructors patterns of the form k(q₁, q₂, ..., qₙ) we define a
|
|
|
|
|
new matrix, the specialized clause matrix S, by applying the
|
|
|
|
|
following transformation on every row /p/:
|
|
|
|
|
\begin{lstlisting}[mathescape,columns=fullflexible,basicstyle=\fontfamily{lmvtt}\selectfont,]
|
|
|
|
|
for every c ∈ Set of constructors
|
|
|
|
|
for i ← 1 .. m
|
|
|
|
|
let kᵢ ← constructor_of($p_{i,1}$)
|
|
|
|
|
if kᵢ = c then
|
|
|
|
|
p ← $q_{i,1}$, $q_{i,2}$, ..., $q_{i,n'}$, $p_{i,2}$, $p_{i,3}$, ..., $p_{i, n}$
|
|
|
|
|
\end{lstlisting}
|
|
|
|
|
Patterns of the form $q_{i,j}$ matches on the values of the
|
|
|
|
|
constructor and we define new fresh variables y₁, y₂, ..., yₐ so
|
|
|
|
|
that the lambda action lᵢ becomes
|
|
|
|
|
|
|
|
|
|
\begin{lstlisting}[mathescape,columns=fullflexible,basicstyle=\fontfamily{lmvtt}\selectfont,]
|
|
|
|
|
(let (y₁ (field 0 x₁))
|
|
|
|
|
(y₂ (field 1 x₁))
|
|
|
|
|
...
|
|
|
|
|
(yₐ (field (a-1) x₁))
|
|
|
|
|
lᵢ)
|
|
|
|
|
\end{lstlisting}
|
|
|
|
|
|
|
|
|
|
and the result of the compilation for the set of constructors
|
|
|
|
|
{c₁, c₂, ..., cₖ} is:
|
|
|
|
|
|
|
|
|
|
\begin{lstlisting}[mathescape,columns=fullflexible,basicstyle=\fontfamily{lmvtt}\selectfont,]
|
|
|
|
|
switch x₁ with
|
|
|
|
|
case c₁: l₁
|
|
|
|
|
case c₂: l₂
|
|
|
|
|
...
|
|
|
|
|
case cₖ: lₖ
|
|
|
|
|
default: exit
|
|
|
|
|
\end{lstlisting}
|
|
|
|
|
|
|
|
|
|
3) Orpat rule: there are various strategies for dealing with
|
|
|
|
|
or-patterns. The most naive one is to split the or-patterns.
|
|
|
|
|
For example a row p containing an or-pattern:
|
|
|
|
|
\begin{equation*}
|
|
|
|
|
(p_{i,1}|q_{i,1}|r_{i,1}), p_{i,2}, ..., p_{i,m} → lᵢ
|
|
|
|
|
\end{equation*}
|
|
|
|
|
results in three rows added to the clause matrix
|
|
|
|
|
\begin{equation*}
|
|
|
|
|
p_{i,1}, p_{i,2}, ..., p_{i,m} → lᵢ \\
|
|
|
|
|
\end{equation*}
|
|
|
|
|
\begin{equation*}
|
|
|
|
|
q_{i,1}, p_{i,2}, ..., p_{i,m} → lᵢ \\
|
|
|
|
|
\end{equation*}
|
|
|
|
|
\begin{equation*}
|
|
|
|
|
r_{i,1}, p_{i,2}, ..., p_{i,m} → lᵢ
|
|
|
|
|
\end{equation*}
|
|
|
|
|
4) Mixture rule:
|
|
|
|
|
When none of the previous rules apply the clause matrix P → L is
|
2020-03-12 19:37:38 +01:00
|
|
|
|
split into two clause matrices, the first P₁ → L₁ that is the
|
2020-02-24 19:46:00 +01:00
|
|
|
|
largest prefix matrix for which one of the three previous rules
|
|
|
|
|
apply, and P₂ → L₂ containing the remaining rows. The algorithm is
|
|
|
|
|
applied to both matrices.
|
2020-03-24 15:52:52 +01:00
|
|
|
|
|
|
|
|
|
* Correctness of the algorithm
|
|
|
|
|
Running a program tₛ or its translation 〚tₛ〛 against an input vₛ
|
|
|
|
|
produces as a result a result /r/ in the following way:
|
|
|
|
|
| ( 〚tₛ〛ₛ(vₛ) = Cₛ(vₛ) ) → r
|
|
|
|
|
| tₛ(vₛ) → r
|
|
|
|
|
Likewise
|
|
|
|
|
| ( 〚tₜ〛ₜ(vₜ) = Cₜ(vₜ) ) → r
|
|
|
|
|
| tₜ(vₜ) → r
|
|
|
|
|
where result r ::= guard list * (Match blackbox | NoMatch | Absurd)
|
|
|
|
|
and guard ::= blackbox.
|
|
|
|
|
|
|
|
|
|
Having defined equivalence between two inputs of which one is
|
|
|
|
|
expressed in the source language and the other in the target language
|
|
|
|
|
vₛ ≃ vₜ (TODO define, this talks about the representation of source values in the target)
|
|
|
|
|
|
|
|
|
|
we can define the equivalence between a couple of programs or a couple
|
2020-04-01 17:16:57 +02:00
|
|
|
|
of decision trees
|
2020-03-24 15:52:52 +01:00
|
|
|
|
| tₛ ≃ tₜ := ∀vₛ≃vₜ, tₛ(vₛ) = tₜ(vₜ)
|
|
|
|
|
| Cₛ ≃ Cₜ := ∀vₛ≃vₜ, Cₛ(vₛ) = Cₜ(vₜ)
|
|
|
|
|
|
|
|
|
|
The proposed equivalence algorithm that works on a couple of
|
2020-04-01 17:16:57 +02:00
|
|
|
|
decision trees is returns either /Yes/ or /No(vₛ, vₜ)/ where vₛ and
|
2020-03-24 15:52:52 +01:00
|
|
|
|
vₜ are a couple of possible counter examples for which the constraint
|
|
|
|
|
trees produce a different result.
|
|
|
|
|
|
|
|
|
|
** Statements
|
2020-04-01 17:16:57 +02:00
|
|
|
|
Theorem. We say that a translation of a source program to a decision tree
|
2020-03-24 15:52:52 +01:00
|
|
|
|
is correct when for every possible input, the source program and its
|
2020-04-01 17:16:57 +02:00
|
|
|
|
respective decision tree produces the same result
|
2020-03-24 15:52:52 +01:00
|
|
|
|
|
|
|
|
|
| ∀vₛ, tₛ(vₛ) = 〚tₛ〛ₛ(vₛ)
|
|
|
|
|
|
2020-03-29 21:24:56 +02:00
|
|
|
|
|
2020-03-24 15:52:52 +01:00
|
|
|
|
Likewise, for the target language:
|
|
|
|
|
|
|
|
|
|
| ∀vₜ, tₜ(vₜ) = 〚tₜ〛ₜ(vₜ)
|
|
|
|
|
|
|
|
|
|
Definition: in the presence of guards we can say that two results are
|
|
|
|
|
equivalent modulo the guards queue, written /r₁ ≃gs r₂/, when:
|
|
|
|
|
| (gs₁, r₁) ≃gs (gs₂, r₂) ⇔ (gs₁, r₁) = (gs₂ ++ gs, r₂)
|
|
|
|
|
|
|
|
|
|
Definition: we say that Cₜ covers the input space /S/, written
|
|
|
|
|
/covers(Cₜ, S) when every value vₛ∈S is a valid input to the
|
2020-04-01 17:16:57 +02:00
|
|
|
|
decision tree Cₜ. (TODO: rephrase)
|
2020-03-24 15:52:52 +01:00
|
|
|
|
|
2020-04-01 17:16:57 +02:00
|
|
|
|
Theorem: Given an input space /S/ and a couple of decision trees, where
|
|
|
|
|
the target decision tree Cₜ covers the input space /S/, we say that
|
|
|
|
|
the two decision trees are equivalent when:
|
2020-03-24 15:52:52 +01:00
|
|
|
|
|
2020-04-02 14:14:39 +02:00
|
|
|
|
| equiv(S, Cₛ, Cₜ, gs) = Yes ∧ covers(Cₜ, S) → ∀vₛ≃vₜ ∈ S, Cₛ(vₛ) ≃gs Cₜ(vₜ)
|
2020-03-24 15:52:52 +01:00
|
|
|
|
|
2020-04-01 17:16:57 +02:00
|
|
|
|
Similarly we say that a couple of decision trees in the presence of
|
2020-03-24 15:52:52 +01:00
|
|
|
|
an input space /S/ are /not/ equivalent when:
|
|
|
|
|
|
2020-04-02 14:14:39 +02:00
|
|
|
|
| equiv(S, Cₛ, Cₜ, gs) = No(vₛ,vₜ) ∧ covers(Cₜ, S) → vₛ≃vₜ ∈ S ∧ Cₛ(vₛ) ≠gs Cₜ(vₜ)
|
2020-03-24 15:52:52 +01:00
|
|
|
|
|
|
|
|
|
Corollary: For a full input space /S/, that is the universe of the
|
|
|
|
|
target program we say:
|
|
|
|
|
|
2020-04-02 14:14:39 +02:00
|
|
|
|
| equiv(S, 〚tₛ〛ₛ, 〚tₜ〛ₜ, ∅) = Yes ⇔ tₛ ≃ tₜ
|
2020-03-29 21:24:56 +02:00
|
|
|
|
|
|
|
|
|
|
2020-04-01 17:16:57 +02:00
|
|
|
|
*** Proof of the correctness of the translation from source programs to source decision trees
|
2020-03-29 21:24:56 +02:00
|
|
|
|
|
|
|
|
|
We define a source term tₛ as a collection of patterns pointing to blackboxes
|
|
|
|
|
| tₛ ::= (p → bb)^{i∈I}
|
|
|
|
|
|
|
|
|
|
A pattern is defined as either a constructor pattern, an or pattern or
|
|
|
|
|
a constant pattern
|
|
|
|
|
| p ::= | K(pᵢ)ⁱ, i ∈ I | (p|q) | n ∈ ℕ
|
|
|
|
|
|
2020-04-01 17:16:57 +02:00
|
|
|
|
A decision tree is defined as either a Leaf, a Failure terminal or
|
2020-03-29 21:24:56 +02:00
|
|
|
|
an intermediate node with different children sharing the same accessor /a/
|
|
|
|
|
and an optional fallback.
|
|
|
|
|
Failure is emitted only when the patterns don't cover the whole set of
|
|
|
|
|
possible input values /S/. The fallback is not needed when the user
|
|
|
|
|
doesn't use a wildcard pattern.
|
|
|
|
|
|
|
|
|
|
| Cₛ ::= Leaf bb | Node(a, (Kᵢ → Cᵢ)^{i∈S} , C?)
|
|
|
|
|
| a ::= Here | n.a
|
|
|
|
|
| vₛ ::= K(vᵢ)^{i∈I} | n ∈ ℕ
|
|
|
|
|
|
|
|
|
|
\begin{comment}
|
|
|
|
|
Are K and Here clear here?
|
|
|
|
|
\end{comment}
|
|
|
|
|
|
|
|
|
|
We define the decomposition matrix /mₛ/ as
|
|
|
|
|
| SMatrix mₛ := (aⱼ)^{j∈J}, ((pᵢⱼ)^{j∈J} → bbᵢ)^{i∈I}
|
2020-04-01 00:37:54 +02:00
|
|
|
|
\begin{comment}
|
|
|
|
|
Correggi prendendo in considerazione l'accessor
|
|
|
|
|
\end{comment}
|
|
|
|
|
|
2020-04-01 17:16:57 +02:00
|
|
|
|
We define the decision tree of source programs
|
2020-04-01 00:37:54 +02:00
|
|
|
|
[|tₛ|]
|
2020-04-01 17:16:57 +02:00
|
|
|
|
in terms of the decision tree of pattern matrices
|
2020-04-01 00:37:54 +02:00
|
|
|
|
[|mₛ|]
|
|
|
|
|
by the following:
|
|
|
|
|
〚((pᵢ → bbᵢ)^{i∈I}〛 := 〚(Root), (pᵢ → bbᵢ)^{i∈I} 〛 # i ∈ I
|
|
|
|
|
|
2020-04-01 17:16:57 +02:00
|
|
|
|
decision tree computed from pattern matrices respect the following invariant:
|
2020-04-01 00:37:54 +02:00
|
|
|
|
| ∀v (vᵢ)^{i∈I} = v(aᵢ)^{i∈I} → 〚m〛(v) = m(vᵢ)^{i∈I} for m = ((aᵢ)^{i∈I}, (rᵢ)^{i∈I})
|
|
|
|
|
where
|
|
|
|
|
| v(Here) = v
|
|
|
|
|
| K(vᵢ)ⁱ(k.a) = vₖ(a) if k ∈ [0;n[
|
|
|
|
|
\begin{comment}
|
|
|
|
|
TODO: EXPLAIN
|
|
|
|
|
\end{comment}
|
|
|
|
|
|
|
|
|
|
We proceed to show the correctness of the invariant by a case
|
|
|
|
|
analysys.
|
|
|
|
|
|
|
|
|
|
Base cases:
|
|
|
|
|
1. [| ∅, (∅ → bbᵢ)ⁱ |] := Leaf bbᵢ where i := min(I), that is a
|
2020-04-01 17:16:57 +02:00
|
|
|
|
decision tree [|ms|] defined by an empty accessor and empty
|
2020-04-01 00:37:54 +02:00
|
|
|
|
patterns pointing to blackboxes /bbᵢ/. This respects the invariant
|
|
|
|
|
because a decomposition matrix in the case of empty rows returns
|
|
|
|
|
the first expression and we known that (Leaf bb)(v) := Match bb
|
|
|
|
|
2. [| (aⱼ)ʲ, ∅ |] := Failure
|
|
|
|
|
|
|
|
|
|
Regarding non base cases:
|
|
|
|
|
We define
|
|
|
|
|
| let Idx(k) = [0; arity(k)[
|
|
|
|
|
| let First(∅) = ⊥
|
|
|
|
|
| let First((aⱼ)ʲ) = a_{min(j)} ## where j ∈ J ≠ ∅
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
** Proof of equivalence checking
|
|
|
|
|
|
|
|
|
|
The equivalence checking algorithm takes as parameters an input space
|
2020-04-01 17:16:57 +02:00
|
|
|
|
/S/, a source decision tree /Cₛ/ and a target decision tree /Cₜ/:
|
2020-04-01 00:37:54 +02:00
|
|
|
|
| equiv(S, Cₛ, Cₜ) → Yes | No(vₛ, vₜ)
|
|
|
|
|
|
|
|
|
|
When the algorithm returns Yes and the input space is covered by /Cₛ/
|
2020-04-01 17:16:57 +02:00
|
|
|
|
we can say that the couple of decision trees are the same for
|
2020-04-01 00:37:54 +02:00
|
|
|
|
every couple of source value /vₛ/ and target value /vₜ/ that are equivalent.
|
|
|
|
|
\begin{comment}
|
|
|
|
|
Define "covered"
|
2020-04-02 14:14:39 +02:00
|
|
|
|
Is it correct to say the same? How to correctly distinguish in words ≃ and = ?
|
2020-04-01 00:37:54 +02:00
|
|
|
|
\end{comment}
|
2020-04-02 14:14:39 +02:00
|
|
|
|
| equiv(S, Cₛ, Cₜ) = Yes and cover(Cₜ, S) → ∀ vₛ ≃ vₜ∈S ∧ Cₛ(vₛ) = Cₜ(vₜ)
|
2020-04-01 00:37:54 +02:00
|
|
|
|
In the case where the algorithm returns No we have at least a couple
|
2020-04-01 17:16:57 +02:00
|
|
|
|
of counter example values vₛ and vₜ for which the two decision trees
|
2020-04-01 00:37:54 +02:00
|
|
|
|
outputs a different result.
|
2020-04-02 14:14:39 +02:00
|
|
|
|
| equiv(S, Cₛ, Cₜ) = No(vₛ,vₜ) and cover(Cₜ, S) → ∀ vₛ ≃ vₜ∈S ∧ Cₛ(vₛ) ≠ Cₜ(vₜ)
|
2020-04-01 00:37:54 +02:00
|
|
|
|
|
|
|
|
|
We define the following
|
|
|
|
|
| Forall(Yes) = Yes
|
|
|
|
|
| Forall(Yes::l) = Forall(l)
|
|
|
|
|
| Forall(No(vₛ,vₜ)::_) = No(vₛ,vₜ)
|
|
|
|
|
There exists and are injective:
|
|
|
|
|
| int(k)∈ℕ (ar(k) = 0)
|
|
|
|
|
| tag(k)∈ℕ (ar(k) > 0)
|
|
|
|
|
| π(k) = {n|int(k) = n} x {n|tag{k} = n}
|
|
|
|
|
where k is a constructor.
|
|
|
|
|
|
|
|
|
|
\begin{comment}
|
|
|
|
|
TODO: explain:
|
|
|
|
|
∀v∈a→π, C_{/a→π}(v) = C(v)
|
|
|
|
|
\end{comment}
|
|
|
|
|
|
|
|
|
|
We proceed by case analysis:
|
|
|
|
|
1. equiv(∅, Cₛ, Cₜ) := Yes
|
|
|
|
|
\begin{comment}
|
|
|
|
|
Devo spiegarlo?
|
|
|
|
|
\end{comment}
|
|
|
|
|
In the other subcases S is always non-empty.
|
|
|
|
|
2. equiv(S, Failure, Failure) := Yes
|
|
|
|
|
the statement holds because of equality between Failure nodes in
|
|
|
|
|
the case of every possible value /v/.
|
2020-04-01 17:16:57 +02:00
|
|
|
|
3. The result of the subcase where we have a source decision tree
|
2020-04-01 00:37:54 +02:00
|
|
|
|
/Cₛ/ that is either a Leaf terminal or a Failure terminal and a
|
2020-04-01 17:16:57 +02:00
|
|
|
|
target decision tree defined by an accessor /a/ and a positive
|
2020-04-01 00:37:54 +02:00
|
|
|
|
number of couples constraint πᵢ and children nodes Cₜᵢ. The output
|
|
|
|
|
the output of the algorithm is:
|
|
|
|
|
| equiv(S, (Leaf bbₛ|Failure) as Cₛ, Node(a, (πᵢ → Cₜᵢ)ⁱ)) := Forall(equiv( S∩a→π(kᵢ)), Cₛ, Cₜᵢ)ⁱ)
|
|
|
|
|
The statement holds because defined let Sᵢ := S∩(a→πᵢ)
|
|
|
|
|
either the algorithm is true for every sub-input space Sᵢ and
|
|
|
|
|
subtree Cₜᵢ
|
|
|
|
|
| equiv(Sᵢ, Cₛ, Cₜᵢ) = Yes ∀i
|
|
|
|
|
or we have a counter example vₛ, vₜ for which
|
2020-04-02 14:14:39 +02:00
|
|
|
|
| vₛ≃vₜ∈Sₖ ∧ cₛ(vₛ) ≠ Cₜₖ(vₜ)
|
2020-04-01 00:37:54 +02:00
|
|
|
|
then because
|
|
|
|
|
| vₜ∈(a→πₖ) ⇒ Cₜ(vₜ) = Cₜₖ(vₜ)
|
|
|
|
|
then
|
2020-04-02 14:14:39 +02:00
|
|
|
|
| vₛ≃vₜ∈S ∧ Cₛ(vₛ)≠Cₜ(vₜ)
|
2020-04-01 00:37:54 +02:00
|
|
|
|
and the result of the algorithm is
|
|
|
|
|
| equiv(Sᵢ, Cₛ, Cₜᵢ) = No(vₛ, vₜ) for some minimal k∈I
|