% Created 2020-03-02 Mon 14:31
% Intended LaTeX compiler: pdflatex
\documentclass[11pt]{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{graphicx}
\usepackage{grffile}
\usepackage{longtable}
\usepackage{wrapfig}
\usepackage{rotating}
\usepackage[normalem]{ulem}
\usepackage{amsmath}
\usepackage{textcomp}
\usepackage{amssymb}
\usepackage{capt-of}
\usepackage{hyperref}
\usepackage{algorithm}
\usepackage{comment}
\usepackage{algpseudocode}
\usepackage{amsmath,amssymb,amsthm}
\newtheorem{definition}{Definition}
\usepackage{graphicx}
\usepackage{listings}
\usepackage{color}
\author{Francesco Mecca}
\date{}
\title{Translation Verification of the OCaml pattern matching compiler}
\hypersetup{
 pdfauthor={Francesco Mecca},
 pdftitle={Translation Verification of the OCaml pattern matching compiler},
 pdfkeywords={},
 pdfsubject={},
 pdfcreator={Emacs 26.3 (Org mode 9.1.9)}, 
 pdflang={English}}
\begin{document}

\maketitle
\begin{comment}
\section{{\bfseries\sffamily TODO} Scaletta [1/5]}
\label{sec:org62901c2}
\begin{itemize}
\item[{$\boxtimes$}] Abstract
\item[{$\boxminus$}] Background [40\%]
\begin{itemize}
\item[{$\boxtimes$}] Ocaml
\item[{$\square$}] Lambda code [0\%]
\begin{itemize}
\item[{$\square$}] Compiler optimizations
\item[{$\square$}] other instructions
\end{itemize}
\item[{$\boxtimes$}] Pattern matching
\item[{$\square$}] Symbolic execution
\item[{$\square$}] Translation Validation
\end{itemize}
\item[{$\square$}] Translation validation of the Pattern Matching Compiler
\begin{itemize}
\item[{$\square$}] Source translation
\begin{itemize}
\item[{$\square$}] Formal Grammar
\item[{$\square$}] Compilation of source patterns
\end{itemize}
\item[{$\square$}] Target translation
\begin{itemize}
\item[{$\square$}] Formal Grammar
\item[{$\square$}] Symbolic execution of the lambda target
\end{itemize}
\item[{$\square$}] Equivalence between source and target
\end{itemize}
\item[{$\square$}] Proof of correctness
\item[{$\square$}] Practical results
\end{itemize}

\end{comment}

\begin{abstract}

This dissertation presents an algorithm for the translation validation of the OCaml
pattern matching compiler. Given the source representation of the target program and the
target program compiled in untyped lambda form, the algoritmhm is capable of modelling
the source program in terms of symbolic constraints on it's branches and apply symbolic
execution on the untyped lambda representation in order to validate wheter the compilation
produced a valid result.
In this context a valid result means that for every input in the domain of the source
program the untyped lambda translation produces the same output as the source program.
The input of the program is modelled in terms of symbolic constraints closely related to
the runtime representation of OCaml objects and the output consists of OCaml code
blackboxes that are not evaluated in the context of the verification.

\end{abstract}

\section{Background}
\label{sec:orgbf4de70}

\subsection{OCaml}
\label{sec:orga9a97c9}
Objective Caml (OCaml) is a dialect of the ML (Meta-Language) family of programming
languages.
OCaml shares many features with other dialects of ML, such as SML and Caml Light,
The main features of ML languages are the use of the Hindley-Milner type system that
provides many advantages with respect to static type systems of traditional imperative and object
oriented language such as C, C++ and Java, such as:
\begin{itemize}
\item Polymorphism: in certain scenarios a function can accept more than one
type for the input parameters. For example a function that computes the lenght of a
list doesn't need to inspect the type of the elements of the list and for this reason
a List.length function can accept lists of integers, lists of strings and in general
lists of any type. Such languages offer polymorphic functions through subtyping at
runtime only, while other languages such as C++ offer polymorphism through compile
time templates and function overloading.
With the Hindley-Milner type system each well typed function can have more than one
type but always has a unique best type, called the \emph{principal type}.
For example the principal type of the List.length function is "For any \emph{a}, function from
list of \emph{a} to \emph{int}" and \emph{a} is called the \emph{type parameter}.
\item Strong typing: Languages such as C and C++ allow the programmer to operate on data
without considering its type, mainly through pointers. Other languages such as C\#
and Go allow type erasure so at runtime the type of the data can't be queried.
In the case of programming languages using an Hindley-Milner type system the
programmer is not allowed to operate on data by ignoring or promoting its type.
\item Type Inference: the principal type of a well formed term can be inferred without any
annotation or declaration.
\item Algebraic data types: types that are modelled by the use of two
algebraic operations, sum and product.
A sum type is a type that can hold of many different types of
objects, but only one at a time. For example the sum type defined
as \emph{A + B} can hold at any moment a value of type A or a value of
type B. Sum types are also called tagged union or variants.
A product type is a type constructed as a direct product
of multiple types and contains at any moment one instance for
every type of its operands. Product types are also called tuples
or records. Algebraic data types can be recursive
in their definition and can be combined.
\end{itemize}
Moreover ML languages are functional, meaning that functions are
treated as first class citizens and variables are immutable,
although mutable statements and imperative constructs are permitted.
In addition to that OCaml features an object system, that provides
inheritance, subtyping and dynamic binding, and modules, that
provide a way to encapsulate definitions. Modules are checked
statically and can be reificated through functors.

\subsection{Lambda form compilation}
\label{sec:org2d10d35}
\begin{comment}
https://dev.realworldocaml.org/compiler-backend.html
\end{comment}

OCaml provides compilation in form of a byecode executable with an
optionally embeddable interpreter and a native executable that could
be statically linked to provide a single file executable.

After the OCaml typechecker has proven that the program is type safe,
the OCaml compiler lower the code to \emph{Lambda}, an s-expression based
language that assumes that its input has already been proved safe.
On the \emph{Lambda} representation of the source program, the compiler
performes a series of optimization passes before translating the
lambda form to assembly code.

\begin{enumerate}
\item OCaml datatypes
\label{sec:orgd605d09}

Most native data types in OCaml, such as integers, tuples, lists,
records, can be seen as instances of the following definition

\begin{verbatim}
type t = Nil | One of int | Cons of int * t 
\end{verbatim}
that is a type \emph{t} with three constructors that define its complete
signature.
Every constructor has an arity. Nil, a constructor of arity 0, is
called a constant constructor. 

\item Lambda form types
\label{sec:org1ee20a6}
A lambda form target file produced by the ocaml compiler consists of a
single s-expression. Every s-expression consist of \emph{(}, a sequence of
elements separated by a whitespace and a closing \emph{)}.
Elements of s-expressions are:
\begin{itemize}
\item Atoms: sequences of ascii letters, digits or symbols
\item Variables
\item Strings: enclosed in double quotes and possibly escaped
\item S-expressions: allowing arbitrary nesting
\end{itemize}

There are several numeric types:
\begin{itemize}
\item integers: that us either 31 or 63 bit two's complement arithmetic
depending on system word size, and also wrapping on overflow
\item 32 bit and 64 bit integers: that use 32-bit and 64-bit two's complement arithmetic
with wrap on overflow
\item big integers: offer integers with arbitrary precision
\item floats: that use IEEE754 double-precision (64-bit) arithmetic with
the addition of the literals \emph{infinity}, \emph{neg\_infinity} and \emph{nan}.
\end{itemize}

The are varios numeric operations defined:

\begin{itemize}
\item Arithmetic operations: +, -, *, /, \% (modulo), neg (unary negation)
\item Bitwise operations: \&, |, \^{}, <<, >> (zero-shifting), a>> (sign extending)
\item Numeric comparisons: <, >, <=, >=, ==
\end{itemize}

\item Functions
\label{sec:org914d5eb}

Functions are defined using the following syntax, and close over all
bindings in scope: (lambda (arg1 arg2 arg3) BODY)
and are applied using the following syntax: (apply FUNC ARG ARG ARG)
Evaluation is eager.

\item Bindings
\label{sec:org055206b}
The atom \emph{let} introduces a sequence of bindings:
(let BINDING BINDING BINDING \ldots{} BODY)

\item Other atoms
\label{sec:org0f92182}
TODO: if, switch, stringswitch\ldots{}
TODO: magari esempi
\end{enumerate}


\subsection{Pattern matching}
\label{sec:org9876fb9}

Pattern matching is a widely adopted mechanism to interact with ADT.
C family languages provide branching on predicates through the use of
if statements and switch statements.
Pattern matching on the other hands express predicates through
syntactic templates that also allow to bind on data structures of
arbitrary shapes. One common example of pattern matching is the use of regular
expressions on strings. OCaml provides pattern matching on ADT and
primitive data types.
The result of a pattern matching operation is always one of:
\begin{itemize}
\item this value does not match this pattern”
\item this value matches this pattern, resulting the following bindings of
names to values and the jump to the expression pointed at the
pattern.
\end{itemize}

\begin{verbatim}
type color = | Red | Blue | Green | Black | White

match color with
| Red -> print "red"
| Blue -> print "red"
| Green -> print "red"
| _ -> print "white or black"
\end{verbatim}

OCaml provides tokens to express data destructoring.
For example we can examine the content of a list with patten matching

\begin{verbatim}

begin match list with
| [ ] -> print "empty list"
| element1 :: [ ] -> print "one element"
| (element1 :: element2) :: [ ] -> print "two elements"
| head :: tail-> print "head followed by many elements"
\end{verbatim}

Parenthesized patterns, such as the third one in the previous example,
matches the same value as the pattern without parenthesis.

The same could be done with tuples
\begin{verbatim}

begin match tuple with
| (Some _, Some _) -> print "Pair of optional types"
| (Some _, None) | (None, Some _) -> print "Pair of optional types, one of which is null"
| (None, None) -> print "Pair of optional types, both null"
\end{verbatim}

The pattern pattern₁ |  pattern₂ represents the logical "or" of the
two patterns pattern₁ and pattern₂. A value matches pattern₁ |
pattern₂ if it matches pattern₁ or pattern₂.

Pattern clauses can make the use of \emph{guards} to test predicates and
variables can captured (binded in scope).

\begin{verbatim}

begin match token_list with
| "switch"::var::"{"::rest -> ...
| "case"::":"::var::rest when is_int var -> ...
| "case"::":"::var::rest when is_string var -> ...
| "}"::[ ] -> ...
| "}"::rest -> error "syntax error: " rest

\end{verbatim}

Moreover, the OCaml pattern matching compiler emits a warning when a
pattern is not exhaustive or some patterns are shadowed by precedent ones.

In general pattern matching on primitive and algebraic data types takes the
following form.

\begin{center}
\begin{tabular}{l}
match variable with\\
\(\vert{}\) pattern₁ -> expr₁\\
\(\vert{}\) pattern₂ when guard -> expr₂\\
\(\vert{}\) pattern₃ as var -> expr₃\\
⋮\\
\(\vert{}\) pₙ -> exprₙ\\
\end{tabular}
\end{center}

It can be described more formally through a BNF grammar.

\begin{verbatim}

pattern ::= value-name  
| _   ;; wildcard pattern
| constant  ;; matches a constant value
| pattern as  value-name  ;; binds to value-name
| ( pattern ) ;; parenthesized pattern
| pattern |  pattern ;; or-pattern 
| constr  pattern ;; variant pattern
| [ pattern  { ; pattern }  [ ; ] ] ;; list patterns 
| pattern ::  pattern   ;; lists patterns using cons operator (::)
| [| pattern  { ; pattern }  [ ; ] |] ;; array pattern 
| char-literal ..  char-literal   ;; match on a range of characters
| { field  [: typexpr]  [= pattern] { ; field  [: typexpr]  [= pattern] } \
  [; _ ] [ ; ] } ;; patterns that match on records

\end{verbatim}

\begin{comment}
\begin{enumerate}
\item 1.2.1 Pattern matching compilation to lambda code
\label{sec:org4d27bd9}

During compilation, patterns are in the form
\begin{center}
\begin{tabular}{ll}
pattern & type of pattern\\
\hline
\_ & wildcard\\
x & variable\\
c(p₁,p₂,\ldots{},pₙ) & constructor pattern\\
(p₁\(\vert{}\) p₂) & or-pattern\\
\end{tabular}
\end{center}

Expressions are compiled into lambda code and are referred as lambda
code actions.

The entire pattern matching code can be represented as a clause matrix
that associates rows of patterns (p\(_{\text{i,1}}\), p\(_{\text{i,2}}\), \ldots{}, p\(_{\text{i,n}}\)) to
lambda code action lⁱ
\begin{equation*}
(P → L) =
\begin{pmatrix}
p_{1,1} & p_{1,2} & \cdots & p_{1,n} & → l₁ \\
p_{2,1} & p_{2,2} & \cdots & p_{2,n} & →  l₂ \\
\vdots & \vdots & \ddots & \vdots & → \vdots \\
p_{m,1} & p_{m,2} & \cdots & p_{m,n} & → lₘ
\end{pmatrix}
\end{equation*}

The pattern \emph{p} matches a value \emph{v}, written as p ≼ v, when one of the
following rules apply

\begin{center}
\begin{tabular}{llll}
\hline
\_ & ≼ & v & ∀v\\
x & ≼ & v & ∀v\\
(p₁ \(\vert{}\)$\backslash$ p₂) & ≼ & v & iff p₁ ≼ v or p₂ ≼ v\\
c(p₁, p₂, \ldots{}, pₐ) & ≼ & c(v₁, v₂, \ldots{}, vₐ) & iff (p₁, p₂, \ldots{}, pₐ) ≼ (v₁, v₂, \ldots{}, vₐ)\\
(p₁, p₂, \ldots{}, pₐ) & ≼ & (v₁, v₂, \ldots{}, vₐ) & iff pᵢ ≼ vᵢ ∀i ∈ [1..a]\\
\hline
\end{tabular}
\end{center}

When a value \emph{v} matches pattern \emph{p} we say that \emph{v} is an \emph{instance} of \emph{p}.

Considering the pattern matrix P we say that the value vector
\(\vec{v}\) = (v₁, v₂, \ldots{}, vᵢ) matches the line number i in P  if and only if the following two
conditions are satisfied:
\begin{itemize}
\item p\(_{\text{i,1}}\), p\(_{\text{i,2}}\), \(\cdots{}\), p\(_{\text{i,n}}\)  ≼ (v₁, v₂, \ldots{}, vᵢ)
\item ∀j < i p\(_{\text{j,1}}\), p\(_{\text{j,2}}\), \(\cdots{}\), p\(_{\text{j,n}}\) ⋠ (v₁, v₂, \ldots{}, vᵢ)
\end{itemize}

We can define the following three relations with respect to patterns:
\begin{itemize}
\item Patter p is less precise than pattern q, written p ≼ q, when all
instances of q are instances of p
\item Pattern p and q are equivalent, written p ≡ q, when their instances
are the same
\item Patterns p and q are compatible when they share a common instance
\end{itemize}

\begin{enumerate}
\item Initial state of the compilation
\label{sec:orge9c6bc4}

Given a source of the following form:


\begin{center}
\begin{tabular}{l}
match variable with\\
\(\vert{}\) pattern₁ -> e₁\\
\(\vert{}\) pattern₂ -> e₂\\
⋮\\
\(\vert{}\) pₘ -> eₘ\\
\end{tabular}
\end{center}

the initial input of the algorithm C consists of a vector of variables
\(\vec{x}\) = (x₁, x₂, \ldots{}, xₙ)  of size \emph{n} where \emph{n} is the arity of
the type of \emph{x} and a clause matrix P → L of width n and height m.
That is:

\begin{equation*}
C((\vec{x} = (x₁, x₂, ..., xₙ),
\begin{pmatrix}
p_{1,1} & p_{1,2} & \cdots & p_{1,n} → l₁ \\
p_{2,1} & p_{2,2} & \cdots & p_{2,n} → l₂ \\
\vdots & \vdots & \ddots & \vdots →  \vdots \\
p_{m,1} & p_{m,2} & \cdots & p_{m,n} → lₘ)
\end{pmatrix}
\end{equation*}

The base case C₀ of the algorithm is the case in which the \(\vec{x}\)
is empty, that is \(\vec{x}\) = (), then the result of the compilation
C₀ is l₁
\begin{equation*}
C₀((),
\begin{pmatrix}
→ l₁ \\
→ l₂ \\
→  \vdots \\
→ lₘ
\end{pmatrix})
) = l₁
\end{equation*}

When \(\vec{x}\) ≠ () then the compilation advances using one of the
following four rules:

\begin{enumerate}
\item Variable rule: if all patterns of the first column of P are wildcard patterns or
bind the value to a variable, then 

\begin{equation*}
C(\vec{x}, P → L) = C((x₂, x₃, ..., xₙ), P' → L')
\end{equation*}
where
\begin{equation*}
\begin{pmatrix}
p_{1,2} & \cdots & p_{1,n} & → & (let & y₁ & x₁) & l₁ \\
p_{2,2} & \cdots & p_{2,n} & → & (let & y₂ & x₁) & l₂ \\
\vdots & \ddots & \vdots & →  & \vdots & \vdots & \vdots & \vdots \\
p_{m,2} & \cdots & p_{m,n} & → & (let & yₘ & x₁) & lₘ
\end{pmatrix}
\end{equation*}

That means in every lambda action lᵢ there is a binding of x₁ to the
variable that appears on the pattern \$p\(_{\text{i,1}}\). Bindings are omitted
for wildcard patterns and the lambda action lᵢ remains unchanged.

\item Constructor rule: if all patterns in the first column of P are
constructors patterns of the form k(q₁, q₂, \ldots{}, qₙ) we define a
new matrix, the specialized clause matrix S, by applying the
following transformation on every row \emph{p}:
\begin{lstlisting}[mathescape,columns=fullflexible,basicstyle=\fontfamily{lmvtt}\selectfont,]
for every c ∈ Set of constructors 
    for i ← 1 .. m 
        let kᵢ ← constructor_of($p_{i,1}$) 
        if kᵢ = c then 
            p ← $q_{i,1}$, $q_{i,2}$, ..., $q_{i,n'}$,  $p_{i,2}$, $p_{i,3}$, ..., $p_{i, n}$
\end{lstlisting}
Patterns of the form \(q_{i,j}\) matches on the values of the
constructor and we define new fresh variables y₁, y₂, \ldots{}, yₐ so
that the lambda action lᵢ becomes
\end{enumerate}

\begin{lstlisting}[mathescape,columns=fullflexible,basicstyle=\fontfamily{lmvtt}\selectfont,]
    (let (y₁ (field 0 x₁))
         (y₂ (field 1 x₁))
         ...
         (yₐ (field (a-1) x₁))
         lᵢ)
\end{lstlisting}

and the result of the compilation for the set of constructors
\{c₁, c₂, \ldots{}, cₖ\} is:

\begin{lstlisting}[mathescape,columns=fullflexible,basicstyle=\fontfamily{lmvtt}\selectfont,]
    switch x₁ with
    case c₁: l₁
    case c₂: l₂
    ...
    case cₖ: lₖ
    default: exit
\end{lstlisting}

\begin{enumerate}
\item Orpat rule: there are various strategies for dealing with
or-patterns. The most naive one is to split the or-patterns.
For example a row p containing an or-pattern:
\begin{equation*}
(p_{i,1}|q_{i,1}|r_{i,1}), p_{i,2}, ..., p_{i,m} → lᵢ
\end{equation*}
results in three rows added to the clause matrix
\begin{equation*}
p_{i,1}, p_{i,2}, ..., p_{i,m} → lᵢ \\
\end{equation*}
\begin{equation*}
q_{i,1}, p_{i,2}, ..., p_{i,m} → lᵢ \\
\end{equation*}
\begin{equation*}
r_{i,1}, p_{i,2}, ..., p_{i,m} → lᵢ 
\end{equation*}
\item Mixture rule:
When none of the previous rules apply the clause matrix P → L is
splitted into two clause matrices, the first P₁ → L₁ that is the
largest prefix matrix for which one of the three previous rules
apply, and P₂ → L₂ containing the remaining rows. The algorithm is
applied to both matrices.
\end{enumerate}

\end{comment}

\begin{comment}
#+BEGIN_COMMENT
CITE paper?
#+END_COMMENT’ 
\end{comment}
\end{enumerate}
\end{enumerate}

\subsection{Symbolic execution}
\label{sec:orgdb60b84}

\subsection{Translation validation}
\label{sec:org096d047}
Translators, such as translators and code generators, are huge pieces of
software usually consisting of multiple subsystem and
constructing an actual specification of a translator implementation for
formal validation is a very long task. Moreover, different
translators implement different algorithms, so the correctness proof of
a translator cannot be generalized and reused to prove another translator.
Translation validation is an alternative to the verification of
existing translators that consists of taking the source and the target
(compiled) program and proving \emph{a posteriori} their semantic equivalence.

\begin{itemize}
\item[{$\square$}] Techniques for translation validation
\item[{$\square$}] What does semantically equivalent mean
\item[{$\square$}] What happens when there is no semantic equivalence
\item[{$\square$}] Translation validation through symbolic execution
\end{itemize}
\end{document}