aa

2020-07-01 19:04:07 +02:00 · 2020-07-01 19:04:07 +02:00 · 009ac7e338
commit 009ac7e338
parent 3498912e21
9 changed files with 240 additions and 4 deletions
--- a/anno3/apprendimento_automatico/ltximg/org-ltximg_1f0aebebe540ea3c260f67e2a3efea40555b5304.png
+++ b/anno3/apprendimento_automatico/ltximg/org-ltximg_1f0aebebe540ea3c260f67e2a3efea40555b5304.png
--- a/anno3/apprendimento_automatico/ltximg/org-ltximg_2e111ee3f04bed792b496c613a0dba3edf7f77c5.png
+++ b/anno3/apprendimento_automatico/ltximg/org-ltximg_2e111ee3f04bed792b496c613a0dba3edf7f77c5.png
--- a/anno3/apprendimento_automatico/ltximg/org-ltximg_30d47ec25a0e2d20c996460948266164368f6aa1.png
+++ b/anno3/apprendimento_automatico/ltximg/org-ltximg_30d47ec25a0e2d20c996460948266164368f6aa1.png
--- a/anno3/apprendimento_automatico/ltximg/org-ltximg_53d15e1ac5806269a4b64985d4443ab7792d83aa.png
+++ b/anno3/apprendimento_automatico/ltximg/org-ltximg_53d15e1ac5806269a4b64985d4443ab7792d83aa.png
--- a/anno3/apprendimento_automatico/ltximg/org-ltximg_7dc62399a4819c1004baba085ef0df7703e51d49.png
+++ b/anno3/apprendimento_automatico/ltximg/org-ltximg_7dc62399a4819c1004baba085ef0df7703e51d49.png
--- a/anno3/apprendimento_automatico/ltximg/org-ltximg_d2ea2cc0b571cf8184fee44e817611bff7ac2753.png
+++ b/anno3/apprendimento_automatico/ltximg/org-ltximg_d2ea2cc0b571cf8184fee44e817611bff7ac2753.png
--- a/anno3/apprendimento_automatico/ltximg/org-ltximg_e3ca34c47825568fd9558fd9d1db191e9a55e61e.png
+++ b/anno3/apprendimento_automatico/ltximg/org-ltximg_e3ca34c47825568fd9558fd9d1db191e9a55e61e.png
--- a/anno3/apprendimento_automatico/preparazione.org
+++ b/anno3/apprendimento_automatico/preparazione.org
@ -113,4 +113,236 @@ Dato un dataset unlabelled D trova:
 - Confidenza: support(a∪b)/suport(a)
 ** Models
 *** Linear Models
+**** Best fitting line
+Cx + D = y
+X w = y in matrix form, w = (C D)ᵀ
+Se X quadrata e full rank: w = X⁻¹·y ma generalmente X non e`
+invertibile 
+| Errore: ‖e‖₂ = ‖y-p‖₂ = (∑ᵢ(yᵢ-pᵢ)²)⁻¹
+Possiamo inquadrare questo problema come un problema di minimizzazione
+della norma di e. p = X·$\hat{w}$: L'intero problema consiste in:
+| $minimize_{\hat{w}}\Vert X \hat{w} - y \Vert_2^2$
+La soluzione consiste nell'imporre l'ortogonalita` di e e C(X), ovvero
+Xᵀ·e=0; quindi:
+| Xᵀ·e = 0; e = y-X·ŵ
+| Xᵀ(y-X·ŵ) = 0
+| Xᵀy = XᵀXŵ
+| ŵ = (XᵀX)⁻¹Xᵀy
+**** Regularization
+evitare l'overfitting applicando dei constraint sul weight vector.
+Generalmente i pesi sono in media piccoli: ~shrinkage~.
+La versione regolarizzata di LSE:
+| w* = argmin_w (y-X·w)ᵀ(y-X·w) + λ‖w‖₂
+Soluzione:
+| ŵ  = (XᵀX + λI)⁻¹Xᵀy
+si dice ~ridge regression~ e significa aggiungere λ alla diagonale di
+XᵀX per migliorare la stabilita` numerica dell'inversione
+Si puo` anche usare ~lasso~ nel caso di soluzioni sparse
+(least absolute shrinkage and selection operator)
+che sostituisce ‖w‖₂ con ‖w‖₁=∑|wᵢ|
+| w* = argmin_w (y-X·w)ᵀ(y-X·w) + λ‖w‖1
+Minimizzare la norma significa immaginare che X sia affetto da errore
+D e minimizzare l'errore:
+| (X+D)w = Xw + Dw
+inoltre significa imporre un bias e quindi minimizzare l'effetto della
+varianza dell'errore. LSE enhance le piccole variazioni nei dati:
+unstable regressor.
+**** LSE per la classificazione
+| ĉ(x) = 1 se xᵀŵ - t > 0 
+| ĉ(x) = 0 se xᵀŵ - t = 0 
+| ĉ(x) = -1 se xᵀŵ - t < 0 
+Ovvero si rappresenta la classe positiva come 1 e la negativa come -1
+t rappresenta gli intercepts.
+** SVM
+Hyperplane:
+| y = ax + b 
+| y -ax -b = 0
+| wᵀx = 0
+- w = (-b -a 1)ᵀ *x* = (1 x y)ᵀ
+- Functional margins: soluzioni che non fanno errori
+- Geometric margins: soluzioni che massimizzano la distanza fra i piu`
+  vicini punti di classe opposta
+*** Margine funzionale
+Valore dell'hyperplane al punto xᵢ:
+| f(xᵢ) = w·xᵢ-t
+possiamo usare f(xᵢ)>0 per discriminare fra classe positiva/negativa
+- Functional margin:
+  | μ(xᵢ) = yᵢ(w·xᵢ-t) = yᵢf(xᵢ)
+  se l'esempio e` ben classificato: μ(xᵢ) > 0
+*** Support Vectors
+Possiamo richiedere che ogni istanza nel dataset soddisfi:
+| yᵢ(w·xᵢ-t) ≥ 1
+Istanze nel decision boundary (chiamate ~support vectors~):
+| yᵢ(w·xᵢ-t) = 1
+Margine geometrico:
+(x₊-x₋)·$\frac{w}{\Vert{w}\Vert}$
+*** TODO (w₀,w₁) ortogonali
+*** Ottimizzazione:
+Margin size:
+| μ = (x₊-x₋)·w/‖w‖
+| x₊·w-t = 1 -> x₊·w = 1+t
+| -(x₋·w-t) = 1 -> x₋·w = t-1
+| $\mu = \frac{1+t-(t-1)}{\Vert{w}\Vert} = \frac{2}{\Vert{w}\Vert}$
+μ va minimizzata, il che significa massimizzare ‖w‖
+| $minimize_{w,t} \frac{1}{2}\Vert{w}\Vert^{2}$
+| yᵢ(w·xᵢ-t)≥1; 0≤i≤n
+minimizzaₓ: f₀(x)
+soggetto a: fᵢ(x) ≤ 0     i = 1, ..., m
+            gᵢ(x) = 0     i = 1, ..., p
+Formulazione duale di Lagrange:
+| g(α, υ) = infₓ ⋀(x,α,υ) = infₓ(f₀(x) + ∑₁ᵐαᵢfᵢ(x) + ∑₁ᵖυᵢgᵢ(x))
+Duality: forma organizzata per per formare bound non triviali in un
+problema di ottimizzazione
+In problemi convessi il bound e` solitamente ~strict~ e massimizzare
+il bound porta alla stessa soluzione che minimizzare la funzione
+originale: ~strong duality~.
+KKT conditions needs to hold for strong duality. 
+TODO: Vedi dimostrazione slides
+
+** Kernels
+Trick usato per adattare degli algoritmi lineari a ipotesi non
+lineari.
+Idea: linear decision surface su uno spazio trasformato puo`
+corrispondere ad una superficie non lineare sullo spazio originale.
+Esempio:
+| ϕ(x) = (x₁², sqrt(2)x₁x₂, x₂², c)
+| ĉ(x) = sign(w·x-t)
+| ĉ(x) = sign(K(w,x)-t) = sign(ϕ(w)·ϕ(x)-t)
+
+Una kernel function K: V×V→R per la quale esiste un mapping ϕ:V→F, F
+spazio di Hilbert, tale che:
+K(x,y) = <ϕ(x), ϕ(y)>
+Ovvero una kernel function calcola l'inner product di x e y dopo
+averli mappati su un nuovo spazio di Hilbert (possibilmente highly
+dimensional)
+
+Restituiscono un intuizione della similarita` (proporzionalmente)
+**** TODO Mercer condition 
+**** Inner product
+generalizzazione del dot product su piu` spazi.
+| Simmetrico: <x,y> = <y,x>
+| lineare sul primo argomento: <ax+by,z> = a<x,z> + b<y,z>
+| definito positivamente: <x,x>≥0; <x,x> = 0 ⇔ x = 0
+Comodi perche`:
+- linear classifier possono lavorare su problemi non lineari
+- similarity function in highly dim. space senza calcolare i feature
+  vectors
+- composizione, nuovi kernel da vecchi
+
+**** Kernel importanti
+Polinomiale:
+K(x,y) = (x·y)ᵈ or K(x,y) = (x·y+1)ᵈ
+- d = 1 → identity
+- d = 2 → quadratic
+- feature space esponenziale in d
+
+Gaussian Kernel:
+$K(x,y) = exp(-\frac{\Vert{x-y}\Vert^2}{2\sigma}$
+σ e` deciso tramite cross validation su un altro set indipendente
+il feature space ha dimensionalita` infinita.
+
 * Meo
+** Concept learning
+Assunto base: ogni ipotesi che approssima bene la target function
+sugli esempi di training, approssimera` bene anche la target function
+con esempi mai visti.
+Inoltre D e` consistente e senza rumori ed esiste un'ipotesi h che
+descrive il target concept c.
+Un'ipotesi h e` una congiunzione di constraint sugli attributi.
+Il numero delle ipotesi e` esponenzialmente largo sul numero delle
+features:
+| {codominio funzione}^{n distinte istanze}
+- Ipotesi piu` generale:
+  siano hⱼ, hₖ due funzioni booleane (ipotesi) definite su X.
+  Si dice che hⱼ e` almeno generale quanto hₖ, scritto hⱼ≥hₖ iff
+  | ∀x∈X: hₖ(x) = 1 → hⱼ(x) = 1
+  La relazione ≥ impone un ordine parziale (rifl, trans, antisimm).
+- Version Space:
+  Si chiama version space il set delle ipotesi consistenti con il dataset.
+*** Algoritmo Find-S
+#+BEGIN_SRC
+h ← most specific hyp. in H
+foreach x∈X:
+    foreach aⱼ in h:    (attribute constraint)
+    if h(x)⊧aⱼ:
+        continue
+    else:
+        h ← next more general hyp that satisfies aⱼ
+output h
+#+END_SRC
+Advantages:
+- Hyp. space defined through conjunction of constraints
+- will output most specific hyp. that is consistent
+- will be consistent with negative examples as well
+Svantaggi:
+- non si sa se il learner converge al target concept (non sa se e`
+  l'unica ipotesi valida)
+- non sa se il training data e` consistente: ignora esempi negativi
+*** Version Space
+Definiamo il Version Space come:
+| VSₕ_D = {h∈H|Consistent(h,D)}
+| Consistent(h,D) = ∀<x,c(x)>∈D: h(x) = c(x)
+General and specific boundary of VS: set of maximally g/s members
+| VSₕ_D = {h∈H| ∃s∈S, ∃g∈G: g≥h≥s}
+**** List then Eliminate
+#+BEGIN_SRC
+Version Space ← list of every hyp. in H
+foreach <x,c(x)> in X:
+    foreach h in Version Space:
+        if h(x) ≠ c(x) : remove h from VS
+output VS
+#+END_SRC
+**** Candidate Elimination
+#+BEGIN_SRC
+G ← max. general hyp.
+S ← max. specific hyp.
+foreach d=<x,c(x)> ∈ D:
+    if d is ⊕:
+        remove from G any inconsistent hyp.
+        foreach inconsistent hyp. s in S:
+            remove s from S
+            add to S all minimal generalizations h of s:
+                - h consistent with d
+                - some members of G is more general than h
+                - S is a summary of all members cons. with positive examples
+            remove from S any hyp. more general than other hyp. in S
+    if d is ⊖:
+        remove from S any inconsistent hyp.
+        foreach inconsistent hyp. g in G:
+            remove g from G
+            add to G all minimal generalizations h of g:
+                - h consistent with d
+                - some members of S is more general than h
+                - G is a summary of all members cons. with negative examples
+            remove from G any hyp. more general than other hyp. in G
+#+END_SRC
+- converge allo stesso VS qualsiasi l'ordine iniziale di D
+- puo` convergere a VS diversi se non ci sono abbastanza membri nel
+  training set
+**** Inductive Leap
+Assumiamo che H contenga il target concept c. Ovvero che c puo` essere
+descritto tramite una congiunzione di literals.
+Unbiased learner: H esprime ogni concetto imparabile, ovver
+Powerset(X).
+S e G sono i due insiemi ⊕  ⊖ (con congiunzioni logiche, vedi slides).
+Futile perche` un learner che non fa assunzioni a priori
+sull'identita` del target concept non ha basi per classificare istanze
+mai viste.
+- Bias induttivo:
+  | ∀xᵢ∈X: (B ∧ D_c ∧ xᵢ) ⊧ L(xᵢ,D_c)
+  L(xᵢ, D_c) e` la classificazione assegnata dal concept learning
+  algorithm L dopo il training su D_c
+  Permette di trasformare un sistema induttivo in deduttivo
+** TODO Path Through hyp. space
+Vedi che vuole sapere
+** TODO Trees
+** Rules
+Ordered rules are a chain of /if-then-else/.
+#+BEGIN_SRC
+1. Keep growing the rule antecedent by literal conjunction (high purity)
+2. Select the label as the rule consequent
+3. Delete the instance segment from the data, restart from 1
+#+END_SRC
+La purezza misura i figli negli alberi, in rule learning la purezza e`
+di un solo figlio il literal e` true. Si possono usare le purity
+measure degli alberi ma senza bisogno di fare la media.
--- a/todo.org
+++ b/todo.org
@ -1,4 +1,4 @@
-* Apprendimento Automatico [2/4]
+* Apprendimento Automatico [3/5]
 - [X] Scrivile per date di esame
 - [X] Richiedi date esame
 - [ ] Slides [0/5]
@ -22,10 +22,14 @@
    + [ ] Sum of squared error
    + [ ] Silhouttes
    + [ ] Rivedi kernelization
- [-] Esercizi [1/3]
+- [ ] Esposito [0/3]
+  + [ ] (w_0,w_1) ortogonale all'iperpiano
+  + [ ] dimostrazione dualita` grangiana
+  + [ ] Mercer condition
+- [X] Esercizi [3/3]
  - [X] es1: perche` min_impurity decrease
-  - [ ] chiedi a Galla`, Marco e Naz quali sono tutti gli es
-  - [ ] linear models.zip?
+  - [X] chiedi a Galla`, Marco e Naz quali sono tutti gli es
+  - [X] linear models.zip?

 * Tesi [18/33]
 - [X] Rivedere inference rules di Gabriel e aggiustarle con le mie