aa

2020-07-01 19:04:07 +02:00 · 2020-07-01 19:04:07 +02:00 · 009ac7e338
commit 009ac7e338
parent 3498912e21
9 changed files with 240 additions and 4 deletions
--- a/anno3/apprendimento_automatico/ltximg/org-ltximg_1f0aebebe540ea3c260f67e2a3efea40555b5304.png
+++ b/anno3/apprendimento_automatico/ltximg/org-ltximg_1f0aebebe540ea3c260f67e2a3efea40555b5304.png
--- a/anno3/apprendimento_automatico/ltximg/org-ltximg_2e111ee3f04bed792b496c613a0dba3edf7f77c5.png
+++ b/anno3/apprendimento_automatico/ltximg/org-ltximg_2e111ee3f04bed792b496c613a0dba3edf7f77c5.png
--- a/anno3/apprendimento_automatico/ltximg/org-ltximg_30d47ec25a0e2d20c996460948266164368f6aa1.png
+++ b/anno3/apprendimento_automatico/ltximg/org-ltximg_30d47ec25a0e2d20c996460948266164368f6aa1.png
--- a/anno3/apprendimento_automatico/ltximg/org-ltximg_53d15e1ac5806269a4b64985d4443ab7792d83aa.png
+++ b/anno3/apprendimento_automatico/ltximg/org-ltximg_53d15e1ac5806269a4b64985d4443ab7792d83aa.png
--- a/anno3/apprendimento_automatico/ltximg/org-ltximg_7dc62399a4819c1004baba085ef0df7703e51d49.png
+++ b/anno3/apprendimento_automatico/ltximg/org-ltximg_7dc62399a4819c1004baba085ef0df7703e51d49.png
--- a/anno3/apprendimento_automatico/ltximg/org-ltximg_d2ea2cc0b571cf8184fee44e817611bff7ac2753.png
+++ b/anno3/apprendimento_automatico/ltximg/org-ltximg_d2ea2cc0b571cf8184fee44e817611bff7ac2753.png
--- a/anno3/apprendimento_automatico/ltximg/org-ltximg_e3ca34c47825568fd9558fd9d1db191e9a55e61e.png
+++ b/anno3/apprendimento_automatico/ltximg/org-ltximg_e3ca34c47825568fd9558fd9d1db191e9a55e61e.png
--- a/anno3/apprendimento_automatico/preparazione.org
+++ b/anno3/apprendimento_automatico/preparazione.org
@ -113,4 +113,236 @@ Dato un dataset unlabelled D trova:
 - Confidenza: support(a∪b)/suport(a)
 ** Models
 *** Linear Models
 **** Best fitting line
 Cx + D = y
 X w = y in matrix form, w = (C D)ᵀ
 Se X quadrata e full rank: w = X⁻¹·y ma generalmente X non e`
 invertibile 
 | Errore: ‖e‖₂ = ‖y-p‖₂ = (∑ᵢ(yᵢ-pᵢ)²)⁻¹
 Possiamo inquadrare questo problema come un problema di minimizzazione
 della norma di e. p = X·$\hat{w}$: L'intero problema consiste in:
 | $minimize_{\hat{w}}\Vert X \hat{w} - y \Vert_2^2$
 La soluzione consiste nell'imporre l'ortogonalita` di e e C(X), ovvero
 Xᵀ·e=0; quindi:
 | Xᵀ·e = 0; e = y-X·ŵ
 | Xᵀ(y-X·ŵ) = 0
 | Xᵀy = XᵀXŵ
 | ŵ = (XᵀX)⁻¹Xᵀy
 **** Regularization
 evitare l'overfitting applicando dei constraint sul weight vector.
 Generalmente i pesi sono in media piccoli: ~shrinkage~.
 La versione regolarizzata di LSE:
 | w* = argmin_w (y-X·w)ᵀ(y-X·w) + λ‖w‖₂
 Soluzione:
 | ŵ  = (XᵀX + λI)⁻¹Xᵀy
 si dice ~ridge regression~ e significa aggiungere λ alla diagonale di
 XᵀX per migliorare la stabilita` numerica dell'inversione
 Si puo` anche usare ~lasso~ nel caso di soluzioni sparse
 (least absolute shrinkage and selection operator)
 che sostituisce ‖w‖₂ con ‖w‖₁=∑|wᵢ|
 | w* = argmin_w (y-X·w)ᵀ(y-X·w) + λ‖w‖1
 Minimizzare la norma significa immaginare che X sia affetto da errore
 D e minimizzare l'errore:
 | (X+D)w = Xw + Dw
 inoltre significa imporre un bias e quindi minimizzare l'effetto della
 varianza dell'errore. LSE enhance le piccole variazioni nei dati:
 unstable regressor.
 **** LSE per la classificazione
 | ĉ(x) = 1 se xᵀŵ - t > 0 
 | ĉ(x) = 0 se xᵀŵ - t = 0 
 | ĉ(x) = -1 se xᵀŵ - t < 0 
 Ovvero si rappresenta la classe positiva come 1 e la negativa come -1
 t rappresenta gli intercepts.
 ** SVM
 Hyperplane:
 | y = ax + b 
 | y -ax -b = 0
 | wᵀx = 0
 - w = (-b -a 1)ᵀ *x* = (1 x y)ᵀ
 - Functional margins: soluzioni che non fanno errori
 - Geometric margins: soluzioni che massimizzano la distanza fra i piu`
  vicini punti di classe opposta
 *** Margine funzionale
 Valore dell'hyperplane al punto xᵢ:
 | f(xᵢ) = w·xᵢ-t
 possiamo usare f(xᵢ)>0 per discriminare fra classe positiva/negativa
 - Functional margin:
  | μ(xᵢ) = yᵢ(w·xᵢ-t) = yᵢf(xᵢ)
  se l'esempio e` ben classificato: μ(xᵢ) > 0
 *** Support Vectors
 Possiamo richiedere che ogni istanza nel dataset soddisfi:
 | yᵢ(w·xᵢ-t) ≥ 1
 Istanze nel decision boundary (chiamate ~support vectors~):
 | yᵢ(w·xᵢ-t) = 1
 Margine geometrico:
 (x₊-x₋)·$\frac{w}{\Vert{w}\Vert}$
 *** TODO (w₀,w₁) ortogonali
 *** Ottimizzazione:
 Margin size:
 | μ = (x₊-x₋)·w/‖w‖
 | x₊·w-t = 1 -> x₊·w = 1+t
 | -(x₋·w-t) = 1 -> x₋·w = t-1
 | $\mu = \frac{1+t-(t-1)}{\Vert{w}\Vert} = \frac{2}{\Vert{w}\Vert}$
 μ va minimizzata, il che significa massimizzare ‖w‖
 | $minimize_{w,t} \frac{1}{2}\Vert{w}\Vert^{2}$
 | yᵢ(w·xᵢ-t)≥1; 0≤i≤n
 minimizzaₓ: f₀(x)
 soggetto a: fᵢ(x) ≤ 0     i = 1, ..., m
            gᵢ(x) = 0     i = 1, ..., p
 Formulazione duale di Lagrange:
 | g(α, υ) = infₓ ⋀(x,α,υ) = infₓ(f₀(x) + ∑₁ᵐαᵢfᵢ(x) + ∑₁ᵖυᵢgᵢ(x))
 Duality: forma organizzata per per formare bound non triviali in un
 problema di ottimizzazione
 In problemi convessi il bound e` solitamente ~strict~ e massimizzare
 il bound porta alla stessa soluzione che minimizzare la funzione
 originale: ~strong duality~.
 KKT conditions needs to hold for strong duality. 
 TODO: Vedi dimostrazione slides
 ** Kernels
 Trick usato per adattare degli algoritmi lineari a ipotesi non
 lineari.
 Idea: linear decision surface su uno spazio trasformato puo`
 corrispondere ad una superficie non lineare sullo spazio originale.
 Esempio:
 | ϕ(x) = (x₁², sqrt(2)x₁x₂, x₂², c)
 | ĉ(x) = sign(w·x-t)
 | ĉ(x) = sign(K(w,x)-t) = sign(ϕ(w)·ϕ(x)-t)
 Una kernel function K: V×V→R per la quale esiste un mapping ϕ:V→F, F
 spazio di Hilbert, tale che:
 K(x,y) = <ϕ(x), ϕ(y)>
 Ovvero una kernel function calcola l'inner product di x e y dopo
 averli mappati su un nuovo spazio di Hilbert (possibilmente highly
 dimensional)
 Restituiscono un intuizione della similarita` (proporzionalmente)
 **** TODO Mercer condition 
 **** Inner product
 generalizzazione del dot product su piu` spazi.
 | Simmetrico: <x,y> = <y,x>
 | lineare sul primo argomento: <ax+by,z> = a<x,z> + b<y,z>
 | definito positivamente: <x,x>≥0; <x,x> = 0 ⇔ x = 0
 Comodi perche`:
 - linear classifier possono lavorare su problemi non lineari
 - similarity function in highly dim. space senza calcolare i feature
  vectors
 - composizione, nuovi kernel da vecchi
 **** Kernel importanti
 Polinomiale:
 K(x,y) = (x·y)ᵈ or K(x,y) = (x·y+1)ᵈ
 - d = 1 → identity
 - d = 2 → quadratic
 - feature space esponenziale in d
 Gaussian Kernel:
 $K(x,y) = exp(-\frac{\Vert{x-y}\Vert^2}{2\sigma}$
 σ e` deciso tramite cross validation su un altro set indipendente
 il feature space ha dimensionalita` infinita.
 * Meo
 ** Concept learning
 Assunto base: ogni ipotesi che approssima bene la target function
 sugli esempi di training, approssimera` bene anche la target function
 con esempi mai visti.
 Inoltre D e` consistente e senza rumori ed esiste un'ipotesi h che
 descrive il target concept c.
 Un'ipotesi h e` una congiunzione di constraint sugli attributi.
 Il numero delle ipotesi e` esponenzialmente largo sul numero delle
 features:
 | {codominio funzione}^{n distinte istanze}
 - Ipotesi piu` generale:
  siano hⱼ, hₖ due funzioni booleane (ipotesi) definite su X.
  Si dice che hⱼ e` almeno generale quanto hₖ, scritto hⱼ≥hₖ iff
  | ∀x∈X: hₖ(x) = 1 → hⱼ(x) = 1
  La relazione ≥ impone un ordine parziale (rifl, trans, antisimm).
 - Version Space:
  Si chiama version space il set delle ipotesi consistenti con il dataset.
 *** Algoritmo Find-S
 #+BEGIN_SRC
 h ← most specific hyp. in H
 foreach x∈X:
    foreach aⱼ in h:    (attribute constraint)
    if h(x)⊧aⱼ:
        continue
    else:
        h ← next more general hyp that satisfies aⱼ
 output h
 #+END_SRC
 Advantages:
 - Hyp. space defined through conjunction of constraints
 - will output most specific hyp. that is consistent
 - will be consistent with negative examples as well
 Svantaggi:
 - non si sa se il learner converge al target concept (non sa se e`
  l'unica ipotesi valida)
 - non sa se il training data e` consistente: ignora esempi negativi
 *** Version Space
 Definiamo il Version Space come:
 | VSₕ_D = {h∈H|Consistent(h,D)}
 | Consistent(h,D) = ∀<x,c(x)>∈D: h(x) = c(x)
 General and specific boundary of VS: set of maximally g/s members
 | VSₕ_D = {h∈H| ∃s∈S, ∃g∈G: g≥h≥s}
 **** List then Eliminate
 #+BEGIN_SRC
 Version Space ← list of every hyp. in H
 foreach <x,c(x)> in X:
    foreach h in Version Space:
        if h(x) ≠ c(x) : remove h from VS
 output VS
 #+END_SRC
 **** Candidate Elimination
 #+BEGIN_SRC
 G ← max. general hyp.
 S ← max. specific hyp.
 foreach d=<x,c(x)> ∈ D:
    if d is ⊕:
        remove from G any inconsistent hyp.
        foreach inconsistent hyp. s in S:
            remove s from S
            add to S all minimal generalizations h of s:
                - h consistent with d
                - some members of G is more general than h
                - S is a summary of all members cons. with positive examples
            remove from S any hyp. more general than other hyp. in S
    if d is ⊖:
        remove from S any inconsistent hyp.
        foreach inconsistent hyp. g in G:
            remove g from G
            add to G all minimal generalizations h of g:
                - h consistent with d
                - some members of S is more general than h
                - G is a summary of all members cons. with negative examples
            remove from G any hyp. more general than other hyp. in G
 #+END_SRC
 - converge allo stesso VS qualsiasi l'ordine iniziale di D
 - puo` convergere a VS diversi se non ci sono abbastanza membri nel
  training set
 **** Inductive Leap
 Assumiamo che H contenga il target concept c. Ovvero che c puo` essere
 descritto tramite una congiunzione di literals.
 Unbiased learner: H esprime ogni concetto imparabile, ovver
 Powerset(X).
 S e G sono i due insiemi ⊕  ⊖ (con congiunzioni logiche, vedi slides).
 Futile perche` un learner che non fa assunzioni a priori
 sull'identita` del target concept non ha basi per classificare istanze
 mai viste.
 - Bias induttivo:
  | ∀xᵢ∈X: (B ∧ D_c ∧ xᵢ) ⊧ L(xᵢ,D_c)
  L(xᵢ, D_c) e` la classificazione assegnata dal concept learning
  algorithm L dopo il training su D_c
  Permette di trasformare un sistema induttivo in deduttivo
 ** TODO Path Through hyp. space
 Vedi che vuole sapere
 ** TODO Trees
 ** Rules
 Ordered rules are a chain of /if-then-else/.
 #+BEGIN_SRC
 1. Keep growing the rule antecedent by literal conjunction (high purity)
 2. Select the label as the rule consequent
 3. Delete the instance segment from the data, restart from 1
 #+END_SRC
 La purezza misura i figli negli alberi, in rule learning la purezza e`
 di un solo figlio il literal e` true. Si possono usare le purity
 measure degli alberi ma senza bisogno di fare la media.
--- a/todo.org
+++ b/todo.org
@ -1,4 +1,4 @@
-* Apprendimento Automatico [2/4]
+* Apprendimento Automatico [3/5]
 - [X] Scrivile per date di esame
 - [X] Richiedi date esame
 - [ ] Slides [0/5]
@ -22,10 +22,14 @@
    + [ ] Sum of squared error
    + [ ] Silhouttes
    + [ ] Rivedi kernelization
- [-] Esercizi [1/3]
+- [ ] Esposito [0/3]
  + [ ] (w_0,w_1) ortogonale all'iperpiano
  + [ ] dimostrazione dualita` grangiana
  + [ ] Mercer condition
 - [X] Esercizi [3/3]
  - [X] es1: perche` min_impurity decrease
-  - [ ] chiedi a Galla`, Marco e Naz quali sono tutti gli es
+  - [X] chiedi a Galla`, Marco e Naz quali sono tutti gli es
-  - [ ] linear models.zip?
+  - [X] linear models.zip?
 * Tesi [18/33]
 - [X] Rivedere inference rules di Gabriel e aggiustarle con le mie