UniTO/anno3/apprendimento_automatico/esercizi/marco/coverage_plots-checkpoint.ipynb
2020-06-17 20:01:41 +02:00

491 lines
89 KiB
Text

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Coverage plots"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import numpy as np\n",
"from matplotlib import pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us consider the following function which applies a linear model to the given data. \n",
"Specifically, given a \"model\" vector containing the model coefficients $(a,b)$ and a $n \\times 2$ \"data\" matrix containing the data points to be classified, the function outputs a vector $\\mathbf{z}$, $|\\mathbf{z}| = n$ of booleans where $z_i$ is `True` if $a \\cdot x_{i,1} + b \\cdot x_{i,2} \\geq 0$, it is `False` otherwise."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def apply_linear_model(model, data):\n",
" return np.dot(data, np.transpose(model)) > 0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us define `data` by generating $1000$ points drawn uniformly from $\\mathcal{X} = [-100,100]^2$."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\galat\\.conda\\envs\\aaut\\lib\\site-packages\\ipykernel_launcher.py:1: DeprecationWarning: This function is deprecated. Please call randint(-100, 100 + 1) instead\n",
" \"\"\"Entry point for launching an IPython kernel.\n"
]
},
{
"data": {
"text/plain": [
"array([[ -8, -49],\n",
" [-39, 7],\n",
" [ 48, 95],\n",
" ...,\n",
" [ -2, 7],\n",
" [ 35, 72],\n",
" [ 28, -5]])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = np.random.random_integers(-100,100,[1000,2])\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"and let target_labels be the labeling output by applying `apply_linear_model` with our target model: $4x -y > 0$"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"target_model = [4.,-1.]\n",
"target_labels = apply_linear_model(target_model, data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By using matplotlib.pyplot module it is easy to plot the generated points onto a 2D plot:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.collections.PathCollection at 0x1f28b3e7788>"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"colors = ['r' if l else 'b' for l in target_labels]\n",
"plt.scatter(data[:,0], data[:,1], color=colors)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally let us now generate at random 100 linear models with coefficients in $[-5,5]$:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[-2.47471947, -3.5581708 ],\n",
" [ 1.59686449, 1.71893864],\n",
" [ 3.03294605, 4.57263288],\n",
" [ 4.3603124 , -1.19540694],\n",
" [ 1.21976183, 4.09720458],\n",
" [ 0.42857121, 3.20268614],\n",
" [-2.15226242, 2.96225011],\n",
" [ 0.65773544, -2.54676899],\n",
" [-2.79365386, -0.78924628],\n",
" [ 2.90156232, -3.31703275],\n",
" [-0.9849533 , 0.62170036],\n",
" [-4.0396815 , 4.36277095],\n",
" [ 1.34248188, 1.77758129],\n",
" [ 0.38419206, 0.17314725],\n",
" [-2.04665134, 2.10337995],\n",
" [-2.50975771, 3.65315789],\n",
" [ 2.004511 , 2.73918509],\n",
" [ 4.15805913, -4.96182686],\n",
" [ 4.51026319, -1.78429829],\n",
" [-3.31973604, 3.43442154],\n",
" [ 3.52497128, 0.85718807],\n",
" [-3.06163513, -0.86712587],\n",
" [-4.18156322, -2.21571818],\n",
" [-4.59703113, 4.4801163 ],\n",
" [-2.65368229, 4.37023623],\n",
" [ 3.90990454, -2.1420455 ],\n",
" [-3.24752094, -0.40639067],\n",
" [ 0.10755654, 0.48649316],\n",
" [ 2.26731088, 4.73364989],\n",
" [-4.22292673, -0.21284274],\n",
" [-4.66592874, -2.79620572],\n",
" [ 4.63716865, 0.87182744],\n",
" [-4.32406367, 1.10060443],\n",
" [-0.45847 , 0.70180339],\n",
" [ 3.22176576, -2.5364163 ],\n",
" [ 3.80797501, 2.35293627],\n",
" [ 3.36332162, -3.79299501],\n",
" [ 3.99625756, 2.36135165],\n",
" [ 1.20216525, -1.23827528],\n",
" [-3.09694201, 3.9600678 ],\n",
" [-0.64611333, 2.09501923],\n",
" [ 0.99744202, 1.49993523],\n",
" [-3.36391051, -3.90944487],\n",
" [-3.58672509, -4.1088498 ],\n",
" [ 3.46090243, 0.02661214],\n",
" [-1.49631605, -2.28424324],\n",
" [ 1.1089388 , -1.73806817],\n",
" [ 3.30150146, -3.13759682],\n",
" [-4.51293209, 4.08479726],\n",
" [-4.09529163, 4.28334043],\n",
" [-0.7227784 , 0.85683098],\n",
" [-3.54236195, -4.37842609],\n",
" [-1.67857772, 1.18420411],\n",
" [-2.06131565, -3.81118901],\n",
" [-0.94505145, -0.79410051],\n",
" [-1.58100698, -4.40226088],\n",
" [ 3.49623546, 0.98568917],\n",
" [-4.7875311 , 2.46132599],\n",
" [-0.90714606, -4.03370503],\n",
" [-4.04974727, 1.89697029],\n",
" [ 2.3912763 , 4.43535836],\n",
" [ 1.91805621, 3.10706978],\n",
" [ 2.7870542 , -4.76785357],\n",
" [-4.83230806, 0.68706866],\n",
" [ 4.21091682, 2.69235722],\n",
" [ 4.92125435, 1.67552945],\n",
" [-4.17809823, -3.0655279 ],\n",
" [ 1.34522792, -2.11218453],\n",
" [-2.82712946, -3.84431909],\n",
" [ 4.32983019, -0.67660343],\n",
" [ 3.69650316, -2.09533608],\n",
" [-2.46459767, -2.78730998],\n",
" [-0.12911643, 3.03464722],\n",
" [-0.54414414, -4.24446833],\n",
" [ 0.70841166, 0.82220448],\n",
" [-1.21624127, 2.67030582],\n",
" [-4.4511487 , -0.18157221],\n",
" [ 0.54850624, 3.80806515],\n",
" [ 0.41580003, 2.39770318],\n",
" [ 0.78040198, -2.27920522],\n",
" [-0.98993749, -4.66406869],\n",
" [ 2.67850165, 1.2013196 ],\n",
" [-0.85139301, -3.08916589],\n",
" [ 2.00142468, -3.62142984],\n",
" [-0.08136816, 1.76822154],\n",
" [-4.92951601, 0.11860089],\n",
" [-2.36011692, 2.25618495],\n",
" [ 1.60982063, -0.44192244],\n",
" [-2.54853258, -2.17737341],\n",
" [-1.31205757, 2.17528846],\n",
" [ 4.9863995 , -3.99442219],\n",
" [-1.87206871, -2.53218008],\n",
" [ 2.35107436, -4.08841325],\n",
" [ 3.5602568 , -2.39084033],\n",
" [-1.67264783, -2.78819786],\n",
" [ 2.14307079, -1.80908234],\n",
" [-2.47515458, 2.07939336],\n",
" [ 0.34640981, 0.10794752],\n",
" [ 1.0289358 , -1.10048266],\n",
" [-4.92276006, 0.74592667]])"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"models = (np.random.rand(100,2) - 0.5) * 10\n",
"models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Write a function that, taken two list of labellings build the corresponding confusion matrix [[1](#hint1)];\n",
"1. For each model in `models` plot the [FP,TP] pairs on a scatter plot;\n",
"1. Just looking at the plot: which is the best model in the pool?\n",
"1. Find the model with the best accuracy [[2](#hint2)] and compare it with the target model, is it close? Is it the model you would have picked up visually from the scatter plot?\n",
"1. If everything is ok, you should have found a pretty good model for our data. It fits the data quite well and it is quite close to the target model. Did you expect this? If so, why? If not so, why not?\n",
"\n",
"<a name=\"hint1\">Hint 1:</a> it may be helpful to have a way to map TRUE to 0, FALSE to 1 and to use these values as indices in the confusion matrix. \n",
"\n",
"<a name=\"hint2\">Hint 2:</a> one way to proceed is to build a function `accuracy`, use the `map` function to calculate the accuracies of all the models, and then apply the `numpy.argmax` to retrieve the index of the best model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Es. 1"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"def build_confusion_matrix(labels1, labels2):\n",
" confusion_matrix = np.zeros((2,2))\n",
" for i in range(len(labels1)):\n",
" confusion_matrix[1 - labels1[i], 1 - labels2[i]] += 1\n",
" return confusion_matrix"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[202. 281.]\n",
" [310. 207.]]\n"
]
}
],
"source": [
"print(build_confusion_matrix(apply_linear_model(target_model, data), apply_linear_model(models[0], data)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Es. 2\n"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.collections.PathCollection at 0x1f28b4e2508>"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fp = []\n",
"tp = []\n",
"\n",
"for model in models:\n",
" confusion = build_confusion_matrix(target_labels, apply_linear_model(model, data))\n",
" fp.append(confusion[1,0])\n",
" tp.append(confusion[0,0])\n",
"plt.scatter(fp, tp)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Es. 3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Il modello migliore è quello in alto a sinistra (max TP/FP)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 4.3603124 , -1.19540694])"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"models[np.argmax([t / f for t, f in zip(tp, fp)])]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Es. 4\n"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"def accuracy(tp, tn, total):\n",
" return (tp + tn) / total"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"model: [ 4.3603124 -1.19540694] accuracy: 0.995\n"
]
}
],
"source": [
"accuracies = []\n",
"\n",
"for model in models:\n",
" confusion = build_confusion_matrix(target_labels, apply_linear_model(model, data))\n",
" accuracies.append(accuracy(confusion[0,0], confusion[1,1], 1000))\n",
"\n",
"print(\"model: \", models[np.argmax(accuracies)], \" accuracy: \", accuracies[np.argmax(accuracies)])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Il modello è lo stesso predetto dalla plot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Es. 5"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Mi aspettavo di trovare un modello con un'accuracy alta ma non come quella trovata (0.995), perchè su 100 modelli, con due variabili comprese tra 5 e -5, generati con una funzione random uniforme mi aspetto dei valori vicini a quelli target."
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
}
},
"nbformat": 4,
"nbformat_minor": 1
}