UniTO/anno3/apprendimento_automatico/esercizi/1/.ipynb_checkpoints/coverage_plots-checkpoint.ipynb
Francesco Mecca 84096d29f6 esercizi
2020-07-03 19:08:23 +02:00

472 lines
86 KiB
Text

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Coverage plots"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import numpy as np\n",
"from matplotlib import pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us consider the following function which applies a linear model to the given data. \n",
"Specifically, given a \"model\" vector containing the model coefficients $(a,b)$ and a $n \\times 2$ \"data\" matrix containing the data points to be classified, the function outputs a vector $\\mathbf{z}$, $|\\mathbf{z}| = n$ of booleans where $z_i$ is `True` if $a \\cdot x_{i,1} + b \\cdot x_{i,2} \\geq 0$, it is `False` otherwise."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"def apply_linear_model(model, data):\n",
" return np.dot(data, np.transpose(model)) > 0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us define `data` by generating $1000$ points drawn uniformly from $\\mathcal{X} = [-100,100]^2$."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/user/.local/lib/python3.7/site-packages/ipykernel_launcher.py:1: DeprecationWarning: This function is deprecated. Please call randint(-100, 100 + 1) instead\n",
" \"\"\"Entry point for launching an IPython kernel.\n"
]
},
{
"data": {
"text/plain": [
"array([[-60, -58],\n",
" [-54, 99],\n",
" [ 95, 99],\n",
" ...,\n",
" [ -3, -80],\n",
" [ 45, -64],\n",
" [ 14, 59]])"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = np.random.random_integers(-100,100,[1000,2])\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"and let target_labels be the labeling output by applying `apply_linear_model` with our target model: $4x -y > 0$"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"target_model = [4.,-1.]\n",
"target_labels = apply_linear_model(target_model, data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By using matplotlib.pyplot module it is easy to plot the generated points onto a 2D plot:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.collections.PathCollection at 0x7f068e473b90>"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"colors = ['r' if l else 'b' for l in target_labels]\n",
"plt.scatter(data[:,0], data[:,1], color=colors)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally let us now generate at random 100 linear models with coefficients in $[-5,5]$:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0.9118242 , -0.90045672],\n",
" [-3.94382217, -3.98527811],\n",
" [-1.18697017, 0.80016719],\n",
" [-1.85839332, 4.98952722],\n",
" [ 1.07261938, -0.90803895],\n",
" [-4.53758697, -4.30257501],\n",
" [-1.17461754, -2.39901897],\n",
" [ 2.34774061, -4.72969268],\n",
" [ 2.47747169, 2.77914928],\n",
" [-1.11617222, -4.16037991],\n",
" [ 1.4716821 , 0.88162217],\n",
" [ 1.78297491, -1.44883212],\n",
" [ 1.58199674, 4.6052321 ],\n",
" [ 3.99068967, -4.28056417],\n",
" [-0.16221614, -2.24741286],\n",
" [ 2.32933961, 0.28596881],\n",
" [-4.70438519, 2.44675435],\n",
" [-3.29578251, -3.86079991],\n",
" [-2.15187954, -3.38596945],\n",
" [-1.31823703, 1.6351425 ],\n",
" [-3.13072307, 1.32378202],\n",
" [ 2.76175045, -1.78814111],\n",
" [ 2.73156012, -2.64332344],\n",
" [ 1.64996223, 1.83230706],\n",
" [ 3.31936639, 0.2228602 ],\n",
" [ 2.26694906, 0.63193375],\n",
" [-2.51414076, 4.51560257],\n",
" [ 0.27215252, -1.67282386],\n",
" [-3.64167257, -0.14559744],\n",
" [-4.30424639, -3.37589157],\n",
" [ 3.01745299, 3.51058308],\n",
" [ 1.35223951, -3.06364559],\n",
" [ 4.94253085, -1.60716143],\n",
" [-4.36074161, -3.10693624],\n",
" [ 3.10628154, 3.49373291],\n",
" [-2.74311538, -4.29366027],\n",
" [-1.89979198, 1.41734176],\n",
" [ 1.9159884 , 1.23531441],\n",
" [-2.15457615, 1.1728522 ],\n",
" [ 4.60642972, 3.51823611],\n",
" [ 1.59513489, 0.56356173],\n",
" [-0.32910123, 1.31288732],\n",
" [ 1.36686363, 0.96076635],\n",
" [-3.9091037 , 0.96514774],\n",
" [ 4.37669631, -0.8778982 ],\n",
" [-3.13000071, -2.59206421],\n",
" [ 0.85730862, 3.96159211],\n",
" [ 2.91165311, 2.24727293],\n",
" [ 2.16991404, -3.35593884],\n",
" [-0.38522275, -1.67180888],\n",
" [-1.91436601, 3.62229527],\n",
" [ 1.31583377, -1.93048586],\n",
" [ 0.52322948, 0.91378549],\n",
" [ 0.69736315, 3.05799437],\n",
" [-2.33259618, 4.23093531],\n",
" [-0.01882034, -3.16737335],\n",
" [-1.85567722, -0.16700837],\n",
" [ 4.74309296, 3.57241682],\n",
" [ 0.96709141, -1.3653478 ],\n",
" [-2.98210548, -0.11106027],\n",
" [-3.86461267, 3.62193573],\n",
" [ 2.83976749, 2.94566098],\n",
" [ 3.76245288, -2.64933837],\n",
" [ 4.58809654, 1.23109222],\n",
" [ 4.84968707, -2.75644381],\n",
" [-1.54471238, 4.83523772],\n",
" [ 1.89738986, 3.61006974],\n",
" [ 1.89077461, 3.96448192],\n",
" [ 0.58264712, -3.48158676],\n",
" [-3.70699049, -1.55128007],\n",
" [-1.74431095, -1.26414456],\n",
" [ 4.95881191, -3.89363783],\n",
" [-3.49425476, 4.69333757],\n",
" [-1.13661494, -4.86289907],\n",
" [-0.80047881, -2.36304971],\n",
" [-2.22814782, -1.71573374],\n",
" [ 1.93181752, -2.84184699],\n",
" [-2.01459345, 3.04690045],\n",
" [ 1.77370361, 2.63596514],\n",
" [-1.62391354, 3.9170375 ],\n",
" [-1.16831826, -0.35730506],\n",
" [ 2.81017534, 4.68734215],\n",
" [ 3.50859446, 3.53556171],\n",
" [-3.00404934, -0.31632676],\n",
" [-3.19738369, -0.50324866],\n",
" [-1.14409139, 3.06816086],\n",
" [-4.67354814, 1.12585223],\n",
" [ 2.62801894, -2.11531302],\n",
" [-3.26599429, -2.09618265],\n",
" [-1.77991357, -3.54630238],\n",
" [ 1.83623843, -2.97438757],\n",
" [-1.90333658, 0.66363691],\n",
" [ 2.61705961, 0.10912733],\n",
" [ 1.76458691, 1.21896092],\n",
" [-2.5188483 , 0.77614823],\n",
" [ 1.75016557, 2.0592426 ],\n",
" [ 4.82096292, -0.0816393 ],\n",
" [-2.66030292, -2.54501908],\n",
" [ 1.90560799, -1.66171268],\n",
" [-1.30050042, -1.94071811]])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"models = (np.random.rand(100,2) - 0.5) * 10\n",
"models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Write a function that, taken two list of labellings build the corresponding confusion matrix [[1](#hint1)];\n",
"1. For each model in `models` plot the [FP,TP] pairs on a scatter plot;\n",
"1. Just looking at the plot: which is the best model in the pool?\n",
"1. Find the model with the best accuracy [[2](#hint2)] and compare it with the target model, is it close? Is it the model you would have picked up visually from the scatter plot?\n",
"1. If everything is ok, you should have found a pretty good model for our data. It fits the data quite well and it is quite close to the target model. Did you expect this? If so, why? If not so, why not?\n",
"\n",
"<a name=\"hint1\">Hint 1:</a> it may be helpful to have a way to map TRUE to 0, FALSE to 1 and to use these values as indices in the confusion matrix. \n",
"\n",
"<a name=\"hint2\">Hint 2:</a> one way to proceed is to build a function `accuracy`, use the `map` function to calculate the accuracies of all the models, and then apply the `numpy.argmax` to retrieve the index of the best model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise 1\n",
"Write a function that, taken two list of labelling, builds the confusion matrix"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[416. 99.]\n",
" [ 77. 408.]]\n"
]
}
],
"source": [
"def confusion_matrix(labels1, labels2):\n",
" assert len(labels1) == len(labels2), \"Label arrays must be of same length\"\n",
" confusion_matrix = np.zeros((2,2))\n",
" for i in range(len(labels1)):\n",
" confusion_matrix[1 - labels1[i], 1 - labels2[i]] += 1\n",
" return confusion_matrix\n",
"\n",
"print(confusion_matrix(apply_linear_model(target_model, data), apply_linear_model(models[0], data)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise 2\n",
"For each model in models plot the [FP,TP] pairs on a scatter plot\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.collections.PathCollection at 0x7f065f9cecd0>"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"confusion_matrices = [confusion_matrix(target_labels, apply_linear_model(model, data))for model in models]\n",
"\n",
"fp, tp = list(map(lambda cm: cm[1,0], confusion_matrices)), list(map(lambda cm: cm[0,0], confusion_matrices))\n",
"plt.scatter(fp, tp)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise three\n",
"By looking at the plot, which is the best model?\n",
"\n",
"Answer: The best model is the one at the top left corner, which has the highest TP/FP ratio"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise four\n",
"Find the model with the best accuracy and compare it with the target model, is it close? Is it the model you would have picked up visually from the scatter plot?"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Plot best: [ 4.37669631 -0.8778982 ], Accuracy best: [ 4.37669631 -0.8778982 ], with accuracy: 0.986\n"
]
}
],
"source": [
"def accuracy(confusion_matrix):\n",
" return sum(confusion_matrix.diagonal()) / sum(sum(confusion_matrix))\n",
"\n",
"models_acc = list(map(lambda m: (m[0], accuracy(m[1])), zip(models, confusion_matrices)))\n",
"models_acc = sorted(models_acc, key=lambda ma: ma[1], reverse=True)\n",
"\n",
"plot_best = models[np.argmax([t / f for t, f in zip(tp, fp)])]\n",
"accuracy_best = models_acc[0]\n",
"\n",
"print(f'Plot best: {plot_best}, Accuracy best: {accuracy_best[0]}, with accuracy: {accuracy_best[1]}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise 5\n",
"Yes, because the models were generated with a uniform distributions of variables of range [-5;5]"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(array([ 4.37669631, -0.8778982 ]), 0.986)"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.7"
}
},
"nbformat": 4,
"nbformat_minor": 1
}