411 lines
13 KiB
Text
411 lines
13 KiB
Text
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Experimenting with least squares and its variants"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%matplotlib inline\n",
|
|
"\n",
|
|
"from sklearn import datasets\n",
|
|
"from scipy.optimize import fmin_bfgs\n",
|
|
"import numpy as np\n",
|
|
"from numpy.linalg import norm\n",
|
|
"from numpy.linalg import inv\n",
|
|
"from numpy import transpose, identity\n",
|
|
"from numpy import zeros"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Data preparation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"boston = datasets.load_boston()\n",
|
|
"data = np.array(boston.data)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The boston dataset is one of the standard regression problems used to experiment with learning algorithms. Below you can find the dataset description"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
".. _boston_dataset:\n",
|
|
"\n",
|
|
"Boston house prices dataset\n",
|
|
"---------------------------\n",
|
|
"\n",
|
|
"**Data Set Characteristics:** \n",
|
|
"\n",
|
|
" :Number of Instances: 506 \n",
|
|
"\n",
|
|
" :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\n",
|
|
"\n",
|
|
" :Attribute Information (in order):\n",
|
|
" - CRIM per capita crime rate by town\n",
|
|
" - ZN proportion of residential land zoned for lots over 25,000 sq.ft.\n",
|
|
" - INDUS proportion of non-retail business acres per town\n",
|
|
" - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n",
|
|
" - NOX nitric oxides concentration (parts per 10 million)\n",
|
|
" - RM average number of rooms per dwelling\n",
|
|
" - AGE proportion of owner-occupied units built prior to 1940\n",
|
|
" - DIS weighted distances to five Boston employment centres\n",
|
|
" - RAD index of accessibility to radial highways\n",
|
|
" - TAX full-value property-tax rate per $10,000\n",
|
|
" - PTRATIO pupil-teacher ratio by town\n",
|
|
" - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n",
|
|
" - LSTAT % lower status of the population\n",
|
|
" - MEDV Median value of owner-occupied homes in $1000's\n",
|
|
"\n",
|
|
" :Missing Attribute Values: None\n",
|
|
"\n",
|
|
" :Creator: Harrison, D. and Rubinfeld, D.L.\n",
|
|
"\n",
|
|
"This is a copy of UCI ML housing dataset.\n",
|
|
"https://archive.ics.uci.edu/ml/machine-learning-databases/housing/\n",
|
|
"\n",
|
|
"\n",
|
|
"This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\n",
|
|
"\n",
|
|
"The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\n",
|
|
"prices and the demand for clean air', J. Environ. Economics & Management,\n",
|
|
"vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n",
|
|
"...', Wiley, 1980. N.B. Various transformations are used in the table on\n",
|
|
"pages 244-261 of the latter.\n",
|
|
"\n",
|
|
"The Boston house-price data has been used in many machine learning papers that address regression\n",
|
|
"problems. \n",
|
|
" \n",
|
|
".. topic:: References\n",
|
|
"\n",
|
|
" - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n",
|
|
" - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print(boston.DESCR)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"First step to apply the formulae we learnt during the lectures is to rewrite the dataset in homogeneous coordinates (i.e., we append a column of 1 to the matrix containing the examples):"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"t = np.ones(len(data)).reshape(len(data),1)\n",
|
|
"data = np.append(data, t, 1)\n",
|
|
"target = np.array(boston.target)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We now divide the data into a training set $X$ and a test set $X_\\textrm{test}$."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"X,y = data[0:400,:], target[0:400]\n",
|
|
"X_test, y_test = data[400:,:], target[400:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Exercise"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"1. Calculate the least square solution (to the regression problem outlined above) and evaluate its performances on the training set and on the test set.\n",
|
|
"1. Calculate the ridge regression solution (set lambda to 0.01) and evaluate its performances on the training set and on test set.\n",
|
|
"1. Calculate the lasso regression solution and evaluate its performances on the training set and on the test set."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Notes"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"- Here it follows a list of functions you may want to use (the required packages are already imported at the beginning of this notebook) along with a very brief explanation of their purpose (`help(nomefun)` will provide you more information about function `nomefun`):\n",
|
|
" - `transpose`: matrix transposition (e.g., `transpose(X)`)\n",
|
|
" - `dot`: matrix multiplication (e.g., `X.dot(X2)`) \n",
|
|
" - `inv`: matrix inversion (e.g., `inv(X)`)\n",
|
|
"- to solve the lasso problem you will need to perform a numerical minimization of the associated loss function (as you know, a closed form solution does not exist). There are many numerical optimization algorithms available in the scipy package. My suggestion is to use `fmin_bfgs`. Here it follows an example of how to use it:\n",
|
|
" ```python\n",
|
|
" def f(w):\n",
|
|
" return w[0]**2 + w[1]**2 + w[0] + w[1]\n",
|
|
" \n",
|
|
" w = fmin_bfgs(f, [0,0])\n",
|
|
" ```\n",
|
|
" note that the function may (and should) reference your data variables (i.e., $X$ and $y$).\n",
|
|
"- to evaluate the performances of your solutions use the $S$ statistic:\n",
|
|
" $$\n",
|
|
" S = \\sqrt{ \\frac{1}{n} \\sum_{i=1}^n (y_i' - y_i)^2 }\n",
|
|
" $$\n",
|
|
" where $y'_i$ is your model prediction for the i-th example, and $n$ is the number of examples."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"least squares: $(X^T X)^{-1}X^T y $"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"least_squares = lambda x,y: inv(x.T.dot(x)).dot(x.T.dot(y))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {
|
|
"scrolled": true
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"array([-1.91246374e-01, 4.42289967e-02, 5.52207977e-02, 1.71631351e+00,\n",
|
|
" -1.49957220e+01, 4.88773025e+00, 2.60921031e-03, -1.29480799e+00,\n",
|
|
" 4.84787214e-01, -1.54006673e-02, -8.08795026e-01, -1.29230427e-03,\n",
|
|
" -5.17953791e-01, 2.86725996e+01])"
|
|
]
|
|
},
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"least_squares(X, y)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Ridge regression: ŵ = (XᵀX + λI)⁻¹Xᵀy"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def ridge_regression(x, y, lmb):\n",
|
|
" I = identity(len(X[0]))\n",
|
|
" return inv(x.T.dot(x) + lmb * I).dot(x.T).dot(y)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"array([-1.92905668e-01, 4.45989360e-02, 4.80153773e-02, 1.70985336e+00,\n",
|
|
" -1.21920175e+01, 5.08501051e+00, 8.60369052e-04, -1.23267578e+00,\n",
|
|
" 4.67418151e-01, -1.51800832e-02, -7.48061272e-01, 7.58288257e-04,\n",
|
|
" -5.09848508e-01, 2.39216289e+01])"
|
|
]
|
|
},
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"ridge_regression(X, y, 0.1)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Lasso: w* = argmin_w (y-X·w)ᵀ(y-X·w) + λ‖w‖₁"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def lasso(w, x, y, lmb):\n",
|
|
" return (y - x.dot(w)).T.dot(y-x.dot(w)) + lmb * sum(w)\n",
|
|
"lasso_regression = lambda x, y, lmb: fmin_bfgs(lasso, zeros(len(x[0])), args= (x,y,lmb))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Warning: Desired error not necessarily achieved due to precision loss.\n",
|
|
" Current function value: 8923.891671\n",
|
|
" Iterations: 18\n",
|
|
" Function evaluations: 1047\n",
|
|
" Gradient evaluations: 69\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"array([-1.91323083e-01, 4.42279444e-02, 5.53631382e-02, 1.71490995e+00,\n",
|
|
" -1.50058027e+01, 4.89083693e+00, 2.62969748e-03, -1.29453598e+00,\n",
|
|
" 4.84604103e-01, -1.53869118e-02, -8.08349218e-01, -1.26958995e-03,\n",
|
|
" -5.17749228e-01, 2.86320548e+01])"
|
|
]
|
|
},
|
|
"execution_count": 11,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"lasso_regression(X,y,0.1)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def S(actual, predicted):\n",
|
|
" from math import sqrt\n",
|
|
" return sqrt(sum((predicted[i] - actual[i])**2 for i in range(len(actual))) / len(actual))\n",
|
|
"predicted = lambda x, w: x.dot(w)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 21,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Warning: Desired error not necessarily achieved due to precision loss.\n",
|
|
" Current function value: 8922.270593\n",
|
|
" Iterations: 18\n",
|
|
" Function evaluations: 555\n",
|
|
" Gradient evaluations: 37\n",
|
|
"Least squares training set s statistics: 4.722840838326382\n",
|
|
"Least squares test set s statistics: 6.155792280412581\n",
|
|
"Ridge regression training set s statistics: 4.734160907532518\n",
|
|
"Ridge regression test set s statistics: 5.98737876633626\n",
|
|
"Lasso regression training set s statistics: 4.722840845344215\n",
|
|
"Lasso regression test set s statistics: 6.155671334655423\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"w_least = least_squares(X, y)\n",
|
|
"w_ridge = ridge_regression(X, y, 0.01)\n",
|
|
"w_ridge = ridge_regression(X, y, 0.2)\n",
|
|
"w_lasso = lasso_regression(X, y, 0.01)\n",
|
|
"print(\"Least squares training set s statistics:\", S(y, predicted(X, w_least)))\n",
|
|
"print(\"Least squares test set s statistics:\", S(y_test, predicted(X_test, w_least)))\n",
|
|
"print(\"Ridge regression training set s statistics:\", S(y, predicted(X, w_ridge)))\n",
|
|
"print(\"Ridge regression test set s statistics:\", S(y_test, predicted(X_test, w_ridge)))\n",
|
|
"print(\"Lasso regression training set s statistics:\", S(y, predicted(X, w_lasso)))\n",
|
|
"print(\"Lasso regression test set s statistics:\", S(y_test, predicted(X_test, w_lasso)))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"anaconda-cloud": {},
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.7.7"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 1
|
|
}
|