258 lines
8.7 KiB
Text
258 lines
8.7 KiB
Text
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Experimenting with least squares and its variants"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%matplotlib inline\n",
|
|
"\n",
|
|
"from sklearn import datasets\n",
|
|
"from scipy.optimize import fmin_bfgs\n",
|
|
"import numpy as np\n",
|
|
"from numpy.linalg import norm\n",
|
|
"from numpy.linalg import inv"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Data preparation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"array([[ 6.32000000e-03, 1.80000000e+01, 2.31000000e+00, ...,\n",
|
|
" 1.53000000e+01, 3.96900000e+02, 4.98000000e+00],\n",
|
|
" [ 2.73100000e-02, 0.00000000e+00, 7.07000000e+00, ...,\n",
|
|
" 1.78000000e+01, 3.96900000e+02, 9.14000000e+00],\n",
|
|
" [ 2.72900000e-02, 0.00000000e+00, 7.07000000e+00, ...,\n",
|
|
" 1.78000000e+01, 3.92830000e+02, 4.03000000e+00],\n",
|
|
" ..., \n",
|
|
" [ 6.07600000e-02, 0.00000000e+00, 1.19300000e+01, ...,\n",
|
|
" 2.10000000e+01, 3.96900000e+02, 5.64000000e+00],\n",
|
|
" [ 1.09590000e-01, 0.00000000e+00, 1.19300000e+01, ...,\n",
|
|
" 2.10000000e+01, 3.93450000e+02, 6.48000000e+00],\n",
|
|
" [ 4.74100000e-02, 0.00000000e+00, 1.19300000e+01, ...,\n",
|
|
" 2.10000000e+01, 3.96900000e+02, 7.88000000e+00]])"
|
|
]
|
|
},
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"boston = datasets.load_boston()\n",
|
|
"data = np.array(boston.data)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The boston dataset is one of the standard regression problems used to experiment with learning algorithms. Below you can find the dataset description"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Boston House Prices dataset\n",
|
|
"\n",
|
|
"Notes\n",
|
|
"------\n",
|
|
"Data Set Characteristics: \n",
|
|
"\n",
|
|
" :Number of Instances: 506 \n",
|
|
"\n",
|
|
" :Number of Attributes: 13 numeric/categorical predictive\n",
|
|
" \n",
|
|
" :Median Value (attribute 14) is usually the target\n",
|
|
"\n",
|
|
" :Attribute Information (in order):\n",
|
|
" - CRIM per capita crime rate by town\n",
|
|
" - ZN proportion of residential land zoned for lots over 25,000 sq.ft.\n",
|
|
" - INDUS proportion of non-retail business acres per town\n",
|
|
" - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n",
|
|
" - NOX nitric oxides concentration (parts per 10 million)\n",
|
|
" - RM average number of rooms per dwelling\n",
|
|
" - AGE proportion of owner-occupied units built prior to 1940\n",
|
|
" - DIS weighted distances to five Boston employment centres\n",
|
|
" - RAD index of accessibility to radial highways\n",
|
|
" - TAX full-value property-tax rate per $10,000\n",
|
|
" - PTRATIO pupil-teacher ratio by town\n",
|
|
" - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n",
|
|
" - LSTAT % lower status of the population\n",
|
|
" - MEDV Median value of owner-occupied homes in $1000's\n",
|
|
"\n",
|
|
" :Missing Attribute Values: None\n",
|
|
"\n",
|
|
" :Creator: Harrison, D. and Rubinfeld, D.L.\n",
|
|
"\n",
|
|
"This is a copy of UCI ML housing dataset.\n",
|
|
"http://archive.ics.uci.edu/ml/datasets/Housing\n",
|
|
"\n",
|
|
"\n",
|
|
"This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\n",
|
|
"\n",
|
|
"The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\n",
|
|
"prices and the demand for clean air', J. Environ. Economics & Management,\n",
|
|
"vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n",
|
|
"...', Wiley, 1980. N.B. Various transformations are used in the table on\n",
|
|
"pages 244-261 of the latter.\n",
|
|
"\n",
|
|
"The Boston house-price data has been used in many machine learning papers that address regression\n",
|
|
"problems. \n",
|
|
" \n",
|
|
"**References**\n",
|
|
"\n",
|
|
" - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n",
|
|
" - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\n",
|
|
" - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print(boston.DESCR)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"First step to apply the formulae we learnt during the lectures is to rewrite the dataset in homogeneous coordinates (i.e., we append a column of 1 to the matrix containing the examples):"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"t = np.ones(len(data)).reshape(len(data),1)\n",
|
|
"data = np.append(data, t, 1)\n",
|
|
"target = np.array(boston.target)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We now divide the data into a training set $X$ and a test set $X_\\textrm{test}$."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"X,y = data[0:400,:], target[0:400]\n",
|
|
"X_test, y_test = data[400:,:], target[400:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Exercise"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"1. Calculate the least square solution (to the regression problem outlined above) and evaluate its performances on the training set and on the test set.\n",
|
|
"1. Calculate the ridge regression solution (set lambda to 0.01) and evaluate its performances on the training set and on test set.\n",
|
|
"1. Calculate the lasso regression solution and evaluate its performances on the training set and on the test set."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Notes"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"- Here it follows a list of functions you may want to use (the required packages are already imported at the beginning of this notebook) along with a very brief explanation of their purpose (`help(nomefun)` will provide you more information about function `nomefun`):\n",
|
|
" - `transpose`: matrix transposition (e.g., `transpose(X)`)\n",
|
|
" - `dot`: matrix multiplication (e.g., `X.dot(X2)`) \n",
|
|
" - `inv`: matrix inversion (e.g., `inv(X)`)\n",
|
|
"- to solve the lasso problem you will need to perform a numerical minimization of the associated loss function (as you know, a closed form solution does not exist). There are many numerical optimization algorithms available in the scipy package. My suggestion is to use `fmin_bfgs`. Here it follows an example of how to use it:\n",
|
|
" ```python\n",
|
|
" def f(w):\n",
|
|
" return w[0]**2 + w[1]**2 + w[0] + w[1]\n",
|
|
" \n",
|
|
" w = fmin_bfgs(f, [0,0])\n",
|
|
" ```\n",
|
|
" note that the function may (and should) reference your data variables (i.e., $X$ and $y$).\n",
|
|
"- to evaluate the performances of your solutions use the $S$ statistic:\n",
|
|
" $$\n",
|
|
" S = \\sqrt{ \\frac{1}{n} \\sum_{i=1}^n (y_i' - y_i)^2 }\n",
|
|
" $$\n",
|
|
" where $y'_i$ is your model prediction for the i-th example, and $n$ is the number of examples."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"anaconda-cloud": {},
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.7.5"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 1
|
|
}
|