Skip to content

Instantly share code, notes, and snippets.

@junjy007
Created August 8, 2014 03:45
Show Gist options
  • Save junjy007/5ba068d06bd1440b90d3 to your computer and use it in GitHub Desktop.
Save junjy007/5ba068d06bd1440b90d3 to your computer and use it in GitHub Desktop.
Paper reading notes.
{
"metadata": {
"name": "",
"signature": "sha256:fcc43d173e5c9dd3fc170ce0b88a1c8ade3969d292e97d35b30aaa9643642754"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "code",
"collapsed": false,
"input": [
"import numpy as np\n",
"import scipy\n",
"from sklearn import datasets"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Back propagation\n",
"Now let's make a neural net. Each unit $i$ has *total* input $x_i$, and output $y_i$. The nonlinear activation is $a_i=\\frac{1}{1+e^{-x_i}}$.\n",
"\n",
"The activation can be understood as follows:\n",
"$$\n",
"y = \\frac{e^x}{1+e^x}\n",
"$$\n",
"So it is $e^x$ competing with $e^0==1$. When $x=0$, they draw, then the output is $1/2$. When $x > 0$, $e^x$ takes most portion of the sum, and the quotient becomes 1. When $x< 0$, 1 takes most of the sum. The quotient becomes 0. So the activation is a \"hidden\" softmax. A unit is competing with its dual (which saying the status should be 0) to be activate."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The derivative w.r.t. $x$: \n",
"$$\n",
"\\frac{d}{dx} e^x\\times (1+e^x)^{-1} = \n",
"= e^x (1+e^x)^{-1} + e^x \\times (-1) \\times (1+e^x)^{-2} e^x\n",
"$$\n",
"The first part is $y$ itself (thanks the factor $e^x$), the second part is $-y^2$. (Maybe you have a better way of explaining it.) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each layer represents a number of neurons, say, $q$, they all accepts the same inputs, say, $p$ of them, and generate one output. \n",
"\n",
"Each function accepts data of $n$ samples, each sample consisting of the $p$ inputs of the neurons. So the layer process $n \\times p$ data matrix into $n \\times q$ output.\n",
"\n",
"In back-propagation, the layer gets information of how the output of each of its neurons affects the final error of each sample, as a $n \\times q$ matrix. It attributes the influence back to its inputs."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Let's take some neurons to make a layer of a network.\n",
"class NLayer: \n",
" def __init__(self, n_in, n_neurons, act=None, dact=None):\n",
" rng = np.random.RandomState(0)\n",
" wr = 1/n_in\n",
" self.W = rng.uniform(-wr, +wr, (n_in, n_neurons))\n",
" self.b = rng.uniform(-.5, +.5, (n_neurons,))\n",
" if act is None:\n",
" #self.act = scipy.special.expit\n",
" self.act = lambda x: 1/(1+np.exp(x))\n",
" self.dact = lambda x: self.act(x)*(1-self.act(x))\n",
" else:\n",
" self.act = act\n",
" self.dact = dact\n",
" \n",
" def forward(self, inputs):\n",
" # inputs: [samples x p-input variables]\n",
" self.inputs = inputs\n",
" self.tot_inputs = np.dot(inputs, self.W) + self.b[np.newaxis,:]\n",
" self.outputs = self.act(self.tot_inputs)\n",
" self.d_outputs_tot_inputs = self.dact(self.outputs)\n",
" \n",
" def backprop(self, de):\n",
" # de: how the output of each neuron influences the error\n",
" n = de.shape[0]\n",
" self.delta = de\n",
" self.de_tot_input = self.d_outputs_tot_inputs*de\n",
" self.de_inputs = np.dot(self.de_tot_input, self.W.T)\n",
" self.dW = np.dot(self.inputs.T, self.de_tot_input) / n\n",
" self.db = np.sum(self.de_tot_input, axis=0) / n\n",
" \n",
" def apply_grad(self, learning_rate):\n",
" self.W -= self.dW*learning_rate\n",
" self.b -= self.db*learning_rate\n",
" "
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"Fine! Now let's build two-layer network to predict house prices!"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"hpdat = datasets.load_boston()\n",
"from sklearn.cross_validation import train_test_split\n",
"trn_x, tst_x, trn_y, tst_y = train_test_split(hpdat.data, hpdat.target[:,np.newaxis], test_size=0.2)\n",
"trn_x, vld_x, trn_y, vld_y = train_test_split(trn_x, trn_y, test_size=0.2)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 26
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"t = np.arange(0, np.pi, 0.05)\n",
"y1 = np.cos(t)\n",
"y2 = np.sin(t)\n",
"trn_x = t[:,np.newaxis]\n",
"trn_y = np.asarray([y1, y2]).T\n",
"print trn_x.shape\n",
"print trn_y.shape\n"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"(63, 1)\n",
"(63, 2)\n"
]
}
],
"prompt_number": 8
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"n_hidden = 20\n",
"nn0 = NLayer(n_in=trn_x.shape[1], n_neurons=n_hidden)\n",
"lin_act = lambda x:x\n",
"lin_dact = lambda x:np.ones_like(x)\n",
"nn1 = NLayer(n_in=n_hidden, n_neurons=trn_y.shape[1], act=lin_act, dact=lin_dact)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 28
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"Train the network by backprop"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"i = 0\n",
"lrate = 0.05\n",
"while True:\n",
" nn0.forward(trn_x)\n",
" nn1.forward(nn0.outputs)\n",
" trn_err = np.mean(0.5*(nn1.outputs - trn_y)**2)\n",
" de_nn1_out = nn1.outputs - trn_y\n",
" nn1.backprop(de_nn1_out)\n",
" nn0.backprop(nn1.de_inputs)\n",
" \n",
" nn0.apply_grad(lrate)\n",
" nn1.apply_grad(lrate)\n",
" #nn0.forward(vld_x)\n",
" #nn1.forward(nn0.outputs)\n",
" #vld_err = np.mean(0.5*(nn1.outputs - vld_y)**2)\n",
" vld_err = 0\n",
" \n",
" print \"epoch %d: trn err %.3f vld err %.3f\" % (i, trn_err, vld_err)\n",
" \n",
" i = i+1\n",
" if i>200:\n",
" break"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"epoch 0: trn err 0.277 vld err 0.000\n",
"epoch 1: trn err 0.215 vld err 0.000\n",
"epoch 2: trn err 0.181 vld err 0.000\n",
"epoch 3: trn err 0.162 vld err 0.000\n",
"epoch 4: trn err 0.152 vld err 0.000\n",
"epoch 5: trn err 0.145 vld err 0.000\n",
"epoch 6: trn err 0.141 vld err 0.000\n",
"epoch 7: trn err 0.138 vld err 0.000\n",
"epoch 8: trn err 0.136 vld err 0.000\n",
"epoch 9: trn err 0.134 vld err 0.000\n",
"epoch 10: trn err 0.132 vld err 0.000\n",
"epoch 11: trn err 0.130 vld err 0.000\n",
"epoch 12: trn err 0.129 vld err 0.000\n",
"epoch 13: trn err 0.127 vld err 0.000\n",
"epoch 14: trn err 0.126 vld err 0.000\n",
"epoch 15: trn err 0.124 vld err 0.000\n",
"epoch 16: trn err 0.123 vld err 0.000\n",
"epoch 17: trn err 0.121 vld err 0.000\n",
"epoch 18: trn err 0.120 vld err 0.000\n",
"epoch 19: trn err 0.118 vld err 0.000\n",
"epoch 20: trn err 0.117 vld err 0.000\n",
"epoch 21: trn err 0.116 vld err 0.000\n",
"epoch 22: trn err 0.114 vld err 0.000\n",
"epoch 23: trn err 0.113 vld err 0.000\n",
"epoch 24: trn err 0.112 vld err 0.000\n",
"epoch 25: trn err 0.111 vld err 0.000\n",
"epoch 26: trn err 0.109 vld err 0.000\n",
"epoch 27: trn err 0.108 vld err 0.000\n",
"epoch 28: trn err 0.107 vld err 0.000\n",
"epoch 29: trn err 0.106 vld err 0.000\n",
"epoch 30: trn err 0.104 vld err 0.000\n",
"epoch 31: trn err 0.103 vld err 0.000\n",
"epoch 32: trn err 0.102 vld err 0.000\n",
"epoch 33: trn err 0.101 vld err 0.000\n",
"epoch 34: trn err 0.100 vld err 0.000\n",
"epoch 35: trn err 0.099 vld err 0.000\n",
"epoch 36: trn err 0.098 vld err 0.000\n",
"epoch 37: trn err 0.097 vld err 0.000\n",
"epoch 38: trn err 0.096 vld err 0.000\n",
"epoch 39: trn err 0.095 vld err 0.000\n",
"epoch 40: trn err 0.094 vld err 0.000\n",
"epoch 41: trn err 0.093 vld err 0.000\n",
"epoch 42: trn err 0.092 vld err 0.000\n",
"epoch 43: trn err 0.091 vld err 0.000\n",
"epoch 44: trn err 0.090 vld err 0.000\n",
"epoch 45: trn err 0.089 vld err 0.000\n",
"epoch 46: trn err 0.088 vld err 0.000\n",
"epoch 47: trn err 0.087 vld err 0.000\n",
"epoch 48: trn err 0.087 vld err 0.000\n",
"epoch 49: trn err 0.086 vld err 0.000\n",
"epoch 50: trn err 0.085 vld err 0.000\n",
"epoch 51: trn err 0.084 vld err 0.000\n",
"epoch 52: trn err 0.083 vld err 0.000\n",
"epoch 53: trn err 0.082 vld err 0.000\n",
"epoch 54: trn err 0.082 vld err 0.000\n",
"epoch 55: trn err 0.081 vld err 0.000\n",
"epoch 56: trn err 0.080 vld err 0.000\n",
"epoch 57: trn err 0.079 vld err 0.000\n",
"epoch 58: trn err 0.079 vld err 0.000\n",
"epoch 59: trn err 0.078 vld err 0.000\n",
"epoch 60: trn err 0.077 vld err 0.000\n",
"epoch 61: trn err 0.077 vld err 0.000\n",
"epoch 62: trn err 0.076 vld err 0.000\n",
"epoch 63: trn err 0.075 vld err 0.000\n",
"epoch 64: trn err 0.075 vld err 0.000\n",
"epoch 65: trn err 0.074 vld err 0.000\n",
"epoch 66: trn err 0.073 vld err 0.000\n",
"epoch 67: trn err 0.073 vld err 0.000\n",
"epoch 68: trn err 0.072 vld err 0.000\n",
"epoch 69: trn err 0.071 vld err 0.000\n",
"epoch 70: trn err 0.071 vld err 0.000\n",
"epoch 71: trn err 0.070 vld err 0.000\n",
"epoch 72: trn err 0.070 vld err 0.000\n",
"epoch 73: trn err 0.069 vld err 0.000\n",
"epoch 74: trn err 0.069 vld err 0.000\n",
"epoch 75: trn err 0.068 vld err 0.000\n",
"epoch 76: trn err 0.067 vld err 0.000\n",
"epoch 77: trn err 0.067 vld err 0.000\n",
"epoch 78: trn err 0.066 vld err 0.000\n",
"epoch 79: trn err 0.066 vld err 0.000\n",
"epoch 80: trn err 0.065 vld err 0.000\n",
"epoch 81: trn err 0.065 vld err 0.000\n",
"epoch 82: trn err 0.064 vld err 0.000\n",
"epoch 83: trn err 0.064 vld err 0.000\n",
"epoch 84: trn err 0.063 vld err 0.000\n",
"epoch 85: trn err 0.063 vld err 0.000\n",
"epoch 86: trn err 0.063 vld err 0.000\n",
"epoch 87: trn err 0.062 vld err 0.000\n",
"epoch 88: trn err 0.062 vld err 0.000\n",
"epoch 89: trn err 0.061 vld err 0.000\n",
"epoch 90: trn err 0.061 vld err 0.000\n",
"epoch 91: trn err 0.060 vld err 0.000\n",
"epoch 92: trn err 0.060 vld err 0.000\n",
"epoch 93: trn err 0.060 vld err 0.000\n",
"epoch 94: trn err 0.059 vld err 0.000\n",
"epoch 95: trn err 0.059 vld err 0.000\n",
"epoch 96: trn err 0.058 vld err 0.000\n",
"epoch 97: trn err 0.058 vld err 0.000\n",
"epoch 98: trn err 0.058 vld err 0.000\n",
"epoch 99: trn err 0.057 vld err 0.000\n",
"epoch 100: trn err 0.057 vld err 0.000\n",
"epoch 101: trn err 0.057 vld err 0.000\n",
"epoch 102: trn err 0.056 vld err 0.000\n",
"epoch 103: trn err 0.056 vld err 0.000"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"epoch 104: trn err 0.056 vld err 0.000\n",
"epoch 105: trn err 0.055 vld err 0.000\n",
"epoch 106: trn err 0.055 vld err 0.000\n",
"epoch 107: trn err 0.055 vld err 0.000\n",
"epoch 108: trn err 0.054 vld err 0.000\n",
"epoch 109: trn err 0.054 vld err 0.000\n",
"epoch 110: trn err 0.054 vld err 0.000\n",
"epoch 111: trn err 0.053 vld err 0.000\n",
"epoch 112: trn err 0.053 vld err 0.000\n",
"epoch 113: trn err 0.053 vld err 0.000\n",
"epoch 114: trn err 0.052 vld err 0.000\n",
"epoch 115: trn err 0.052 vld err 0.000\n",
"epoch 116: trn err 0.052 vld err 0.000\n",
"epoch 117: trn err 0.052 vld err 0.000\n",
"epoch 118: trn err 0.051 vld err 0.000\n",
"epoch 119: trn err 0.051 vld err 0.000\n",
"epoch 120: trn err 0.051 vld err 0.000\n",
"epoch 121: trn err 0.050 vld err 0.000\n",
"epoch 122: trn err 0.050 vld err 0.000\n",
"epoch 123: trn err 0.050 vld err 0.000\n",
"epoch 124: trn err 0.050 vld err 0.000\n",
"epoch 125: trn err 0.049 vld err 0.000\n",
"epoch 126: trn err 0.049 vld err 0.000\n",
"epoch 127: trn err 0.049 vld err 0.000\n",
"epoch 128: trn err 0.049 vld err 0.000\n",
"epoch 129: trn err 0.049 vld err 0.000\n",
"epoch 130: trn err 0.048 vld err 0.000\n",
"epoch 131: trn err 0.048 vld err 0.000\n",
"epoch 132: trn err 0.048 vld err 0.000\n",
"epoch 133: trn err 0.048 vld err 0.000\n",
"epoch 134: trn err 0.047 vld err 0.000\n",
"epoch 135: trn err 0.047 vld err 0.000\n",
"epoch 136: trn err 0.047 vld err 0.000\n",
"epoch 137: trn err 0.047 vld err 0.000\n",
"epoch 138: trn err 0.047 vld err 0.000\n",
"epoch 139: trn err 0.046 vld err 0.000\n",
"epoch 140: trn err 0.046 vld err 0.000\n",
"epoch 141: trn err 0.046 vld err 0.000\n",
"epoch 142: trn err 0.046 vld err 0.000\n",
"epoch 143: trn err 0.046 vld err 0.000\n",
"epoch 144: trn err 0.045 vld err 0.000\n",
"epoch 145: trn err 0.045 vld err 0.000\n",
"epoch 146: trn err 0.045 vld err 0.000\n",
"epoch 147: trn err 0.045 vld err 0.000\n",
"epoch 148: trn err 0.045 vld err 0.000\n",
"epoch 149: trn err 0.045 vld err 0.000\n",
"epoch 150: trn err 0.044 vld err 0.000\n",
"epoch 151: trn err 0.044 vld err 0.000\n",
"epoch 152: trn err 0.044 vld err 0.000\n",
"epoch 153: trn err 0.044 vld err 0.000\n",
"epoch 154: trn err 0.044 vld err 0.000\n",
"epoch 155: trn err 0.044 vld err 0.000\n",
"epoch 156: trn err 0.043 vld err 0.000\n",
"epoch 157: trn err 0.043 vld err 0.000\n",
"epoch 158: trn err 0.043 vld err 0.000\n",
"epoch 159: trn err 0.043 vld err 0.000\n",
"epoch 160: trn err 0.043 vld err 0.000\n",
"epoch 161: trn err 0.043 vld err 0.000\n",
"epoch 162: trn err 0.043 vld err 0.000\n",
"epoch 163: trn err 0.042 vld err 0.000\n",
"epoch 164: trn err 0.042 vld err 0.000\n",
"epoch 165: trn err 0.042 vld err 0.000\n",
"epoch 166: trn err 0.042 vld err 0.000\n",
"epoch 167: trn err 0.042 vld err 0.000\n",
"epoch 168: trn err 0.042 vld err 0.000\n",
"epoch 169: trn err 0.042 vld err 0.000\n",
"epoch 170: trn err 0.042 vld err 0.000\n",
"epoch 171: trn err 0.041 vld err 0.000\n",
"epoch 172: trn err 0.041 vld err 0.000\n",
"epoch 173: trn err 0.041 vld err 0.000\n",
"epoch 174: trn err 0.041 vld err 0.000\n",
"epoch 175: trn err 0.041 vld err 0.000\n",
"epoch 176: trn err 0.041 vld err 0.000\n",
"epoch 177: trn err 0.041 vld err 0.000\n",
"epoch 178: trn err 0.041 vld err 0.000\n",
"epoch 179: trn err 0.041 vld err 0.000\n",
"epoch 180: trn err 0.040 vld err 0.000\n",
"epoch 181: trn err 0.040 vld err 0.000\n",
"epoch 182: trn err 0.040 vld err 0.000\n",
"epoch 183: trn err 0.040 vld err 0.000\n",
"epoch 184: trn err 0.040 vld err 0.000\n",
"epoch 185: trn err 0.040 vld err 0.000\n",
"epoch 186: trn err 0.040 vld err 0.000\n",
"epoch 187: trn err 0.040 vld err 0.000\n",
"epoch 188: trn err 0.040 vld err 0.000\n",
"epoch 189: trn err 0.040 vld err 0.000\n",
"epoch 190: trn err 0.039 vld err 0.000\n",
"epoch 191: trn err 0.039 vld err 0.000\n",
"epoch 192: trn err 0.039 vld err 0.000\n",
"epoch 193: trn err 0.039 vld err 0.000\n",
"epoch 194: trn err 0.039 vld err 0.000\n",
"epoch 195: trn err 0.039 vld err 0.000\n",
"epoch 196: trn err 0.039 vld err 0.000\n",
"epoch 197: trn err 0.039 vld err 0.000\n",
"epoch 198: trn err 0.039 vld err 0.000\n",
"epoch 199: trn err 0.039 vld err 0.000\n",
"epoch 200: trn err 0.039 vld err 0.000\n"
]
}
],
"prompt_number": 29
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"t2 = np.arange(0,np.pi,0.01)\n",
"nn0.forward(t2[:,np.newaxis])\n",
"nn1.forward(nn0.outputs)\n",
"pred2 = nn1.outputs"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 30
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pylab as pl\n",
"pl.plot(t2, pred2[:,0])\n",
"pl.plot(trn_x[:,0], trn_y[:,0])\n",
"pl.show()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 31
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"*TODO* Cross validation to selection a good model."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Sparse Coding"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$X$ is a $k\\times m$ data matrix, where $m$ is the number of samples, $k$ being the number of features. Let the dictionary be $B$ of $k\\times n$ and the coefficient matrix $S$ $of n\\times m$: $n$ coefficients per sample of the $m$ samples. The problem being:\n",
"$$\n",
"\\min_{B,S}\t\\sigma^{-2}\\|X-BS\\|_{F}^{2}+\\beta\\sum_{i,j}|s_{ij}| \\\\\n",
"s.t.\t\\sum_{i}B_{ij}^{2}\\leq c,\\ j=1\\dots n\n",
"$$\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Fixing $B$, the objective w.r.t. $s$ is $$ \\xi^{T}\\xi-\\xi^{T}Bs-s^{T}B^{T}\\xi+s^{T}B^{T}Bs+\\lambda\\sum_{j}|s_{j}| $$. So the gradient to $s$ is $2B^{T}(Bs-\\xi)$ (a bit familiar of the backprop computation of attributing error-derivatives back to the inputs? -- all summing over some features -- the inner-dimension of the matrix multiplication), and the penalty gradient, a vector of $\\{\\pm\\lambda,0\\}$, depending on the current value of $s$. The gradients can be computed as follows."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for s, x in S, X:\n",
" ds0 = np.dot(B.T, np.dot(B,s)-x)\n",
" ds1 = lam * np.sign(x)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now it is intuitive that ds0 looks like a perceptron error, and ds1 being the penalty of not being 0 -- note it is a constant penalty as long as you are not zero -- so the method gives you sparse solutions! \n",
"\n",
"But the problem is that if you push your s along ds0+ds1, ds1 could jump rigorously, and hamper you from convergence. Naturally, as an antidose to the jumping behaviour, we want to keep the momentum of the optimisation w.r.t. penalties. E.g. for a few steps, just keep the previous direction (the current choice of $\\pm \\lambda, 0$), despite a coefficient may cross 0 during the process.\n",
"\n",
"Why not simply using a larger step size? We want to gradient of the actual error [ds0] to guide our search. So the idea is not to play the trick on every one, but only on those coefficients that are strongly affecting the prediction performance (changing them greatly help reduce the errors). How strong? The error gradient should be greater than $\\lambda$. Then you can push them for a few steps assuming they are penalised in the same way as they are currently.\n",
"\n",
"Consider that the error dictates some $s_j$ to be smaller, i.e. the $\\partial E/\\partial s_j > 0$. If **(A)** $\\partial E$ is not strong enough, say $ \\lt +\\lambda$, then $s_j$ is dominated by the penalty. \n",
"\n",
"If the error gradient is strong, exceeding the penalty gradient. Then consider **(B)** if currently $s_j$ is already \"over the line\", $s_j \\lt 0$, the penalty $\\partial P/\\partial s_j = -\\lambda$ pushes $s_j$ to be larger (closer to 0), but will be overwhelmed by the demand of error-reducing, so $s_j$ will become smaller. <del> Moreover, the solution must be some $s_j^* \\lt s_j$, otherwise, if you fix all other $s_{k\\neq j}$ and examine $s_j$, you would find convexity has been violated (by the \"otherwise\").</del> The above argument is not accurate according to our discussion. The condition I proposed may not be enough to preserve the sign of a coefficient.\n",
"\n",
"The complex condition is **(C)** when currently $s_j>0$, so now the penalty is \"helping\" us, it drives $s_j$ to where it would put error down. However, as long as we cross the zero-line, the penalty will turn against us. The hope is that initially our error-gradient is strong, so it might overcome the changed penalty, but this is no guarantee. So you check every time some coefficient changing its sign, pick and choose the best one. \n",
"\n",
"Anyway, the active set is made of coefficients contributing significantly to the (reduction of) error.\n",
"```python\n",
" # choose an element to join the active set\n",
" ina = (th==0).nonzero()[0] # inactive index\n",
" g = np.dot(B.T, np.dot(B,s) - x) # gradient to error\n",
" ing = np.abs(g[ina])\n",
" a_i = np.argmax(ing) # leading contributor\n",
" if ing[ai]<=lam: # none-tobe activated, over!\n",
" return s, fn_err(s), g, th\n",
" th[ina[a_i]] = -sign(g[ina[a_i]]) # predict sign as if descended\n",
"``` "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The line search inspected\n",
"Let s0 be the active coefficients (the current value of coefficients in the current active set). So the logic goes we will **(1)** solve a least square problem with the coefficents being \"dragged-to-zero-according-to-current-value\"; and **(2)** check the sign changes along the way. \n",
"\n",
"### solution exists?\n",
"Note in **(1)** the solution must exist:\n",
"$$\n",
"L^s = 0.5*\\| x - B s\\|^2 + \\sum_j (\\pm) s_j\n",
"$$\n",
"The norm is a convex function w.r.t. $s$. With those signs fixed, the second term is linear (also convex) w.r.t. $s$ as well. So a global minimum is guaranteed -- you won't shoot to nowhere.\n",
"\n",
"So we solves the equalibrium of the gradients as\n",
"$$\n",
"B^T (Bs - x) + \\lambda \\vec{\\theta} = 0\n",
"$$\n",
"```python\n",
" a = th.nonzero()[0] # active set \n",
" s1 = np.linalg.solve(np.dot(B[:,a].T, B[:,a]), \n",
" (np.dot(B[:,a].T, x) - lam*th[a])\n",
" )\n",
"```\n",
"\n",
"But a question arises: *is the problem always well-defined?* Since $B$ is over-complete, is it possible the active set has more than enough coefficients to manipulate, which renders the problem to be underdetermined? If this is impossible, is there a proof of it?\n",
"\n",
"### are we making progress?\n",
"For step **(2)**, let us imagine a movement from the current coefficients to the least square solution -- *despite that we can analytically get the solution and just \"jump\" to it*. Keep in mind along the path, we are minimising $L^s$, but the true objective should be \n",
"$$\n",
"L = \\| x - B s\\|^2 + \\sum_j | s_j |\n",
"$$\n",
"In the first part of the journey, right before the first sign change of any $s_j$. $L^s$ and $L$ are the same and we are minimising the true goal. After a sign has been changed, our surrogate $L^s$ starts deviating from $L$. Say $s_{k0}$ changes its sign before any one else, at some $t_0 \\in (0,1]$, where $t=0$ means the origional coefficient $s_0$ and $t=1$ means the coefficient minimising $L^s$. Then we can be assured that between $(0,t_0)$ we are reducing $L$. \n",
"\n",
"### where to chop the line for a check?\n",
"After that, before another $s_{k1}$ changes its sign, we are minimising\n",
"\\begin{align}\n",
"L^s &= \\| x - B s\\|^2 + \\sum_j p_j s_j \\\\\n",
"& = \\| x - B s\\|^2 - |s_{k0}| + \\sum_{j \\neq k_0} | s_j | \n",
"& = L - 2|s_{k0}|\n",
"\\end{align}\n",
"So the previous path becomes suboptimal. However, as long as $|s_{k0}|$ is small, hopfully, the path is still a _decending_ one for $L$. Along this section, say, from $t_0$ to $t_1$, where another $s_{k1}$ changes sign, our surrogate _steadily_ deviates from the true $L$. We have no guarantee the path remains descending for $L$, but can just say, the rate of our surrogate getting worse is constant, _until $s_{k1}$ joins trouble makers' club_!\n",
"```python\n",
"flipping_at = -s0/(s1-s0) # t0, t1, t ...\n",
"```\n",
"For each t in flipping between $(0,1]$, the coefficient is $(1-t)s_0+ts_1$.\n",
"```python\n",
"fn_n2=lambda x:numpy.sum(x**x)\n",
"\n",
"interm_costs = [(t,(1-t)*s0+t*s1) for t in flipping_at if t>0.0 and t<=1.0]\n",
"best_t, best_L = max(interm_costs, \n",
" key=lambda item:\n",
" fn_n2(x-np.dot(B[:,a],item[1]))*0.5 \n",
" +np.sum(np.abs(item[1])))\n",
"```\n"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"The following code has bugs in it and not working, but it makes an illustration and please help improve it."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def solve_coeff(x, B, lam):\n",
" s = np.zeros(B.shape[1])\n",
" th = np.zeros(B.shape[1], dtype=int) # active flags\n",
" fn_err = lambda s_: np.sum((x-np.dot(B,s_))**2)*0.5 \\\n",
" + np.sum(np.abs(s_))*lam\n",
" fn_n2=lambda x:numpy.sum(x**x)\n",
" new_acti = True\n",
" it = 0\n",
" max_it = 1\n",
" while True:\n",
" if new_acti:\n",
" ina = (th==0).nonzero()[0]\n",
" g = np.dot(B.T, np.dot(B,s) - x)\n",
" ing = np.abs(g[ina])\n",
" a_i = np.argmax(ing)\n",
" if ing[a_i]<=lam:\n",
" return s, fn_err(s), g, th\n",
" th[ina[a_i]] = -np.sign(g[ina[a_i]])\n",
" new_acti = False\n",
" \n",
" a = th.nonzero()[0]\n",
" s0 = s[a]\n",
" s1 = np.linalg.solve(np.dot(B[:,a].T, B[:,a]), \n",
" (np.dot(B[:,a].T, x) - lam*th[a])\n",
" )\n",
" \n",
" print s0\n",
" print s1\n",
" \n",
" flipping_at = -s0/(s1-s0)\n",
" interm_costs = [(t,(1-t)*s0+t*s1) for t in flipping_at \n",
" if t>0.0 and t<=1.0]\n",
" if len(interm_costs) == 0:\n",
" print \"No flipping\"\n",
" s[a]=s1\n",
" new_acti=True\n",
" continue\n",
" \n",
" best_t, best_sa = max(interm_costs,\n",
" key=lambda item:\n",
" fn_n2(x-np.dot(B[:,a],item[1]))*0.5 \n",
" +np.sum(np.abs(item[1]))\n",
" )\n",
" s[a] = best_sa\n",
" th[a[np.isclose(s[a],0)]] = 0\n",
" a2 = th.nonzero()[0]\n",
" if len(a2)==0:\n",
" print \"No active left\"\n",
" break\n",
" else:\n",
" g = np.dot(B.T, np.dot(B, s)-x) + lam*np.sign(s)\n",
" if np.abs(g[a2]).sum() < 1e-9:\n",
" print \"active optimal, try to add new\"\n",
" new_acti = True\n",
" \n",
" it += 1\n",
" if it > max_it:\n",
" print \"max iter reached\"\n",
" break\n",
" print it, fn_err(s), s\n",
" return s, fn_err(s), g, th\n",
" \n",
" "
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 36
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"B = np.random.uniform(-1,1,(3, 10))\n",
"x = np.random.uniform(-1,1,(3,))\n",
"lam = 0.01\n",
"s, e, g, th = solve_coeff(x,B,lam)\n",
"print s, np.dot(B,s)-x\n"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[ 0.]\n",
"[-0.7568033]\n",
"No flipping\n",
"[ 0. -0.7568033]\n",
"[ 0.74835549 -1.06740888]\n",
"No flipping\n",
"[ 0. 0.74835549 -1.06740888]\n",
"[-0.16148946 0.69244504 -1.1069943 ]\n",
"No flipping\n",
"[-0.16148946 0.69244504 0. -1.1069943 ]\n",
"[ -2.34359196e+12 7.87085011e+12 -1.13198683e+13 -2.74272065e+12]\n",
"No flipping\n",
"[ -2.34359196e+12 7.87085011e+12 0.00000000e+00 -1.13198683e+13\n",
" -2.74272065e+12]\n",
"[ -2.49625358e+12 7.81476711e+12 2.63142024e+12 -1.00892193e+13\n",
" -3.91827174e+12]\n",
"No flipping\n",
"[ -2.49625358e+12 0.00000000e+00 0.00000000e+00 0.00000000e+00\n",
" 7.81476711e+12 2.63142024e+12 0.00000000e+00 -1.00892193e+13\n",
" 0.00000000e+00 -3.91827174e+12] [ 0.00661006 0.00196886 -0.01683293]\n"
]
}
],
"prompt_number": 37
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment