Last active
July 7, 2023 08:33
-
-
Save kobayassiy/0e72b941b444d7d7adc8 to your computer and use it in GitHub Desktop.
NMFサンプル
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## はじめに\n", | |
"\n", | |
"[こちらの記事](http://www.kecl.ntt.co.jp/icl/signal/sawada/mypaper/829-833_9_02.pdf)を参考に Euclid 距離版の NMF (非負値行列因子分解)のプログラムを書いてみました。\n", | |
"環境準備してこのipynbファイルを[jupyter](http://jupyter.org/)で読み込めば実際に動かせると思います。\n", | |
"\n", | |
"勉強用に書いたプログラムなので、粗い&怪しい部分(補足の章参照)もあります。\n", | |
"何か間違いとかあったらすみません・・。\n", | |
"\n", | |
"## コード\n", | |
"\n", | |
"- 変数名は[元記事](http://www.kecl.ntt.co.jp/icl/signal/sawada/mypaper/829-833_9_02.pdf)の数式に対応させてます。\n", | |
"\n", | |
"### 0. ライブラリ読込" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"import math\n", | |
"import numpy as np\n", | |
"import pandas as pd\n", | |
"from pandas import Series, DataFrame" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 1. 設定値と入力データ定義\n", | |
"\n", | |
"- `base_count` は行列分解する時の基底数です。\n", | |
"- サンプルの入力データは2パターン作成。入力データ②はユーザーのアイテムに対する評価値行列をイメージしていて、0〜2のユーザー、3〜4のユーザーがそれぞれ似た嗜好というイメージ。" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" <th>2</th>\n", | |
" <th>3</th>\n", | |
" <th>4</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" <td>3</td>\n", | |
" <td>4</td>\n", | |
" <td>5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>5</td>\n", | |
" <td>4</td>\n", | |
" <td>3</td>\n", | |
" <td>2</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1 2 3 4\n", | |
"0 1 2 3 4 5\n", | |
"1 5 4 3 2 1\n", | |
"2 2 2 2 2 2" | |
] | |
}, | |
"execution_count": 2, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# 設定\n", | |
"base_count = 2\n", | |
"\n", | |
"# 入力データ①:欠損値無し\n", | |
"x = DataFrame([[1., 2., 3., 4., 5.], [5., 4., 3., 2., 1.], [2., 2., 2., 2., 2.]])\n", | |
"# 入力データ②:欠損値有り:\n", | |
"# x = DataFrame([[5., 4., 1., np.nan], [4., 4., 1., np.nan], [5., 4., np.nan, 1.], [1., np.nan, 5., 5.], [1., 1., np.nan, 5.]])\n", | |
"x" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 2. 初期化\n", | |
"\n", | |
"- 分解後の行列(T行列とV行列)を初期化します。\n", | |
"- 要素の初期値はとりあえず適当に0〜1の乱数に設定してます・・。\n", | |
"- (T行列とV行列の行列積を確認用に出力しています。)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" <th>2</th>\n", | |
" <th>3</th>\n", | |
" <th>4</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>0.151917</td>\n", | |
" <td>0.425408</td>\n", | |
" <td>0.658315</td>\n", | |
" <td>0.879016</td>\n", | |
" <td>0.230995</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>0.240535</td>\n", | |
" <td>0.195024</td>\n", | |
" <td>0.821954</td>\n", | |
" <td>0.869349</td>\n", | |
" <td>0.306594</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>0.098839</td>\n", | |
" <td>0.469053</td>\n", | |
" <td>0.516858</td>\n", | |
" <td>0.781811</td>\n", | |
" <td>0.174055</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1 2 3 4\n", | |
"0 0.151917 0.425408 0.658315 0.879016 0.230995\n", | |
"1 0.240535 0.195024 0.821954 0.869349 0.306594\n", | |
"2 0.098839 0.469053 0.516858 0.781811 0.174055" | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"t = DataFrame(np.random.rand(x.shape[0], base_count))\n", | |
"v = DataFrame(np.random.rand(base_count, x.shape[1]))\n", | |
"DataFrame.dot(t, v)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 3. 分解行列の更新\n", | |
"\n", | |
"- 下記の行列更新用コードを複数回実行すると、実行する度に出力(更新後のT行列とV行列の積)が入力データ行列に近付いくのが確認できると思います。\n", | |
"- 誤差関数値出力用コードを実行することで、その時点の二乗誤差値が確認できます。行列更新用コードを実行する度に誤差が収束していくのも確認できると思います。\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 48, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" <th>2</th>\n", | |
" <th>3</th>\n", | |
" <th>4</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1.121085</td>\n", | |
" <td>1.970961</td>\n", | |
" <td>2.997647</td>\n", | |
" <td>4.001589</td>\n", | |
" <td>5.001515</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>4.963758</td>\n", | |
" <td>4.019716</td>\n", | |
" <td>3.006991</td>\n", | |
" <td>2.003058</td>\n", | |
" <td>1.005344</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2.021834</td>\n", | |
" <td>1.990703</td>\n", | |
" <td>1.995529</td>\n", | |
" <td>1.995715</td>\n", | |
" <td>1.996632</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1 2 3 4\n", | |
"0 1.121085 1.970961 2.997647 4.001589 5.001515\n", | |
"1 4.963758 4.019716 3.006991 2.003058 1.005344\n", | |
"2 2.021834 1.990703 1.995529 1.995715 1.996632" | |
] | |
}, | |
"execution_count": 48, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# T行列の更新\n", | |
"for i in range(t.shape[0]): #行\n", | |
" for k in range(t.shape[1]): #列\n", | |
" denominator = 0.\n", | |
" numerator = 0.\n", | |
" for j in range(x.shape[1]):\n", | |
" if(math.isnan(x[j][i]) == False):\n", | |
" x_hat = np.dot(t.ix[i], v[j])\n", | |
" denominator += x_hat * v[j][k]\n", | |
" numerator += x[j][i] * v[j][k]\n", | |
" t[k][i] = t[k][i] * numerator / denominator\n", | |
"\n", | |
"# V行列の更新\n", | |
"for k in range(v.shape[0]): #行\n", | |
" for j in range(v.shape[1]): #列\n", | |
" denominator = 0.\n", | |
" numerator = 0.\n", | |
" for i in range(x.shape[0]):\n", | |
" if(math.isnan(x[j][i]) == False):\n", | |
" x_hat = np.dot(t.ix[i], v[j])\n", | |
" denominator += x_hat * t[k][i]\n", | |
" numerator += x[j][i] * t[k][i]\n", | |
" v[j][k] = v[j][k] * numerator / denominator\n", | |
"\n", | |
"DataFrame.dot(t, v)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 49, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.017916968411981343" | |
] | |
}, | |
"execution_count": 49, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# 誤差関数の値を確認\n", | |
"error = (x - DataFrame.dot(t, v)).fillna(0)\n", | |
"(error * error).sum(axis=0).sum()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 補足" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"行列を更新するタイミングについて下記3通りで悩みました。\n", | |
"\n", | |
"1. 各要素の更新値が求まるたびに値を更新する ※前章で記載したプログラムはコレ\n", | |
"2. T行列とV行列の全要素の更新値を求める -> T行列とV行列を更新 -> (繰り返し)\n", | |
"3. T行列の全要素の更新値を求める -> T行列を更新 -> V行列の全要素の更新値を求める -> V行列の更新 -> (繰り返し) \n", | |
"\n", | |
"最初 2 でやっていたのですが、これだと収束しなかったです。最終的には 1 でやって収束しました。ただ 3 でも収束するようです。(いずれも、サンプルの入力データで試した限り。)\n", | |
"\n", | |
"1 と 3 のどちらが正かはちょっと自信がないです・・。\n", | |
"\n", | |
"※ 一瞬、数式的には 1 は 3 と等価かな?と思ったのですが、例えば $t_{0,1}$ の更新には $\\hat{x}_{0,0}$ 等の値が必要であり、そのためには $t_{0,0}$ などT行列の他要素値が必要となります。逐次値更新していると次の要素の更新値を求めるときにその前の更新の影響を受けるため、おそらく等価ではないですね。\n", | |
"\n", | |
"### 2 の場合の更新プログラム\n", | |
"\n", | |
"収束しないと思います。" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"x_hat = DataFrame.dot(t, v)\n", | |
"t_upd = DataFrame(t)\n", | |
"v_upd = DataFrame(v)\n", | |
"\n", | |
"# T行列の更新値を求める\n", | |
"for i in range(t.shape[0]): #行\n", | |
" for k in range(t.shape[1]): #列\n", | |
" denominator = 0.\n", | |
" numerator = 0.\n", | |
" for j in range(x.shape[1]):\n", | |
" if(math.isnan(x[j][i]) == False):\n", | |
" denominator += x_hat[j][i] * v[j][k]\n", | |
" numerator += x[j][i] * v[j][k]\n", | |
" t_upd[k][i] = t[k][i] * numerator / denominator\n", | |
"\n", | |
"# V行列の更新値を求める\n", | |
"for k in range(v.shape[0]): #行\n", | |
" for j in range(v.shape[1]): #列\n", | |
" denominator = 0.\n", | |
" numerator = 0.\n", | |
" for i in range(x.shape[0]):\n", | |
" if(math.isnan(x[j][i]) == False):\n", | |
" denominator += x_hat[j][i] * t[k][i]\n", | |
" numerator += x[j][i] * t[k][i]\n", | |
" v_upd[j][k] = v[j][k] * numerator / denominator\n", | |
"\n", | |
"# 更新\n", | |
"t = t_upd\n", | |
"v = v_upd\n", | |
"\n", | |
"DataFrame.dot(t, v)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 3 の場合の更新プログラム\n", | |
"\n", | |
"収束すると思います。" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# T行列の更新\n", | |
"x_hat = DataFrame.dot(t, v)\n", | |
"t_upd = DataFrame(t)\n", | |
"for i in range(t.shape[0]): #行\n", | |
" for k in range(t.shape[1]): #列\n", | |
" denominator = 0.\n", | |
" numerator = 0.\n", | |
" for j in range(x.shape[1]):\n", | |
" if(math.isnan(x[j][i]) == False):\n", | |
" denominator += x_hat[j][i] * v[j][k]\n", | |
" numerator += x[j][i] * v[j][k]\n", | |
" t_upd[k][i] = t[k][i] * numerator / denominator\n", | |
"t = t_upd\n", | |
"\n", | |
"# V行列の更新\n", | |
"x_hat = DataFrame.dot(t, v)\n", | |
"v_upd = DataFrame(v)\n", | |
"for k in range(v.shape[0]): #行\n", | |
" for j in range(v.shape[1]): #列\n", | |
" denominator = 0.\n", | |
" numerator = 0.\n", | |
" for i in range(x.shape[0]):\n", | |
" if(math.isnan(x[j][i]) == False):\n", | |
" denominator += x_hat[j][i] * t[k][i]\n", | |
" numerator += x[j][i] * t[k][i]\n", | |
" v_upd[j][k] = v[j][k] * numerator / denominator\n", | |
"v = v_upd\n", | |
"\n", | |
"DataFrame.dot(t, v)\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 2", | |
"language": "python", | |
"name": "python2" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.10" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment