Skip to content

Instantly share code, notes, and snippets.

@kazimiersmith
Created February 19, 2024 23:29
Show Gist options
  • Save kazimiersmith/28a35c276aad078a36dfbec808014a6e to your computer and use it in GitHub Desktop.
Save kazimiersmith/28a35c276aad078a36dfbec808014a6e to your computer and use it in GitHub Desktop.
tree_nb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# <center> Tree-based Methods and Support Vector Machine </center>"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from sklearn import preprocessing\n",
"from sklearn.tree import DecisionTreeRegressor, plot_tree\n",
"from sklearn.ensemble import RandomForestRegressor\n",
"from sklearn.model_selection import train_test_split, cross_val_score\n",
"from itertools import product\n",
"from sklearn.metrics import accuracy_score, mean_squared_error"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tree-based methods"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this exercise, you are supposed to train different tree-based regression models. Dataset for this exercise is contained in \"UsedCars.csv\" and you will use the `price` column as the target variable and other columns as features."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data Preprocessing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load the data and one-hot encode categorical columns. Split the data into training and test sets."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"UsedCars = pd.read_csv(\"UsedCars.csv\")\n",
"\n",
"# One-hot encoding for categorical results\n",
"UsedCars = pd.get_dummies(UsedCars, columns=UsedCars.select_dtypes(include=['object']).columns, drop_first=True)\n",
"\n",
"X_train_UsedCars, X_test_UsedCars, y_train_UsedCars, y_test_UsedCars \\\n",
" = train_test_split(UsedCars.drop(columns='price'), UsedCars.price, test_size=0.2, random_state=10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fit a simple tree"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use `DecisionTreeRegressor` to fit a simple regression tree on the training set with the `max_depth` = 3. Visualize this tree by `plot_tree`. What are the two most important features (i.e. columns of X) selected by this tree for splitting?"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Text(0.5, 0.875, 'X[1] <= 2007.5\\nsquared_error = 333305140.254\\nsamples = 16050\\nvalue = 30689.784'),\n",
" Text(0.25, 0.625, 'X[1] <= 2006.5\\nsquared_error = 96343187.319\\nsamples = 8371\\nvalue = 16962.349'),\n",
" Text(0.125, 0.375, 'X[1] <= 2002.5\\nsquared_error = 24405215.916\\nsamples = 5458\\nvalue = 10988.437'),\n",
" Text(0.0625, 0.125, 'squared_error = 10711830.676\\nsamples = 2679\\nvalue = 8069.068'),\n",
" Text(0.1875, 0.125, 'squared_error = 21469429.499\\nsamples = 2779\\nvalue = 13802.756'),\n",
" Text(0.375, 0.375, 'X[0] <= 70527.5\\nsquared_error = 38978032.29\\nsamples = 2913\\nvalue = 28155.487'),\n",
" Text(0.3125, 0.125, 'squared_error = 39544608.116\\nsamples = 1250\\nvalue = 32009.03'),\n",
" Text(0.4375, 0.125, 'squared_error = 19000375.501\\nsamples = 1663\\nvalue = 25258.957'),\n",
" Text(0.75, 0.625, 'X[0] <= 42138.5\\nsquared_error = 162260922.09\\nsamples = 7679\\nvalue = 45654.278'),\n",
" Text(0.625, 0.375, 'X[1] <= 2011.5\\nsquared_error = 105934860.965\\nsamples = 3914\\nvalue = 54037.27'),\n",
" Text(0.5625, 0.125, 'squared_error = 87916801.383\\nsamples = 1784\\nvalue = 48231.098'),\n",
" Text(0.6875, 0.125, 'squared_error = 69141703.913\\nsamples = 2130\\nvalue = 58900.28'),\n",
" Text(0.875, 0.375, 'X[1] <= 2009.5\\nsquared_error = 71813552.873\\nsamples = 3765\\nvalue = 36939.529'),\n",
" Text(0.8125, 0.125, 'squared_error = 46019370.82\\nsamples = 2243\\nvalue = 32847.739'),\n",
" Text(0.9375, 0.125, 'squared_error = 48790148.191\\nsamples = 1522\\nvalue = 42969.676')]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"#TODO\n",
"regressor = DecisionTreeRegressor(max_depth = 3)\n",
"tree = regressor.fit(X_train_UsedCars, y_train_UsedCars)\n",
"plot_tree(tree, fontsize = 8)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The two important features are \"year\" (X[1]) and \"mileage\" (X[0])."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fit larger regression trees and find the optimal parameters by Cross-Validation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this section, we focus on two parameters in the regression tree model: \n",
"\n",
"* `max_depth`: the maximum depth. The possible values are 1,2,...,30.\n",
"* `min_samples_leaf`: the minimum number of samples required in a leaf node. The possible values are 3 or 30.\n",
"\n",
"Let other parameters be default. Use Cross-Validation to find the best parameters that minimize the Root Mean Squared Errors (RMSE) in CV. To do this, use function `cross_val_score` with `scoring='neg_root_mean_squared_error'` to obtain the negative RMSE in CV. Note that the original `cross_val_score` gives the negative RMSEs of each Cross-Validation fold, and thus we need to take average over them. \n",
"\n",
"Plot the CV RMSEs (y-axis) as a function of `max_depth` (x-axis) for each `min_samples_leaf`, which means there should be two curves (one for `min_samples_leaf`= 3 and the other for `min_samples_leaf`= 30) on the same plot. Use legend to make the plot better. \n",
"\n",
"With which parameters does the regression tree achieve the lowest CV RMSE? Report this RMSE. "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"max_depth = range(1, 31)\n",
"RMSE_30 = [np.mean(cross_val_score(DecisionTreeRegressor(max_depth = m, min_samples_leaf = 30),\n",
" X_train_UsedCars,\n",
" y_train_UsedCars,\n",
" scoring = 'neg_root_mean_squared_error'))\n",
" for m in max_depth]\n",
"RMSE_3 = [np.mean(cross_val_score(DecisionTreeRegressor(max_depth = m, min_samples_leaf = 3),\n",
" X_train_UsedCars,\n",
" y_train_UsedCars,\n",
" scoring = 'neg_root_mean_squared_error'))\n",
" for m in max_depth]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"#TODO\n",
"plt.plot(max_depth, RMSE_30, color = 'red', label = 'Min. samples = 30')\n",
"plt.plot(max_depth, RMSE_3, color = 'blue', label = 'Min. samples = 3')\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"7"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.argmax(RMSE_3)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-4523.4274428918"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.max(RMSE_3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Setting min_samples_leaf = 3 and max_depth = 8 achieves the lowest CV RMSE of 4524.29"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, evaluate the optimal tree on the test set. Report the test RMSE."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4267.127203919875"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#TODO\n",
"regressor = DecisionTreeRegressor(max_depth = 8, min_samples_leaf = 3)\n",
"tree = regressor.fit(X_train_UsedCars, y_train_UsedCars)\n",
"y_pred_UsedCars = tree.predict(X_test_UsedCars)\n",
"mean_squared_error(y_test_UsedCars, y_pred_UsedCars, squared = False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fit Random Forest\n",
"\n",
"Fit random forests on the training set, using the optimal `min_samples_leaf` and `max_depth` selected above. We use Cross-Validation again to find the optimal `n_estimators` (the number of trees in the forest) in [100, 200, 300, ..., 600]. Similar to before, record the CV RMSEs of each model and draw a plot of CV RMSE versus `n_estimators`. Cross-Validation for Random Forest may take a much longer time. Be patient and wait.\n",
"\n",
"What is the optimal value for `n_estimators`? Report the test RMSE of this optimal model.\n",
"\n",
"How does the random forest perform as compared to the decision tree?"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"#TODO\n",
"num_estimators = range(100, 700, 100)\n",
"RMSE = [np.mean(cross_val_score(RandomForestRegressor(n_estimators = n, max_depth = 8, min_samples_leaf = 3),\n",
" X_train_UsedCars,\n",
" y_train_UsedCars,\n",
" scoring = 'neg_root_mean_squared_error'))\n",
" for n in num_estimators]"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"#TODO\n",
"plt.plot(num_estimators, RMSE, color = 'red', label = 'RMSE')\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.argmax(RMSE)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-4297.726913286067"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.max(RMSE)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Setting n_estimators = 400 achieves the lowest RMSE of 4299.36"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4124.667841213942"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"regressor = RandomForestRegressor(n_estimators = 400, max_depth = 8, min_samples_leaf = 3)\n",
"forest = regressor.fit(X_train_UsedCars, y_train_UsedCars)\n",
"y_pred_UsedCars = forest.predict(X_test_UsedCars)\n",
"mean_squared_error(y_test_UsedCars, y_pred_UsedCars, squared = False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The random forest achieves slightly lower RMSE than the decision tree, so it's a bit better."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Support Vector Machine"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this exercise, we simulate a dataset containing two circles with the same center and different sizes and then split it to training and test sets. "
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.collections.PathCollection at 0x2a98e9f26e0>"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"from sklearn.svm import SVC\n",
"from sklearn.datasets import make_circles\n",
"\n",
"X_circle, y_circle = make_circles(500, factor=.6, noise=.1, random_state=0)\n",
"\n",
"X_train_circle, X_test_circle, y_train_circle, y_test_circle \\\n",
" = train_test_split(X_circle, y_circle, test_size=0.2, random_state=10)\n",
"\n",
"plt.scatter(X_train_circle[:, 0], X_train_circle[:, 1], c=y_train_circle, s=5, cmap=plt.cm.RdYlBu)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we are going to fit Support Vector Machines to this dataset and use Cross-Validation to find the optimal parameters. In this problem, we focus on the following two parameters: \n",
"* C: Regularization parameter. The possible C values are 20 positive numbers evenly distributed on \\[$10^{-4}$, $10^4$\\]. \n",
"* kernel: The kernel type to be used in the algorithm. Possible options are 'linear', 'rbf' and 'poly'.\n",
"\n",
"Leave other parameters as default. For each pair of parameters, use `cross_val_score` to obtain the CV score. Note that the original `cross_val_score` gives the CV scores of each Cross-Validation fold, and thus we need to take average over them. Plot all of the CV scores as a function of C for each kernel, which means there should be three curves on the same figure, each for one kernel. Remember to use legend. "
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"kernels = ['linear', 'rbf', 'poly']\n",
"\n",
"Cs = 10 ** np.linspace(-4, 4, 20)\n",
"\n",
"cv_scores = [[np.mean(cross_val_score(SVC(C = c, kernel = k),\n",
" X_train_circle,\n",
" y_train_circle,\n",
" scoring = 'neg_root_mean_squared_error'))\n",
" for c in Cs] for k in kernels]"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"#TODO\n",
"plt.plot(Cs, cv_scores[0], color = 'red', label = 'Linear kernel')\n",
"plt.plot(Cs, cv_scores[1], color = 'blue', label = 'RBF kernel')\n",
"plt.plot(Cs, cv_scores[2], color = 'green', label = 'Poly kernel')\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Based on the CV scores, which parameters are the best? Use the best model to do prediction on the test set and report the test accuarcy."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"78.47599703514607"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"best_C = Cs[np.argmax(cv_scores[1])]\n",
"best_C"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's clear from the plots that the RBF kernel is best. Choosing C to be approximately 78.48 minimizes RMSE"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.1414213562373095"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"machine = SVC(C = best_C, kernel = 'rbf')\n",
"machine_fit = machine.fit(X_train_circle, y_train_circle)\n",
"y_pred_circle = machine_fit.predict(X_test_circle)\n",
"mean_squared_error(y_test_circle, y_pred_circle, squared = False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the following function to show the decision boundary of the best SVC model."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"def plot_decision_boundary(model, X, y):\n",
" x_min, x_max = X[:, 0].min() - 0.1, X[:, 0].max() + 0.1\n",
" y_min, y_max = X[:, 1].min() - 0.1, X[:, 1].max() + 0.1\n",
" xx, yy = np.meshgrid(np.linspace(x_min, x_max, 101), np.linspace(y_min, y_max, 101))\n",
" cmap = plt.cm.Spectral\n",
" \n",
" X_test = np.c_[xx.ravel(), yy.ravel()]\n",
" y_pred = model.predict(X_test).reshape(xx.shape)\n",
" # y_pred = np.argmax(y_pred, axis=1).reshape(xx.shape)\n",
" plt.contourf(xx, yy, y_pred, cmap=plt.cm.Spectral, alpha=0.5, \n",
" levels=[-1, 0, 1], linestyles=['--', '-', '--']) \n",
" \n",
" plt.scatter(X[:, 0], X[:, 1], c=y, s=5, cmap=plt.cm.RdYlBu)\n",
" plt.xlim(xx.min(), xx.max())\n",
" plt.ylim(yy.min(), yy.max())"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plot_decision_boundary(machine_fit, X_train_circle, y_train_circle)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment