Skip to content

Instantly share code, notes, and snippets.

@mitchelljphayes
Created November 18, 2023 04:59
Show Gist options
  • Save mitchelljphayes/82de40eb4ec9275c9b3403fa53665fde to your computer and use it in GitHub Desktop.
Save mitchelljphayes/82de40eb4ec9275c9b3403fa53665fde to your computer and use it in GitHub Desktop.
nn_textbook_raw_text
MATHEMATICS FOR MACHINE LEARNING
Marc Peter Deisenroth
A. Aldo Faisal
Cheng Soon Ong
Contents
Foreword 1
Part I Mathematical Foundations 9
1 Introduction and Motivation 11 1.1 Finding Words for Intuitions 12 1.2 Two Ways to Read This Book 13 1.3 Exercises and Feedback 16
2 Linear Algebra 17 2.1 Systems of Linear Equations 19 2.2 Matrices 22 2.3 Solving Systems of Linear Equations 27 2.4 Vector Spaces 35 2.5 Linear Independence 40 2.6 Basis and Rank 44 2.7 Linear Mappings 48 2.8 Affine Spaces 61 2.9 Further Reading 63 Exercises 64
3 Analytic Geometry 70 3.1 Norms 71 3.2 Inner Products 72 3.3 Lengths and Distances 75 3.4 Angles and Orthogonality 76 3.5 Orthonormal Basis 78 3.6 Orthogonal Complement 79 3.7 Inner Product of Functions 80 3.8 Orthogonal Projections 81 3.9 Rotations 91 3.10 Further Reading 94 Exercises 96
4 Matrix Decompositions 98 4.1 Determinant and Trace 99
i
This material is published by Cambridge University Press as Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view and download for personal use only. Not for re-distribution, re-sale, or use in derivative works. ©by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2023. https://mml-book.com.
ii Contents
4.2 Eigenvalues and Eigenvectors 105 4.3 Cholesky Decomposition 114 4.4 Eigendecomposition and Diagonalization 115 4.5 Singular Value Decomposition 119 4.6 Matrix Approximation 129 4.7 Matrix Phylogeny 134 4.8 Further Reading 135 Exercises 137
5 Vector Calculus 139 5.1 Differentiation of Univariate Functions 141 5.2 Partial Differentiation and Gradients 146 5.3 Gradients of Vector-Valued Functions 149 5.4 Gradients of Matrices 155 5.5 Useful Identities for Computing Gradients 158 5.6 Backpropagation and Automatic Differentiation 159 5.7 Higher-Order Derivatives 164 5.8 Linearization and Multivariate Taylor Series 165 5.9 Further Reading 170 Exercises 170
6 Probability and Distributions 172 6.1 Construction of a Probability Space 172 6.2 Discrete and Continuous Probabilities 178 6.3 Sum Rule, Product Rule, and Bayes’ Theorem 183 6.4 Summary Statistics and Independence 186 6.5 Gaussian Distribution 197 6.6 Conjugacy and the Exponential Family 205 6.7 Change of Variables/Inverse Transform 214 6.8 Further Reading 221 Exercises 222
7 Continuous Optimization 225 7.1 Optimization Using Gradient Descent 227 7.2 Constrained Optimization and Lagrange Multipliers 233 7.3 Convex Optimization 236 7.4 Further Reading 246 Exercises 247
Part II Central Machine Learning Problems 249
8 When Models Meet Data 251 8.1 Data, Models, and Learning 251 8.2 Empirical Risk Minimization 258 8.3 Parameter Estimation 265 8.4 Probabilistic Modeling and Inference 272 8.5 Directed Graphical Models 278
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
Contents iii 8.6 Model Selection 283
9 Linear Regression 289 9.1 Problem Formulation 291 9.2 Parameter Estimation 292 9.3 Bayesian Linear Regression 303 9.4 Maximum Likelihood as Orthogonal Projection 313 9.5 Further Reading 315
10 Dimensionality Reduction with Principal Component Analysis 317 10.1 Problem Setting 318 10.2 Maximum Variance Perspective 320 10.3 Projection Perspective 325 10.4 Eigenvector Computation and Low-Rank Approximations 333 10.5 PCA in High Dimensions 335 10.6 Key Steps of PCA in Practice 336 10.7 Latent Variable Perspective 339 10.8 Further Reading 343
11 Density Estimation with Gaussian Mixture Models 348 11.1 Gaussian Mixture Model 349 11.2 Parameter Learning via Maximum Likelihood 350 11.3 EM Algorithm 360 11.4 Latent-Variable Perspective 363 11.5 Further Reading 368
12 Classification with Support Vector Machines 370 12.1 Separating Hyperplanes 372 12.2 Primal Support Vector Machine 374 12.3 Dual Support Vector Machine 383 12.4 Kernels 388 12.5 Numerical Solution 390 12.6 Further Reading 392
References 395 Index 407
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
Foreword
Machine learning is the latest in a long line of attempts to distill human knowledge and reasoning into a form that is suitable for constructing ma chines and engineering automated systems. As machine learning becomes more ubiquitous and its software packages become easier to use, it is nat ural and desirable that the low-level technical details are abstracted away and hidden from the practitioner. However, this brings with it the danger that a practitioner becomes unaware of the design decisions and, hence, the limits of machine learning algorithms.
The enthusiastic practitioner who is interested to learn more about the magic behind successful machine learning algorithms currently faces a daunting set of pre-requisite knowledge:
Programming languages and data analysis tools
Large-scale computation and the associated frameworks Mathematics and statistics and how machine learning builds on it
At universities, introductory courses on machine learning tend to spend early parts of the course covering some of these pre-requisites. For histori cal reasons, courses in machine learning tend to be taught in the computer science department, where students are often trained in the first two areas of knowledge, but not so much in mathematics and statistics.
Current machine learning textbooks primarily focus on machine learn ing algorithms and methodologies and assume that the reader is com petent in mathematics and statistics. Therefore, these books only spend one or two chapters on background mathematics, either at the beginning of the book or as appendices. We have found many people who want to delve into the foundations of basic machine learning methods who strug gle with the mathematical knowledge required to read a machine learning textbook. Having taught undergraduate and graduate courses at universi ties, we find that the gap between high school mathematics and the math ematics level required to read a standard machine learning textbook is too big for many people.
This book brings the mathematical foundations of basic machine learn ing concepts to the fore and collects the information in a single place so that this skills gap is narrowed or even closed.
1
This material is published by Cambridge University Press as Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view and download for personal use only. Not for re-distribution, re-sale, or use in derivative works. ©by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2023. https://mml-book.com.
2 Foreword
Why Another Book on Machine Learning?
Machine learning builds upon the language of mathematics to express concepts that seem intuitively obvious but that are surprisingly difficult to formalize. Once formalized properly, we can gain insights into the task we want to solve. One common complaint of students of mathematics around the globe is that the topics covered seem to have little relevance to practical problems. We believe that machine learning is an obvious and direct motivation for people to learn mathematics.
This book is intended to be a guidebook to the vast mathematical lit- “Math is linked in erature that forms the foundations of modern machine learning. We mo
the popular mind with phobia and anxiety. You’d think we’re discussing spiders.” (Strogatz, 2014, page 281)
tivate the need for mathematical concepts by directly pointing out their usefulness in the context of fundamental machine learning problems. In the interest of keeping the book short, many details and more advanced concepts have been left out. Equipped with the basic concepts presented here, and how they fit into the larger context of machine learning, the reader can find numerous resources for further study, which we provide at the end of the respective chapters. For readers with a mathematical back ground, this book provides a brief but precisely stated glimpse of machine learning. In contrast to other books that focus on methods and models of machine learning (MacKay, 2003; Bishop, 2006; Alpaydin, 2010; Bar ber, 2012; Murphy, 2012; Shalev-Shwartz and Ben-David, 2014; Rogers and Girolami, 2016) or programmatic aspects of machine learning (Muller ¨ and Guido, 2016; Raschka and Mirjalili, 2017; Chollet and Allaire, 2018), we provide only four representative examples of machine learning algo rithms. Instead, we focus on the mathematical concepts behind the models themselves. We hope that readers will be able to gain a deeper understand ing of the basic questions in machine learning and connect practical ques tions arising from the use of machine learning with fundamental choices in the mathematical model.
We do not aim to write a classical machine learning book. Instead, our intention is to provide the mathematical background, applied to four cen tral machine learning problems, to make it easier to read other machine learning textbooks.
Who Is the Target Audience?
As applications of machine learning become widespread in society, we believe that everybody should have some understanding of its underlying principles. This book is written in an academic mathematical style, which enables us to be precise about the concepts behind machine learning. We encourage readers unfamiliar with this seemingly terse style to persevere and to keep the goals of each topic in mind. We sprinkle comments and remarks throughout the text, in the hope that it provides useful guidance with respect to the big picture.
The book assumes the reader to have mathematical knowledge commonly Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
Foreword 3
covered in high school mathematics and physics. For example, the reader should have seen derivatives and integrals before, and geometric vectors in two or three dimensions. Starting from there, we generalize these con cepts. Therefore, the target audience of the book includes undergraduate university students, evening learners and learners participating in online machine learning courses.
In analogy to music, there are three types of interaction that people have with machine learning:
Astute Listener The democratization of machine learning by the pro vision of open-source software, online tutorials and cloud-based tools al lows users to not worry about the specifics of pipelines. Users can focus on extracting insights from data using off-the-shelf tools. This enables non tech-savvy domain experts to benefit from machine learning. This is sim ilar to listening to music; the user is able to choose and discern between different types of machine learning, and benefits from it. More experi enced users are like music critics, asking important questions about the application of machine learning in society such as ethics, fairness, and pri vacy of the individual. We hope that this book provides a foundation for thinking about the certification and risk management of machine learning systems, and allows them to use their domain expertise to build better machine learning systems.
Experienced Artist Skilled practitioners of machine learning can plug and play different tools and libraries into an analysis pipeline. The stereo typical practitioner would be a data scientist or engineer who understands machine learning interfaces and their use cases, and is able to perform wonderful feats of prediction from data. This is similar to a virtuoso play ing music, where highly skilled practitioners can bring existing instru ments to life and bring enjoyment to their audience. Using the mathe matics presented here as a primer, practitioners would be able to under stand the benefits and limits of their favorite method, and to extend and generalize existing machine learning algorithms. We hope that this book provides the impetus for more rigorous and principled development of machine learning methods.
Fledgling Composer As machine learning is applied to new domains, developers of machine learning need to develop new methods and extend existing algorithms. They are often researchers who need to understand the mathematical basis of machine learning and uncover relationships be tween different tasks. This is similar to composers of music who, within the rules and structure of musical theory, create new and amazing pieces. We hope this book provides a high-level overview of other technical books for people who want to become composers of machine learning. There is a great need in society for new researchers who are able to propose and explore novel approaches for attacking the many challenges of learning from data.
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
4 Foreword
Acknowledgments
We are grateful to many people who looked at early drafts of the book and suffered through painful expositions of concepts. We tried to imple ment their ideas that we did not vehemently disagree with. We would like to especially acknowledge Christfried Webers for his careful reading of many parts of the book, and his detailed suggestions on structure and presentation. Many friends and colleagues have also been kind enough to provide their time and energy on different versions of each chapter. We have been lucky to benefit from the generosity of the online commu nity, who have suggested improvements via https://github.com, which greatly improved the book.
The following people have found bugs, proposed clarifications and sug gested relevant literature, either via https://github.com or personal communication. Their names are sorted alphabetically.
Abdul-Ganiy Usman Adam Gaier
Adele Jackson
Aditya Menon
Alasdair Tran
Aleksandar Krnjaic Alexander Makrigiorgos Alfredo Canziani
Ali Shafti
Amr Khalifa
Andrew Tanggara
Angus Gruen
Antal A. Buss
Antoine Toisoul Le Cann Areg Sarvazyan
Artem Artemev
Artyom Stepanov
Bill Kromydas
Bob Williamson
Boon Ping Lim
Chao Qu
Cheng Li
Chris Sherlock
Christopher Gray
Daniel McNamara
Daniel Wood
Darren Siegel
David Johnston
Dawei Chen
Ellen Broad
Fengkuangtian Zhu Fiona Condon
Georgios Theodorou He Xin
Irene Raissa Kameni Jakub Nabaglo
James Hensman
Jamie Liu
Jean Kaddour
Jean-Paul Ebejer
Jerry Qiang
Jitesh Sindhare
John Lloyd
Jonas Ngnawe
Jon Martin
Justin Hsi
Kai Arulkumaran
Kamil Dreczkowski Lily Wang
Lionel Tondji Ngoupeyou Lydia Knufing ¨
Mahmoud Aslan
Mark Hartenstein
Mark van der Wilk
Markus Hegland
Martin Hewing
Matthew Alger
Matthew Lee
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
Foreword 5
Maximus McCann
Mengyan Zhang
Michael Bennett
Michael Pedersen
Minjeong Shin
Mohammad Malekzadeh Naveen Kumar
Nico Montali
Oscar Armas
Patrick Henriksen
Patrick Wieschollek
Pattarawat Chormai
Paul Kelly
Petros Christodoulou
Piotr Januszewski
Pranav Subramani
Quyu Kong
Ragib Zaman
Rui Zhang
Ryan-Rhys Griffiths
Salomon Kabongo
Samuel Ogunmola
Sandeep Mavadia
Sarvesh Nikumbh
Sebastian Raschka
Senanayak Sesh Kumar Karri Seung-Heon Baek
Shahbaz Chaudhary
Shakir Mohamed
Shawn Berry
Sheikh Abdul Raheem Ali Sheng Xue
Sridhar Thiagarajan Syed Nouman Hasany Szymon Brych
Thomas Buhler ¨
Timur Sharapov
Tom Melamed
Vincent Adam
Vincent Dutordoir
Vu Minh
Wasim Aftab
Wen Zhi
Wojciech Stokowiec Xiaonan Chong
Xiaowei Zhang
Yazhou Hao
Yicheng Luo
Young Lee
Yu Lu
Yun Cheng
Yuxiao Huang
Zac Cranko
Zijian Cao
Zoe Nolan
Contributors through GitHub, whose real names were not listed on their GitHub profile, are:
SamDataMad
bumptiousmonkey idoamihai
deepakiim
insad
HorizonP cs-maillist kudo23
empet
victorBigand 17SKYE
jessjing1995
We are also very grateful to Parameswaran Raman and the many anony mous reviewers, organized by Cambridge University Press, who read one or more chapters of earlier versions of the manuscript, and provided con structive criticism that led to considerable improvements. A special men tion goes to Dinesh Singh Negi, our LATEX support, for detailed and prompt advice about LATEX-related issues. Last but not least, we are very grateful to our editor Lauren Cowles, who has been patiently guiding us through the gestation process of this book.
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
6 Foreword Table of Symbols
Symbol Typical meaning
a, b, c, α, β, γ Scalars are lowercase
x, y, z Vectors are bold lowercase
A, B, C Matrices are bold uppercase
x⊤, A⊤Transpose of a vector or matrix
A−1Inverse of a matrix
⟨x, y⟩ Inner product of x and y
x⊤y Dot product of x and y
B = (b1, b2, b3) (Ordered) tuple
B = [b1, b2, b3] Matrix of column vectors stacked horizontally B = {b1, b2, b3} Set of vectors (unordered)
Z, N Integers and natural numbers, respectively
R, C Real and complex numbers, respectively
Rn n-dimensional vector space of real numbers
∀x Universal quantifier: for all x
∃x Existential quantifier: there exists x
a := b a is defined as b
a =: b b is defined as a
a ∝ b a is proportional to b, i.e., a = constant · b
g ◦ f Function composition: “g after f”
⇐⇒ If and only if
=⇒ Implies
A, C Sets
a ∈ A a is an element of set A
∅ Empty set
A\B A without B: the set of elements in A but not in B D Number of dimensions; indexed by d = 1, . . . , D N Number of data points; indexed by n = 1, . . . , N Im Identity matrix of size m × m
0m,n Matrix of zeros of size m × n
1m,n Matrix of ones of size m × n
ei Standard/canonical vector (where i is the component that is 1) dim Dimensionality of vector space
rk(A) Rank of matrix A
Im(Φ) Image of linear mapping Φ
ker(Φ) Kernel (null space) of a linear mapping Φ
span[b1] Span (generating set) of b1
tr(A) Trace of A
det(A) Determinant of A
| · | Absolute value or determinant (depending on context) ∥·∥ Norm; Euclidean, unless specified
λ Eigenvalue or Lagrange multiplier
Eλ Eigenspace corresponding to eigenvalue λ
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
Foreword 7
Symbol Typical meaning
x ⊥ y Vectors x and y are orthogonal
V Vector space
V⊥ Orthogonal complement of vector space V
PN
Qn=1 xn Sum of the xn: x1 + . . . + xN
N
n=1 xn Product of the xn: x1 · . . . · xN
θ Parameter vector
∂f
∂x Partial derivative of f with respect to x
df
dxTotal derivative of f with respect to x
∇ Gradient
f∗ = minx f(x) The smallest function value of f
x∗ ∈ arg minx f(x) The value x∗ that minimizes f (note: arg min returns a set of values) L Lagrangian
L Negative log-likelihood
Binomial coefficient, n choose k
n
k
VX[x] Variance of x with respect to the random variable X EX[x] Expectation of x with respect to the random variable X CovX,Y [x, y] Covariance between x and y.
X ⊥⊥ Y |Z X is conditionally independent of Y given Z X ∼ p Random variable X is distributed according to p
Nµ, Σ Gaussian distribution with mean µ and covariance Σ Ber(µ) Bernoulli distribution with parameter µ Bin(N, µ) Binomial distribution with parameters N, µ Beta(α, β) Beta distribution with parameters α, β
Table of Abbreviations and Acronyms
Acronym Meaning
e.g. Exempli gratia (Latin: for example)
GMM Gaussian mixture model
i.e. Id est (Latin: this means)
i.i.d. Independent, identically distributed
MAP Maximum a posteriori
MLE Maximum likelihood estimation/estimator
ONB Orthonormal basis
PCA Principal component analysis
PPCA Probabilistic principal component analysis
REF Row-echelon form
SPD Symmetric, positive definite
SVM Support vector machine
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
Part I
Mathematical Foundations
9
This material is published by Cambridge University Press as Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view and download for personal use only. Not for re-distribution, re-sale, or use in derivative works. ©by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2023. https://mml-book.com.
1
Introduction and Motivation
Machine learning is about designing algorithms that automatically extract valuable information from data. The emphasis here is on “automatic”, i.e., machine learning is concerned about general-purpose methodologies that can be applied to many datasets, while producing something that is mean ingful. There are three concepts that are at the core of machine learning: data, a model, and learning.
Since machine learning is inherently data driven, data is at the core data of machine learning. The goal of machine learning is to design general purpose methodologies to extract valuable patterns from data, ideally without much domain-specific expertise. For example, given a large corpus of documents (e.g., books in many libraries), machine learning methods can be used to automatically find relevant topics that are shared across documents (Hoffman et al., 2010). To achieve this goal, we design mod els that are typically related to the process that generates data, similar to model the dataset we are given. For example, in a regression setting, the model would describe a function that maps inputs to real-valued outputs. To paraphrase Mitchell (1997): A model is said to learn from data if its per formance on a given task improves after the data is taken into account. The goal is to find good models that generalize well to yet unseen data, which we may care about in the future. Learning can be understood as a learning way to automatically find patterns and structure in data by optimizing the parameters of the model.
While machine learning has seen many success stories, and software is readily available to design and train rich and flexible machine learning systems, we believe that the mathematical foundations of machine learn ing are important in order to understand fundamental principles upon which more complicated machine learning systems are built. Understand ing these principles can facilitate creating new machine learning solutions, understanding and debugging existing approaches, and learning about the inherent assumptions and limitations of the methodologies we are work ing with.
11
This material is published by Cambridge University Press as Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view and download for personal use only. Not for re-distribution, re-sale, or use in derivative works. ©by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2023. https://mml-book.com.
12 Introduction and Motivation 1.1 Finding Words for Intuitions
A challenge we face regularly in machine learning is that concepts and words are slippery, and a particular component of the machine learning system can be abstracted to different mathematical concepts. For example, the word “algorithm” is used in at least two different senses in the con text of machine learning. In the first sense, we use the phrase “machine learning algorithm” to mean a system that makes predictions based on in
predictor put data. We refer to these algorithms as predictors. In the second sense, we use the exact same phrase “machine learning algorithm” to mean a system that adapts some internal parameters of the predictor so that it performs well on future unseen input data. Here we refer to this adapta training tion as training a system.
This book will not resolve the issue of ambiguity, but we want to high light upfront that, depending on the context, the same expressions can mean different things. However, we attempt to make the context suffi ciently clear to reduce the level of ambiguity.
The first part of this book introduces the mathematical concepts and foundations needed to talk about the three main components of a machine learning system: data, models, and learning. We will briefly outline these components here, and we will revisit them again in Chapter 8 once we have discussed the necessary mathematical concepts.
While not all data is numerical, it is often useful to consider data in a number format. In this book, we assume that data has already been appropriately converted into a numerical representation suitable for read
data as vectors ing into a computer program. Therefore, we think of data as vectors. As another illustration of how subtle words are, there are (at least) three different ways to think about vectors: a vector as an array of numbers (a computer science view), a vector as an arrow with a direction and magni tude (a physics view), and a vector as an object that obeys addition and scaling (a mathematical view).
model A model is typically used to describe a process for generating data, sim ilar to the dataset at hand. Therefore, good models can also be thought of as simplified versions of the real (unknown) data-generating process, capturing aspects that are relevant for modeling the data and extracting hidden patterns from it. A good model can then be used to predict what would happen in the real world without performing real-world experi ments.
learning We now come to the crux of the matter, the learning component of machine learning. Assume we are given a dataset and a suitable model. Training the model means to use the data available to optimize some pa rameters of the model with respect to a utility function that evaluates how well the model predicts the training data. Most training methods can be thought of as an approach analogous to climbing a hill to reach its peak. In this analogy, the peak of the hill corresponds to a maximum of some
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
1.2 Two Ways to Read This Book 13
desired performance measure. However, in practice, we are interested in the model to perform well on unseen data. Performing well on data that we have already seen (training data) may only mean that we found a good way to memorize the data. However, this may not generalize well to unseen data, and, in practical applications, we often need to expose our machine learning system to situations that it has not encountered before.
Let us summarize the main concepts of machine learning that we cover in this book:
We represent data as vectors.
We choose an appropriate model, either using the probabilistic or opti mization view.
We learn from available data by using numerical optimization methods with the aim that the model performs well on data not used for training.
1.2 Two Ways to Read This Book
We can consider two strategies for understanding the mathematics for machine learning:
Bottom-up: Building up the concepts from foundational to more ad vanced. This is often the preferred approach in more technical fields, such as mathematics. This strategy has the advantage that the reader at all times is able to rely on their previously learned concepts. Unfor tunately, for a practitioner many of the foundational concepts are not particularly interesting by themselves, and the lack of motivation means that most foundational definitions are quickly forgotten.
Top-down: Drilling down from practical needs to more basic require ments. This goal-driven approach has the advantage that the readers know at all times why they need to work on a particular concept, and there is a clear path of required knowledge. The downside of this strat egy is that the knowledge is built on potentially shaky foundations, and the readers have to remember a set of words that they do not have any way of understanding.
We decided to write this book in a modular way to separate foundational (mathematical) concepts from applications so that this book can be read in both ways. The book is split into two parts, where Part I lays the math ematical foundations and Part II applies the concepts from Part I to a set of fundamental machine learning problems, which form four pillars of machine learning as illustrated in Figure 1.1: regression, dimensionality reduction, density estimation, and classification. Chapters in Part I mostly build upon the previous ones, but it is possible to skip a chapter and work backward if necessary. Chapters in Part II are only loosely coupled and can be read in any order. There are many pointers forward and backward
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
14 Introduction and Motivation
Figure 1.1 The
foundations and
four pillars of machine learning.
n
Machine Learning
y
t
i
l
n
a
o
n
o
i
t
o
is
s
e
r
ge
R
n
oi
s
n
e
m
iD
it
c
u
de
R
n
y
o
i
t
t
i
s
a
n
m
i
e
t
D
s
E
a
c
fii
s
s
a
lC
Vector Calculus Probability & Distributions Optimization
Linear Algebra Analytic Geometry Matrix Decomposition
between the two parts of the book to link mathematical concepts with machine learning algorithms.
Of course there are more than two ways to read this book. Most readers learn using a combination of top-down and bottom-up approaches, some times building up basic mathematical skills before attempting more com plex concepts, but also choosing topics based on applications of machine learning.
Part I Is about Mathematics
The four pillars of machine learning we cover in this book (see Figure 1.1) require a solid mathematical foundation, which is laid out in Part I.
We represent numerical data as vectors and represent a table of such data as a matrix. The study of vectors and matrices is called linear algebra, linear algebra which we introduce in Chapter 2. The collection of vectors as a matrix is also described there.
Given two vectors representing two objects in the real world, we want to make statements about their similarity. The idea is that vectors that are similar should be predicted to have similar outputs by our machine learning algorithm (our predictor). To formalize the idea of similarity be tween vectors, we need to introduce operations that take two vectors as input and return a numerical value representing their similarity. The con
analytic geometry struction of similarity and distances is central to analytic geometry and is discussed in Chapter 3.
In Chapter 4, we introduce some fundamental concepts about matri matrix ces and matrix decomposition. Some operations on matrices are extremely decomposition useful in machine learning, and they allow for an intuitive interpretation of the data and more efficient learning.
We often consider data to be noisy observations of some true underly ing signal. We hope that by applying machine learning we can identify the signal from the noise. This requires us to have a language for quantify ing what “noise” means. We often would also like to have predictors that
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
1.2 Two Ways to Read This Book 15
allow us to express some sort of uncertainty, e.g., to quantify the confi dence we have about the value of the prediction at a particular test data point. Quantification of uncertainty is the realm of probability theory and probability theory is covered in Chapter 6.
To train machine learning models, we typically find parameters that maximize some performance measure. Many optimization techniques re quire the concept of a gradient, which tells us the direction in which to search for a solution. Chapter 5 is about vector calculus and details the vector calculus
concept of gradients, which we subsequently use in Chapter 7, where we talk about optimization to find maxima/minima of functions. optimization
Part II Is about Machine Learning
The second part of the book introduces four pillars of machine learning as shown in Figure 1.1. We illustrate how the mathematical concepts in troduced in the first part of the book are the foundation for each pillar. Broadly speaking, chapters are ordered by difficulty (in ascending order).
In Chapter 8, we restate the three components of machine learning (data, models, and parameter estimation) in a mathematical fashion. In addition, we provide some guidelines for building experimental set-ups that guard against overly optimistic evaluations of machine learning sys tems. Recall that the goal is to build a predictor that performs well on unseen data.
In Chapter 9, we will have a close look at linear regression, where our linear regression objective is to find functions that map inputs x ∈ RD to corresponding ob served function values y ∈ R, which we can interpret as the labels of their respective inputs. We will discuss classical model fitting (parameter esti mation) via maximum likelihood and maximum a posteriori estimation, as well as Bayesian linear regression, where we integrate the parameters out instead of optimizing them.
Chapter 10 focuses on dimensionality reduction, the second pillar in Fig- dimensionality ure 1.1 reduction , using principal component analysis. The key objective of dimen sionality reduction is to find a compact, lower-dimensional representation of high-dimensional data x ∈ RD, which is often easier to analyze than the original data. Unlike regression, dimensionality reduction is only con cerned about modeling the data – there are no labels associated with a data point x.
In Chapter 11, we will move to our third pillar: density estimation. The density estimation objective of density estimation is to find a probability distribution that de scribes a given dataset. We will focus on Gaussian mixture models for this purpose, and we will discuss an iterative scheme to find the parameters of this model. As in dimensionality reduction, there are no labels associated with the data points x ∈ RD. However, we do not seek a low-dimensional representation of the data. Instead, we are interested in a density model that describes the data.
Chapter 12 concludes the book with an in-depth discussion of the fourth ©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
16 Introduction and Motivation
classification pillar: classification. We will discuss classification in the context of support vector machines. Similar to regression (Chapter 9), we have inputs x and corresponding labels y. However, unlike regression, where the labels were real-valued, the labels in classification are integers, which requires special care.
1.3 Exercises and Feedback
We provide some exercises in Part I, which can be done mostly by pen and paper. For Part II, we provide programming tutorials (jupyter notebooks) to explore some properties of the machine learning algorithms we discuss in this book.
We appreciate that Cambridge University Press strongly supports our aim to democratize education and learning by making this book freely available for download at
https://mml-book.com
where tutorials, errata, and additional materials can be found. Mistakes can be reported and feedback provided using the preceding URL.
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
2
Linear Algebra
When formalizing intuitive concepts, a common approach is to construct a set of objects (symbols) and a set of rules to manipulate these objects. This is known as an algebra. Linear algebra is the study of vectors and certain algebra rules to manipulate vectors. The vectors many of us know from school are called “geometric vectors”, which are usually denoted by a small arrow above the letter, e.g., −→x and −→y . In this book, we discuss more general concepts of vectors and use a bold letter to represent them, e.g., x and y.
In general, vectors are special objects that can be added together and multiplied by scalars to produce another object of the same kind. From an abstract mathematical viewpoint, any object that satisfies these two properties can be considered a vector. Here are some examples of such vector objects:
1. Geometric vectors. This example of a vector may be familiar from high school mathematics and physics. Geometric vectors – see Figure 2.1(a) – are directed segments, which can be drawn (at least in two dimen sions). Two geometric vectors →x,→y can be added, such that →x+→y =→z is another geometric vector. Furthermore, multiplication by a scalar λ→x, λ ∈ R, is also a geometric vector. In fact, it is the original vector scaled by λ. Therefore, geometric vectors are instances of the vector concepts introduced previously. Interpreting vectors as geometric vec tors enables us to use our intuitions about direction and magnitude to reason about mathematical operations.
2. Polynomials are also vectors; see Figure 2.1(b): Two polynomials can
→x +→y
→x →y (a) Geometric vectors.
4
2
0
y
−2
−4
−6
−2 0 2
x
(b) Polynomials.
17
Figure 2.1
Different types of vectors. Vectors can be surprising
objects, including (a) geometric
vectors
and (b) polynomials.
This material is published by Cambridge University Press as Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view and download for personal use only. Not for re-distribution, re-sale, or use in derivative works. ©by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2023. https://mml-book.com.
18 Linear Algebra
be added together, which results in another polynomial; and they can be multiplied by a scalar λ ∈ R, and the result is a polynomial as well. Therefore, polynomials are (rather unusual) instances of vectors. Note that polynomials are very different from geometric vectors. While geometric vectors are concrete “drawings”, polynomials are abstract concepts. However, they are both vectors in the sense previously de scribed.
3. Audio signals are vectors. Audio signals are represented as a series of numbers. We can add audio signals together, and their sum is a new audio signal. If we scale an audio signal, we also obtain an audio signal. Therefore, audio signals are a type of vector, too.
4. Elements of Rn(tuples of n real numbers) are vectors. Rnis more abstract than polynomials, and it is the concept we focus on in this book. For instance,
a =
 
1
 ∈ R3(2.1)
2
3
is an example of a triplet of numbers. Adding two vectors a, b ∈ Rn component-wise results in another vector: a + b = c ∈ Rn. Moreover, multiplying a ∈ Rn by λ ∈ R results in a scaled vector λa ∈ Rn. Considering vectors as elements of Rn
Be careful to check has an additional benefit that
whether array
operations actually perform vector operations when implementing on a computer.
it loosely corresponds to arrays of real numbers on a computer. Many programming languages support array operations, which allow for con venient implementation of algorithms that involve vector operations.
Linear algebra focuses on the similarities between these vector concepts.
We can add them together and multiply them by scalars. We will largely Pavel Grinfeld’s
series on linear algebra:
http://tinyurl. com/nahclwm
Gilbert Strang’s course on linear algebra:
http://tinyurl. com/29p5q8j
3Blue1Brown series on linear algebra: https://tinyurl. com/h5g4kps
focus on vectors in Rnsince most algorithms in linear algebra are for mulated in Rn. We will see in Chapter 8 that we often consider data to be represented as vectors in Rn. In this book, we will focus on finite dimensional vector spaces, in which case there is a 1:1 correspondence between any kind of vector and Rn. When it is convenient, we will use intuitions about geometric vectors and consider array-based algorithms.
One major idea in mathematics is the idea of “closure”. This is the ques tion: What is the set of all things that can result from my proposed oper ations? In the case of vectors: What is the set of vectors that can result by starting with a small set of vectors, and adding them to each other and scaling them? This results in a vector space (Section 2.4). The concept of a vector space and its properties underlie much of machine learning. The concepts introduced in this chapter are summarized in Figure 2.2.
This chapter is mostly based on the lecture notes and books by Drumm and Weil (2001), Strang (2003), Hogben (2013), Liesen and Mehrmann (2015), as well as Pavel Grinfeld’s Linear Algebra series. Other excellent
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
2.1 Systems of Linear Equations 19
Vector
composesclos
property of
u
r
e
Matrix Chapter 5
Figure 2.2 A mind map of the concepts introduced in this chapter, along with where they are used
Vector calculus
Group
re
p
re
se
represents
Vector space
Abelian with +
Linear
independence
m
a
in other parts of the book.
System of
linear equations
s
o
l
d b
v
e
y
Gaussian
elimination
solves
n
ts
Linear/affine
mapping
Matrix
inverse
Chapter 3
Analytic geometry
Chapter 12 Classification
x
i
l s
m
a
e
t
Basis
Chapter 10 Dimensionality reduction
resources are Gilbert Strang’s Linear Algebra course at MIT and the Linear Algebra Series by 3Blue1Brown.
Linear algebra plays an important role in machine learning and gen eral mathematics. The concepts introduced in this chapter are further ex panded to include the idea of geometry in Chapter 3. In Chapter 5, we will discuss vector calculus, where a principled knowledge of matrix op erations is essential. In Chapter 10, we will use projections (to be intro duced in Section 3.8) for dimensionality reduction with principal compo nent analysis (PCA). In Chapter 9, we will discuss linear regression, where linear algebra plays a central role for solving least-squares problems.
2.1 Systems of Linear Equations
Systems of linear equations play a central part of linear algebra. Many problems can be formulated as systems of linear equations, and linear algebra gives us the tools for solving them.
Example 2.1
A company produces products N1, . . . , Nn for which resources R1, . . . , Rm are required. To produce a unit of product Nj , aij units of resource Ri are needed, where i = 1, . . . , m and j = 1, . . . , n.
The objective is to find an optimal production plan, i.e., a plan of how many units xj of product Nj should be produced if a total of bi units of resource Ri are available and (ideally) no resources are left over. If we produce x1, . . . , xn units of the corresponding products, we need
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
20 Linear Algebra
a total of
ai1x1 + · · · + ainxn (2.2)
many units of resource Ri. An optimal production plan (x1, . . . , xn) ∈ Rn, therefore, has to satisfy the following system of equations:
a11x1 + · · · + a1nxn = b1
...
am1x1 + · · · + amnxn = bm
where aij ∈ R and bi ∈ R.
, (2.3)
system of linear Equation (2.3) is the general form of a system of linear equations, and equations x1, . . . , xn are the unknowns of this system. Every n-tuple (x1, . . . , xn) ∈ Rn
solution that satisfies (2.3) is a solution of the linear equation system.
Example 2.2
The system of linear equations
x1 + x2 + x3 = 3 (1)
x1 − x2 + 2x3 = 2 (2) 2x1 + 3x3 = 1 (3)
(2.4)
has no solution: Adding the first two equations yields 2x1+3x3 = 5, which contradicts the third equation (3).
Let us have a look at the system of linear equations
x1 + x2 + x3 = 3 (1)
x1 − x2 + 2x3 = 2 (2) x2 + x3 = 2 (3)
. (2.5)
From the first and third equation, it follows that x1 = 1. From (1)+(2), we get 2x1 + 3x3 = 5, i.e., x3 = 1. From (3), we then get that x2 = 1. Therefore, (1, 1, 1) is the only possible and unique solution (verify that (1, 1, 1) is a solution by plugging in).
As a third example, we consider
x1 + x2 + x3 = 3 (1)
x1 − x2 + 2x3 = 2 (2) 2x1 + 3x3 = 5 (3)
. (2.6)
Since (1)+(2)=(3), we can omit the third equation (redundancy). From (1) and (2), we get 2x1 = 5−3x3 and 2x2 = 1+x3. We define x3 = a ∈ R as a free variable, such that any triplet
5
2−32a,12+12a, a , a ∈ R (2.7)
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
2.1 Systems of Linear Equations 21 Figure 2.3 The
x2
4x1 + 4x2 = 5
2x1 − 4x2 = 1
x1
solution space of a system of two linear equations with two variables can be geometrically
interpreted as the intersection of two lines. Every linear
equation represents a line.
is a solution of the system of linear equations, i.e., we obtain a solution set that contains infinitely many solutions.
In general, for a real-valued system of linear equations we obtain either no, exactly one, or infinitely many solutions. Linear regression (Chapter 9) solves a version of Example 2.1 when we cannot solve the system of linear equations.
Remark (Geometric Interpretation of Systems of Linear Equations). In a system of linear equations with two variables x1, x2, each linear equation defines a line on the x1x2-plane. Since a solution to a system of linear equations must satisfy all equations simultaneously, the solution set is the intersection of these lines. This intersection set can be a line (if the linear equations describe the same line), a point, or empty (when the lines are parallel). An illustration is given in Figure 2.3 for the system
4x1 + 4x2 = 5
2x1 − 4x2 = 1(2.8)
where the solution space is the point (x1, x2) = (1,14). Similarly, for three variables, each linear equation determines a plane in three-dimensional space. When we intersect these planes, i.e., satisfy all linear equations at the same time, we can obtain a solution set that is a plane, a line, a point or empty (when the planes have no common intersection). ♢
For a systematic approach to solving systems of linear equations, we will introduce a useful compact notation. We collect the coefficients aij into vectors and collect the vectors into matrices. In other words, we write the system from (2.3) in the following form:
a11... am1
x1 +
a12... am2
x2 + · · · +
a1n
...
amn
xn =
b1... bm
 (2.9)
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
22 Linear Algebra
⇐⇒
a11 · · · a1n
......
am1 · · · amn
 
x1... xn
 =
b1... bm
 . (2.10)
In the following, we will have a close look at these matrices and de fine computation rules. We will return to solving linear equations in Sec tion 2.3.
2.2 Matrices
Matrices play a central role in linear algebra. They can be used to com pactly represent systems of linear equations, but they also represent linear functions (linear mappings) as we will see later in Section 2.7. Before we discuss some of these interesting topics, let us first define what a matrix is and what kind of operations we can do with matrices. We will see more properties of matrices in Chapter 4.
matrix Definition 2.1 (Matrix). With m, n ∈ N a real-valued (m, n) matrix A is an m·n-tuple of elements aij , i = 1, . . . , m, j = 1, . . . , n, which is ordered according to a rectangular scheme consisting of m rows and n columns:
A =
a11 a12 · · · a1n
a21 a22 · · · a2n ......... am1 am2 · · · amn
, aij ∈ R . (2.11)
row By convention (1, n)-matrices are called rows and (m, 1)-matrices are called column columns. These special matrices are also called row/column vectors.
row vector
column vector
Figure 2.4 By
stacking its
columns, a matrix A can be represented as a long vector a.
A ∈ R4×2 a ∈ R8 re-shape
Rm×nis the set of all real-valued (m, n)-matrices. A ∈ Rm×ncan be equivalently represented as a ∈ Rmn by stacking all n columns of the matrix into a long vector; see Figure 2.4.
2.2.1 Matrix Addition and Multiplication
The sum of two matrices A ∈ Rm×n, B ∈ Rm×nis defined as the element wise sum, i.e.,
A + B :=
a11 + b11 · · · a1n + b1n
......
am1 + bm1 · · · amn + bmn
 ∈ Rm×n. (2.12)
For matrices A ∈ Rm×n, B ∈ Rn×k Note the size of the , the elements cij of the product matrices. C = AB ∈ Rm×k are computed as
C =
lj’, A, B) cij =Xn
np.einsum(’il,
l=1
ailblj , i = 1, . . . , m, j = 1, . . . , k. (2.13)
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
2.2 Matrices 23 This means, to compute element cij we multiply the elements of the ith There are n columns
row of A with the jth column of B and sum them up. Later in Section 3.2, we will call this the dot product of the corresponding row and column. In cases, where we need to be explicit that we are performing multiplication, we use the notation A · B to denote multiplication (explicitly showing “·”).
Remark. Matrices can only be multiplied if their “neighboring” dimensions match. For instance, an n × k-matrix A can be multiplied with a k × m matrix B, but only from the left side:
in A and n rows in B so that we can compute ailblj for l = 1, . . . , n.
Commonly, the dot product between two vectors a, b is denoted by a⊤b or ⟨a, b⟩.
A
|{z} n×k
B
|{z} k×m
= |{z}
C
n×m
(2.14)
The product BA is not defined if m ̸= n since the neighboring dimensions do not match. ♢
Remark. Matrix multiplication is not defined as an element-wise operation on matrix elements, i.e., cij ̸= aij bij (even if the size of A, B was cho sen appropriately). This kind of element-wise multiplication often appears in programming languages when we multiply (multi-dimensional) arrays with each other, and is called a Hadamard product. ♢ Hadamard product
Example 2.3
1 2 3
0 2
For A =
3 2 1
∈ R2×3, B =
1 −1 0 1
 ∈ R3×2, we obtain
AB =
1 2 3 3 2 1
 
0 2 1 −1
 =
2 3 2 5
∈ R2×2, (2.15)
BA =
 
0 2 1 −1 0 1
 
0 1
1 2 3 3 2 1
=
 
6 4 2 −2 0 2 3 2 1
 ∈ R3×3. (2.16)
Figure 2.5 Even if
From this example, we can already see that matrix multiplication is not commutative, i.e., AB ̸= BA; see also Figure 2.5 for an illustration.
Definition 2.2 (Identity Matrix). In Rn×n, we define the identity matrix
both matrix
multiplications AB and BA are
defined, the
In :=
1 0 · · · 0 · · · 0 
0 1 · · · 0 · · · 0 .................. 0 0 · · · 1 · · · 0 .................. 0 0 · · · 0 · · · 1
∈ Rn×n(2.17)
dimensions of the results can be different.
identity matrix
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
24 Linear Algebra as the n × n-matrix containing 1 on the diagonal and 0 everywhere else.
Now that we defined matrix multiplication, matrix addition and the identity matrix, let us have a look at some properties of matrices:
associativity
Associativity:
∀A ∈ Rm×n, B ∈ Rn×p, C ∈ Rp×q: (AB)C = A(BC) (2.18)
distributivityDistributivity:
∀A, B ∈ Rm×n, C, D ∈ Rn×p: (A + B)C = AC + BC (2.19a)
A(C + D) = AC + AD (2.19b)
Multiplication with the identity matrix:
∀A ∈ Rm×n: ImA = AIn = A (2.20)
Note that Im ̸= In for m ̸= n.
2.2.2 Inverse and Transpose
Definition 2.3 (Inverse). Consider a square matrix A ∈ Rn×n
A square matrix . Let matrix
possesses the same number of columns and rows.
B ∈ Rn×n have the property that AB = In = BA. B is called the inverse of A and denoted by A−1.
inverse Unfortunately, not every matrix A possesses an inverse A−1. If this regular inverse does exist, A is called regular/invertible/nonsingular, otherwise
invertible
nonsingular singular
noninvertible
singular/noninvertible. When the matrix inverse exists, it is unique. In Sec tion 2.3, we will discuss a general way to compute the inverse of a matrix by solving a system of linear equations.
Remark (Existence of the Inverse of a 2 × 2-matrix). Consider a matrix
∈ R2×2. (2.21)
If we multiply A with
A :=
a11 a12 a21 a22
A′:=
a22 −a12 −a21 a11
(2.22)
we obtain AA′ =
a11a22 − a12a21 0 0 a11a22 − a12a21
= (a11a22 − a12a21)I . (2.23)
Therefore,
A−1 =1
a11a22 − a12a21
a22 −a12 −a21 a11
(2.24)
if and only if a11a22 −a12a21 ̸= 0. In Section 4.1, we will see that a11a22 − Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
2.2 Matrices 25
a12a21 is the determinant of a 2×2-matrix. Furthermore, we can generally use the determinant to check whether a matrix is invertible. ♢
Example 2.4 (Inverse Matrix)
The matrices
A =
1 2 1 4 4 5 6 7 7
 , B =
 
−7 −7 6 2 1 −1 4 5 −4
 (2.25)
are inverse to each other since AB = I = BA.
Definition 2.4 (Transpose). For A ∈ Rm×nthe matrix B ∈ Rn×m with bij = aji is called the transpose of A. We write B = A⊤. transpose
In general, A⊤can be obtained by writing the columns of A as the rows of A⊤. The following are important properties of inverses and transposes:
AA−1 = I = A−1A (2.26)
(AB)−1 = B−1A−1(2.27)
(A + B)−1̸= A−1 + B−1(2.28)
(A⊤)⊤ = A (2.29)
(AB)⊤ = B⊤A⊤(2.30)
(A + B)⊤ = A⊤ + B⊤(2.31)
The main diagonal (sometimes called “principal diagonal”, “primary diagonal”, “leading diagonal”, or “major diagonal”) of a matrix A is the collection of entries Aij where i = j. The scalar case of (2.28) is
2+4 = 16̸= 12 + 14. 1
Definition 2.5 (Symmetric Matrix). A matrix A ∈ Rn×nis symmetric if symmetric matrix A = A⊤.
Note that only (n, n)-matrices can be symmetric. Generally, we call (n, n)-matrices also square matrices because they possess the same num- square matrix ber of rows and columns. Moreover, if A is invertible, then so is A⊤, and (A−1)⊤ = (A⊤)−1 =: A−⊤.
Remark (Sum and Product of Symmetric Matrices). The sum of symmet ric matrices A, B ∈ Rn×nis always symmetric. However, although their product is always defined, it is generally not symmetric:
1 0 0 0
1 1 1 1
=
1 1 0 0
. (2.32) ♢
2.2.3 Multiplication by a Scalar
Let us look at what happens to matrices when they are multiplied by a scalar λ ∈ R. Let A ∈ Rm×n and λ ∈ R. Then λA = K, Kij = λ aij . Practically, λ scales each element of A. For λ, ψ ∈ R, the following holds:
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
26 Linear Algebra
associativity Associativity:
(λψ)C = λ(ψC), C ∈ Rm×n
λ(BC) = (λB)C = B(λC) = (BC)λ, B ∈ Rm×n, C ∈ Rn×k.
Note that this allows us to move scalar values around.
(λC)⊤ = C⊤λ⊤ = C⊤λ = λC⊤since λ = λ⊤ for all λ ∈ R. distributivity Distributivity:
(λ + ψ)C = λC + ψC, C ∈ Rm×n
λ(B + C) = λB + λC, B, C ∈ Rm×n
Example 2.5 (Distributivity)
If we define
C :=
then for any λ, ψ ∈ R we obtain
1 2 3 4
, (2.33)
(λ + ψ)C =
(λ + ψ)1 (λ + ψ)2 (λ + ψ)3 (λ + ψ)4
=
λ + ψ 2λ + 2ψ 3λ + 3ψ 4λ + 4ψ
(2.34a)
=
λ 2λ 3λ 4λ
+
ψ 2ψ 3ψ 4ψ
= λC + ψC . (2.34b)
2.2.4 Compact Representations of Systems of Linear Equations If we consider the system of linear equations
2x1 + 3x2 + 5x3 = 1
4x1 − 2x2 − 7x3 = 8 9x1 + 5x2 − 3x3 = 2
(2.35)
and use the rules for matrix multiplication, we can write this equation system in a more compact form as
 
2 3 5 4 −2 −7 9 5 −3
 
 
x1 x2 x3
 =
 
1
 . (2.36)
8
2
Note that x1 scales the first column, x2 the second one, and x3 the third one.
Generally, a system of linear equations can be compactly represented in their matrix form as Ax = b; see (2.3), and the product Ax is a (linear) combination of the columns of A. We will discuss linear combinations in more detail in Section 2.5.
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
2.3 Solving Systems of Linear Equations 27 2.3 Solving Systems of Linear Equations
In (2.3), we introduced the general form of an equation system, i.e., a11x1 + · · · + a1nxn = b1
...
am1x1 + · · · + amnxn = bm ,
(2.37)
where aij ∈ R and bi ∈ R are known constants and xj are unknowns, i = 1, . . . , m, j = 1, . . . , n. Thus far, we saw that matrices can be used as a compact way of formulating systems of linear equations so that we can write Ax = b, see (2.10). Moreover, we defined basic matrix operations, such as addition and multiplication of matrices. In the following, we will focus on solving systems of linear equations and provide an algorithm for finding the inverse of a matrix.
2.3.1 Particular and General Solution
Before discussing how to generally solve systems of linear equations, let us have a look at an example. Consider the system of equations
1 0 8 −4 0 1 2 12
 
x1 x2 x3 x4
 = 428 . (2.38)
The system has two equations and four unknowns. Therefore, in general we would expect infinitely many solutions. This system of equations is in a particularly easy form, where the first two columns consist of a 1 Pand a 0. Remember that we want to find scalars x1, . . . , x4, such that 4
i=1 xici = b, where we define ci to be the ith column of the matrix and b the right-hand-side of (2.38). A solution to the problem in (2.38) can be found immediately by taking 42 times the first column and 8 times the
second column so that b =
42 8
= 42
1 0
+ 8
0 1
. (2.39)
Therefore, a solution is [42, 8, 0, 0]⊤. This solution is called a particular particular solution solution or special solution. However, this is not the only solution of this special solution system of linear equations. To capture all the other solutions, we need to be creative in generating 0 in a non-trivial way using the columns of the matrix: Adding 0 to our special solution does not change the special solution. To do so, we express the third column using the first two columns (which are of this very simple form)
8 2
= 8
1 0
+ 2
0 1
(2.40)
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
28 Linear Algebra
so that 0 = 8c1 + 2c2 − 1c3 + 0c4 and (x1, x2, x3, x4) = (8, 2, −1, 0). In fact, any scaling of this solution by λ1 ∈ R produces the 0 vector, i.e.,
1 0 8 −4 0 1 2 12
λ1
 
8
2
−1 0

 = λ1(8c1 + 2c2 − c3) = 0 . (2.41)
Following the same line of reasoning, we express the fourth column of the matrix in (2.38) using the first two columns and generate another set of non-trivial versions of 0 as
1 0 8 −4 0 1 2 12
λ2
 
−4 12 0
−1

 = λ2(−4c1 + 12c2 − c4) = 0 (2.42)
for any λ2 ∈ R. Putting everything together, we obtain all solutions of the general solution equation system in (2.38), which is called the general solution, as the set


x ∈ R4: x =
42 8
0
0
+ λ1
 
8
2
−1 0
+ λ2
 
−4 12 0
−1
, λ1, λ2 ∈ R

. (2.43)
Remark. The general approach we followed consisted of the following three steps:
1. Find a particular solution to Ax = b.
2. Find all solutions to Ax = 0.
3. Combine the solutions from steps 1. and 2. to the general solution. Neither the general nor the particular solution is unique. ♢ The system of linear equations in the preceding example was easy to solve because the matrix in (2.38) has this particularly convenient form, which allowed us to find the particular and the general solution by in spection. However, general equation systems are not of this simple form. Fortunately, there exists a constructive algorithmic way of transforming any system of linear equations into this particularly simple form: Gaussian elimination. Key to Gaussian elimination are elementary transformations of systems of linear equations, which transform the equation system into a simple form. Then, we can apply the three steps to the simple form that we just discussed in the context of the example in (2.38).
2.3.2 Elementary Transformations
elementary Key to solving a system of linear equations are elementary transformations transformations that keep the solution set the same, but that transform the equation system into a simpler form:
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
2.3 Solving Systems of Linear Equations 29
Exchange of two equations (rows in the matrix representing the system of equations)
Multiplication of an equation (row) with a constant λ ∈ R\{0} Addition of two equations (rows)
Example 2.6
For a ∈ R, we seek all solutions of the following system of equations: −2x1 + 4x2 − 2x3 − x4 + 4x5 = −3
4x1 − 8x2 + 3x3 − 3x4 + x5 = 2 x1 − 2x2 + x3 − x4 + x5 = 0 x1 − 2x2 − 3x4 + 4x5 = a
. (2.44)
We start by converting this system of equations into the compact matrix notation Ax = b. We no longer mention the variables x explicitly and
build the augmented matrix (in the form A | b ) augmented matrix
−2 4 −2 −1 4 −3 

4 −8 3 −3 1 2 1 −2 1 −1 1 0 1 −2 0 −3 4 a
Swap with R3 Swap with R1
where we used the vertical line to separate the left-hand side from the right-hand side in (2.44). We use ⇝ to indicate a transformation of the augmented matrix using elementary transformations. The augmented
Swapping Rows 1 and 3 leads to
1 −2 1 −1 1 0
matrix A | b compactly
represents the

4 −8 3 −3 1 2 −2 4 −2 −1 4 −3 1 −2 0 −3 4 a
−4R1
+2R1
−R1
system of linear equations Ax = b.
When we now apply the indicated transformations (e.g., subtract Row 1 four times from Row 2), we obtain
1 −2 1 −1 1 0 

0 0 −1 1 −3 2 0 0 0 −3 6 −3 0 0 −1 −2 3 a 
1 −2 1 −1 1 0
−R2 − R3
⇝ ⇝

0 0 −1 1 −3 2 0 0 0 −3 6 −3 0 0 0 0 0 a+ 1 
1 −2 1 −1 1 0 
0 0 1 −1 3 −2 0 0 0 1 −2 1 0 0 0 0 0 a+ 1
·(−1) ·(−13)

©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
30 Linear Algebra
row-echelon form This (augmented) matrix is in a convenient form, the row-echelon form (REF). Reverting this compact notation back into the explicit notation with the variables we seek, we obtain
x1 − 2x2 + x3 − x4 + x5 = 0
x3 − x4 + 3x5 = −2 x4 − 2x5 = 1
0 = a + 1
. (2.45)
particular solution Only for a = −1 this system can be solved. A particular solution is
 
x1 x2 x3 x4 x5
=
 
2
0
−1 1
0
. (2.46)
general solution The general solution, which captures the set of all possible solutions, is


x ∈ R5: x =
2
0
−1 1
0
+ λ1
 
2 1 0 0 0
+ λ2
 
2
0
−1 2
1
, λ1, λ2 ∈ R

. (2.47)
In the following, we will detail a constructive way to obtain a particular and general solution of a system of linear equations.
Remark (Pivots and Staircase Structure). The leading coefficient of a row pivot (first nonzero number from the left) is called the pivot and is always strictly to the right of the pivot of the row above it. Therefore, any equa tion system in row-echelon form always has a “staircase” structure. ♢
row-echelon form Definition 2.6 (Row-Echelon Form). A matrix is in row-echelon form if
All rows that contain only zeros are at the bottom of the matrix; corre spondingly, all rows that contain at least one nonzero element are on top of rows that contain only zeros.
Looking at nonzero rows only, the first nonzero number from the left pivot (also called the pivot or the leading coefficient) is always strictly to the leading coefficient right of the pivot of the row above it.
In other texts, it is sometimes required that the pivot is 1.
Remark (Basic and Free Variables). The variables corresponding to the pivots in the row-echelon form are called basic variables and the other
basic variable variables are free variables. For example, in (2.45), x1, x3, x4 are basic free variable variables, whereas x2, x5 are free variables. ♢ Remark (Obtaining a Particular Solution). The row-echelon form makes
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
2.3 Solving Systems of Linear Equations 31
our lives easier when we need to determine a particular solution. To do this, we express the right-hand side of the equation system using the pivot columns, such that b =PPi=1 λipi, where pi, i = 1, . . . , P, are the pivot columns. The λi are determined easiest if we start with the rightmost pivot column and work our way to the left.
In the previous example, we would try to find λ1, λ2, λ3 so that
λ1
 
1 0 0 0
+ λ2
 
1 1 0 0
+ λ3
 
−1 −1 1
0
 =
 
0
−2 1
0
. (2.48)
From here, we find relatively directly that λ3 = 1, λ2 = −1, λ1 = 2. When we put everything together, we must not forget the non-pivot columns for which we set the coefficients implicitly to 0. Therefore, we get the particular solution x = [2, 0, −1, 1, 0]⊤. ♢
Remark (Reduced Row Echelon Form). An equation system is in reduced reduced row-echelon form (also: row-reduced echelon form or row canonical form) if row-echelon form
It is in row-echelon form.
Every pivot is 1.
The pivot is the only nonzero entry in its column.
The reduced row-echelon form will play an important role later in Sec tion 2.3.3 because it allows us to determine the general solution of a sys tem of linear equations in a straightforward way.Gaussian Remark (Gaussian Elimination). Gaussian elimination is an algorithm that elimination performs elementary transformations to bring a system of linear equations into reduced row-echelon form. ♢
Example 2.7 (Reduced Row Echelon Form)
Verify that the following matrix is in reduced row-echelon form (the pivots
are in bold):
A =
1 3 0 0 3 0 0 1 0 9 0 0 0 1 −4
 . (2.49)
The key idea for finding the solutions of Ax = 0 is to look at the non pivot columns, which we will need to express as a (linear) combination of the pivot columns. The reduced row echelon form makes this relatively straightforward, and we express the non-pivot columns in terms of sums and multiples of the pivot columns that are on their left: The second col umn is 3 times the first column (we can ignore the pivot columns on the right of the second column). Therefore, to obtain 0, we need to subtract
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
32 Linear Algebra
the second column from three times the first column. Now, we look at the fifth column, which is our second non-pivot column. The fifth column can be expressed as 3 times the first pivot column, 9 times the second pivot column, and −4 times the third pivot column. We need to keep track of the indices of the pivot columns and translate this into 3 times the first col umn, 0 times the second column (which is a non-pivot column), 9 times the third column (which is our second pivot column), and −4 times the fourth column (which is the third pivot column). Then we need to subtract the fifth column to obtain 0. In the end, we are still solving a homogeneous equation system.
To summarize, all solutions of Ax = 0, x ∈ R5 are given by


x ∈ R5: x = λ1
3
−1 0
0
0
+ λ2
 
3
0
9
−4 −1
, λ1, λ2 ∈ R

. (2.50)
2.3.3 The Minus-1 Trick
In the following, we introduce a practical trick for reading out the solu tions x of a homogeneous system of linear equations Ax = 0, where A ∈ Rk×n, x ∈ Rn.
To start, we assume that A is in reduced row-echelon form without any rows that just contain zeros, i.e.,
A =
0 · · · 0 1 ∗ · · · ∗ 0 ∗ · · · ∗ 0 ∗ · · · ∗
...... 0 0 · · · 0 1 ∗ · · · ∗......... ............... 0............... ........................ 0...... 0 · · · 0 0 0 · · · 0 0 0 · · · 0 1 ∗ · · · ∗
,
(2.51)
where ∗ can be an arbitrary real number, with the constraints that the first nonzero entry per row must be 1 and all other entries in the corresponding column must be 0. The columns j1, . . . , jk with the pivots (marked in bold) are the standard unit vectors e1, . . . , ek ∈ Rk. We extend this matrix to an n × n-matrix A˜ by adding n − k rows of the form
0 · · · 0 −1 0 · · · 0 (2.52)
so that the diagonal of the augmented matrix A˜ contains either 1 or −1. Then, the columns of A˜ that contain the −1 as pivots are solutions of
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
2.3 Solving Systems of Linear Equations 33
the homogeneous equation system Ax = 0. To be more precise, these columns form a basis (Section 2.6.1) of the solution space of Ax = 0, which we will later call the kernel or null space (see Section 2.7.3). kernel null space
Example 2.8 (Minus-1 Trick)
Let us revisit the matrix in (2.49), which is already in reduced REF:
A =
1 3 0 0 3 
0 0 1 0 9 0 0 0 1 −4
 . (2.53)
We now augment this matrix to a 5 × 5 matrix by adding rows of the form (2.52) at the places where the pivots on the diagonal are missing
and obtain

A˜ =
1 3 0 0 3 0 −1 0 0 0 0 0 1 0 9 0 0 0 1 −4
0 0 0 0 −1
. (2.54)
From this form, we can immediately read out the solutions of Ax = 0 by taking the columns of A˜ , which contain −1 on the diagonal:


x ∈ R5: x = λ1
3
−1 0
0
0
+ λ2
 
3
0
9
−4 −1
, λ1, λ2 ∈ R

, (2.55)
which is identical to the solution in (2.50) that we obtained by “insight”.
Calculating the Inverse
To compute the inverse A−1of A ∈ Rn×n, we need to find a matrix X that satisfies AX = In. Then, X = A−1. We can write this down as a set of simultaneous linear equations AX = In, where we solve for X = [x1| · · · |xn]. We use the augmented matrix notation for a compact representation of this set of systems of linear equations and obtain
A|In ⇝ · · · ⇝ In|A−1 . (2.56)
This means that if we bring the augmented equation system into reduced row-echelon form, we can read out the inverse on the right-hand side of the equation system. Hence, determining the inverse of a matrix is equiv alent to solving systems of linear equations.
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
34 Linear Algebra Example 2.9 (Calculating an Inverse Matrix by Gaussian Elimination)
To determine the inverse of 

A =
1 0 2 0 1 1 0 0 1 2 0 1 1 1 1 1
(2.57)
we write down the augmented matrix
1 0 2 0 1 0 0 0 

1 1 0 0 0 1 0 0 1 2 0 1 0 0 1 0 1 1 1 1 0 0 0 1
and use Gaussian elimination to bring it into reduced row-echelon form
1 0 0 0 −1 2 −2 2 
0 1 0 0 1 −1 2 −2 0 0 1 0 1 −1 1 −1 0 0 0 1 −1 0 −1 2
,
such that the desired inverse is given as its right-hand side:
A−1 =
−1 2 −2 2 
1 −1 2 −2 1 −1 1 −1 −1 0 −1 2
. (2.58)
We can verify that (2.58) is indeed the inverse by performing the multi plication AA−1and observing that we recover I4.
2.3.4 Algorithms for Solving a System of Linear Equations
In the following, we briefly discuss approaches to solving a system of lin ear equations of the form Ax = b. We make the assumption that a solu tion exists. Should there be no solution, we need to resort to approximate solutions, which we do not cover in this chapter. One way to solve the ap proximate problem is using the approach of linear regression, which we discuss in detail in Chapter 9.
In special cases, we may be able to determine the inverse A−1, such that the solution of Ax = b is given as x = A−1b. However, this is only possible if A is a square matrix and invertible, which is often not the case. Otherwise, under mild assumptions (i.e., A needs to have linearly independent columns) we can use the transformation
Ax = b ⇐⇒ A⊤Ax = A⊤b ⇐⇒ x = (A⊤A)−1A⊤b (2.59) Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
2.4 Vector Spaces 35
and use the Moore-Penrose pseudo-inverse (A⊤A)−1A⊤to determine the Moore-Penrose pseudo-inverse solution (2.59) that solves Ax = b, which also corresponds to the mini mum norm least-squares solution. A disadvantage of this approach is that it requires many computations for the matrix-matrix product and comput ing the inverse of A⊤A. Moreover, for reasons of numerical precision it is generally not recommended to compute the inverse or pseudo-inverse. In the following, we therefore briefly discuss alternative approaches to solving systems of linear equations.
Gaussian elimination plays an important role when computing deter minants (Section 4.1), checking whether a set of vectors is linearly inde pendent (Section 2.5), computing the inverse of a matrix (Section 2.2.2), computing the rank of a matrix (Section 2.6.2), and determining a basis of a vector space (Section 2.6.1). Gaussian elimination is an intuitive and constructive way to solve a system of linear equations with thousands of variables. However, for systems with millions of variables, it is impracti cal as the required number of arithmetic operations scales cubically in the number of simultaneous equations.
In practice, systems of many linear equations are solved indirectly, by ei ther stationary iterative methods, such as the Richardson method, the Ja cobi method, the Gauß-Seidel method, and the successive over-relaxation method, or Krylov subspace methods, such as conjugate gradients, gener alized minimal residual, or biconjugate gradients. We refer to the books by Stoer and Burlirsch (2002), Strang (2003), and Liesen and Mehrmann (2015) for further details.
Let x∗ be a solution of Ax = b. The key idea of these iterative methods is to set up an iteration of the form
x(k+1) = Cx(k) + d (2.60)
for suitable C and d that reduces the residual error ∥x(k+1)−x∗∥ in every iteration and converges to x∗. We will introduce norms ∥ · ∥, which allow us to compute similarities between vectors, in Section 3.1.
2.4 Vector Spaces
Thus far, we have looked at systems of linear equations and how to solve them (Section 2.3). We saw that systems of linear equations can be com pactly represented using matrix-vector notation (2.10). In the following, we will have a closer look at vector spaces, i.e., a structured space in which vectors live.
In the beginning of this chapter, we informally characterized vectors as objects that can be added together and multiplied by a scalar, and they remain objects of the same type. Now, we are ready to formalize this, and we will start by introducing the concept of a group, which is a set of elements and an operation defined on these elements that keeps some structure of the set intact.
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
36 Linear Algebra 2.4.1 Groups
Groups play an important role in computer science. Besides providing a fundamental framework for operations on sets, they are heavily used in cryptography, coding theory, and graphics.
Definition 2.7 (Group). Consider a set G and an operation ⊗ : G ×G → G group defined on G. Then G := (G, ⊗) is called a group if the following hold:
closure1. Closure of G under ⊗: ∀x, y ∈ G : x ⊗ y ∈ G associativity2. Associativity: ∀x, y, z ∈ G : (x ⊗ y) ⊗ z = x ⊗ (y ⊗ z)
neutral element
inverse element 3. Neutral element: ∃e ∈ G ∀x ∈ G : x ⊗ e = x and e ⊗ x = x 4. Inverse element: ∀x ∈ G ∃y ∈ G : x ⊗ y = e and y ⊗ x = e, where e is the neutral element. We often write x−1to denote the inverse element of x.
Remark. The inverse element is defined with respect to the operation ⊗ and does not necessarily mean 1x. ♢ Abelian group If additionally ∀x, y ∈ G : x ⊗ y = y ⊗ x, then G = (G, ⊗) is an Abelian group (commutative).
Example 2.10 (Groups)
Let us have a look at some examples of sets with associated operations and see whether they are groups:
(Z, +) is an Abelian group.
N (N0, +) is not a group: Although (N0, +) possesses a neutral element 0 := N ∪ {0} (0), the inverse elements are missing.
(Z, ·) is not a group: Although (Z, ·) contains a neutral element (1), the inverse elements for any z ∈ Z, z ̸= ±1, are missing.
(R, ·) is not a group since 0 does not possess an inverse element.
(R\{0}, ·) is Abelian.
(Rn, +),(Zn, +), n ∈ N are Abelian if + is defined componentwise, i.e., (x1, · · · , xn) + (y1, · · · , yn) = (x1 + y1, · · · , xn + yn). (2.61)
Then, (x1, · · · , xn)−1:= (−x1, · · · , −xn) is the inverse element and e = (0, · · · , 0) is the neutral element.
(Rm×n, +), the set of m × n-matrices is Abelian (with componentwise addition as defined in (2.61)).
Let us have a closer look at (Rn×n, ·), i.e., the set of n×n-matrices with matrix multiplication as defined in (2.13).
– Closure and associativity follow directly from the definition of matrix multiplication.
– Neutral element: The identity matrix In is the neutral element with respect to matrix multiplication “·” in (Rn×n, ·).
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
2.4 Vector Spaces 37
– Inverse element: If the inverse exists (A is regular), then A−1is the inverse element of A ∈ Rn×n, and in exactly this case (Rn×n, ·) is a
group, called the general linear group.
Definition 2.8 (General Linear Group). The set of regular (invertible) matrices A ∈ Rn×nis a group with respect to matrix multiplication as defined in (2.13) and is called general linear group GL(n, R). However, general linear group since matrix multiplication is not commutative, the group is not Abelian.
2.4.2 Vector Spaces
When we discussed groups, we looked at sets G and inner operations on G, i.e., mappings G × G → G that only operate on elements in G. In the following, we will consider sets that in addition to an inner operation + also contain an outer operation ·, the multiplication of a vector x ∈ G by a scalar λ ∈ R. We can think of the inner operation as a form of addition, and the outer operation as a form of scaling. Note that the inner/outer operations have nothing to do with inner/outer products.
Definition 2.9 (Vector Space). A real-valued vector space V = (V, +, ·) is vector space a set V with two operations
+ : V × V → V (2.62)
· : R × V → V (2.63)
where
1. (V, +) is an Abelian group
2. Distributivity:
1. ∀λ ∈ R, x, y ∈ V : λ · (x + y) = λ · x + λ · y
2. ∀λ, ψ ∈ R, x ∈ V : (λ + ψ) · x = λ · x + ψ · x
3. Associativity (outer operation): ∀λ, ψ ∈ R, x ∈ V : λ·(ψ·x) = (λψ)·x 4. Neutral element with respect to the outer operation: ∀x ∈ V : 1·x = x
The elements x ∈ V are called vectors. The neutral element of (V, +) is vector the zero vector 0 = [0, . . . , 0]⊤, and the inner operation + is called vector vector addition addition. The elements λ ∈ R are called scalars and the outer operation scalar · is a multiplication by scalars. Note that a scalar product is something multiplication by different, and we will get to this in Section 3.2 scalars
.
Remark. A “vector multiplication” ab, a, b ∈ Rn, is not defined. Theoret ically, we could define an element-wise multiplication, such that c = ab with cj = aj bj . This “array multiplication” is common to many program ming languages but makes mathematically limited sense using the stan dard rules for matrix multiplication: By treating vectors as n × 1 matrices
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
38 Linear Algebra
(which we usually do), we can use the matrix multiplication as defined in (2.13). However, then the dimensions of the vectors do not match. Only the following multiplications for vectors are defined: ab⊤ ∈ Rn×n
outer product (outer product), a⊤b ∈ R (inner/scalar/dot product). ♢
Example 2.11 (Vector Spaces)
Let us have a look at some important examples:
V = Rn, n ∈ N is a vector space with operations defined as follows:
– Addition: x+y = (x1, . . . , xn)+(y1, . . . , yn) = (x1+y1, . . . , xn+yn) for all x, y ∈ Rn
– Multiplication by scalars: λx = λ(x1, . . . , xn) = (λx1, . . . , λxn) for all λ ∈ R, x ∈ Rn
V = Rm×n, m, n ∈ N is a vector space with 
a11 + b11 · · · a1n + b1n
is defined ele
– Addition: A + B =
......
am1 + bm1 · · · amn + bmn
mentwise for all A, B ∈ V – Multiplication by scalars: λA =
λa11 · · · λa1n
......
λam1 · · · λamn
as defined in
Section 2.2. Remember that Rm×nis equivalent to Rmn.
V = C, with the standard definition of addition of complex numbers.
Remark. In the following, we will denote a vector space (V, +, ·) by V when + and · are the standard vector addition and scalar multiplication. Moreover, we will use the notation x ∈ V for vectors in V to simplify notation. ♢
Remark. The vector spaces Rn, Rn×1, R1×n are only different in the way we write vectors. In the following, we will not make a distinction between Rn and Rn×1
column vector , which allows us to write n-tuples as column vectors
x =
x1... xn
 . (2.64)
This simplifies the notation regarding vector space operations. However, we do distinguish between Rn×1 and R1×n
row vector (the row vectors) to avoid con fusion with matrix multiplication. By default, we write x to denote a col umn vector, and a row vector is denoted by x⊤ transpose , the transpose of x. ♢
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
2.4 Vector Spaces 39 2.4.3 Vector Subspaces
In the following, we will introduce vector subspaces. Intuitively, they are sets contained in the original vector space with the property that when we perform vector space operations on elements within this subspace, we will never leave it. In this sense, they are “closed”. Vector subspaces are a key idea in machine learning. For example, Chapter 10 demonstrates how to use vector subspaces for dimensionality reduction.
Definition 2.10 (Vector Subspace). Let V = (V, +, ·) be a vector space and U ⊆ V, U ̸= ∅. Then U = (U, +, ·) is called vector subspace of V (or vector subspace linear subspace) if U is a vector space with the vector space operations + linear subspace and · restricted to U ×U and R×U. We write U ⊆ V to denote a subspace U of V .
If U ⊆ V and V is a vector space, then U naturally inherits many prop erties directly from V because they hold for all x ∈ V, and in particular for all x ∈ U ⊆ V. This includes the Abelian group properties, the distribu tivity, the associativity and the neutral element. To determine whether (U, +, ·) is a subspace of V we still do need to show
1. U ̸= ∅, in particular: 0 ∈ U
2. Closure of U:
a. With respect to the outer operation: ∀λ ∈ R ∀x ∈ U : λx ∈ U.
b. With respect to the inner operation: ∀x, y ∈ U : x + y ∈ U.
Example 2.12 (Vector Subspaces)
Let us have a look at some examples:
For every vector space V , the trivial subspaces are V itself and {0}. Only example D in Figure 2.6 is a subspace of R2(with the usual inner/ outer operations). In A and C, the closure property is violated; B does not contain 0.
The solution set of a homogeneous system of linear equations Ax = 0 with n unknowns x = [x1, . . . , xn]⊤ is a subspace of Rn.
The solution of an inhomogeneous system of linear equations Ax = b, b ̸= 0 is not a subspace of Rn.
The intersection of arbitrarily many subspaces is a subspace itself.
AB
D
Figure 2.6 Not all subsets of R2 are subspaces. In A and
0 0 0 0 C
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
C, the closure
property is violated; B does not contain 0. Only D is a
subspace.
40 Linear Algebra
Remark. Every subspace U ⊆ (Rn, +, ·) is the solution space of a homo geneous system of linear equations Ax = 0 for x ∈ Rn. ♢
2.5 Linear Independence
In the following, we will have a close look at what we can do with vectors (elements of the vector space). In particular, we can add vectors together and multiply them with scalars. The closure property guarantees that we end up with another vector in the same vector space. It is possible to find a set of vectors with which we can represent every vector in the vector space by adding them together and scaling them. This set of vectors is a basis, and we will discuss them in Section 2.6.1. Before we get there, we will need to introduce the concepts of linear combinations and linear independence.
Definition 2.11 (Linear Combination). Consider a vector space V and a finite number of vectors x1, . . . , xk ∈ V . Then, every v ∈ V of the form
v = λ1x1 + · · · + λkxk =Xk i=1
λixi ∈ V (2.65)
linear combination with λ1, . . . , λk ∈ R is a linear combination of the vectors x1, . . . , xk. The 0-vector can always be written as the linear combination of k vec tors x1, . . . , xk because 0 =Pki=1 0xiis always true. In the following, we are interested in non-trivial linear combinations of a set of vectors to represent 0, i.e., linear combinations of vectors x1, . . . , xk, where not all coefficients λiin (2.65) are 0.
Definition 2.12 (Linear (In)dependence). Let us consider a vector space V with k ∈ N and x1, . . . , xk ∈ V . If there is a non-trivial linear com
bination, such that 0 =Pki=1 λixi with at least one λi ̸= 0, the vectors linearly dependent x1, . . . , xk are linearly dependent. If only the trivial solution exists, i.e., linearly λ1 = . . . = λk = 0 the vectors x1, . . . , xk are linearly independent. independentLinear independence is one of the most important concepts in linear
algebra. Intuitively, a set of linearly independent vectors consists of vectors that have no redundancy, i.e., if we remove any of those vectors from the set, we will lose something. Throughout the next sections, we will formalize this intuition more.
Example 2.13 (Linearly Dependent Vectors)
A geographic example may help to clarify the concept of linear indepen dence. A person in Nairobi (Kenya) describing where Kigali (Rwanda) is might say ,“You can get to Kigali by first going 506 km Northwest to Kam pala (Uganda) and then 374 km Southwest.”. This is sufficient information
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
2.5 Linear Independence 41
to describe the location of Kigali because the geographic coordinate sys tem may be considered a two-dimensional vector space (ignoring altitude and the Earth’s curved surface). The person may add, “It is about 751 km West of here.” Although this last statement is true, it is not necessary to find Kigali given the previous information (see Figure 2.7 for an illus tration). In this example, the “506 km Northwest” vector (blue) and the “374 km Southwest” vector (purple) are linearly independent. This means the Southwest vector cannot be described in terms of the Northwest vec tor, and vice versa. However, the third “751 km West” vector (black) is a linear combination of the other two vectors, and it makes the set of vec tors linearly dependent. Equivalently, given “751 km West” and “374 km Southwest” can be linearly combined to obtain “506 km Northwest”.
Kampala
Figure 2.7
Southwest
506 km Northwest
Nairobi
Geographic example (with crude
approximations to cardinal directions) of linearly
Southwest 374 km
751 km West
Kigali
374 km
Remark. The following properties are useful to find out whether vectors are linearly independent:
k vectors are either linearly dependent or linearly independent. There is no third option.
If at least one of the vectors x1, . . . , xk is 0 then they are linearly de pendent. The same holds if two vectors are identical.
The vectors {x1, . . . , xk : xi ̸= 0, i = 1, . . . , k}, k ⩾ 2, are linearly dependent if and only if (at least) one of them is a linear combination of the others. In particular, if one vector is a multiple of another vector, i.e., xi = λxj , λ ∈ R then the set {x1, . . . , xk : xi ̸= 0, i = 1, . . . , k} is linearly dependent.
A practical way of checking whether vectors x1, . . . , xk ∈ V are linearly independent is to use Gaussian elimination: Write all vectors as columns of a matrix A and perform Gaussian elimination until the matrix is in row echelon form (the reduced row-echelon form is unnecessary here):
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
dependent vectors in a
two-dimensional space (plane).
42 Linear Algebra
– The pivot columns indicate the vectors, which are linearly indepen dent of the vectors on the left. Note that there is an ordering of vec tors when the matrix is built.
– The non-pivot columns can be expressed as linear combinations of the pivot columns on their left. For instance, the row-echelon form
1 3 0 0 0 2
(2.66)
tells us that the first and third columns are pivot columns. The sec ond column is a non-pivot column because it is three times the first column.
All column vectors are linearly independent if and only if all columns are pivot columns. If there is at least one non-pivot column, the columns (and, therefore, the corresponding vectors) are linearly dependent.
Example 2.14
Consider R4 with 

x1 =
1
2
−3 4
, x2 =
 
1 1 0 2
, x3 =
 
−1 −2 1
1
. (2.67)
To check whether they are linearly dependent, we follow the general ap proach and solve

λ1x1 + λ2x2 + λ3x3 = λ1
1
2
−3 4
+ λ2
 
1 1 0 2
+ λ3
 
−1 −2 1
1
 = 0 (2.68)
for λ1, . . . , λ3. We write the vectors xi, i = 1, 2, 3, as the columns of a matrix and apply elementary row operations until we identify the pivot columns:
 
1 1 −1 2 1 −2 −3 0 1 4 2 1
 ⇝ · · · ⇝
 
1 1 −1 0 1 0 0 0 1 0 0 0
. (2.69)
Here, every column of the matrix is a pivot column. Therefore, there is no non-trivial solution, and we require λ1 = 0, λ2 = 0, λ3 = 0 to solve the equation system. Hence, the vectors x1, x2, x3 are linearly independent.
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
2.5 Linear Independence 43
Remark. Consider a vector space V with k linearly independent vectors b1, . . . , bk and m linear combinations
x1 =Xk i=1
...
xm =Xk i=1
λi1bi, λimbi.
(2.70)
Defining B = [b1, . . . , bk] as the matrix whose columns are the linearly independent vectors b1, . . . , bk, we can write
xj = Bλj , λj =
in a more compact form.
λ1j... λkj
 , j = 1, . . . , m , (2.71)
We want to test whether x1, . . . , xm are linearly independent. For this purpose, we follow the general approach of testing when Pmj=1 ψjxj = 0. With (2.71), we obtain
Xm j=1
ψjxj =Xm j=1
ψjBλj = BXm j=1
ψjλj . (2.72)
This means that {x1, . . . , xm} are linearly independent if and only if the column vectors {λ1, . . . , λm} are linearly independent.
Remark. In a vector space V , m linear combinations of k vectors x1, . . . , xk are linearly dependent if m > k. ♢
Example 2.15
Consider a set of linearly independent vectors b1, b2, b3, b4 ∈ Rn and x1 = b1 − 2b2 + b3 − b4
x2 = −4b1 − 2b2 + 4b4 x3 = 2b1 + 3b2 − b3 − 3b4 x4 = 17b1 − 10b2 + 11b3 + b4
. (2.73)
Are the vectors x1, . . . , x4 ∈ Rnlinearly independent? To answer this question, we investigate whether the column vectors
 
 
1
−2 1
−1
,
 
−4 −2 0
4
,
 
2
3
−1 −3
,
 
17
−10 11
1
 

(2.74)
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
44 Linear Algebra
are linearly independent. The reduced row-echelon form of the corre sponding linear equation system with coefficient matrix
(2.75)
is given as
A =
1 −4 2 17 
−2 −2 3 −10 1 0 −1 11 −1 4 −3 1
 
1 0 0 −7 0 1 0 −15 0 0 1 −18 0 0 0 0
. (2.76)
We see that the corresponding linear equation system is non-trivially solv able: The last column is not a pivot column, and x4 = −7x1−15x2−18x3. Therefore, x1, . . . , x4 are linearly dependent as x4 can be expressed as a linear combination of x1, . . . , x3.
2.6 Basis and Rank
In a vector space V , we are particularly interested in sets of vectors A that possess the property that any vector v ∈ V can be obtained by a linear combination of vectors in A. These vectors are special vectors, and in the following, we will characterize them.
2.6.1 Generating Set and Basis
Definition 2.13 (Generating Set and Span). Consider a vector space V = (V, +, ·) and set of vectors A = {x1, . . . , xk} ⊆ V. If every vector v ∈ V can be expressed as a linear combination of x1, . . . , xk, A is called a
generating set generating set of V . The set of all linear combinations of vectors in A is span called the span of A. If A spans the vector space V , we write V = span[A] or V = span[x1, . . . , xk].
Generating sets are sets of vectors that span vector (sub)spaces, i.e., every vector can be represented as a linear combination of the vectors in the generating set. Now, we will be more specific and characterize the smallest generating set that spans a vector (sub)space.
Definition 2.14 (Basis). Consider a vector space V = (V, +, ·) and A ⊆ minimal V. A generating set A of V is called minimal if there exists no smaller set A˜ ⊊ A ⊆ V that spans V . Every linearly independent generating set of V basis is minimal and is called a basis of V .
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
2.6 Basis and Rank 45
Let V = (V, +, ·) be a vector space and B ⊆ V, B ̸= ∅. Then, the following statements are equivalent: A basis is a minimal generating set and a
B is a basis of V .
B is a minimal generating set.
B is a maximal linearly independent set of vectors in V , i.e., adding any other vector to this set will make it linearly dependent. Every vector x ∈ V is a linear combination of vectors from B, and every linear combination is unique, i.e., with
maximal linearly independent set of vectors.
x =Xk i=1
λibi =Xk i=1
ψibi (2.77)
and λi, ψi ∈ R, bi ∈ B it follows that λi = ψi, i = 1, . . . , k.
Example 2.16
In R3, the canonical/standard basis is canonical basis
B =
 
 
1 0 0
 ,
 
0 1 0
 ,
 
0 0 1
 

. (2.78)
Different bases in R3 are



B1 =
 
 
1 0 0
 ,
 
1 1 0
 ,
 
1 1 1
 
, B2 =
 
0.5 0.8 0.4
 ,
 
1.8 0.3 0.3
 ,
 
−2.2 −1.3 3.5
 
. (2.79)
The set
A =
 
 
1 2 3 4
,
 
2
−1 0
2
,
 
1
1
0
−4
 

(2.80)
is linearly independent, but not a generating set (and no basis) of R4: For instance, the vector [1, 0, 0, 0]⊤ cannot be obtained by a linear com bination of elements in A.
Remark. Every vector space V possesses a basis B. The preceding exam ples show that there can be many bases of a vector space V , i.e., there is no unique basis. However, all bases possess the same number of elements, the basis vectors. ♢ basis vector
We only consider finite-dimensional vector spaces V . In this case, the dimension of V is the number of basis vectors of V , and we write dim(V ). dimension If U ⊆ V is a subspace of V , then dim(U) ⩽ dim(V ) and dim(U) =
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
46 Linear Algebra
dim(V ) if and only if U = V . Intuitively, the dimension of a vector space can be thought of as the number of independent directions in this vector The dimension of a space.
vector space
Remark. The dimension of a vector space is not necessarily the number
corresponds to the number of its basis vectors.
0
of elements in a vector. For instance, the vector space V = span[1
] is
one-dimensional, although the basis vector possesses two elements. ♢ Remark. A basis of a subspace U = span[x1, . . . , xm] ⊆ Rncan be found by executing the following steps:
1. Write the spanning vectors as columns of a matrix A 2. Determine the row-echelon form of A.
3. The spanning vectors associated with the pivot columns are a basis of U.
Example 2.17 (Determining a Basis)
For a vector subspace U ⊆ R5, spanned by the vectors
x1 =
 
1
2
−1 −1 −1
, x2 =
 
2
−1 1
2
−2
, x3 =
 
3
−4 3
5
−3
, x4 =
 
−1 8
−5 −6 1
∈ R5, (2.81)
we are interested in finding out which vectors x1, . . . , x4 are a basis for U. For this, we need to check whether x1, . . . , x4 are linearly independent. Therefore, we need to solve
X4 i=1
λixi = 0 , (2.82)
which leads to a homogeneous system of equations with matrix

x1, x2, x3, x4 =
1 2 3 −1 2 −1 −4 8 −1 1 3 −5 −1 2 5 −6 −1 −2 −3 1
. (2.83)
With the basic transformation rules for systems of linear equations, we
obtain the row-echelon form
1 2 3 −1 
2 −1 −4 8 −1 1 3 −5 −1 2 5 −6 −1 −2 −3 1
⇝ · · · ⇝
1 2 3 −1 
0 1 2 −2 0 0 0 1 0 0 0 0 0 0 0 0
.
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
2.6 Basis and Rank 47
Since the pivot columns indicate which set of vectors is linearly indepen dent, we see from the row-echelon form that x1, x2, x4 are linearly inde pendent (because the system of linear equations λ1x1 + λ2x2 + λ4x4 = 0 can only be solved with λ1 = λ2 = λ4 = 0). Therefore, {x1, x2, x4} is a basis of U.
2.6.2 Rank
The number of linearly independent columns of a matrix A ∈ Rm×n equals the number of linearly independent rows and is called the rank rank of A and is denoted by rk(A).
Remark. The rank of a matrix has some important properties:
rk(A) = rk(A⊤), i.e., the column rank equals the row rank.
The columns of A ∈ Rm×nspan a subspace U ⊆ Rm with dim(U) = rk(A). Later we will call this subspace the image or range. A basis of U can be found by applying Gaussian elimination to A to identify the pivot columns.
The rows of A ∈ Rm×nspan a subspace W ⊆ Rn with dim(W) = rk(A). A basis of W can be found by applying Gaussian elimination to A⊤.
For all A ∈ Rn×nit holds that A is regular (invertible) if and only if rk(A) = n.
For all A ∈ Rm×n and all b ∈ Rm it holds that the linear equation system Ax = b can be solved if and only if rk(A) = rk(A|b), where A|b denotes the augmented system.
For A ∈ Rm×nthe subspace of solutions for Ax = 0 possesses dimen sion n − rk(A). Later, we will call this subspace the kernel or the null kernel null space space. A matrix A ∈ Rm×n has full rank if its rank equals the largest possible full rank rank for a matrix of the same dimensions. This means that the rank of a full-rank matrix is the lesser of the number of rows and columns, i.e., rk(A) = min(m, n). A matrix is said to be rank deficient if it does not rank deficient have full rank.
Example 2.18 (Rank)
A =
 
1 0 1 0 1 1 0 0 0
.
A has two linearly independent rows/columns so that rk(A) = 2. ©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
48 Linear Algebra
A =
 
1 2 1 −2 −3 1 3 5 0
 .
We use Gaussian elimination to determine the rank:
 
1 2 1 −2 −3 1 3 5 0
 ⇝ · · · ⇝
 
1 2 1 0 1 3 0 0 0
 . (2.84)
Here, we see that the number of linearly independent rows and columns is 2, such that rk(A) = 2.
2.7 Linear Mappings
In the following, we will study mappings on vector spaces that preserve their structure, which will allow us to define the concept of a coordinate. In the beginning of the chapter, we said that vectors are objects that can be added together and multiplied by a scalar, and the resulting object is still a vector. We wish to preserve this property when applying the mapping: Consider two real vector spaces V, W. A mapping Φ : V → W preserves the structure of the vector space if
Φ(x + y) = Φ(x) + Φ(y) (2.85)
Φ(λx) = λΦ(x) (2.86)
for all x, y ∈ V and λ ∈ R. We can summarize this in the following definition:
Definition 2.15 (Linear Mapping). For vector spaces V, W, a mapping linear mapping Φ : V → W is called a linear mapping (or vector space homomorphism/
vector space homomorphism linear
transformation injective
linear transformation) if
∀x, y ∈ V ∀λ, ψ ∈ R : Φ(λx + ψy) = λΦ(x) + ψΦ(y). (2.87)
It turns out that we can represent linear mappings as matrices (Sec tion 2.7.1). Recall that we can also collect a set of vectors as columns of a matrix. When working with matrices, we have to keep in mind what the matrix represents: a linear mapping or a collection of vectors. We will see more about linear mappings in Chapter 4. Before we continue, we will briefly introduce special mappings.
Definition 2.16 (Injective, Surjective, Bijective). Consider a mapping Φ : V → W, where V, W can be arbitrary sets. Then Φ is called
Injective if ∀x, y ∈ V : Φ(x) = Φ(y) =⇒ x = y.
surjectiveSurjective if Φ(V) = W. bijectiveBijective if it is injective and surjective.
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
2.7 Linear Mappings 49
If Φ is surjective, then every element in W can be “reached” from V using Φ. A bijective Φ can be “undone”, i.e., there exists a mapping Ψ : W → V so that Ψ ◦ Φ(x) = x. This mapping Ψ is then called the inverse of Φ and normally denoted by Φ−1.
With these definitions, we introduce the following special cases of linear mappings between vector spaces V and W:
isomorphism
Isomorphism: Φ : V → W linear and bijective endomorphism Endomorphism: Φ : V → V linear automorphism Automorphism: Φ : V → V linear and bijective
We define idV : V → V , x 7→ x as the identity mapping or identity identity mapping
automorphism in V .
Example 2.19 (Homomorphism)
The mapping Φ : R2 → C, Φ(x) = x1 + ix2, is a homomorphism:
identity
automorphism
Φ
x1 x2
+
y1 y2
= (x1 + y1) + i(x2 + y2) = x1 + ix2 + y1 + iy2
x1
= Φ
x1 x2
+ Φ
y1 y2
x1
Φ
λ
x2
= λx1 + λix2 = λ(x1 + ix2) = λΦ
x2
.
(2.88)
This also justifies why complex numbers can be represented as tuples in R2: There is a bijective linear mapping that converts the elementwise addi tion of tuples in R2into the set of complex numbers with the correspond ing addition. Note that we only showed linearity, but not the bijection.
Theorem 2.17 (Theorem 3.59 in Axler (2015)). Finite-dimensional vector spaces V and W are isomorphic if and only if dim(V ) = dim(W).
Theorem 2.17 states that there exists a linear, bijective mapping be tween two vector spaces of the same dimension. Intuitively, this means that vector spaces of the same dimension are kind of the same thing, as they can be transformed into each other without incurring any loss.
Theorem 2.17 also gives us the justification to treat Rm×n(the vector space of m × n-matrices) and Rmn (the vector space of vectors of length mn) the same, as their dimensions are mn, and there exists a linear, bi jective mapping that transforms one into the other.
Remark. Consider vector spaces V, W, X. Then:
For linear mappings Φ : V → W and Ψ : W → X, the mapping Ψ ◦ Φ : V → X is also linear.
If Φ : V → W is an isomorphism, then Φ−1: W → V is an isomor phism, too.
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
50 Linear Algebra Figure 2.8 Two
different coordinate systems defined by two sets of basis vectors. A vector x has different
coordinate
representations depending on which coordinate system is chosen.
e2
b1
x
b2
x
e1
If Φ : V → W, Ψ : V → W are linear, then Φ + Ψ and λΦ, λ ∈ R, are linear, too.
2.7.1 Matrix Representation of Linear Mappings
Any n-dimensional vector space is isomorphic to Rn(Theorem 2.17). We consider a basis {b1, . . . , bn} of an n-dimensional vector space V . In the following, the order of the basis vectors will be important. Therefore, we write
B = (b1, . . . , bn) (2.89)
ordered basis and call this n-tuple an ordered basis of V .
Remark (Notation). We are at the point where notation gets a bit tricky. Therefore, we summarize some parts here. B = (b1, . . . , bn) is an ordered basis, B = {b1, . . . , bn} is an (unordered) basis, and B = [b1, . . . , bn] is a matrix whose columns are the vectors b1, . . . , bn. ♢
Definition 2.18 (Coordinates). Consider a vector space V and an ordered basis B = (b1, . . . , bn) of V . For any x ∈ V we obtain a unique represen tation (linear combination)
x = α1b1 + . . . + αnbn (2.90)
coordinate of x with respect to B. Then α1, . . . , αn are the coordinates of x with respect to B, and the vector
α =
α1... αn
 ∈ Rn(2.91)
coordinate vector is the coordinate vector/coordinate representation of x with respect to the
coordinate
representation
ordered basis B.
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
2.7 Linear Mappings 51
A basis effectively defines a coordinate system. We are familiar with the Cartesian coordinate system in two dimensions, which is spanned by the canonical basis vectors e1, e2. In this coordinate system, a vector x ∈ R2 has a representation that tells us how to linearly combine e1 and e2 to obtain x. However, any basis of R2 defines a valid coordinate system, and the same vector x from before may have a different coordinate rep resentation in the (b1, b2) basis. In Figure 2.8, the coordinates of x with respect to the standard basis (e1, e2) is [2, 2]⊤. However, with respect to the basis (b1, b2) the same vector x is represented as [1.09, 0.72]⊤, i.e., x = 1.09b1 + 0.72b2. In the following sections, we will discover how to obtain this representation.
Example 2.20
Let us have a look at a geometric vector x ∈ R2 with coordinates [2, 3]⊤ Figure 2.9
with respect to the standard basis (e1, e2) of R2. This means, we can write x = 2e1 + 3e2. However, we do not have to choose the standard basis to represent this vector. If we use the basis vectors b1 = [1, −1]⊤, b2 = [1, 1]⊤ we will obtain the coordinates 12[−1, 5]⊤ to represent the same vector with respect to (b1, b2) (see Figure 2.9).
Remark. For an n-dimensional vector space V and an ordered basis B of V , the mapping Φ : Rn → V , Φ(ei) = bi, i = 1, . . . , n, is linear (and because of Theorem 2.17 an isomorphism), where (e1, . . . , en) is the standard basis of Rn.
Now we are ready to make an explicit connection between matrices and linear mappings between finite-dimensional vector spaces.
Definition 2.19 (Transformation Matrix). Consider vector spaces V, W with corresponding (ordered) bases B = (b1, . . . , bn) and C = (c1, . . . , cm). Moreover, we consider a linear mapping Φ : V → W. For j ∈ {1, . . . , n},
Different coordinate representations of a vector x, depending on the choice of
basis.
x = 2e1 + 3e2
x = −12b1 +52b2
e2 b2
e1
b1
Φ(bj ) = α1jc1 + · · · + αmjcm =Xm i=1
αijci (2.92)
is the unique representation of Φ(bj ) with respect to C. Then, we call the m × n-matrix AΦ, whose elements are given by
AΦ(i, j) = αij , (2.93)
the transformation matrix of Φ (with respect to the ordered bases B of V transformation and C of W matrix ).
The coordinates of Φ(bj ) with respect to the ordered basis C of W are the j-th column of AΦ. Consider (finite-dimensional) vector spaces V, W with ordered bases B, C and a linear mapping Φ : V → W with
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
52 Linear Algebra
transformation matrix AΦ. If xˆ is the coordinate vector of x ∈ V with respect to B and yˆ the coordinate vector of y = Φ(x) ∈ W with respect to C, then
yˆ = AΦxˆ . (2.94)
This means that the transformation matrix can be used to map coordinates with respect to an ordered basis in V to coordinates with respect to an ordered basis in W.
Example 2.21 (Transformation Matrix)
Consider a homomorphism Φ : V → W and ordered bases B = (b1, . . . , b3) of V and C = (c1, . . . , c4) of W. With
Φ(b1) = c1 − c2 + 3c3 − c4
Φ(b2) = 2c1 + c2 + 7c3 + 2c4 Φ(b3) = 3c2 + c3 + 4c4
(2.95)
PAΦ with respect to B and C satisfies Φ(bk) =
the transformation matrix
4
i=1 αikci for k = 1, . . . , 3 and is given as

AΦ = [α1, α2, α3] =
1 2 0 −1 1 3 3 7 1 −1 2 4
, (2.96)
where the αj , j = 1, 2, 3, are the coordinate vectors of Φ(bj ) with respect to C.
Example 2.22 (Linear Transformations of Vectors)
Figure 2.10 Three
examples of linear
transformations of
the vectors shown
as dots in (a);
(b) Rotation by 45◦;
(c) Stretching of the
horizontal
coordinates by 2; (d) Combination of
(a) Original data. (b) Rotation by 45◦. (c) Stretch along the horizontal axis.
(d) General linear mapping.
reflection, rotation and stretching.
We consider three linear transformations of a set of vectors in R2 with the transformation matrices
A1 =
cos( π4) − sin( π4) sin( π4) cos( π4)
, A2 =
2 0 0 1
, A3 =12 3 −1 1 −1
. (2.97)
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
2.7 Linear Mappings 53
Figure 2.10 gives three examples of linear transformations of a set of vec tors. Figure 2.10(a) shows 400 vectors in R2, each of which is represented by a dot at the corresponding (x1, x2)-coordinates. The vectors are ar ranged in a square. When we use matrix A1 in (2.97) to linearly transform each of these vectors, we obtain the rotated square in Figure 2.10(b). If we apply the linear mapping represented by A2, we obtain the rectangle in Figure 2.10(c) where each x1-coordinate is stretched by 2. Figure 2.10(d) shows the original square from Figure 2.10(a) when linearly transformed using A3, which is a combination of a reflection, a rotation, and a stretch.
2.7.2 Basis Change
In the following, we will have a closer look at how transformation matrices of a linear mapping Φ : V → W change if we change the bases in V and W. Consider two ordered bases
B = (b1, . . . , bn), B˜ = (˜b1, . . . ,˜bn) (2.98) of V and two ordered bases
C = (c1, . . . , cm), C˜ = (c˜1, . . . , c˜m) (2.99) of W. Moreover, AΦ ∈ Rm×nis the transformation matrix of the linear mapping Φ : V → W with respect to the bases B and C, and A˜Φ ∈ Rm×n is the corresponding transformation mapping with respect to B˜ and C˜. In the following, we will investigate how A and A˜ are related, i.e., how/ whether we can transform AΦ into A˜Φ if we choose to perform a basis change from B, C to B, ˜ C˜.
Remark. We effectively get different coordinate representations of the identity mapping idV . In the context of Figure 2.9, this would mean to map coordinates with respect to (e1, e2) onto coordinates with respect to (b1, b2) without changing the vector x. By changing the basis and corre spondingly the representation of vectors, the transformation matrix with respect to this new basis can have a particularly simple form that allows for straightforward computation. ♢
Example 2.23 (Basis Change)
Consider a transformation matrix A =
2 1 1 2
(2.100)
with respect to the canonical basis in R2. If we define a new basis
B = (
1 1
,
1
−1
) (2.101)
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
54 Linear Algebra we obtain a diagonal transformation matrix
A˜ =
3 0 0 1
(2.102)
with respect to B, which is easier to work with than A.
In the following, we will look at mappings that transform coordinate vectors with respect to one basis into coordinate vectors with respect to a different basis. We will state our main result first and then provide an explanation.
Theorem 2.20 (Basis Change). For a linear mapping Φ : V → W, ordered bases
B = (b1, . . . , bn), B˜ = (˜b1, . . . ,˜bn) (2.103) of V and
C = (c1, . . . , cm), C˜ = (c˜1, . . . , c˜m) (2.104)
of W, and a transformation matrix AΦ of Φ with respect to B and C, the corresponding transformation matrix A˜Φ with respect to the bases B˜ and C˜ is given as
A˜Φ = T−1AΦS . (2.105)
Here, S ∈ Rn×nis the transformation matrix of idV that maps coordinates with respect to B˜ onto coordinates with respect to B, and T ∈ Rm×m is the transformation matrix of idW that maps coordinates with respect to C˜ onto coordinates with respect to C.
Proof Following Drumm and Weil (2001), we can write the vectors of the new basis B˜ of V as a linear combination of the basis vectors of B, such that
˜bj = s1jb1 + · · · + snjbn =Xn i=1
sijbi, j = 1, . . . , n . (2.106)
Similarly, we write the new basis vectors C˜ of W as a linear combination of the basis vectors of C, which yields
c˜k = t1kc1 + · · · + tmkcm =Xm l=1
tlkcl, k = 1, . . . , m . (2.107)
We define S = ((sij )) ∈ Rn×n as the transformation matrix that maps coordinates with respect to B˜ onto coordinates with respect to B and T = ((tlk)) ∈ Rm×m as the transformation matrix that maps coordinates with respect to C˜ onto coordinates with respect to C. In particular, the jth column of S is the coordinate representation of ˜bj with respect to B and
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
2.7 Linear Mappings 55
the kth column of T is the coordinate representation of c˜k with respect to C. Note that both S and T are regular.
We are going to look at Φ(˜bj ) from two perspectives. First, applying the mapping Φ, we get that for all j = 1, . . . , n
Φ(˜bj ) = Xm k=1
a˜kjc˜k | {z } ∈W
(2.107) =Xm k=1
a˜kjXm l=1
tlkcl =Xm l=1
Xm k=1
tlka˜kj
!
cl, (2.108)
where we first expressed the new basis vectors c˜k ∈ W as linear com binations of the basis vectors cl ∈ W and then swapped the order of summation.
Alternatively, when we express the ˜bj ∈ V as linear combinations of bj ∈ V , we arrive at
Φ(˜bj )(2.106) = Φ
Xn i=1
!
sijbi
=Xn i=1
sijΦ(bi) =Xn i=1
sijXm l=1
alicl (2.109a)
=Xm l=1
Xn i=1
alisij
!
cl, j = 1, . . . , n , (2.109b)
where we exploited the linearity of Φ. Comparing (2.108) and (2.109b), it follows for all j = 1, . . . , n and l = 1, . . . , m that
Xm
alisij (2.110)
k=1
and, therefore,
tlka˜kj =Xn i=1
T A˜Φ = AΦS ∈ Rm×n, (2.111)
such that
A˜Φ = T−1AΦS , (2.112)
which proves Theorem 2.20.
Theorem 2.20 tells us that with a basis change in V (B is replaced with B˜) and W (C is replaced with C˜), the transformation matrix AΦ of a linear mapping Φ : V → W is replaced by an equivalent matrix A˜Φ with A˜Φ = T−1AΦS. (2.113)
Figure 2.11 illustrates this relation: Consider a homomorphism Φ : V → W and ordered bases B, B˜ of V and C, C˜ of W. The mapping ΦCB is an instantiation of Φ and maps basis vectors of B onto linear combinations of basis vectors of C. Assume that we know the transformation matrix AΦ of ΦCB with respect to the ordered bases B, C. When we perform a basis change from B to B˜ in V and from C to C˜ in W, we can determine the
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
56 Linear Algebra
Figure 2.11 For a homomorphism
Vector spaces
Φ
V W
Φ
V W
Φ : V → W and ordered bases B, B˜
B
ΦCB AΦ
C
B
ΦCB AΦ
C
of V and C, C˜ of W (marked in blue), we can express the mapping ΦC˜B˜ with respect to the bases
Ordered bases
ΨBB˜ S T ΞCC˜ A˜ Φ
B˜ C˜
ΦC˜B˜
ΨBB˜ ΞCC˜ = Ξ−1 S T CC˜
−1
A˜ Φ
B˜ C˜
ΦC˜B˜
B, ˜ C˜ equivalently as a composition of the homomorphisms ΦC˜B˜ =
ΞCC˜ ◦ ΦCB ◦ ΨBB˜ with respect to the bases in the
subscripts. The
corresponding
transformation
matrices are in red.
corresponding transformation matrix A˜Φ as follows: First, we find the ma trix representation of the linear mapping ΨBB˜ : V → V that maps coordi nates with respect to the new basis B˜ onto the (unique) coordinates with respect to the “old” basis B (in V ). Then, we use the transformation ma trix AΦ of ΦCB : V → W to map these coordinates onto the coordinates with respect to C in W. Finally, we use a linear mapping ΞCC˜ : W → W to map the coordinates with respect to C onto coordinates with respect to C˜. Therefore, we can express the linear mapping ΦC˜B˜ as a composition of linear mappings that involve the “old” basis:
ΦC˜B˜ = ΞCC˜ ◦ ΦCB ◦ ΨBB˜ = Ξ−1
CC˜ ◦ ΦCB ◦ ΨBB˜ . (2.114)
Concretely, we use ΨBB˜ = idV and ΞCC˜ = idW , i.e., the identity mappings that map vectors onto themselves, but with respect to a different basis.
Definition 2.21 (Equivalence). Two matrices A, A˜ ∈ Rm×n
equivalent are equivalent if there exist regular matrices S ∈ Rn×n and T ∈ Rm×m, such that A˜ = T−1AS.
Definition 2.22 (Similarity). Two matrices A, A˜ ∈ Rn×n
similar are similar if there exists a regular matrix S ∈ Rn×n with A˜ = S−1AS
Remark. Similar matrices are always equivalent. However, equivalent ma trices are not necessarily similar. ♢
Remark. Consider vector spaces V, W, X. From the remark that follows Theorem 2.17, we already know that for linear mappings Φ : V → W and Ψ : W → X the mapping Ψ ◦ Φ : V → X is also linear. With transformation matrices AΦ and AΨ of the corresponding mappings, the overall transformation matrix is AΨ◦Φ = AΨAΦ. ♢
In light of this remark, we can look at basis changes from the perspec tive of composing linear mappings:
AΦ is the transformation matrix of a linear mapping ΦCB : V → W with respect to the bases B, C.
A˜Φ is the transformation matrix of the linear mapping ΦC˜B˜ : V → W with respect to the bases B, ˜ C˜.
S is the transformation matrix of a linear mapping ΨBB˜ : V → V (automorphism) that represents B˜ in terms of B. Normally, Ψ = idV is the identity mapping in V .
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
2.7 Linear Mappings 57
T is the transformation matrix of a linear mapping ΞCC˜ : W → W (automorphism) that represents C˜ in terms of C. Normally, Ξ = idW is the identity mapping in W.
If we (informally) write down the transformations just in terms of bases, then AΦ : B → C, A˜Φ : B˜ → C˜, S : B˜ → B, T : C˜ → C and T−1: C → C˜, and
B˜ → C˜ = B˜ → B→ C → C˜ (2.115)
A˜Φ = T−1AΦS . (2.116)
Note that the execution order in (2.116) is from right to left because vec tors are multiplied at the right-hand side so that x 7→ Sx 7→ AΦ(Sx) 7→
T−1AΦ(Sx) = A˜Φx.
Example 2.24 (Basis Change)
Consider a linear mapping Φ : R3 → R4 whose transformation matrix is

AΦ =
with respect to the standard bases
1 2 0 −1 1 3 3 7 1 −1 2 4
(2.117)
B = (
 
1 0 0
 ,
 
0 1 0
 ,
 
0 0 1
), C = (
 
1 0 0 0
,
 
0 1 0 0
,
 
0 0 1 0
,
 
0 0 0 1
). (2.118)
We seek the transformation matrix A˜Φ of Φ with respect to the new bases
B˜ = ( Then,
 
1 1 0
 ,
 
0 1 1
 ,
 
1 0 1
) ∈ R3, C˜ = (
 
1 1 0 0
,
 
1 0 1 0
,
 
0 1 1 0
,
 
1 0 0 1
). (2.119)
S =
 
1 0 1 1 1 0 0 1 1
 , T =
 
1 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1
, (2.120)
where the ith column of S is the coordinate representation of ˜biin terms of the basis vectors of B. Since B is the standard basis, the co ordinate representation is straightforward to find. For a general basis B, we would need to solve a linear equation system to find the λi such that
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
58 Linear Algebra
P3
i=1 λibi = ˜bj , j = 1, . . . , 3. Similarly, the jth column of T is the coordi nate representation of c˜j in terms of the basis vectors of C. Therefore, we obtain

A˜Φ = T−1AΦS =12
1 1 −1 −1 1 −1 1 −1 −1 1 1 1
 
 
3 2 1 0 4 2 10 8 4
(2.121a)
−4 −4 −2
0 0 0 2
1 6 3
=

6 0 0 4 8 4 1 6 3
. (2.121b)
In Chapter 4, we will be able to exploit the concept of a basis change to find a basis with respect to which the transformation matrix of an en domorphism has a particularly simple (diagonal) form. In Chapter 10, we will look at a data compression problem and find a convenient basis onto which we can project the data while minimizing the compression loss.
2.7.3 Image and Kernel
The image and kernel of a linear mapping are vector subspaces with cer tain important properties. In the following, we will characterize them more carefully.
Definition 2.23 (Image and Kernel).
kernel For Φ : V → W, we define the kernel/null space
null spaceker(Φ) := Φ−1(0W ) = {v ∈ V : Φ(v) = 0W } (2.122)
image and the image/range
rangeIm(Φ) := Φ(V ) = {w ∈ W|∃v ∈ V : Φ(v) = w} . (2.123)
domain We also call V and W also the domain and codomain of Φ, respectively. codomainIntuitively, the kernel is the set of vectors v ∈ V that Φ maps onto the neutral element 0W ∈ W. The image is the set of vectors w ∈ W that can be “reached” by Φ from any vector in V . An illustration is given in Figure 2.12.
Remark. Consider a linear mapping Φ : V → W, where V, W are vector spaces.
It always holds that Φ(0V ) = 0W and, therefore, 0V ∈ ker(Φ). In particular, the null space is never empty.
Im(Φ) ⊆ W is a subspace of W, and ker(Φ) ⊆ V is a subspace of V .
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
2.7 Linear Mappings 59 Φ : V → W V W
Figure 2.12 Kernel and image of a linear mapping Φ : V → W.
ker(Φ)
0V
Im(Φ) 0W
Φ is injective (one-to-one) if and only if ker(Φ) = {0}.
Remark (Null Space and Column Space). Let us consider A ∈ Rm×n and a linear mapping Φ : Rn → Rm, x 7→ Ax.
For A = [a1, . . . , an], where ai are the columns of A, we obtain
Im(Φ) = {Ax : x ∈ Rn} =
(Xn i=1
xiai: x1, . . . , xn ∈ R
)
(2.124a)
= span[a1, . . . , an] ⊆ Rm , (2.124b)
i.e., the image is the span of the columns of A, also called the column column space space. Therefore, the column space (image) is a subspace of Rm, where m is the “height” of the matrix.
rk(A) = dim(Im(Φ)).
The kernel/null space ker(Φ) is the general solution to the homoge neous system of linear equations Ax = 0 and captures all possible linear combinations of the elements in Rnthat produce 0 ∈ Rm.
The kernel is a subspace of Rn, where n is the “width” of the matrix. The kernel focuses on the relationship among the columns, and we can use it to determine whether/how we can express a column as a linear combination of other columns.
Example 2.25 (Image and Kernel of a Linear Mapping)
The mapping

Φ : R4 → R2,
x1 x2 x3 x4
7→ 1 2 −1 0 1 0 0 1
 
x1 x2 x3 x4
 = x1 + 2x2 − x3 x1 + x4
(2.125a)
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
60 Linear Algebra
= x1
1 1
+ x2
2 0
+ x3
−1 0
+ x4
0 1
(2.125b)
is linear. To determine Im(Φ), we can take the span of the columns of the
transformation matrix and obtain Im(Φ) = span[
1 1
,
2 0
,
−1 0
,
0 1
] . (2.126)
To compute the kernel (null space) of Φ, we need to solve Ax = 0, i.e., we need to solve a homogeneous equation system. To do this, we use Gaussian elimination to transform A into reduced row-echelon form:
1 2 −1 0 1 0 0 1
⇝ · · · ⇝
1 0 0 1 0 1 −12 −12
. (2.127)
This matrix is in reduced row-echelon form, and we can use the Minus 1 Trick to compute a basis of the kernel (see Section 2.3.3). Alternatively, we can express the non-pivot columns (columns 3 and 4) as linear com binations of the pivot columns (columns 1 and 2). The third column a3 is equivalent to −12times the second column a2. Therefore, 0 = a3+12a2. In the same way, we see that a4 = a1−12a2 and, therefore, 0 = a1−12a2−a4. Overall, this gives us the kernel (null space) as
] . (2.128)
rank-nullity

ker(Φ) = span[
0 1
21 0
,
 
−1 1
20
1
theorem Theorem 2.24 (Rank-Nullity Theorem). For vector spaces V, W and a lin ear mapping Φ : V → W it holds that
dim(ker(Φ)) + dim(Im(Φ)) = dim(V ). (2.129)
fundamental The rank-nullity theorem is also referred to as the fundamental theorem theorem of linear
mappingsof linear mappings (Axler, 2015, theorem 3.22). The following are direct consequences of Theorem 2.24:
If dim(Im(Φ)) < dim(V ), then ker(Φ) is non-trivial, i.e., the kernel contains more than 0V and dim(ker(Φ)) ⩾ 1.
If AΦ is the transformation matrix of Φ with respect to an ordered basis and dim(Im(Φ)) < dim(V ), then the system of linear equations AΦx = 0 has infinitely many solutions.
If dim(V ) = dim(W), then the following three-way equivalence holds: – Φ is injective
– Φ is surjective
– Φ is bijective
since Im(Φ) ⊆ W.
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
2.8 Affine Spaces 61 2.8 Affine Spaces
In the following, we will have a closer look at spaces that are offset from the origin, i.e., spaces that are no longer vector subspaces. Moreover, we will briefly discuss properties of mappings between these affine spaces, which resemble linear mappings.
Remark. In the machine learning literature, the distinction between linear and affine is sometimes not clear so that we can find references to affine spaces/mappings as linear spaces/mappings. ♢
2.8.1 Affine Subspaces
Definition 2.25 (Affine Subspace). Let V be a vector space, x0 ∈ V and U ⊆ V a subspace. Then the subset
L = x0 + U := {x0 + u : u ∈ U} (2.130a)
= {v ∈ V |∃u ∈ U : v = x0 + u} ⊆ V (2.130b)
is called affine subspace or linear manifold of V . U is called direction or affine subspace
direction space, and x0 is called support point. In Chapter 12, we refer to such a subspace as a hyperplane.
Note that the definition of an affine subspace excludes 0 if x0 ∈/ U. Therefore, an affine subspace is not a (linear) subspace (vector subspace) of V for x0 ∈/ U.
Examples of affine subspaces are points, lines, and planes in R3, which do not (necessarily) go through the origin.
Remark. Consider two affine subspaces L = x0 + U and L˜ = x˜0 + U˜ of a vector space V . Then, L ⊆ L˜ if and only if U ⊆ U˜ and x0 − x˜0 ∈ U˜. Affine subspaces are often described by parameters: Consider a k-dimen sional affine space L = x0 + U of V . If (b1, . . . , bk) is an ordered basis of U, then every element x ∈ L can be uniquely described as
x = x0 + λ1b1 + . . . + λkbk , (2.131)
linear manifold direction
direction space support point hyperplane
where λ1, . . . , λk ∈ R. This representation is called parametric equation parametric equation of L with directional vectors b1, . . . , bk and parameters λ1, . . . , λk. ♢ parameters
Example 2.26 (Affine Subspaces)
One-dimensional affine subspaces are called lines and can be written line as y = x0 + λb1, where λ ∈ R and U = span[b1] ⊆ Rnis a one dimensional subspace of Rn. This means that a line is defined by a sup port point x0 and a vector b1 that defines the direction. See Figure 2.13 for an illustration.
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
62 Linear Algebra
Two-dimensional affine subspaces of Rn
plane are called planes. The para metric equation for planes is y = x0 + λ1b1 + λ2b2, where λ1, λ2 ∈ R and U = span[b1, b2] ⊆ Rn. This means that a plane is defined by a support point x0 and two linearly independent vectors b1, b2 that span the direction space.
In Rn
hyperplane , the (n − 1)-dimensional affine subspaces are called hyperplanes, and the corresponding parametric equation is y = x0 +Pn−1
i=1 λibi,
where b1, . . . , bn−1 form a basis of an (n − 1)-dimensional subspace U of Rn. This means that a hyperplane is defined by a support point x0 and (n − 1) linearly independent vectors b1, . . . , bn−1 that span the direction space. In R2, a line is also a hyperplane. In R3, a plane is also a hyperplane.
Figure 2.13 Lines are affine subspaces. Vectors y on a line x0 + λb1 lie in an affine subspace L with support point x0 and direction b1.
L = x0 + λb1
x0
0
y
b1
Remark (Inhomogeneous systems of linear equations and affine subspaces). For A ∈ Rm×n and x ∈ Rm, the solution of the system of linear equa tions Aλ = x is either the empty set or an affine subspace of Rn of dimension n − rk(A). In particular, the solution of the linear equation λ1b1 + . . . + λnbn = x, where (λ1, . . . , λn) ̸= (0, . . . , 0), is a hyperplane in Rn.
In Rn, every k-dimensional affine subspace is the solution of an inho mogeneous system of linear equations Ax = b, where A ∈ Rm×n, b ∈ Rm and rk(A) = n − k. Recall that for homogeneous equation systems Ax = 0 the solution was a vector subspace, which we can also think of as a special affine space with support point x0 = 0. ♢
2.8.2 Affine Mappings
Similar to linear mappings between vector spaces, which we discussed in Section 2.7, we can define affine mappings between two affine spaces. Linear and affine mappings are closely related. Therefore, many properties that we already know from linear mappings, e.g., that the composition of linear mappings is a linear mapping, also hold for affine mappings.
Definition 2.26 (Affine Mapping). For two vector spaces V, W, a linear Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
2.9 Further Reading 63 mapping Φ : V → W, and a ∈ W, the mapping
ϕ : V → W (2.132)
x 7→ a + Φ(x) (2.133)
is an affine mapping from V to W. The vector a is called the translation affine mapping vector of ϕ. translation vector
Every affine mapping ϕ : V → W is also the composition of a linear mapping Φ : V → W and a translation τ : W → W in W, such that ϕ = τ ◦ Φ. The mappings Φ and τ are uniquely determined.
The composition ϕ′ ◦ ϕ of affine mappings ϕ : V → W, ϕ′: W → X is affine.
Affine mappings keep the geometric structure invariant. They also pre serve the dimension and parallelism.
2.9 Further Reading
There are many resources for learning linear algebra, including the text books by Strang (2003), Golan (2007), Axler (2015), and Liesen and Mehrmann (2015). There are also several online resources that we men tioned in the introduction to this chapter. We only covered Gaussian elim ination here, but there are many other approaches for solving systems of linear equations, and we refer to numerical linear algebra textbooks by Stoer and Burlirsch (2002), Golub and Van Loan (2012), and Horn and Johnson (2013) for an in-depth discussion.
In this book, we distinguish between the topics of linear algebra (e.g., vectors, matrices, linear independence, basis) and topics related to the geometry of a vector space. In Chapter 3, we will introduce the inner product, which induces a norm. These concepts allow us to define angles, lengths and distances, which we will use for orthogonal projections. Pro jections turn out to be key in many machine learning algorithms, such as linear regression and principal component analysis, both of which we will cover in Chapters 9 and 10, respectively.
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
64 Linear Algebra Exercises
2.1 We consider (R\{−1}, ⋆), where
a ⋆ b := ab + a + b, a, b ∈ R\{−1} (2.134)
a. Show that (R\{−1}, ⋆) is an Abelian group.
b. Solve
3 ⋆ x ⋆ x = 15
in the Abelian group (R\{−1}, ⋆), where ⋆ is defined in (2.134). 2.2 Let n be in N\{0}. Let k, x be in Z. We define the congruence class k¯ of the integer k as the set
k = {x ∈ Z | x − k = 0 (modn)}
= {x ∈ Z | ∃a ∈ Z: (x − k = n · a)} .
We now define Z/nZ (sometimes written Zn) as the set of all congruence classes modulo n. Euclidean division implies that this set is a finite set con taining n elements:
Zn = {0, 1, . . . , n − 1}
For all a, b ∈ Zn, we define
a ⊕ b := a + b
a. Show that (Zn, ⊕) is a group. Is it Abelian?
b. We now define another operation ⊗ for all a and b in Zn as a ⊗ b = a × b , (2.135)
where a × b represents the usual multiplication in Z.
Let n = 5. Draw the times table of the elements of Z5\{0} under ⊗, i.e., calculate the products a ⊗ b for all a and b in Z5\{0}.
Hence, show that Z5\{0} is closed under ⊗ and possesses a neutral element for ⊗. Display the inverse of all elements in Z5\{0} under ⊗. Conclude that (Z5\{0}, ⊗) is an Abelian group.
c. Show that (Z8\{0}, ⊗) is not a group.
d. We recall that the B´ezout theorem states that two integers a and b are relatively prime (i.e., gcd(a, b) = 1) if and only if there exist two integers u and v such that au + bv = 1. Show that (Zn\{0}, ⊗) is a group if and only if n ∈ N\{0} is prime.
2.3 Consider the set G of 3 × 3 matrices defined as follows:
G =
 
 
1 x z 0 1 y 0 0 1
 ∈ R3×3
x, y, z ∈ R
We define · as the standard matrix multiplication.
Is (G, ·) a group? If yes, is it Abelian? Justify your answer.
2.4 Compute the following matrix products, if possible:
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
Exercises 65 a.
b.
1 2 4 5 7 8
 
 
1 1 0 0 1 1 1 0 1
c.
1 2 3 4 5 6 7 8 9
 
 
1 1 0 0 1 1 1 0 1
d.
1 1 0 0 1 1 1 0 1
 
 
1 2 3 4 5 6 7 8 9

e.
1 2 1 2 4 1 −1 −4
 
0 3 1 −1 2 1 5 2
 
0 3 1 −1 2 1 5 2
 1 2 1 2 4 1 −1 −4
2.5 Find the set S of all solutions in x of the following inhomogeneous linear systems Ax = b, where A and b are defined as follows:
a.

b.
A =
1 1 −1 −1 
2 5 −7 −5 2 −1 1 3 5 2 −4 2
, b =
 
1
−2 4
6
A =
1 −1 0 0 1 
1 1 0 −3 0 2 −1 0 1 −1 −1 2 0 −2 −1
, b =
 
3
6
5
−1
 
2.6 Using Gaussian elimination, find all solutions of the inhomogeneous equa tion system Ax = b with
A =
0 1 0 0 1 0 
0 0 0 1 1 0 0 1 0 0 0 1
 , b =
 
2
−1 1
 .
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
66 Linear Algebra
2.7 Find all solutions in x =
x1
 ∈ R3 of the equation system Ax = 12x,
x2
x3
where
A =
and P3i=1 xi = 1.
6 4 3 6 0 9 0 8 0
 
2.8 Determine the inverses of the following matrices if possible:
a.
A =
2 3 4 3 4 5 4 5 6
 
b.

A =
1 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0
 
2.9 Which of the following sets are subspaces of R3? a. A = {(λ, λ + µ3, λ − µ3) | λ, µ ∈ R}
b. B = {(λ2, −λ2, 0) | λ ∈ R}
c. Let γ be in R.
C = {(ξ1, ξ2, ξ3) ∈ R3| ξ1 − 2ξ2 + 3ξ3 = γ}
d. D = {(ξ1, ξ2, ξ3) ∈ R3| ξ2 ∈ Z}
2.10 Are the following sets of vectors linearly independent?
a.
x1 =
2
−1 3
 , x2 =
 
1
1
−2
 , x3 =
 
3
−3 8
 
b.
1

2
x1 =
1
0
0
, x2 =
 
1 1 0 1 1
, x3 =
 
1 0 0 1 1
 
2.11 Write
y =
1
−2 5
 
as linear combination of 
1
x1 =
1
1
 , x2 =
 
1 2 3
 , x3 =
 
2
−1 1
 
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
Exercises 67 2.12 Consider two subspaces of R4:
U1 = span[
 
1
1
−3 1
,
 
2
−1 0
−1
,
 
−1 1
−1 1
] , U2 = span[
 
−1 −2 2
1
,
 
2
−2 0
0
,
 
−3 6
−2 −1
] .
Determine a basis of U1 ∩ U2.
2.13 Consider two subspaces U1 and U2, where U1 is the solution space of the homogeneous equation system A1x = 0 and U2 is the solution space of the homogeneous equation system A2x = 0 with
A1 =
 
1 0 1 1 −2 −1 2 1 3 1 0 1
, A2 =
 
3 −3 0 1 2 3 7 −5 2 3 −1 2
.
a. Determine the dimension of U1, U2.
b. Determine bases of U1 and U2.
c. Determine a basis of U1 ∩ U2.
2.14 Consider two subspaces U1 and U2, where U1 is spanned by the columns of A1 and U2 is spanned by the columns of A2 with
A1 =
 
1 0 1 1 −2 −1 2 1 3 1 0 1
, A2 =
 
3 −3 0 1 2 3 7 −5 2 3 −1 2
.
a. Determine the dimension of U1, U2
b. Determine bases of U1 and U2
c. Determine a basis of U1 ∩ U2
2.15 Let F = {(x, y, z) ∈ R3| x+y−z = 0} and G = {(a−b, a+b, a−3b) | a, b ∈ R}. a. Show that F and G are subspaces of R3.
b. Calculate F ∩ G without resorting to any basis vector.
c. Find one basis for F and one for G, calculate F∩G using the basis vectors previously found and check your result with the previous question. 2.16 Are the following mappings linear?
a. Let a, b ∈ R.
Φ : L1([a, b]) → R
Z b
f 7→ Φ(f) = a
f(x)dx ,
where L1([a, b]) denotes the set of integrable functions on [a, b]. b.
Φ : C1 → C0
f 7→ Φ(f) = f′,
where for k ⩾ 1, Ck denotes the set of k times continuously differen tiable functions, and C0 denotes the set of continuous functions.
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
68 Linear Algebra c.
Φ : R → R
x 7→ Φ(x) = cos(x)
d.
Φ : R3 → R2
x
e. Let θ be in [0, 2π[ and
x 7→
1 2 3 1 4 3
Φ : R2 → R2
cos(θ) sin(θ)
x
2.17 Consider the linear mapping
x 7→
− sin(θ) cos(θ)
Φ : R3 → R4
Φ
 
 
x1 x2 x3
 
 =

3x1 + 2x2 + x3 x1 + x2 + x3 x1 − 3x2
2x1 + 3x2 + x3

Find the transformation matrix AΦ.
Determine rk(AΦ).
Compute the kernel and image of Φ. What are dim(ker(Φ)) and dim(Im(Φ))? 2.18 Let E be a vector space. Let f and g be two automorphisms on E such that f ◦ g = idE (i.e., f ◦ g is the identity mapping idE). Show that ker(f) = ker(g ◦ f), Im(g) = Im(g ◦ f) and that ker(f) ∩ Im(g) = {0E}.
2.19 Consider an endomorphism Φ : R3 → R3 whose transformation matrix (with respect to the standard basis in R3) is
AΦ =
a. Determine ker(Φ) and Im(Φ).
1 1 0 1 −1 0 1 1 1
 .
b. Determine the transformation matrix A˜ Φ with respect to the basis
B = (
 
1 1 1
,
 
1 2 1
,
 
1 0 0
),
i.e., perform a basis change toward the new basis B.
2.20 Let us consider b1, b2, b′1, b′2, 4 vectors of R2expressed in the standard basis
of R2as
b1 =
2 1
, b2 =
−1 −1
, b′1 =
2 −2
, b′2 =
1 1
and let us define two ordered bases B = (b1, b2) and B′ = (b′1, b′2) of R2. Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
Exercises 69
a. Show that B and B′are two bases of R2and draw those basis vectors. b. Compute the matrix P 1 that performs a basis change from B′to B. c. We consider c1, c2, c3, three vectors of R3 defined in the standard basis
of R3as
c1 =
1
2
−1
, c2 =
 
0
−1 2
, c3 =
 
1
0
−1
 
and we define C = (c1, c2, c3).
(i) Show that C is a basis of R3, e.g., by using determinants (see Section 4.1).
(ii) Let us call C′ = (c′1, c′2, c′3) the standard basis of R3. Determine the matrix P 2 that performs the basis change from C to C′.
d. We consider a homomorphism Φ : R2 −→ R3, such that
Φ(b1 + b2) = c2 + c3
Φ(b1 − b2) = 2c1 − c2 + 3c3
where B = (b1, b2) and C = (c1, c2, c3) are ordered bases of R2and R3, respectively.
Determine the transformation matrix AΦ of Φ with respect to the or dered bases B and C.
e. Determine A′, the transformation matrix of Φ with respect to the bases B′and C′.
f. Let us consider the vector x ∈ R2 whose coordinates in B′are [2, 3]⊤. In other words, x = 2b′1 + 3b′2.
(i) Calculate the coordinates of x in B.
(ii) Based on that, compute the coordinates of Φ(x) expressed in C. (iii) Then, write Φ(x) in terms of c′1, c′2, c′3.
(iv) Use the representation of x in B′and the matrix A′to find this result directly.
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
Figure 3.1 A mind map of the concepts introduced in this chapter, along with when they are used in other parts of the
3
Analytic Geometry
In Chapter 2, we studied vectors, vector spaces, and linear mappings at a general but abstract level. In this chapter, we will add some geomet ric interpretation and intuition to all of these concepts. In particular, we will look at geometric vectors and compute their lengths and distances or angles between two vectors. To be able to do this, we equip the vec tor space with an inner product that induces the geometry of the vector space. Inner products and their corresponding norms and metrics capture the intuitive notions of similarity and distances, which we use to develop the support vector machine in Chapter 12. We will then use the concepts of lengths and angles between vectors to discuss orthogonal projections, which will play a central role when we discuss principal component anal ysis in Chapter 10 and regression via maximum likelihood estimation in Chapter 9. Figure 3.1 gives an overview of how concepts in this chapter are related and how they are connected to other chapters of the book.
Inner product
induces
Chapter 12
book.
Norm
Lengths Orthogonal
Classification
projection Angles Rotations
Chapter 9 Regression
70
Chapter 4
Matrix
decomposition
Chapter 10 Dimensionality reduction
This material is published by Cambridge University Press as Mathematics for Machine Learning by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view and download for personal use only. Not for re-distribution, re-sale, or use in derivative works. ©by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2023. https://mml-book.com.
3.1 Norms 71
kxk1 = 1 kxk2 = 1
1 1
Figure 3.3 For different norms, the red lines indicate the set of vectors
1
3.1 Norms
1
with norm 1. Left: Manhattan norm; Right: Euclidean distance.
When we think of geometric vectors, i.e., directed line segments that start at the origin, then intuitively the length of a vector is the distance of the “end” of this directed line segment from the origin. In the following, we will discuss the notion of the length of vectors using the concept of a norm.
Definition 3.1 (Norm). A norm on a vector space V is a function norm
∥ · ∥ : V → R , (3.1)
x 7→ ∥x∥ , (3.2)
which assigns each vector x its length ∥x∥ ∈ R, such that for all λ ∈ R length and x, y ∈ V the following hold:
absolutely
Absolutely homogeneous: ∥λx∥ = homogeneous |λ|∥x∥ Triangle inequality: ∥x + y∥ ⩽ ∥x∥ + ∥y∥ triangle inequality Positive definite: positive definite ∥x∥ ⩾ 0 and ∥x∥ = 0 ⇐⇒ x = 0
In geometric terms, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side; see Figure 3.2 for an illustration. Definition 3.1 is in terms of a general vector space V (Section 2.4), but in this book we will only consider a finite-dimensional vector space Rn. Recall that for a vector x ∈ Rn we denote the elements of the vector using a subscript, that is, xiis the ith element of the vector x.
Example 3.1 (Manhattan Norm)
Figure 3.2 Triangle inequality.
a b
c ≤ a + b
The Manhattan norm on Rnis defined for x ∈ Rn as Manhattan norm
∥x∥1 := Xn i=1
|xi| , (3.3)
where | · | is the absolute value. The left panel of Figure 3.3 shows all vectors x ∈ R2 with ∥x∥1 = 1. The Manhattan norm is also called ℓ1 ℓ1 norm norm.
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
72 Analytic Geometry
Example 3.2 (Euclidean Norm)
The Euclidean norm of x ∈ Rn Euclidean norm is defined as
∥x∥2 :=
vuutXn i=1
x2i =√x⊤x (3.4)
Euclidean distance and computes the Euclidean distance of x from the origin. The right panel of Figure 3.3 shows all vectors x ∈ R2 with ∥x∥2 = 1. The Euclidean ℓ2 norm norm is also called ℓ2 norm.
Remark. Throughout this book, we will use the Euclidean norm (3.4) by default if not stated otherwise. ♢
3.2 Inner Products
Inner products allow for the introduction of intuitive geometrical con cepts, such as the length of a vector and the angle or distance between two vectors. A major purpose of inner products is to determine whether vectors are orthogonal to each other.
3.2.1 Dot Product
We may already be familiar with a particular type of inner product, the scalar product/dot product in Rn
scalar product , which is given by
dot product
x⊤y =Xn i=1
xiyi. (3.5)
We will refer to this particular inner product as the dot product in this book. However, inner products are more general concepts with specific properties, which we will now introduce.
3.2.2 General Inner Products
Recall the linear mapping from Section 2.7, where we can rearrange the bilinear mapping mapping with respect to addition and multiplication with a scalar. A bi linear mapping Ω is a mapping with two arguments, and it is linear in each argument, i.e., when we look at a vector space V then it holds that for all x, y, z ∈ V, λ, ψ ∈ R that
Ω(λx + ψy, z) = λΩ(x, z) + ψΩ(y, z) (3.6)
Ω(x, λy + ψz) = λΩ(x, y) + ψΩ(x, z). (3.7)
Here, (3.6) asserts that Ω is linear in the first argument, and (3.7) asserts that Ω is linear in the second argument (see also (2.87)).
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
3.2 Inner Products 73
Definition 3.2. Let V be a vector space and Ω : V × V → R be a bilinear mapping that takes two vectors and maps them onto a real number. Then
Ω is called symmetric if Ω(x, y) = Ω(y, x) for all x, y ∈ V , i.e., the symmetric order of the arguments does not matter.
Ω is called positive definite if positive definite ∀x ∈ V \{0} : Ω(x, x) > 0 , Ω(0, 0) = 0 . (3.8)
Definition 3.3. Let V be a vector space and Ω : V × V → R be a bilinear mapping that takes two vectors and maps them onto a real number. Then
A positive definite, symmetric bilinear mapping Ω : V ×V → R is called an inner product on V . We typically write ⟨x, y⟩ instead of Ω(x, y). inner product The pair (V,⟨·, ·⟩) is called an inner product space or (real) vector space inner product space
with inner product. If we use the dot product defined in (3.5), we call (V,⟨·, ·⟩) a Euclidean vector space.
vector space with inner product Euclidean vector
We will refer to these spaces as inner product spaces in this book. space
Example 3.3 (Inner Product That Is Not the Dot Product) Consider V = R2. If we define
⟨x, y⟩ := x1y1 − (x1y2 + x2y1) + 2x2y2 (3.9)
then ⟨·, ·⟩ is an inner product but different from the dot product. The proof will be an exercise.
3.2.3 Symmetric, Positive Definite Matrices
Symmetric, positive definite matrices play an important role in machine learning, and they are defined via the inner product. In Section 4.3, we will return to symmetric, positive definite matrices in the context of matrix decompositions. The idea of symmetric positive semidefinite matrices is key in the definition of kernels (Section 12.4).
Consider an n-dimensional vector space V with an inner product ⟨·, ·⟩ : V × V → R (see Definition 3.3) and an ordered basis B = (b1, . . . , bn) of V . Recall from Section 2.6.1 that any vectors x, y ∈ V can be written as
linear combinations of the basis vectors so that x =Pni=1 ψibi ∈ V and
y =Pnj=1 λjbj ∈ V for suitable ψi, λj ∈ R. Due to the bilinearity of the inner product, it holds for all x, y ∈ V that
⟨x, y⟩ =
*Xn i=1
ψibi,Xn j=1
+
λjbj
=Xn i=1
Xn j=1
ψi ⟨bi, bj ⟩ λj = xˆ⊤Ayˆ , (3.10)
where Aij := ⟨bi, bj ⟩ and xˆ, yˆ are the coordinates of x and y with respect to the basis B. This implies that the inner product ⟨·, ·⟩ is uniquely deter mined through A. The symmetry of the inner product also means that A
©2023 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
74 Analytic Geometry
is symmetric. Furthermore, the positive definiteness of the inner product implies that
∀x ∈ V \{0} : x⊤Ax > 0 . (3.11)
Definition 3.4 (Symmetric, Positive Definite Matrix). A symmetric matrix A ∈ Rn×n
symmetric, positive that satisfies (3.11) is called symmetric, positive definite, or definite just positive definite. If only ⩾ holds in (3.11), then A is called symmetric,
positive definite symmetric, positive semidefinite
positive semidefinite.
Example 3.4 (Symmetric, Positive Definite Matrices)
Consider the matrices A1 =
9 6 6 5
, A2 =
9 6 6 3
. (3.12)
A1 is positive definite because it is symmetric and
x⊤A1x = x1 x2 9 6 6 5
x1 x2
(3.13a)
= 9x21 + 12x1x2 + 5x22 = (3x1 + 2x2)2 + x22 > 0 (3.13b)
for all x ∈ V \{0}. In contrast, A2 is symmetric but not positive definite because x⊤A2x = 9x21 + 12x1x2 + 3x22 = (3x1 + 2x2)2 − x22can be less than 0, e.g., for x = [2, −3]⊤.
If A ∈ Rn×nis symmetric, positive definite, then
⟨x, y⟩ = xˆ⊤Ayˆ (3.14)
defines an inner product with respect to an ordered basis B, where xˆ and yˆ are the coordinate representations of x, y ∈ V with respect to B.
Theorem 3.5. For a real-valued, finite-dimensional vector space V and an ordered basis B of V , it holds that ⟨·, ·⟩ : V × V → R is an inner product if and only if there exists a symmetric, positive definite matrix A ∈ Rn×n with
⟨x, y⟩ = xˆ⊤Ayˆ . (3.15)
The following properties hold if A ∈ Rn×nis symmetric and positive definite:
The null space (kernel) of A consists only of 0 because x⊤Ax > 0 for all x ̸= 0. This implies that Ax ̸= 0 if x ̸= 0.
The diagonal elements aii of A are positive because aii = e⊤i Aei > 0, where eiis the ith vector of the standard basis in Rn.
Draft (2023-10-18) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.
Machine Learning for Humans
Machine Learning
for Humans
Vishal Maini
Samer Sabri
1
Machine Learning for Humans
Table of Contents
Part 1: Introduction. The big picture of artificial intelligence and machine learning—past, present, and future.
Part 2.1: Supervised Learning. Learning with an answer key. Introducing linear regression, loss functions, overfitting, and gradient descent.
Part 2.2: Supervised Learning II. Two methods of classification: logistic regression and support vector machines (SVMs).
Part 2.3: Supervised Learning III. Non-parametric learners: k-nearest neighbors, decision trees, random forests. Introducing cross-validation, hyperparameter tuning, and ensemble models.
Part 3: Unsupervised Learning. Clustering: k-means, hierarchical. Dimensionality reduction: principal components analysis (PCA), singular value decomposition (SVD).
Part 4: Neural Networks & Deep Learning. Why, where, and how deep learning works. Drawing inspiration from the brain. Convolutional neural networks (CNNs), recurrent neural networks (RNNs). Real-world applications.
Part 5: Reinforcement Learning. Exploration and exploitation. Markov decision processes. Q-learning, policy learning, and deep reinforcement learning. The value learning problem.
Appendix: The Best Machine Learning Resources. A curated list of resources for creating your machine learning curriculum.
Published August 19, 2017. Edited by Sachin Maini.
For inquiries regarding this book, please contact [email protected]
2
Simple, plain-English explanations accompanied by math, code, and real-world examples
Machine Learning for Humans
Part 1: Introduction.
Who should read this?
Technical people who want to get up to speed on machine learning quickly
Non-technical people who want a primer on machine learning and are willing to engage with technical concepts
Anyone who is curious about how machines think
This guide is intended to be accessible to anyone. Basic concepts in probability, statistics, programming, linear algebra, and calculus will be discussed, but it isn’t necessary to have prior knowledge of them to gain value from this series.
If you're more interested in figuring out which courses to take, textbooks to read, projects to attempt, etc. Take a look at our top picks in the Appendix: The Best Machine Learning Resources.
Why machine learning matters
Artificial intelligence will shape our future more powerfully than any other innovation this century. Anyone who does not understand it will soon find themselves feeling left behind, waking up in a world full of technology that feels more and more like magic.
The rate of acceleration is already astounding. After a couple of AI winters and periods of false hope over the past four decades, rapid advances in data storage and computer processing power have dramatically changed the game in recent years.
3
Part 1: Introduction
Machine Learning for Humans
In 2015, Google trained a conversational agent (AI) that could not only convincingly interact with humans as a tech support helpdesk, but also discuss morality, express opinions, and answer general facts-based questions.
(Vinyals & Le, 2017)
4
Part 1: Introduction
Machine Learning for Humans
The same year, DeepMind developed an agent that surpassed human-level performance at 49 Atari games, receiving only the pixels and game score as inputs. Soon after, in 2016, DeepMind obsoleted their own this achievement by releasing a new state-of-the-art gameplay method called A3C.
Meanwhile, AlphaGo defeated one of the best human players at Go—an extraordinary achievement in a game dominated by humans for two decades after machines first conquered chess. Many masters could not fathom how it would be possible for a machine to grasp the full nuance and complexity of this ancient Chinese war strategy game, with its 10170 possible board positions (there are only 1080 atoms in the universe).
Professional Go player Lee Sedol reviewing his match with AlphaGo after defeat.
Photo via The Atlantic.
5
Part 1: Introduction
Machine Learning for Humans
In March 2017, OpenAI created agents that invented their own language to cooperate and more effectively achieve their goal. Soon after, Facebook reportedly successfully training agents to negotiate and even lie.
Just a few days ago (as of this writing), on August 11, 2017, OpenAI reached yet another incredible milestone by defeating the world’s top professionals in 1v1 matches of the online multiplayer game Dota 2.
See the full match at The International 2017, with Dendi (human) vs. OpenAI (bot), on YouTube.
6
Part 1: Introduction
Machine Learning for Humans
Much of our day-to-day technology is powered by artificial intelligence. Point your camera at the menu during your next trip to Taiwan and the restaurant’s selections will magically appear in English via the Google Translate app.
Google Translate overlaying English translations on a drink menu in real time using convolutional neural networks.
7
Part 1: Introduction
Machine Learning for Humans
Today AI is used to design evidence-based treatment plans for cancer patients, instantly analyze results from medical tests to escalate to the appropriate specialist immediately, and conduct scientific research for drug discovery.
A bold proclamation by London-based BenevolentAI (screenshot from About Us page, August 2017).
Law enforcement uses visual recognition and natural language processing to process footage from body cameras. The Mars rover Curiosity even utilizes AI to autonomously select inspection-worthy soil and rock samples with high accuracy.
In everyday life, it’s increasingly commonplace to discover machines in roles traditionally occupied by humans. Really, don’t be surprised if a little housekeeping delivery bot shows up instead of a human next time you call the hotel desk to send up some toothpaste.
In this series, we’ll explore the core machine learning concepts behind these technologies. By the end, you should be able to describe how they work at a conceptual level and be equipped with the tools to start building similar applications yourself.
The semantic tree: artificial intelligence and machine learning
One bit of advice: it is important to view knowledge as sort of
a semantic tree—make sure you understand the fundamental
principles, ie the trunk and big branches, before you get into the
leaves/details or there is nothing for them to hang on to.—Elon Musk, Reddit AMA
8
Part 1: Introduction
Machine Learning for Humans
Machine learning is one of many subfields of artificial intelligence, concerning the ways that computers learn from experience to improve their ability to think, plan, decide, and act.
Artificial intelligence is the study of agents that perceive the world around them, form plans, and make decisions to achieve their goals. Its foundations include mathematics, logic, philosophy, probability, linguistics, neuroscience, and decision theory. Many fields fall under the umbrella of AI, such as computer vision, robotics, machine learning, and natural language processing.
Machine learning is a subfield of artificial intelligence. Its goal is to enable computers to learn on their own. A machine’s learning algorithm enables it to identify patterns in observed data, build models that explain the world, and predict things without having explicit pre-programmed rules and models.
9
Part 1: Introduction
Machine Learning for Humans
The AI effect: what actually qualifies as “artificial intelligence”?
The exact standard for technology that qualifies as “AI” is a bit fuzzy, and interpretations change over time. The AI label tends to describe machines doing tasks traditionally in the domain of humans. Interestingly, once computers figure out how to do one of these tasks, humans have a tendency to say it wasn’t really intelligence. This is known as the AI effect.
For example, when IBM’s Deep Blue defeated world chess champion Garry Kasparov in 1997, people complained that it was using "brute force" methods and it wasn’t “real” intelligence at all. As Pamela McCorduck wrote, “It’s part of the history of the field of artificial intelligence that every time somebody figured out how to make a computer do something — play good checkers, solve simple but relatively informal problems — there was chorus of critics to say, ‘that’s not thinking’”(McCorduck, 2004).
Perhaps there is a certain je ne sais quoi inherent to what people will reliably accept as “artificial intelligence”:
"AI is whatever hasn't been done yet." - Douglas Hofstadter
So does a calculator count as AI? Maybe by some interpretation. What about a self-driving car? Today, yes. In the future, perhaps not. Your cool new chatbot startup that automates a flow chart? Sure… why not.
Strong AI will change our world forever; to understand how, studying machine learning is a good place to start
The technologies discussed above are examples of artificial narrow intelligence (ANI), which can effectively perform a narrowly defined task.
10
Part 1: Introduction
Machine Learning for Humans
Meanwhile, we’re continuing to make foundational advances towards human-level artificial general intelligence (AGI), also known as strong AI. The definition of an AGI is an artificial intelligence that can successfully perform any intellectual task that a human being can, including learning, planning and decision-making under uncertainty, communicating in natural language, making jokes, manipulating people, trading stocks, or… reprogramming itself.
And this last one is a big deal. If we create an AI that can improve itself, it would unlock a cycle of recursive self-improvement that could lead to an intelligence explosion over some unknown time period, ranging from many decades to a single day.
Let an ultraintelligent machine be defined as a machine
that can far surpass all the intellectual activities of any
man however clever. Since the design of machines is one
of these intellectual activities, an ultraintelligent machine
could design even better machines; there would then
unquestionably be an ‘intelligence explosion,’ and the
intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make, provided that the machine is docile enough to tell us how to keep it under control.—I.J. Good, 1965
You may have heard this point referred to as the singularity. The term is borrowed from the gravitational singularity that occurs at the center of a black hole, an infinitely dense one-dimensional point where the laws of physics as we understand them start to break down.
11
Part 1: Introduction
Machine Learning for Humans
We have zero visibility into what happens beyond the event horizon of a black hole because no light can escape. Similarly, after we unlock AI’s ability to recursively improve itself, it’s impossible to predict what will happen, just as mice who intentionally designed a human might have trouble predicting what the human would do to their world. Would it keep helping them get more cheese, as they originally intended? (Image via WIRED)
A recent report by the Future of Humanity Institute surveyed a panel of AI researchers on timelines for AGI, and found that “researchers believe there is a 50% chance of AI outperforming humans in all tasks in 45 years” (Grace et al, 2017). We’ve personally spoken with a number of sane and reasonable AI practitioners who predict much longer timelines (the upper limit being “never”), and others whose timelines are alarmingly short—as little as a few years.
12
Part 1: Introduction
Machine Learning for Humans
Image from Kurzweil’s The Singularity Is Near, published in 2005. Now, in 2017, only a couple of these posters could justifiably remain on the wall.
The advent of greater-than-human-level artificial superintelligence (ASI) could be one of the best or worst things to happen to our species. It carries with it the immense challenge of specifying what AIs will want in a way that is friendly to humans.
While it’s impossible to say what the future holds, one thing is certain: 2017 is a good time to start understanding how machines think. To go beyond the abstractions of armchair philosophy and intelligently shape our roadmaps and policies with respect to AI, we must engage with the details of how machines see the world—what they “want”, their potential biases and failure modes, their temperamental quirks—just as we study psychology and neuroscience to understand how humans learn, decide, act, and feel.
13
Part 1: Introduction
Machine Learning for Humans
There are complex, high-stakes questions about AI that will require our careful attention in the coming years.
How can we combat AI’s propensity to further entrench systemic biases evident in existing data sets? What should we make of fundamental disagreements among the world’s most powerful technologists about the potential risks and benefits of artificial intelligence? What are the most promising technical approaches to teaching AI systems to behave themselves? What will happen to humans' sense of purpose in a world without work?
Machine learning is at the core of our journey towards artificial general intelligence, and in the meantime, it will change every industry and have a massive impact on our day-to-day lives. That’s why we believe it’s worth understanding machine learning, at least at a conceptual level—and we designed this series to be the best place to start.
How to read this series
You don’t necessarily need to read the series cover-to-cover to get value out of it. Here are three suggestions on how to approach it, depending on your interests and how much time you have:
1. T-shaped approach. Read from beginning to end. Summarize each section in your own words as you go (see: Feynman technique); this encourages active reading & stronger retention. Go deeper into areas that are most relevant to your interests or work. We’ll include resources for further exploration at the end of each section.
2. Focused approach. Jump straight to the sections you’re most curious about and focus your mental energy there.
3. 80/20 approach. Skim everything in one go, make a few notes on interesting high-level concepts, and call it a night.
14
Part 1: Introduction
Machine Learning for Humans About the authors
“Ok, we have to be done with gradient descent by the time we finish this ale.”
@ The Boozy Cow in Edinburgh
Vishal most recently led growth at Upstart, a lending platform that utilizes machine learning to price credit, automate the borrowing process, and acquire users. He spends his time thinking about startups, applied cognitive science, moral philosophy, and the ethics of artificial intelligence. (Contact: [email protected])
Samer is a Master’s student in Computer Science and Engineering at UCSD and co founder of Conigo Labs. Prior to grad school, he founded TableScribe, a business intelligence tool for SMBs, and spent two years advising Fortune 100 companies at McKinsey. Samer previously studied Computer Science and Ethics, Politics, and Economics at Yale. (Contact: [email protected])
Most of this series was written during a 10-day trip to the United Kingdom in a frantic blur of trains, planes, cafes, pubs and wherever else we could find a dry place to sit. Our aim was to solidify our own understanding of artificial intelligence, machine learning, and how the methods therein fit together—and hopefully create something worth sharing in the process.
And now, without further ado, let’s dive into machine learning with Part 2.1: Supervised Learning!
15
Part 1: Introduction
Machine Learning for Humans
Part 2.1: Supervised Learning
The two tasks of supervised learning: regression and classification. Linear regression, loss functions, and gradient descent.
How much money will we make by spending more dollars on digital advertising? Will this loan applicant pay back the loan or not? What’s going to happen to the stock market tomorrow?
In supervised learning problems, we start with a data set containing training examples with associated correct labels. For example, when learning to classify handwritten digits, a supervised learning algorithm takes thousands of pictures of handwritten digits along with labels containing the correct number each image represents. The algorithm will then learn the relationship between the images and their associated numbers, and apply that learned relationship to classify completely new images (without labels) that the machine hasn’t seen before. This is how you’re able to deposit a check by taking a picture with your phone!
To illustrate how supervised learning works, let’s examine the problem of predicting annual income based on the number of years of higher education someone has completed. Expressed more formally, we’d like to build a model that approximates the relationship f between the number of years of higher education X and corresponding annual income Y.
16
Part 2.1: Supervised Learning
Machine Learning for Humans
X (input) = years of higher education
Y (output) = annual income
f = function describing the relationship between X and Y
ε (epsilon) = random error term (positive or negative) with mean zero
Regarding epsilon:
(1) ε represents irreducible error in the model, which is a theoretical limit around the performance of your algorithm due to inherent noise in the phenomena you are trying to explain. For example, imagine building a model to predict the outcome of a coin flip.
(2) Incidentally, mathematician Paul Erdös referred to children as “epsilons” because in calculus (but not in stats!) ε denotes an arbitrarily small positive quantity. Fitting, no? One method for predicting income would be to create a rigid rules-based model for how income and education are related. For example: “I’d estimate that for every additional year of higher education, annual income increases by $5,000.”
income = ($5,000 * years_of_education) + baseline_income
This approach is an example of engineering a solution (vs. learning a solution, as with the linear regression method described below).
You could come up with a more complex model by including some rules about degree type, years of work experience, school tiers, etc. For example: “If they completed a Bachelor’s degree or higher, give the income estimate a 1.5x multiplier.”
But this kind of explicit rules-based programming doesn’t work well with complex data. Imagine trying to design an image classification algorithm made of if-then statements describing the combinations of pixel brightnesses that should be labeled “cat” or “not cat”.
17
Part 2.1: Supervised Learning
Machine Learning for Humans
Supervised machine learning solves this problem by getting the computer to do the work for you. By identifying patterns in the data, the machine is able to form heuristics. The primary difference between this and human learning is that machine learning runs on computer hardware and is best understood through the lens of computer science and statistics, whereas human pattern-matching happens in a biological brain (while accomplishing the same goals).
In supervised learning, the machine attempts to learn the relationship between income and education from scratch, by running labeled training data through a learning algorithm. This learned function can be used to estimate the income of people whose income Y is unknown, as long as we have years of education X as inputs. In other words, we can apply our model to the unlabeled test data to estimate Y.
The goal of supervised learning is to predict Y as accurately as possible when given new examples where X is known and Y is unknown. In what follows we’ll explore several of the most common approaches to doing so.
The two tasks of supervised learning: regression and classification
Regression:
Predict a continuous numerical value. How much will that house sell for?
Classification:
Assign a label. Is this a picture of a cat or a dog?
The rest of this section will focus on regression. In Part 2.2 we’ll dive deeper into classification methods.
Regression: predicting a continuous value
Regression predicts a continuous target variable Y. It allows you to estimate a value, such as housing prices or human lifespan, based on input data X.
Here, target variable means the unknown variable we care about predicting, and continuous means there aren’t gaps (discontinuities) in the value that Y can take on.
18
Part 2.1: Supervised Learning
Machine Learning for Humans
A person’s weight and height are continuous values. Discrete variables, on the other hand, can only take on a finite number of values—for example, the number of kids somebody has is a discrete variable.
Predicting income is a classic regression problem. Your input data X includes all relevant information about individuals in the data set that can be used to predict income, such as years of education, years of work experience, job title, or zip code. These attributes are called features, which can be numerical (e.g. years of work experience) or categorical (e.g. job title or field of study).
You’ll want as many training observations as possible relating these features to the target output Y, so that your model can learn the relationship f between X and Y.
The data is split into a training data set and a test data set. The training set has labels, so your model can learn from these labeled examples. The test set does not have labels, i.e. you don’t yet know the value you’re trying to predict. It’s important that your model can generalize to situations it hasn’t encountered before so that it can perform well on the test data.
Regression
Y = f(X) + ε, where X = (x1, x2…xn)
Training: machine learns f from labeled training data
Test: machine predicts Y from unlabeled testing data
Note that X can be a tensor with an any number of dimensions. A 1D tensor is a vector (1 row, many columns), a 2D tensor is a matrix (many rows, many columns), and then you can have tensors with 3, 4, 5 or more dimensions (e.g. a 3D tensor with rows, columns, and depth). For a review of these terms, see the first few pages of this linear algebra review.
In our trivially simple 2D example, this could take the form of a .csv file where each row contains a person’s education level and income. Add more columns with more features and you’ll have a more complex, but possibly more accurate, model.
19
Part 2.1: Supervised Learning
Machine Learning for Humans
So how do we solve these problems?
How do we build models that make accurate, useful predictions in the real world? We do so by using supervised learning algorithms.
Now let’s get to the fun part: getting to know the algorithms. We’ll explore some of the ways to approach regression and classification and illustrate key machine learning concepts throughout.
20
Part 2.1: Supervised Learning
Machine Learning for Humans
Linear regression (ordinary least squares)
“Draw the line. Yes, this counts as machine learning.”
First, we’ll focus on solving the income prediction problem with linear regression, since linear models don’t work well with image recognition tasks (this is the domain of deep learning, which we’ll explore later).
We have our data set X, and corresponding target values Y. The goal of ordinary least squares (OLS) regression is to learn a linear model that we can use to predict a new y given a previously unseen x with as little error as possible. We want to guess how much income someone earns based on how many years of education they received.
X_train = [4, 5, 0, 2, …, 6] # years of post-secondary education Y_train = [80, 91.5, 42, 55, …, 100] # corresponding annual incomes, in thousands of dollars
21
Part 2.1: Supervised Learning
Machine Learning for Humans
Linear regression is a parametric method, which means it makes an assumption about the form of the function relating X and Y (we’ll cover examples of non-parametric methods later). Our model will be a function that predicts ŷ given a specific x:
In this case, we make the explicit assumption that there is a linear relationship between X and Y—that is, for each one-unit increase in X, we see a constant increase (or decrease) in Y.
β0 is the y-intercept and β1 is the slope of our line, i.e. how much income increases (or decreases) with one additional year of education.
Our goal is to learn the model parameters (in this case, β0 and β1) that minimize error in the model’s predictions.
To find the best parameters:
Define a cost function, or loss function, that measures how 1
inaccurate our model’s predictions are.
Find the parameters that minimize loss, i.e. make our model 2
as accurate as possible.
Graphically, in two dimensions, this results in a line of best fit. In three dimensions, we would draw a plane, and so on with higher-dimensional hyperplanes.
A note on dimensionality: our example is two-dimensional for
simplicity, but you’ll typically have more features (x’s) and coefficients (betas) in your model, e.g. when adding more relevant variables to improve the accuracy of your model predictions. The same principles generalize to higher dimensions, though things get much harder to visualize beyond three dimensions.
22
Part 2.1: Supervised Learning
Machine Learning for Humans
Mathematically, we look at the difference between each real data point (y) and our model’s prediction (ŷ). Square these differences to avoid negative numbers and penalize larger differences, and then add them up and take the average. This is a measure of how well our data fits the line.
n = # of observations. Using 2*n instead of n makes the math work out more cleanly when taking the derivative to minimize loss, though some stats people say this is blasphemy. When you start having opinions on this kind of stuff, you’ll know you are all the way in the rabbit hole.
For a simple problem like this, we can compute a closed form solution using calculus to find the optimal beta parameters that minimize our loss function. But as a cost function grows in complexity, finding a closed form solution with calculus is no longer feasible. This is the motivation for an iterative approach called gradient descent, which allows us to minimize a complex loss function.
23
Part 2.1: Supervised Learning
Machine Learning for Humans
Gradient descent: learn the parameters
“Put on a blindfold, take a step downhill. You’ve found the bottom when you have nowhere to go but up.”
Gradient descent will come up over and over again, especially in neural networks. Machine learning libraries like scikit-learn and TensorFlow use it in the background everywhere, so it’s worth understanding the details.
The goal of gradient descent is to find the minimum of our model’s loss function by iteratively getting a better and better approximation of it.
Imagine yourself walking through a valley with a blindfold on. Your goal is to find the bottom of the valley. How would you do it?
A reasonable approach would be to touch the ground around you and move in whichever direction the ground is sloping down most steeply. Take a step and repeat the same process continually until the ground is flat. Then you know you’ve reached the bottom of a valley; if you move in any direction from where you are, you’ll end up at the same elevation or further uphill.
Going back to mathematics, the ground becomes our loss function, and the elevation at the bottom of the valley is the minimum of that function.
Let’s take a look at the loss function we saw in regression:
24
Part 2.1: Supervised Learning
Machine Learning for Humans
We see that this is really a function of two variables: β0 and β1. All the rest of the variables are determined, since X, Y, and n are given during training. We want to try to minimize this function.
The function is f(β0,β1)=z. To begin gradient descent, you make some guess of the parameters β0 and β1 that minimize the function.
Next, you find the partial derivatives of the loss function with respect to each beta parameter: [dz/dβ0, dz/dβ1]. A partial derivative indicates how much total loss is increased or decreased if you increase β0 or β1 by a very small amount.
Put another way, how much would increasing your estimate of annual income assuming zero higher education (β0) increase the loss (i.e. inaccuracy) of your model? You want
25
Part 2.1: Supervised Learning
Machine Learning for Humans
to go in the opposite direction so that you end up walking downhill and minimizing loss.
Similarly, if you increase your estimate of how much each incremental year of education affects income (β1), how much does this increase loss (z)? If the partial derivative dz/ dβ1 is a negative number, then increasing β1 is good because it will reduce total loss. If it’s a positive number, you want to decrease β1. If it’s zero, don’t change β1 because it means you’ve reached an optimum.
Keep doing that until you reach the bottom, i.e. the algorithm converged and loss has been minimized. There are lots of tricks and exceptional cases beyond the scope of this series, but generally, this is how you find the optimal parameters for your parametric model.
Overfitting
Overfitting: “Sherlock, your explanation of what just happened is too specific to the situation.” Regularization: “Don’t overcomplicate things, Sherlock. I’ll punch you for every extra word.” Hyperparameter (λ): “Here’s the strength with which I will punch you for every extra word.”
A common problem in machine learning is overfitting: learning a function that perfectly explains the training data that the model learned from, but doesn’t generalize well to unseen test data. Overfitting happens when a model overlearns from the training data to the point that it starts picking up idiosyncrasies that aren’t representative of patterns in the real world. This becomes especially problematic as you make your model increasingly complex. Underfitting is a related issue where your model is not complex enough to capture the underlying trend in the data.
26
Part 2.1: Supervised Learning
Machine Learning for Humans
Bias-Variance Tradeoff
Bias is the amount of error introduced by approximating real-world phenomena with a simplified model.
Variance is how much your model's test error changes based on variation in the training data. It reflects the model's sensitivity to the idiosyncrasies of the data set it was trained on.
As a model increases in complexity and it becomes more wiggly (flexible), its bias decreases (it does a good job of explaining the training data), but variance increases (it doesn't generalize as well). Ultimately, in order to have a good model, you need one with low bias and low variance.
Source: Coursera’s ML course, taught by Andrew Ng
Remember that the only thing we care about is how the model performs on test data. You want to predict which emails will be marked as spam before they’re marked, not just build a model that is 100% accurate at reclassifying the emails it used to build itself in the first place. Hindsight is 20/20—the real question is whether the lessons learned will help in the future.
27
Part 2.1: Supervised Learning
Machine Learning for Humans
The model on the right has zero loss for the training data because it perfectly fits every data point. But the lesson doesn’t generalize. It would do a horrible job at explaining a new data point that isn’t yet on the line.
Two ways to combat overfitting:
1. Use more training data. The more you have, the harder it is to overfit the data by learning too much from any single training example.
2. Use regularization. Add in a penalty in the loss function for building a model that assigns too much explanatory power to any one feature or allows too many features to be taken into account.
The first piece of the sum above is our normal cost function. The second piece is a regularization term that adds a penalty for large beta coefficients that give too much explanatory power to any specific feature. With these two elements in place, the cost function now balances between two priorities: explaining the training data and preventing that explanation from becoming overly specific.
The lambda coefficient of the regularization term in the cost function is a hyperparameter: a general setting of your model that can be increased or decreased (i.e. tuned) in order to improve performance. A higher lambda value will more harshly penalize large beta coefficients that could lead to potential overfitting. To decide the best value of lambda, you’d use a method called cross-validation which involves holding out a portion of the training data during training, and then seeing how well your model explains the held-out portion. We’ll go over this in more depth
Woo! We made it.
28
Part 2.1: Supervised Learning
Machine Learning for Humans
Here’s what we covered in this section:
• How supervised machine learning enables computers to learn from labeled training data without being explicitly programmed
• The tasks of supervised learning: regression and classification
• Linear regression, a bread-and-butter parametric algorithm
• Learning parameters with gradient descent
• Overfitting and regularization
In the next section—Part 2.2: Supervised Learning II—we’ll talk about two foundational methods of classification: logistic regression and support vector machines.
Practice materials & further reading
2.1a—Linear regression
For a more thorough treatment of linear regression, read chapters 1–3 of An Introduction to Statistical Learning. The book is available for free online and is an excellent resource for understanding machine learning concepts with accompanying exercises.
For more practice:
• Play with the Boston Housing dataset. You can either use software with nice GUIs like Minitab and Excel or do it the hard (but more rewarding) way with Python or R.
• Try your hand at a Kaggle challenge, e.g. housing price prediction, and see how others approached the problem after attempting it yourself.
2.1b—Implementing gradient descent
To actually implement gradient descent in Python, check out this tutorial. And here is a more mathematically rigorous description of the same concepts.
In practice, you’ll rarely need to implement gradient descent from scratch, but understanding how it works behind the scenes will allow you to use it more effectively and understand why things break when they do.
29
Part 2.1: Supervised Learning
Machine Learning for Humans
Part 2.2: Supervised Learning II
Classification with logistic regression and support vector machines (SVMs).
Classification: predicting a label
Is this email spam or not? Is that borrower going to repay their loan? Will those users click on the ad or not? Who is that person in your Facebook picture?
Classification predicts a discrete target label Y. Classification is the problem of assigning new observations to the class to which they most likely belong, based on a classification model built from labeled training data.
The accuracy of your classifications will depend on the effectiveness of the algorithm you choose, how you apply it, and how much useful training data you have.
30
Part 2.2: Supervised Learning II
Machine Learning for Humans
Logistic regression: 0 or 1?
Logistic regression is a method of classification: the model outputs the probability of a categorical target variable Y belonging to a certain class.
A good example of classification is determining whether a loan application is fraudulent.
Ultimately, the lender wants to know whether they should give the borrower a loan or not, and they have some tolerance for risk that the application is in fact fraudulent. In this case, the goal of logistic regression is to calculate the probability (between 0% and 100%) that the application is fraud. With these probabilities, we can set some threshold above which we’re willing to lend to the borrower, and below which we deny their loan application or flag the application for further review.
Though logistic regression is often used for binary classification where there are two classes, keep in mind that classification can performed with any number of categories (e.g. when assigning handwritten digits a label between 0 and 9, or using facial recognition to detect which friends are in a Facebook picture).
Can I just use ordinary least squares?
Nope. If you trained a linear regression model on a bunch of examples where Y = 0 or 1, you might end up predicting some probabilities that are less than 0 or greater than 1, which doesn’t make sense. Instead, we’ll use a logistic regression model (or logit model) which was designed for assigning a probability between 0% and 100% that Y belongs to a certain class.
How does the math work?
Note: the math in this section is interesting but might be on the more technical side. Feel free to skim through it if you’re more interested in the high-level concepts. The logit model is a modification of linear regression that makes sure to output a probability between 0 and 1 by applying the sigmoid function, which, when graphed, looks like the characteristic S-shaped curve that you’ll see a bit later.
31
Part 2.2: Supervised Learning II
Machine Learning for Humans
Sigmoid function, which squashes values between 0 and 1.
Recall the original form of our simple linear regression model, which we’ll now call g(x) since we’re going to use it within a compound function:
Now, to solve this issue of getting model outputs less than 0 or greater than 1, we’re going to define a new function F(g(x)) that transforms g(x) by squashing the output of linear regression to a value in the [0,1] range. Can you think of a function that does this?
Are you thinking of the sigmoid function? Bam! Presto! You’re correct.
So we plug g(x) into the sigmoid function above, resulting in a function of our original function (yes, things are getting meta) that outputs a probability between 0 and 1:
In other words, we’re calculating the probability that the training example belongs to a certain class: P(Y=1).
Here we’ve isolated p, the probability that Y=1, on the left side of the equation. If we want to solve for a nice clean β0 + β1x + ϵ on the right side so we can straightforwardly interpret the beta coefficients we’re going to learn, we’d instead end up with the log odds ratio, or logit, on the left side—hence the name “logit model”:
32
Part 2.2: Supervised Learning II
Machine Learning for Humans
The log-odds ratio is simply the natural log of the odds ratio, p/(1-p), which crops up in everyday conversations:
“Yo, what do you think are the odds that Tyrion Lannister dies in this season of Game of Thrones?”
“Hmm. It’s definitely 2x more likely to happen than not. 2-to-1 odds. Sure, he might seem too important to be killed, but we all saw what they did to Ned Stark…”
Note that in the logit model, β1 now represents the rate of change in the log-odds ratio as X changes. In other words, it’s the “slope of log-odds”, not the “slope of the probability”.
Log-odds might be slightly unintuitive but it’s worth understanding since it will come up again when you’re interpreting the output of neural networks performing classification tasks.
33
Part 2.2: Supervised Learning II
Machine Learning for Humans
Using the output of a logistic regression model to make decisions
The output of the logistic regression model from above looks like an S-curve showing P(Y=1) based on the value of X:
Source: Wikipedia
To predict the Y label—spam/not spam, cancer/not cancer, fraud/not fraud, etc.—you have to set a probability cutoff, or threshold, for a positive result. For example: “If our model thinks the probability of this email being spam is higher than 70%, label it spam. Otherwise, don’t.”
The threshold depends on your tolerance for false positives vs. false negatives. If you’re diagnosing cancer, you’d have a very low tolerance for false negatives, because even if there’s a very small chance the patient has cancer, you’d want to run further tests to make sure. So you’d set a very low threshold for a positive result.
34
Part 2.2: Supervised Learning II
Machine Learning for Humans
In the case of fraudulent loan applications, on the other hand, the tolerance for false positives might be higher, particularly for smaller loans, since further vetting is costly and a small loan may not be worth the additional operational costs and friction for non fraudulent applicants who are flagged for further processing.
Minimizing loss with logistic regression
As in the case of linear regression, we use gradient descent to learn the beta parameters that minimize loss.
In logistic regression, the cost function is basically a measure of how often you predicted 1 when the true answer was 0, or vice versa. Below is a regularized cost function just like the one we went over for linear regression.
Don’t panic when you see a long equation like this! Break it into chunks and think about what’s going on in each part conceptually. Then the specifics will start to make sense.
The first chunk is the data loss, i.e. how much discrepancy there is between the model’s predictions and reality. The second chunk is the regularization loss, i.e. how much we penalize the model for having large parameters that heavily weight certain features (remember, this prevents overfitting).
We’ll minimize this cost function with gradient descent, as above, and voilà! we’ve built a logistic regression model to make class predictions as accurately as possible.
35
Part 2.2: Supervised Learning II
Machine Learning for Humans
Support vector machines (SVMs)
“We’re in a room full of marbles again. Why are we always in a room full of marbles? I could’ve sworn we already lost them.”
SVM is the last parametric model we’ll cover. It typically solves the same problem as logistic regression—classification with two classes—and yields similar performance. It’s worth understanding because the algorithm is geometrically motivated in nature, rather than being driven by probabilistic thinking.
A few examples of the problems SVMs can solve:
• Is this an image of a cat or a dog?
• Is this review positive or negative?
• Are the dots in the 2D plane red or blue?
We’ll use the third example to illustrate how SVMs work. Problems like these are called toy problems because they’re not real—but nothing is real, so it’s fine.
In this example, we have points in a 2D space that are either red or blue, and we’d like to cleanly separate the two.
The training set is plotted the graph above. We would like to classify new, unclassified points in this plane. To do this, SVMs use a separating line (or, in more than two dimensions, a multi-dimensional hyperplane) to split the space into a red zone and
36
Part 2.2: Supervised Learning II
Machine Learning for Humans
a blue zone. You can already imagine how a separating line might look in the graph above.
How, specifically, do we choose where to draw the line?
Below are two examples of such a line:
These charts were made with Microsoft Paint, which was deprecated a few weeks ago after 32 wonderful years. RIP Paint :(
Hopefully, you share the intuition that the first line is superior. The distance to the nearest point on either side of the line is called the margin, and SVM tries to maximize the margin. You can think about it like a safety space: the bigger that space, the less likely that noisy points get misclassified.
37
Part 2.2: Supervised Learning II
Machine Learning for Humans
Based on this short explanation, a few big questions come up.
1. How does the math behind this work?
We want to find the optimal hyperplane (a line, in our 2D example). This hyperplane needs to (1) separate the data cleanly, with blue points on one side of the line and red points on the other side, and (2) maximize the margin. This is an optimization problem. The solution has to respect constraint (1) while maximizing the margin as is required in (2).
The human version of solving this problem would be to take a ruler and keep trying different lines separating all the points until you get the one that maximizes the margin.
It turns out there’s a clean mathematical way to do this maximization, but the specifics are beyond our scope. To explore it further, here’s a video lecture that shows how it works using Lagrangian Optimization.
The solution hyperplane you end up with is defined in relation to its position with respect to certain x_i’s, which are called the support vectors, and they’re usually the ones closest to the hyperplane.
2. What happens if you can’t separate the data cleanly?
There are two methods for dealing with this problem.
2.1. Soften the definition of “separate”.
We allow a few mistakes, meaning we allow some blue points in the red zone or some red points in the blue zone. We do that by adding a cost C for misclassified examples in our loss function. Basically, we say it’s acceptable but costly to misclassify a point.
2.2. Throw the data into higher dimensions.
We can create nonlinear classifiers by increasing the number of dimensions, i.e. include x², x³, even cos(x), etc. Suddenly, you have boundaries that can look more squiggly when we bring them back to the lower dimensional representation.
38
Part 2.2: Supervised Learning II
Machine Learning for Humans
Intuitively, this is like having red and blue marbles lying on the ground such that they can’t be cleanly separated by a line—but if you could make all the red marbles levitate off the ground in just the right way, you could draw a plane separating them. Then you let them fall back to the ground knowing where the blues stop and reds begin.
A nonseparable dataset in a two-dimensional space R², and the same dataset mapped onto threedimensions with the third dimension being x²+y² (source: http://www.eric kim.net/eric-kim-net/posts/1/kernel_trick.html)
The decision boundary is shown in green, first in the three-dimensional space (left), then back in the two-dimensional space (right). Same source as previous image.
In summary, SVMs are used for classification with two classes. They attempt to find a plane that separates the two classes cleanly. When this isn’t possible, we either soften
39
Part 2.2: Supervised Learning II
Machine Learning for Humans
the definition of “separate,” or we throw the data into higher dimensions so that we can cleanly separate the data.
Success!
In this section we covered:
• The classification task of supervised learning
• Two foundational classification methods: logistic regression and support vector machines (SVMs)
• Recurring concepts: the sigmoid function, log-odds (“logit”), and false positives vs. false negatives,
In Part 2.3: Supervised Learning III, we’ll go into non-parametric supervised learning, where the ideas behind the algorithms are very intuitive and performance is excellent for certain kinds of problems, but the models can be harder to interpret.
Practice materials & further reading
2.2a — Logistic regression
Data School has an excellent in-depth guide to logistic regression. We’ll also continue to refer you to An Introduction to Statistical Learning. See Chapter 4 on logistic regression, and Chapter 9 on support vector machines.
To implement logistic regression, we recommend working on this problem set. You have to register on the site to work through it, unfortunately. C’est la vie.
2.2b — Down the SVM rabbit hole
To dig into the math behind SVMs, watch Prof. Patrick Winston’s lecture from MIT 6.034: Artificial Intelligence. And check out this tutorial to work through a Python implementation.
40
Part 2.2: Supervised Learning II
Machine Learning for Humans
Part 2.3: Supervised Learning III
Non-parametric models: k-nearest neighbors, decision trees, and random forests. Introducing cross-validation, hyperparameter tuning, and ensemble models.
Non-parametric learners.
Things are about to get a little… wiggly.
In contrast to the methods we’ve covered so far—linear regression, logistic regression, and SVMs where the form of the model was pre-defined— non-parametric learners do not have a model structure specified a priori. We don’t speculate about the form of the function f that we’re trying to learn before training the model, as we did previously with linear regression. Instead, the model structure is purely determined from the data.
These models are more flexible to the shape of the training data, but this sometimes comes at the cost of interpretability. This will make more sense soon. Let’s jump in.
k-nearest neighbors (k-NN)
“You are the average of your k closest friends.”
k-NN seems almost too simple to be a machine learning algorithm. The idea is to label a test data point x by finding the mean (or mode) of the k closest data points’ labels.
Take a look at the image below. Let’s say you want to figure out whether Mysterious Green Circle is a Red Triangle or a Blue Square. What do you do?
41
Part 2.3: Supervised Learning III
Machine Learning for Humans
You could try to come up with a fancy equation that looks at where Green Circle lies on the coordinate plane below and makes a prediction accordingly. Or, you could just look its three nearest neighbors, and guess that Green Circle is probably a Red Triangle. You could also expand the circle further and look at the five nearest neighbors, and make
a prediction that way (3/5 of its five nearest neighbors are Blue Squares, so we’d guess that Mysterious Green Circle is a Blue Square when k=5).
k-NN illustration with k=1, 3, and 5. To classify the Mysterious Green Circle (x) above, look at its single nearest neighbor, a “Red Triangle”. So, we’d guess that ŷ = “Red Triangle”. With k=3, look at the 3 nearest neighbors: the mode of these is again “Red Triangle” so ŷ= “Red Triangle”. With k=5, we take the mode of the 5 nearest neighbors instead. Now, notice that ŷ becomes “Blue Square”. Image from Wikipedia.
42
Part 2.3: Supervised Learning III
Machine Learning for Humans
That’s it. That’s k-nearest neighbors. You look at the k closest data points and take the average of their values if variables are continuous (like housing prices), or the mode if they’re categorical (like cat vs. dog).
If you wanted to guess unknown house prices, you could just take the average of some number of geographically nearby houses, and you’d end up with some pretty nice guesses. These might even outperform a parametric regression model built by some economist that estimates model coefficients for # of beds/baths, nearby schools, distance to public transport, etc.
How to use k-NN to predict housing prices:
1) Store the training data, a matrix X of features like zip code, neighborhood, # of bedrooms, square feet, distance from public transport, etc., and a matrix Y of corresponding sale prices.
2) Sort the houses in your training data set by similarity to the house in question, based on the features in X. We’ll define “similarity” below.
3) Take the mean of the k closest houses. That is your guess at the sale price (i.e. ŷ)
The fact that k-NN doesn’t require a pre-defined parametric function f(X) relating Y to X makes it well-suited for situations where the relationship is too complex to be expressed with a simple linear model.
Distance metrics: defining and calculating “nearness”
How do you calculate distance from the data point in question when finding the “nearest neighbors”? How do you mathematically determine which of the Blue Squares and Red Triangles in the example above are closest to Green Circle, especially if you can’t just draw a nice 2D graph and eyeball it?
43
Part 2.3: Supervised Learning III
Machine Learning for Humans
The most straightforward measure is Euclidean distance (a straight line, “as the crow flies”). Another is Manhattan distance, like walking city blocks. You could imagine that Manhattan distance is more useful in a model involving fare calculation for Uber drivers, for example.
Green line = Euclidean distance. Blue line = Manhattan distance. Source: Wikipedia
Remember the Pythagorean theorem for finding the length of the hypotenuse of a
right triangle?
c = length of hypotenuse (green line above). a and b = length of the other sides, at a right angle (red lines above). 44
Part 2.3: Supervised Learning III
Machine Learning for Humans
Solving in terms of c, we find the length of the hypotenuse by taking the square root of the sum of squared lengths of a and b, where a and b are orthogonal sides of the triangle (i.e. they are at a 90-degree angle from one another, going in perpendicular directions in space).
This idea of finding the length of the hypotenuse given vectors in two orthogonal directions generalizes to many dimensions, and this is how we derive the formula for Euclidean distance d(p,q) between points p and q in n-dimensional space:
Formula for Euclidean distance, derived from the Pythagorean theorem. With this formula, you can calculate the nearness of all the training data points to the data point you’re trying to label, and take the mean/mode of the k nearest neighbors to make your prediction.
Typically you won’t need to calculate any distance metrics by hand—a quick Google search reveals pre-built functions in NumPy or SciPy that will do this for you, e.g. euclidean_dist = numpy.linalg.norm(p-q)— but it’s fun to see how geometry concepts from eighth grade end up being helpful for building ML models today!
45
Part 2.3: Supervised Learning III
Machine Learning for Humans
Choosing k: tuning hyperparameters with cross-validation
To decide which value of k to use, you can test different k-NN models using different values of k with cross-validation:
1. Split your training data into segments, and train your model on all but one of the segments; use the held-out segment as the “test” data.
2. See how your model performs by comparing your model’s predictions (ŷ) to the actual values of the test data (y).
3. Pick whichever yields the lowest error, on average, across all iterations.
Cross-validation illustrated. The number of splits and iterations can be varied.
46
Part 2.3: Supervised Learning III
Machine Learning for Humans
Higher k prevents overfitting
Higher values of k help address overfitting, but if the value of k is too high your model will be very biased and inflexible. To take an extreme example: if k = N (the total number of data points), the model would just dumbly blanket-classify all the test data as the mean or mode of the training data.
If the single most common animal in a data set of animals is a Scottish Fold kitten, k-NN with k set to N (the # of training observations) would then predict that every other animal in the world is also a Scottish Fold kitten. Which, in Vishal’s opinion, would be awesome. Samer disagrees.
Completely gratuitous Scottish Fold .gif. We’ll call it a study break. 
47
Part 2.3: Supervised Learning III
Machine Learning for Humans
Where to use k-NN in the real world
Some examples of where you can use k-NN:
• Classification: fraud detection. The model can update virtually instantly with new training examples since you’re just storing more data points, which allows quick adaptation to new methods of fraud.
• Regression: predicting housing prices. In housing price prediction, literally being a “near neighbor” is actually a good indicator of being similar in price. k-NN is useful in domains where physical proximity matters.
• Imputing missing training data. If one of the columns in your .csv has lots of missing values, you can impute the data by taking the mean or mode. k-NN could give you a somewhat more accurate guess at each missing value.
Decision trees, random forests
Making a good decision tree is like playing a game of “20 questions”. The decision tree on the right describes survival patterns on the Titanic.
48
Part 2.3: Supervised Learning III
Machine Learning for Humans
The first split at the root of a decision tree should be like the first question you should ask in 20 questions: you want to separate the data as cleanly as possible, thereby maximizing information gain from that split.
If your friend says “I’m thinking of a noun, ask me up to 20 yes/no questions to guess what it is” and your first question is “is it a potato?”, then you’re a dumbass, because they’re going to say no and you gained almost no information. Unless you happen to know your friend thinks about potatoes all the time, or is thinking about one right now.
Then you did a great job.
Instead, a question like “is it an object?” might make more sense.
This is kind of like how hospitals triage patients or approach differential diagnoses. They ask a few questions up front and check some basic vitals to determine if you’re going to die imminently or something. They don’t start by doing a biopsy to check if you have pancreatic cancer as soon as you walk in the door.
There are ways to quantify information gain so that you can essentially evaluate every possible split of the training data and maximize information gain for every split. This way you can predict every label or value as efficiently as possible.
Now, let’s look at a particular data set and talk about how we choose splits.
The Titanic dataset
Kaggle has a Titanic dataset that is used for a lot of machine learning intros. When the titanic sunk, 1,502 out of 2,224 passengers and crew were killed. Even though there was some luck involved, women, children, and the upper-class were more likely to survive. If you look back at the decision tree above, you’ll see that it somewhat reflects this variability across gender, age, and class.
49
Part 2.3: Supervised Learning III
Machine Learning for Humans
Choosing splits in a decision tree
Entropy is the amount of disorder in a set (measured by Gini index or cross-entropy). If the values are really mixed, there’s lots of entropy; if you can cleanly split values, there’s no entropy. For every split at a parent node, you want the child nodes to be as pure as
possible—minimize entropy. For example, in the Titanic, gender is a big determinant of survival, so it makes sense for this feature to be used in the first split as it’s the one that leads to the most information gain.
Let’s take a look at our Titanic variables:
Source: Kaggle
We build a tree by picking one of these variables and splitting the dataset according to it. 50
Part 2.3: Supervised Learning III
Machine Learning for Humans
The first split separates our dataset into men and women. Then, the women branch gets split again in age (the split that minimizes entropy). Similarly, the men branch gets split by class. By following the tree for a new passenger, you can use the tree to make a
guess at whether they died.
The Titanic example is solving a classification problem (“survive” or “die”). If we were using decision trees for regression—say, to predict housing prices—we would create splits on the most important features that determine housing prices. How many square feet: more than or less than ___? How many bedrooms & bathrooms: more than or less than ___?
Then, during testing, you would run a specific house through all the splits and take the average of all the housing prices in the final leaf node (bottom-most node) where the house ends up as your prediction for the sale price.
51
Part 2.3: Supervised Learning III
Machine Learning for Humans
There are a few hyperparameters you can tune with decision trees models, including max_depth and max_leaf_nodes. See the scikit-learn module on decision trees for advice on defining these parameters.
Decision trees are effective because they are easy to read, powerful even with messy data, and computationally cheap to deploy once after training. Decision trees are also good for handling mixed data (numerical or categorical).
That said, decision trees are computationally expensive to train, carry a big risk of overfitting, and tend to find local optima because they can’t go back after they have made a split. To address these weaknesses, we turn to a method that illustrates the power of combining many decision trees into one model.
Random forest: an ensemble of decision trees
A model comprised of many models is called an ensemble model, and this is usually a winning strategy.
A single decision tree can make a lot of wrong calls because it has very black-and-white judgments. A random forest is a meta-estimator that aggregates many decision trees, with some helpful modifications:
1. The number of features that can be split on at each node is limited to some percentage of the total (this is a hyperparameter you can choose—see scikit-learn documentation for details). This ensures that the ensemble model does not rely too heavily on any individual feature, and makes fair use of all potentially predictive features.
2. Each tree draws a random sample from the original data set when generating its splits, adding a further element of randomness that prevents overfitting.
These modifications also prevent the trees from being too highly correlated. Without #1 and #2 above, every tree would be identical, since recursive binary splitting is deterministic.
52
Part 2.3: Supervised Learning III
Machine Learning for Humans
To illustrate, see these nine decision tree classifiers below.
Source: http://xenon.stanford.edu/~jianzh/ml/
These decision tree classifiers can be aggregated into a random forest ensemble which combines their input. Think of the horizontal and vertical axes of each decision tree output as features x1 and x2. At certain values of each feature, the decision tree outputs a classification of “blue”, “green”, “red”, etc.
Source: http://xenon.stanford.edu/~jianzh/ml/
53
Part 2.3: Supervised Learning III
Machine Learning for Humans
These results are aggregated, through modal votes or averaging, into a single ensemble model that ends up outperforming any individual decision tree’s output.
Random forests are an excellent starting point for the modeling process, since they tend to have strong performance with a high tolerance for less-cleaned data and can be useful for figuring out which features actually matter among many features.
There are many other clever ensemble models that combine decision trees and yield excellent performance—check out XGBoost (Extreme Gradient Boosting) as an example.
And with that, we conclude our study of supervised learning!
Nice work. In this section we’ve covered:
• Two non-parametric supervised learning algorithms: k-NN and decision trees
• Measures of distance and information gain
• Random forests, which are an example of an ensemble model
• Cross-validation and hyperparameter tuning
Hopefully, you now have some solid intuitions for how we learn f given a training data set and use this to make predictions with the test data.
Next, we’ll talk about how to approach problems where we don’t have any labeled training data to work with, in Part 3: Unsupervised Learning.
Practice materials & further reading
2.3a — Implementing k-NN
Try this walkthrough for implementing k-NN from scratch in Python. You may also want to take a look at the scikit-learn documentation to get a sense of how pre built implementations work.
54
Part 2.3: Supervised Learning III
Machine Learning for Humans
2.3b—Decision trees
Try the decision trees lab in Chapter 8 of An Introduction to Statistical Learning. You can also play with the Titanic dataset, and check out this tutorial which covers the same concepts as above with accompanying code. Here is the scikit learn implementation of random forest for out-of-the-box use on data sets.
Part 3: Unsupervised Learning
Clustering and dimensionality reduction: k-means clustering, hierarchical clustering, principal component analysis (PCA), singular value decomposition (SVD)
How do you find the underlying structure of a dataset? How do you summarize it and group it most usefully? How do you effectively represent data in a compressed format? These are the goals of unsupervised learning, which is called “unsupervised” because you start with unlabeled data (there’s no Y).
The two unsupervised learning tasks we will explore are clustering the data into groups by similarity and reducing dimensionality to compress the data while maintaining its structure and usefulness.
Examples of where unsupervised learning methods might be useful:
- An advertising platform segments the U.S. population into smaller groups with similar demographics and purchasing habits so that advertisers can reach their target market with relevant ads.
- Airbnb groups its housing listings into neighborhoods so that users can navigate listings more easily.
- A data science team reduces the number of dimensions in a large data set to simplify modeling and reduce file size.
55
Part 3: Unsupervised Learning
Machine Learning for Humans
In contrast to supervised learning, it’s not always easy to come up with metrics for how well an unsupervised learning algorithm is doing. “Performance” is often subjective and domain-specific.
Clustering
An interesting example of clustering in the real world is marketing data provider Acxiom’s life stage clustering system, Personicx. This service segments U.S. households into 70 distinct clusters within 21 life stage groups that are used by advertisers when targeting Facebook ads, display ads, direct mail campaigns, etc.
A selection of Personicx demographic clusters
Their white paper reveals that they used centroid clustering and principal components analysis, both of which are techniques covered in this section.
You can imagine how having access to these clusters is extremely useful for advertisers who want to (1) understand their existing customer base and (2) use their ad spend effectively by targeting potential new customers with relevant demographics, interests, and lifestyles.
56
Part 3: Unsupervised Learning
Machine Learning for Humans
You can actually find out which cluster you personally would belong to by answering a few simple questions in Acxiom’s “What’s My Cluster?” tool.
Let’s walk through a couple of clustering methods to develop intuition for how this task can be performed.
k-means clustering
“And k rings were given to the race of Centroids, who above all else, desire power.”
The goal of clustering is to create groups of data points such that points in different clusters are dissimilar while points within a cluster are similar.
With k-means clustering, we want to cluster our data points into k groups. A larger k creates smaller groups with more granularity, a lower k means larger groups and less granularity.
The output of the algorithm would be a set of “labels” assigning each data point to one of the k groups. In k-means clustering, the way these groups are defined is by creating a centroid for each group. The centroids are like the heart of the cluster, they “capture” the points closest to them and add them to the cluster.
57
Part 3: Unsupervised Learning
Machine Learning for Humans
Think of these as the people who show up at a party and soon become the centers of attention because they’re so magnetic. If there’s just one of them, everyone will gather around; if there are lots, many smaller centers of activity will form.
Here are the steps to k-means clustering:
1. Define the k centroids. Initialize these at random (there are also fancier algorithms for initializing the centroids that end up converging more effectively).
2. Find the closest centroid & update cluster assignments. Assign each data point to one of the k clusters. Each data point is assigned to the nearest centroid’s cluster. Here, the measure of “nearness” is a hyperparameter — often Euclidean distance.
3. Move the centroids to the center of their clusters. The new position of each centroid is calculated as the average position of all the points in its cluster.
Keep repeating steps 2 and 3 until the centroid stop moving a lot at each iteration (i.e., until the algorithm converges)
That, in short, is how k-means clustering works! Check out this visualization of the algorithm—read it like a comic book. Each point in the plane is colored according the centroid that it is closest to at each moment. You’ll notice that the centroids (the larger blue, red, and green circles) start randomly and then quickly adjust to capture their respective clusters.
58
Part 3: Unsupervised Learning
Machine Learning for Humans
Another real-life application of k-means clustering is classifying handwritten digits. Suppose we have images of the digits as a long vector of pixel brightnesses. Let’s say the images are black and white and are 64x64 pixels. Each pixel represents a dimension. So the world these images live in has 64x64=4,096 dimensions. In this 4,096-dimensional world, k-means clustering allows us to group the images that are close together and assume they represent the same digit, which can achieve pretty good results for digit recognition.
Hierarchical clustering
“Let’s make a million options become seven options. Or five. Or twenty? Meh, we can decide later.”
Hierarchical clustering is similar to regular clustering, except that you’re aiming to build a hierarchy of clusters. This can be useful when you want flexibility in how many clusters you ultimately want. For example, imagine grouping items on an online marketplace like Etsy or Amazon. On the homepage you’d want a few broad categories of items for
59
Part 3: Unsupervised Learning
Machine Learning for Humans
simple navigation, but as you go into more specific shopping categories you’d want increasing levels of granularity, i.e. more distinct clusters of items.
In terms of outputs from the algorithm, in addition to cluster assignments you also build a nice tree that tells you about the hierarchies between the clusters. You can then pick the number of clusters you want from this tree.
Here are the steps for hierarchical clustering:
1. Start with N clusters, one for each data point.
2. Merge the two clusters that are closest to each other. Now you have N-1 clusters.
3. Recompute the distances between the clusters. There are several ways to do this (see this tutorial for more details). One of them (called average-linkage clustering) is to consider the distance between two clusters to be the average distance between all their respective members.
4. Repeat steps 2 and 3 until you get one cluster of N data points. You get a tree (also known as a dendrogram) like the one below.
5. Pick a number of clusters and draw a horizontal line in the dendrogram. For example, if you want k=2 clusters, you should draw a horizontal line around “distance=20000.” You’ll get one cluster with data points 8, 9, 11, 16 and one cluster with the rest of the data points. In general, the number of clusters you get is the number of intersection points of your horizontal line with the vertical lines in the dendrogram.
60
Part 3: Unsupervised Learning
Machine Learning for Humans
Source: Solver.com. For more detail on hierarchical clustering, you can check this video out.
Dimensionality reduction
“It is not the daily increase, but the daily decrease. Hack away at the unessential.”—Bruce Lee
Dimensionality reduction looks a lot like compression. This is about trying to reduce the complexity of the data while keeping as much of the relevant structure as possible. If you take a simple 128 x 128 x 3 pixels image (length x width x RGB value), that’s 49,152 dimensions of data. If you’re able to reduce the dimensionality of the space in which these images live without destroying too much of the meaningful content in the images, then you’ve done a good job at dimensionality reduction.
We’ll take a look at two common techniques in practice: principal component analysis and singular value decomposition.
61
Part 3: Unsupervised Learning
Machine Learning for Humans
Principal component analysis (PCA)
First, a little linear algebra refresher—let’s talk about spaces and bases.
You’re familiar with the coordinate plane with origin O(0,0) and basis vectors i(1,0) and j(0,1). It turns out you can choose a completely different basis and still have all the math work out. For example, you can keep O as the origin and choose the basis to vectors i’=(2,1) and j’=(1,2). If you have the patience for it, you’ll convince yourself that the point labeled (2,2) in the i’, j’ coordinate system is labeled (6, 6) in the i, j system.
Plotted using Mathisfun’s “Interactive Cartesian Coordinates”
62
Part 3: Unsupervised Learning
Machine Learning for Humans
This means we can change the basis of a space. Now imagine much higher dimensional space. Like, 50K dimensions. You can select a basis for that space, and then select only the 200 most significant vectors of that basis. These basis vectors are called principal components, and the subset you select constitute a new space that is smaller in dimensionality than the original space but maintains as much of the complexity of the data as possible.
To select the most significant principal components, we look at how much of the data’s variance they capture and order them by that metric.
Another way of thinking about this is that PCA remaps the space in which our data exists to make it more compressible. The transformed dimension is smaller than the original dimension.
By making use of the first several dimensions of the remapped space only, we can start gaining an understanding of the dataset’s organization. This is the promise of dimensionality reduction: reduce complexity (dimensionality in this case) while maintaining structure (variance). Here’s a fun paper Samer wrote on using PCA (and diffusion mapping, another technique) to try to make sense of the Wikileaks cable release.
Singular value decomposition (SVD)
Let’s represent our data like a big A = m x n matrix. SVD is a computation that allows us to decompose that big matrix into a product of 3 smaller matrices (U=m x r, diagonal matrix Σ=r x r, and V=r x n where r is a small number).
63
Part 3: Unsupervised Learning
Machine Learning for Humans
Here’s a more visual illustration of that product to start with:
The values in the r*r diagonal matrix Σ are called singular values. What’s cool about them is that these singular values can be used to compress the original matrix. If you drop the smallest 20% of singular values and the associated columns in matrices U and V, you save quite a bit of space and still get a decent representation of the underlying matrix.
To examine what that means more precisely, let’s work with this image of a dog: 64
Part 3: Unsupervised Learning
Machine Learning for Humans
We’ll use the code written in Andrew Gibiansky’s post on SVD. First, we show that if we rank the singular values (the values of the matrix Σ) by magnitude, the first 50 singular values contain 85% of the magnitude of the whole matrix Σ.
65
Part 3: Unsupervised Learning
Machine Learning for Humans
We can use this fact to discard the next 250 values of sigma (i.e., set them to 0) and just keep a “rank 50” version of the image of the dog. Here, we create a rank 200, 100, 50, 30, 20, 10, and 3 dog. Obviously, the picture is smaller, but let’s agree that the rank 30 dog is still good. Now let’s see how much compression we achieve with this dog. The original image matrix is 305*275 = 83,875 values. The rank 30 dog is 305*30+30+30*275=17,430—almost 5 times fewer values with very little loss in image quality. The reason for the calculation above is that we also discard the parts of the matrix U and V that get multiplied by zeros when the operation UΣ’V is carried out (where Σ’ is the modified version of Σ that only has the first 30 values in it).
66
Part 3: Unsupervised Learning
Machine Learning for Humans
Unsupervised learning is often used to preprocess the data. Usually, that means compressing it in some meaning-preserving way like with PCA or SVD before feeding it to a deep neural net or another supervised learning algorithm.
Onwards!
Now that you’ve finished this section, you’ve earned an awful, horrible, never-to be-mentioned-again joke about unsupervised learning. Here goes…
Person-in-joke-#1: Y would u ever need to use unsupervised tho?
Person-in-joke-#2: Y? there’s no Y.
Next up… Part 4: Neural Networks & Deep Learning!
Practice materials & further reading
3a—k-means clustering
Play around with this clustering visualization to build intuition for how the algorithm works. Then, take a look at this implementation of k-means clustering for handwritten digits and the associated tutorial.
3b—SVD
For a good reference on SVD, go no further than Andrew Gibiansky’s post.
67
Part 3: Unsupervised Learning
Machine Learning for Humans
Part 4: Neural Networks & Deep Learning
Where, why, where, and how deep neural networks work. Drawing inspiration from the brain. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Real-world applications.
With deep learning, we’re still learning a function f to map input X to output Y with minimal loss on the test data, just as we’ve been doing all along. Recall our initial “problem statement” from Part 2.1 on supervised learning:
Y = f(X) + ϵ
Training: machine learns f from labeled training data
Testing: machine predicts Y from unlabeled testing data
The real world is messy, so sometimes f is complicated. In natural language problems large vocabulary sizes mean lots of features. Vision problems involve lots of visual information about pixels. Playing games requires making a decision based on complex scenarios with many possible futures. The learning techniques we’ve covered so far do well when the data we’re working with is not insanely complex, but it’s not clear how they’d generalize to scenarios like these.
Deep learning is really good at learning f, particularly in situations where the data is complex. In fact, artificial neural networks are known as universal function approximators because they’re able to learn any function, no matter how wiggly, with just a single hidden layer.
Let’s look at the problem of image classification. We take an image as an input, and output a class (e.g., dog, cat, car).
68
Part 4: Neural Networks & Deep Learning
Machine Learning for Humans
Graphically, a deep neural network solving image classification looks something like this:
Image from Jeff Clune’s 1-hour Deep Learning Overview on YouTube
But really, this is a giant mathematical equation with millions of terms and lots of parameters. The input X is, say, a greyscale image represented by a w-by-h matrix of pixel brightnesses. The output Y is a vector of class probabilities. This means we have as an output the probability of each class being the correct label. If this neural net is working well, the highest probability should be for the correct class. And the layers in the middle are just doing a bunch of matrix multiplication by summing activations x weights with non-linear transformations (activation functions) after every hidden layer to enable the network to learn a non-linear function.
Incredibly, you can use gradient descent in the exact same way that we did with linear regression in Part 2.1 to train these parameters in a way that minimizes loss. So with a lot of examples and a lot of gradient descent, the model can learn how to classify images of animals correctly. And that, in a nutshell’s nutshell, is “deep learning”.
69
Part 4: Neural Networks & Deep Learning
Machine Learning for Humans
Where deep learning does well, and some history
Artificial neural networks have actually been around for a long time. Their application has been historically referred to as cybernetics (1940s-1960s), connectionism (1980s-1990s), and then came into vogue as deep learning circa 2006 when neural networks started getting, well, “deeper” (Goodfellow et al., 2016). But only recently have we really started to scratch the surface of their full potential.
As described by Andrej Karpathy (Director of AI at Tesla, whom we tend to think of as the Shaman of Deep Learning), there are generally “four separate factors that hold back AI:
1. Compute (the obvious one: Moore’s Law, GPUs, ASICs),
2. Data (in a nice form, not just out there somewhere on the internet—e.g. ImageNet),
3. Algorithms (research and ideas, e.g. backprop, CNN, LSTM), and
4. Infrastructure (software under you—Linux, TCP/IP, Git, ROS, PR2, AWS, AMT, TensorFlow, etc.)” (Karpathy, 2016).
In the past decade or so, the full potential of deep learning is finally being unlocked by advances in (1) and (2), which in turn has led to further breakthroughs in (3) and (4)—and so the cycle continues, with exponentially more humans rallying to the frontlines of deep learning research along the way (just think about what you’re doing right now!)
70
Part 4: Neural Networks & Deep Learning
Machine Learning for Humans
Illustration by NVIDIA, a leading maker of graphics processing units (GPUs) which were originally built for for gaming but turned out to be well-suited to the type of parallel computing required by deep neural networks
In the rest of this section, we’ll provide some background from biology and statistics to explain what happens inside neural nets, and then talk through some amazing applications of deep learning. Finally, we’ll link to a few resources so you can apply deep learning yourself, even sitting on the couch in your pajamas with a laptop, to quickly achieve greater-than-human-level performance on certain types of problems.
Drawing inspiration from the brain (or is it just statistics?)—what happens inside neural nets
Neurons, feature learning, and layers of abstraction
As you read these words you aren’t examining every letter of every word, or every pixel making up each letter, to derive the meaning of the words. You’re abstracting
71
Part 4: Neural Networks & Deep Learning
Machine Learning for Humans
away from the details and grouping things into higher-level concepts: words, phrases, sentences, paragraphs.
Tihs abiilty to exaimne hgiher-lveel fteaures is waht aollws yuo to unedrtsand waht is hpapening in tihs snetecne wthiout too mcuh troulbe (or myabe yuo sned too mnay dnruk txets).
The same thing happens in vision, not just in humans but in animals’ visual systems generally.
Brains are made up of neurons which “fire” by emitting electrical signals to other neurons after being sufficiently “activated”. These neurons are malleable in terms of how much a signal from other neurons will add to the activation level of the neuron (vaguely speaking, the weights connecting neurons to each other end up being trained to make the neural connections more useful, just like the parameters in a linear regression can be trained to improve the mapping from input to output).
Side-by-side illustrations of biological and artificial neurons, via Stanford’s CS231n. This analogy can’t be taken too literally—biological neurons can do things that artificial neurons can’t, and vice versa—but it’s useful to understand the biological inspiration. See Wikipedia’s description of biological vs. artificial neurons for more detail.
72
Part 4: Neural Networks & Deep Learning
Machine Learning for Humans
Our biological networks are arranged in a hierarchical manner, so that certain neurons end up detecting not extremely specific features of the world around us, but rather more abstract features, i.e. patterns or groupings of more low-level features. For example, the fusiform face area in the human visual system is specialized for facial recognition.
Top: Illustration of learning increasingly abstract features, via NVIDIA. Bottom: diagram of how an artificial neural network takes raw pixel inputs, develops intermediate “neurons” to detect higher-level features (e.g. presence of a nose), and combines the outputs of these to create a final output. Illustration from Neural Networks and Deep Learning (Nielsen, 2017).
73
Part 4: Neural Networks & Deep Learning
Machine Learning for Humans
This hierarchical structure exhibited by biological neural networks was discovered in the 1950s when researchers David Hubel and Torsten Wiesel were studying neurons in the visual cortex of cats. They were unable to observe neural activation after exposing the cat to a variety of stimuli: dark spots, light spots, hand-waving, and even pictures of women in magazines. But in their frustration, as they removed a slide from the projector at a diagonal angle, they noticed some neural activity! It turned out that diagonal edges at a very particular angle were causing certain neurons to be activated.
Background via Knowing Neurons
This makes sense evolutionarily since natural environments are generally noisy and random (imagine a grassy plain or a rocky terrain). So when a feline in the wild perceives an “edge”, i.e. a line that contrasts from its background, this might indicate that an object or creature is in the visual field. When a certain combination of edge neurons are activated, those activations will combine to yield a yet more abstract activation, and so on, until the final abstraction is a useful concept, like “bird” or “wolf”.
The idea behind a deep neural network is to mimic a similar structure with layers of artificial neurons.
Why linear models don’t work
To draw from Stanford’s excellent deep learning course, CS231n: Convolutional Neural Networks and Visual Recognition, imagine that we want to train a neural network to classify images with the correct one of the following labels: ["plane", "car", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"].
One approach could be to construct a “template”, or average image, of each class of image using the training examples, and then use a nearest-neighbors algorithm during
74
Part 4: Neural Networks & Deep Learning
Machine Learning for Humans
testing to measure the distance of each unclassified image’s pixel values, in aggregate, to each template. This approach involves no layers of abstraction. It’s a linear model that combines all the different orientations of each type of image into one averaged blur.
For instance, it would take all the cars—regardless of whether they’re facing left, right, center, and regardless of their color—and average them. The template then ends up looking rather vague and blurry.
Example drawn from Stanford’s CS231n: Convolutional Neural Networks and Visual Recognition, Lecture 2.
Notice that the horse template above appears to have two heads. This doesn’t really help us: we want to be able to detect right-facing horse or a left-facing horse separately, and then if either one of those features is detected, then we want to say we’re looking at a horse. This flexibility is provided by deep neural nets, as we will see in the next section.
Deep neural networks approach the image classification problem using layers of abstraction
To repeat what we explained earlier in this section: the input layer will take raw pixel brightnesses of an image. The final layer will be an output vector of class probabilities (i.e. the probability of the image being a “cat”, “car”, “horse”, etc.)
But instead of learning a simple linear model relating input to output, we’ll instead construct intermediate hidden layers of the network will learn increasingly abstract features, which enables us to not lose all the nuance in the complex data.
75
Part 4: Neural Networks & Deep Learning
Machine Learning for Humans
Source: Analytics Vidhya
Just as we described animal brains detecting abstract features, the artificial neurons in the hidden layers will learn to detect abstract concepts—whichever concepts are ultimately most useful for capturing the most information and minimizing loss in the accuracy of the network’s output (this is an instance of unsupervised learning happening within the network).
This comes at the cost of model interpretability, since as you add in more hidden layers the neurons start representing more and more abstract and ultimately unintelligible features—to the point that you may hear deep learning referred to as “black box optimization”, where you basically are just trying stuff somewhat at random and seeing what comes out, without really understanding what’s happening inside.
76
Part 4: Neural Networks & Deep Learning
Machine Learning for Humans
Linear regression is interpretable because you decided which features to include in the model. Deep neural networks are harder to interpret because the features are learned and aren’t explained anywhere in English. It’s all in the machine’s imagination.
Some extensions and further concepts worth noting
• Deep learning software packages. You’ll rarely need to implement all the parts of neural networks from scratch because of existing libraries and tools that make deep learning implementations easier. There are many of these: TensorFlow, Caffe, Torch, Theano, and more.
• Convolutional neural networks (CNNs). CNNs are designed specifically for taking images as input, and are effective for computer vision tasks. They are also instrumental in deep reinforcement learning. CNNs are specifically inspired by the way animal visual cortices work, and they’re the focus of the deep learning course we’ve been referencing throughout this article, Stanford’s CS231n.
• Recurrent neural networks (RNNs). RNNs have a sense of built-in memory and are well-suited for language problems. They’re also important in reinforcement learning since they enable the agent to keep track of where things are and what happened historically even when those elements aren’t all visible at once. Christopher Olah wrote an excellent walkthrough of RNNs and LSTMs in the context of language problems.
• Deep reinforcement learning. This is one of the most exciting areas of deep learning research, at the heart of recent achievements like OpenAI defeating professional Dota 2 players and DeepMind’s AlphaGo surpassing humans in the game of Go. We’ll dive deeper in Part 5, but essentially the goal is to apply all of the techniques in this post to the problem of teaching an agent to maximize reward. This can be applied in any context that can be gamified—from actual games like Counter Strike or Pacman, to self-driving cars, to trading stocks, to (ultimately) real life and the real world.
77
Part 4: Neural Networks & Deep Learning
Machine Learning for Humans
Deep learning applications
Deep learning is reshaping the world in virtually every domain. Here are a few examples of the incredible things that deep learning can do…
• Facebook trained a neural network augmented by short-term memory to intelligently answer questions about the plot of Lord of the Rings.
Research from FAIR (Facebook AI Research) applying deep neural networks augmented by separate short-term memory to intelligently answer questions about the LOTR storyline. This is the definition of epic.
• Self-driving cars rely on deep learning for visual tasks like understanding road signs, detecting lanes, and recognizing obstacles.
Source: Business Insider
78
Part 4: Neural Networks & Deep Learning
Machine Learning for Humans
• Deep learning can be used for fun stuff like art generation. A tool called neural style can impressively mimic an artist’s style and use it to remix another image.
The style of Van Gogh’s Starry Night applied to a picture of Stanford’s campus, via Justin Johnson’s neural style implementation: https://github.com/jcjohnson/neural-style
Other noteworthy examples include:
• Predicting molecule bioactivity for drug discovery
• Face and object recognition for photo and video tagging
• Powering Google search results
• Natural language understanding and generation, e.g. Google Translate • The Mars explorer robot Curiosity is autonomously selecting inspection-worthy soil targets based on visual examination
…and many, many, more.
79
Part 4: Neural Networks & Deep Learning
Machine Learning for Humans
Now go do it!
We haven’t gone into as much detail here on how neural networks are set up in practice because it’s much easier to understand the details by implementing them yourself. Here are some amazing hands-on resources for getting started.
• Play around with the architecture of neural networks to see how different configurations affect network performance with the Google’s Neural Network Playground. • Get up-and-running quickly with this tutorial by Google: TensorFlow and deep learning, without a PhD. Classify handwritten digits at >99% accuracy, get familiar with TensorFlow, and learn deep learning concepts within 3 hours.
• Then, work through at least the first few lectures of Stanford’s CS231n and the first assignment of building a two-layer neural network from scratch to really solidify the concepts covered in this article.
Further resources
Deep learning is an expansive subject area. Accordingly, we’ve also compiled some of the best resources we’ve encountered on the topic, in case you’d like to go… deeper.
• Deeplearning.ai, Andrew Ng’s new deep learning course with a comprehensive syllabus on the subject
• CS231n: Convolutional Neural Networks for Visual Recognition, Stanford’s deep learning course. One of the best treatments we’ve seen, with excellent lectures and illustrative problem sets • Deep Learning & Neural Networks—accessible but rigorous
• Deep Learning Book—foundational, more mathematical
• Fast.ai—less theoretical, much more applied and black-boxy
• See Greg Brockman (CTO of OpenAI)’s answer to the question “What are the best ways to pick up Deep Learning skills as an engineer?” on Quora
Next up: time to play some games!
Last, but most certainly not least, is Part 5: Reinforcement Learning.
80
Part 4: Neural Networks & Deep Learning
Deep Learning
Ian Goodfellow
Yoshua Bengio
Aaron Courville
Contents
Website vii
Acknowledgments viii 1 Introduction 1
1.1 Who Should Read This Book? . . . . . . . . . . . . . . . . . . . . 8
Notation xi 1.2 Historical Trends in Deep Learning . . . . . . . . . . . . . . . . . 11 I Applied Math and Machine Learning Basics 29 2 Linear Algebra 31 2.1 Scalars, Vectors, Matrices and Tensors . . . . . . . . . . . . . . . 31 2.2 Multiplying Matrices and Vectors . . . . . . . . . . . . . . . . . . 34 2.3 Identity and Inverse Matrices . . . . . . . . . . . . . . . . . . . . 36 2.4 Linear Dependence and Span . . . . . . . . . . . . . . . . . . . . 37 2.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.6 Special Kinds of Matrices and Vectors . . . . . . . . . . . . . . . 40 2.7 Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.8 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . 44 2.9 The Moore-Penrose Pseudoinverse . . . . . . . . . . . . . . . . . . 45 2.10 The Trace Operator . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.11 The Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.12 Example: Principal Components Analysis . . . . . . . . . . . . . 48 3 Probability and Information Theory 53
3.1 Why Probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 i
CONTENTS
3.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . 56 3.4 Marginal Probability . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . 59 3.6 The Chain Rule of Conditional Probabilities . . . . . . . . . . . . 59 3.7 Independence and Conditional Independence . . . . . . . . . . . . 60 3.8 Expectation, Variance and Covariance . . . . . . . . . . . . . . . 60 3.9 Common Probability Distributions . . . . . . . . . . . . . . . . . 62 3.10 Useful Properties of Common Functions . . . . . . . . . . . . . . 67 3.11 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.12 Technical Details of Continuous Variables . . . . . . . . . . . . . 71 3.13 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.14 Structured Probabilistic Models . . . . . . . . . . . . . . . . . . . 75
4 Numerical Computation 80 4.1 Overflow and Underflow . . . . . . . . . . . . . . . . . . . . . . . 80 4.2 Poor Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3 Gradient-Based Optimization . . . . . . . . . . . . . . . . . . . . 82 4.4 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . 93 4.5 Example: Linear Least Squares . . . . . . . . . . . . . . . . . . . 96
5 Machine Learning Basics 98 5.1 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.2 Capacity, Overfitting and Underfitting . . . . . . . . . . . . . . . 110 5.3 Hyperparameters and Validation Sets . . . . . . . . . . . . . . . . 120 5.4 Estimators, Bias and Variance . . . . . . . . . . . . . . . . . . . . 122 5.5 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . 131 5.6 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.7 Supervised Learning Algorithms . . . . . . . . . . . . . . . . . . . 140 5.8 Unsupervised Learning Algorithms . . . . . . . . . . . . . . . . . 146 5.9 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . 151 5.10 Building a Machine Learning Algorithm . . . . . . . . . . . . . . 153 5.11 Challenges Motivating Deep Learning . . . . . . . . . . . . . . . . 155
II Deep Networks: Modern Practices 166
6 Deep Feedforward Networks 168 6.1 Example: Learning XOR . . . . . . . . . . . . . . . . . . . . . . . 171 6.2 Gradient-Based Learning . . . . . . . . . . . . . . . . . . . . . . . 177
ii
CONTENTS
6.3 Hidden Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 6.4 Architecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . 197 6.5 Back-Propagation and Other Differentiation Algorithms . . . . . 204 6.6 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
7 Regularization for Deep Learning 228 7.1 Parameter Norm Penalties . . . . . . . . . . . . . . . . . . . . . . 230 7.2 Norm Penalties as Constrained Optimization . . . . . . . . . . . . 237 7.3 Regularization and Under-Constrained Problems . . . . . . . . . 239 7.4 Dataset Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 240 7.5 Noise Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 7.6 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . 243 7.7 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . 244 7.8 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 7.9 Parameter Tying and Parameter Sharing . . . . . . . . . . . . . . 253 7.10 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . 254 7.11 Bagging and Other Ensemble Methods . . . . . . . . . . . . . . . 256 7.12 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 7.13 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . 268 7.14 Tangent Distance, Tangent Prop, and Manifold Tangent Classifier 270
8 Optimization for Training Deep Models 274 8.1 How Learning Differs from Pure Optimization . . . . . . . . . . . 275 8.2 Challenges in Neural Network Optimization . . . . . . . . . . . . 282 8.3 Basic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 8.4 Parameter Initialization Strategies . . . . . . . . . . . . . . . . . 301 8.5 Algorithms with Adaptive Learning Rates . . . . . . . . . . . . . 306 8.6 Approximate Second-Order Methods . . . . . . . . . . . . . . . . 310 8.7 Optimization Strategies and Meta-Algorithms . . . . . . . . . . . 317
9 Convolutional Networks 330 9.1 The Convolution Operation . . . . . . . . . . . . . . . . . . . . . 331 9.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 9.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 9.4 Convolution and Pooling as an Infinitely Strong Prior . . . . . . . 345 9.5 Variants of the Basic Convolution Function . . . . . . . . . . . . 347 9.6 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 358 9.7 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 9.8 Efficient Convolution Algorithms . . . . . . . . . . . . . . . . . . 362 9.9 Random or Unsupervised Features . . . . . . . . . . . . . . . . . 363
iii
CONTENTS
9.10 The Neuroscientific Basis for Convolutional Networks . . . . . . . 364 9.11 Convolutional Networks and the History of Deep Learning . . . . 371
10 Sequence Modeling: Recurrent and Recursive Nets 373 10.1 Unfolding Computational Graphs . . . . . . . . . . . . . . . . . . 375 10.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . 378 10.3 Bidirectional RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . 394 10.4 Encoder-Decoder Sequence-to-Sequence Architectures . . . . . . . 396 10.5 Deep Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . 398 10.6 Recursive Neural Networks . . . . . . . . . . . . . . . . . . . . . . 400 10.7 The Challenge of Long-Term Dependencies . . . . . . . . . . . . . 401 10.8 Echo State Networks . . . . . . . . . . . . . . . . . . . . . . . . . 404 10.9 Leaky Units and Other Strategies for Multiple Time Scales . . . . 406 10.10 The Long Short-Term Memory and Other Gated RNNs . . . . . . 408 10.11 Optimization for Long-Term Dependencies . . . . . . . . . . . . . 413 10.12 Explicit Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
11 Practical Methodology 421 11.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 422 11.2 Default Baseline Models . . . . . . . . . . . . . . . . . . . . . . . 425 11.3 Determining Whether to Gather More Data . . . . . . . . . . . . 426 11.4 Selecting Hyperparameters . . . . . . . . . . . . . . . . . . . . . . 427 11.5 Debugging Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 436 11.6 Example: Multi-Digit Number Recognition . . . . . . . . . . . . . 440
12 Applications 443 12.1 Large-Scale Deep Learning . . . . . . . . . . . . . . . . . . . . . . 443 12.2 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 452 12.3 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 458 12.4 Natural Language Processing . . . . . . . . . . . . . . . . . . . . 461 12.5 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 478
III Deep Learning Research 486
13 Linear Factor Models 489 13.1 Probabilistic PCA and Factor Analysis . . . . . . . . . . . . . . . 490 13.2 Independent Component Analysis (ICA) . . . . . . . . . . . . . . 491 13.3 Slow Feature Analysis . . . . . . . . . . . . . . . . . . . . . . . . 493 13.4 Sparse Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
iv
CONTENTS
13.5 Manifold Interpretation of PCA . . . . . . . . . . . . . . . . . . . 499
14 Autoencoders 502 14.1 Undercomplete Autoencoders . . . . . . . . . . . . . . . . . . . . 503 14.2 Regularized Autoencoders . . . . . . . . . . . . . . . . . . . . . . 504 14.3 Representational Power, Layer Size and Depth . . . . . . . . . . . 508 14.4 Stochastic Encoders and Decoders . . . . . . . . . . . . . . . . . . 509 14.5 Denoising Autoencoders . . . . . . . . . . . . . . . . . . . . . . . 510 14.6 Learning Manifolds with Autoencoders . . . . . . . . . . . . . . . 515 14.7 Contractive Autoencoders . . . . . . . . . . . . . . . . . . . . . . 521 14.8 Predictive Sparse Decomposition . . . . . . . . . . . . . . . . . . 523 14.9 Applications of Autoencoders . . . . . . . . . . . . . . . . . . . . 524
15 Representation Learning 526 15.1 Greedy Layer-Wise Unsupervised Pretraining . . . . . . . . . . . 528 15.2 Transfer Learning and Domain Adaptation . . . . . . . . . . . . . 536 15.3 Semi-Supervised Disentangling of Causal Factors . . . . . . . . . 541 15.4 Distributed Representation . . . . . . . . . . . . . . . . . . . . . . 546 15.5 Exponential Gains from Depth . . . . . . . . . . . . . . . . . . . 553 15.6 Providing Clues to Discover Underlying Causes . . . . . . . . . . 554
16 Structured Probabilistic Models for Deep Learning 558 16.1 The Challenge of Unstructured Modeling . . . . . . . . . . . . . . 559 16.2 Using Graphs to Describe Model Structure . . . . . . . . . . . . . 563 16.3 Sampling from Graphical Models . . . . . . . . . . . . . . . . . . 580 16.4 Advantages of Structured Modeling . . . . . . . . . . . . . . . . . 582 16.5 Learning about Dependencies . . . . . . . . . . . . . . . . . . . . 582 16.6 Inference and Approximate Inference . . . . . . . . . . . . . . . . 584 16.7 The Deep Learning Approach to Structured Probabilistic Models 585
17 Monte Carlo Methods 590 17.1 Sampling and Monte Carlo Methods . . . . . . . . . . . . . . . . 590 17.2 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 592 17.3 Markov Chain Monte Carlo Methods . . . . . . . . . . . . . . . . 595 17.4 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599 17.5 The Challenge of Mixing between Separated Modes . . . . . . . . 599
18 Confronting the Partition Function 605 18.1 The Log-Likelihood Gradient . . . . . . . . . . . . . . . . . . . . 606 18.2 Stochastic Maximum Likelihood and Contrastive Divergence . . . 607
v
CONTENTS
18.3 Pseudolikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 615 18.4 Score Matching and Ratio Matching . . . . . . . . . . . . . . . . 617 18.5 Denoising Score Matching . . . . . . . . . . . . . . . . . . . . . . 619 18.6 Noise-Contrastive Estimation . . . . . . . . . . . . . . . . . . . . 620 18.7 Estimating the Partition Function . . . . . . . . . . . . . . . . . . 623
19 Approximate Inference 631 19.1 Inference as Optimization . . . . . . . . . . . . . . . . . . . . . . 633 19.2 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . 634 19.3 MAP Inference and Sparse Coding . . . . . . . . . . . . . . . . . 635 19.4 Variational Inference and Learning . . . . . . . . . . . . . . . . . 638 19.5 Learned Approximate Inference . . . . . . . . . . . . . . . . . . . 651
20 Deep Generative Models 654 20.1 Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . 654 20.2 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . 656 20.3 Deep Belief Networks . . . . . . . . . . . . . . . . . . . . . . . . . 660 20.4 Deep Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . 663 20.5 Boltzmann Machines for Real-Valued Data . . . . . . . . . . . . . 676 20.6 Convolutional Boltzmann Machines . . . . . . . . . . . . . . . . . 683 20.7 Boltzmann Machines for Structured or Sequential Outputs . . . . 685 20.8 Other Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . 686 20.9 Back-Propagation through Random Operations . . . . . . . . . . 687 20.10 Directed Generative Nets . . . . . . . . . . . . . . . . . . . . . . . 692 20.11 Drawing Samples from Autoencoders . . . . . . . . . . . . . . . . 711 20.12 Generative Stochastic Networks . . . . . . . . . . . . . . . . . . . 714 20.13 Other Generation Schemes . . . . . . . . . . . . . . . . . . . . . . 716 20.14 Evaluating Generative Models . . . . . . . . . . . . . . . . . . . . 717 20.15 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 720
Bibliography 721 Index 777
vi
Website
www.deeplearningbook.org
This book is accompanied by the above website. The website provides a variety of supplementary material, including exercises, lecture slides, corrections of mistakes, and other resources that should be useful to both readers and instructors.
vii
Acknowledgments
This book would not have been possible without the contributions of many people. We would like to thank those who commented on our proposal for the book and helped plan its contents and organization: Guillaume Alain, Kyunghyun Cho, Çağlar Gülçehre, David Krueger, Hugo Larochelle, Razvan Pascanu and Thomas Rohée.
We would like to thank the people who offered feedback on the content of the book itself. Some offered feedback on many chapters: Martín Abadi, Guillaume Alain, Ion Androutsopoulos, Fred Bertsch, Olexa Bilaniuk, Ufuk Can Biçici, Matko Bošnjak, John Boersma, Greg Brockman, Alexandre de Brébisson, Pierre Luc Carrier, Sarath Chandar, Pawel Chilinski, Mark Daoust, Oleg Dashevskii, Laurent
Dinh, Stephan Dreseitl, Jim Fan, Miao Fan, Meire Fortunato, Frédéric Francis, Nando de Freitas, Çağlar Gülçehre, Jurgen Van Gael, Javier Alonso García, Jonathan Hunt, Gopi Jeyaram, Chingiz Kabytayev, Lukasz Kaiser, Varun Kanade, Asifullah Khan, Akiel Khan, John King, Diederik P. Kingma, Yann LeCun, Rudolf Mathey, Matías Mattamala, Abhinav Maurya, Kevin Murphy, Oleg Mürk, Roman
Novak, Augustus Q. Odena, Simon Pavlik, Karl Pichotta, Eddie Pierce, Kari Pulli, Roussel Rahman, Tapani Raiko, Anurag Ranjan, Johannes Roith, Mihaela Rosca, Halis Sak, César Salgado, Grigory Sapunov, Yoshinori Sasaki, Mike Schuster, Julian Serban, Nir Shabat, Ken Shirriff, Andre Simpelo, Scott Stanley, David
Sussillo, Ilya Sutskever, Carles Gelada Sáez, Graham Taylor, Valentin Tolmer, Massimiliano Tomassoli, An Tran, Shubhendu Trivedi, Alexey Umnov, Vincent Vanhoucke, Marco Visentini-Scarzanella, Martin Vita, David Warde-Farley, Dustin Webb, Kelvin Xu, Wei Xue, Ke Yang, Li Yao, Zygmunt Zając and Ozan Çağlayan.
We would also like to thank those who provided us with useful feedback on individual chapters:
• Notation: Zhang Yuanhang.
• Chapter 1, Introduction: Yusuf Akgul, Sebastien Bratieres, Samira Ebrahimi, viii
CONTENTS
Charlie Gorichanaz, Brendan Loudermilk, Eric Morris, Cosmin Pârvulescu and Alfredo Solano.
• Chapter 2, Linear Algebra: Amjad Almahairi, Nikola Banić, Kevin Bennett, Philippe Castonguay, Oscar Chang, Eric Fosler-Lussier, Andrey Khalyavin, Sergey Oreshkov, István Petrás, Dennis Prangle, Thomas Rohée, Gitanjali Gulve Sehgal, Colby Toland, Alessandro Vitale and Bob Welland.
• Chapter 3, Probability and Information Theory: John Philip Anderson, Kai Arulkumaran, Vincent Dumoulin, Rui Fa, Stephan Gouws, Artem Oboturov, Antti Rasmus, Alexey Surkov and Volker Tresp.
• Chapter 4, Numerical Computation: Tran Lam AnIan Fischer and Hu Yuhuang.
• Chapter 5, Machine Learning Basics: Dzmitry Bahdanau, Justin Domingue, Nikhil Garg, Makoto Otsuka, Bob Pepin, Philip Popien, Emmanuel Rayner, Peter Shepard, Kee-Bong Song, Zheng Sun and Andy Wu.
• Chapter 6, Deep Feedforward Networks: Uriel Berdugo, Fabrizio Bottarel, Elizabeth Burl, Ishan Durugkar, Jeff Hlywa, Jong Wook Kim, David Krueger and Aditya Kumar Praharaj.
• Chapter 7, Regularization for Deep Learning: Morten Kolbæk, Kshitij Lauria, Inkyu Lee, Sunil Mohan, Hai Phong Phan and Joshua Salisbury.
• Chapter 8, Optimization for Training Deep Models: Marcel Ackermann, Peter Armitage, Rowel Atienza, Andrew Brock, Tegan Maharaj, James Martens, Kashif Rasul, Klaus Strobl and Nicholas Turner.
• Chapter 9, Convolutional Networks: Martín Arjovsky, Eugene Brevdo, Kon stantin Divilov, Eric Jensen, Mehdi Mirza, Alex Paino, Marjorie Sayer, Ryan Stout and Wentao Wu.
• Chapter 10, Sequence Modeling: Recurrent and Recursive Nets: Gökçen Eraslan, Steven Hickson, Razvan Pascanu, Lorenzo von Ritter, Rui Rodrigues, Dmitriy Serdyuk, Dongyu Shi and Kaiyu Yang.
• Chapter 11, Practical Methodology: Daniel Beckstein.
• Chapter 12, Applications: George Dahl, Vladimir Nekrasov and Ribana Roscher.
• Chapter 13, Linear Factor Models: Jayanth Koushik.
ix
CONTENTS
• Chapter 15, Representation Learning: Kunal Ghosh.
• Chapter 16, Structured Probabilistic Models for Deep Learning: Minh Lê and Anton Varfolom.
• Chapter 18, Confronting the Partition Function: Sam Bowman. • Chapter 19, Approximate Inference: Yujia Bao.
• Chapter 20, Deep Generative Models: Nicolas Chapados, Daniel Galvez, Wenming Ma, Fady Medhat, Shakir Mohamed and Grégoire Montavon.
• Bibliography: Lukas Michelbacher and Leslie N. Smith.
We also want to thank those who allowed us to reproduce images, figures or data from their publications. We indicate their contributions in the figure captions throughout the text.
We would like to thank Lu Wang for writing pdf2htmlEX, which we used to make the web version of the book, and for offering support to improve the quality of the resulting HTML.
We would like to thank Ian’s wife Daniela Flori Goodfellow for patiently supporting Ian during the writing of the book as well as for help with proofreading. We would like to thank the Google Brain team for providing an intellectual environment where Ian could devote a tremendous amount of time to writing this book and receive feedback and guidance from colleagues. We would especially like to thank Ian’s former manager, Greg Corrado, and his current manager, Samy Bengio, for their support of this project. Finally, we would like to thank Geoffrey Hinton for encouragement when writing was difficult.
x
Notation
This section provides a concise reference describing the notation used throughout this book. If you are unfamiliar with any of the corresponding mathematical concepts, we describe most of these ideas in chapters 2–4.
Numbers and Arrays
a A scalar (integer or real)
a A vector
A A matrix
A A tensor
In Identity matrix with n rows and n columns
I Identity matrix with dimensionality implied by
context
e(i) Standard basis vector [0, . . . , 0, 1, 0, . . . , 0] with a
1 at position i
diag(a) A square, diagonal matrix with diagonal entries
given by a
a A scalar random variable
a A vector-valued random variable
A A matrix-valued random variable
xi
CONTENTS
Sets and Graphs
A A set
R The set of real numbers
{0, 1} The set containing 0 and 1
{0, 1, . . . , n} The set of all integers between 0 and n [a, b] The real interval including a and b
(a, b] The real interval excluding a but including b
A\B Set subtraction, i.e., the set containing the ele ments of A that are not in B
G A graph
PaG(xi) The parents of xi in G
Indexing
ai Element i of vector a, with indexing starting at 1 a−i All elements of vector a except for element i Ai,j Element i, j of matrix A
Ai,: Row i of matrix A
A:,i Column i of matrix A
Ai,j,k Element (i, j, k) of a 3-D tensor A
A:,:,i 2-D slice of a 3-D tensor
ai Element i of the random vector a
Linear Algebra Operations
A Transpose of matrix A
A+ Moore-Penrose pseudoinverse of A
A  B Element-wise (Hadamard) product of A and B det(A) Determinant of A
xii
CONTENTS
dy
Calculus
dxDerivative of y with respect to x
∂y
∂xPartial derivative of y with respect to x ∇xy Gradient of y with respect to x
∇X y Matrix derivatives of y with respect to X
∇Xy Tensor containing derivatives of y with respect to X
∂f
∂xJacobian matrix J ∈ Rm×n of f : Rn → Rm ∇2xf(x) or H(f)(x) The Hessian matrix of f at input point x
f(x)dx Definite integral over the entire domain of x 
f(x)dx Definite integral with respect to x over the set S S
Probability and Information Theory
a⊥b The random variables a and b are independent a⊥b | c They are conditionally independent given c P(a) A probability distribution over a discrete variable
p(a) A probability distribution over a continuous vari able, or over a variable whose type has not been
specified
a ∼ P Random variable a has distribution P Ex∼P[f(x)] or Ef(x) Expectation of f(x) with respect to P(x) Var(f(x)) Variance of f(x) under P(x)
Cov(f(x), g(x)) Covariance of f(x) and g(x) under P(x) H(x) Shannon entropy of the random variable x DKL(PQ) Kullback-Leibler divergence of P and Q
N (x; µ, Σ) Gaussian distribution over x with mean µ and covariance Σ
xiii
CONTENTS
Functions
f : A → B The function f with domain A and range B
f ◦ g Composition of the functions f and g
f(x; θ) A function of x parametrized by θ. (Sometimes
we write f(x) and omit the argument θ to lighten
notation)
log x Natural logarithm of x
σ(x) Logistic sigmoid, 1
1 + exp(−x)
ζ(x) Softplus, log(1 + exp(x))
||x||p Lp norm of x
||x|| L2 norm of x
x+ Positive part of x, i.e., max(0, x)
1condition is 1 if the condition is true, 0 otherwise
Sometimes we use a function f whose argument is a scalar but apply it to a vector, matrix, or tensor: f(x), f(X), or f(X). This denotes the application of f to the array element-wise. For example, if C = σ(X), then Ci,j,k = σ(Xi,j,k) for all valid values of i, j and k.
Datasets and Distributions
pdata The data generating distribution
pˆdata The empirical distribution defined by the training
set
X A set of training examples
x(i) The i-th example (input) from a dataset
y(i) or y(i) The target associated with x(i) for supervised learn
ing
X The m × n matrix with input example x(i)in row Xi,:
xiv
Chapter 1
Introduction
Inventors have long dreamed of creating machines that think. This desire dates back to at least the time of ancient Greece. The mythical figures Pygmalion, Daedalus, and Hephaestus may all be interpreted as legendary inventors, and Galatea, Talos, and Pandora may all be regarded as artificial life (Ovid and Martin, 2004; Sparkes, 1996; Tandy, 1997).
When programmable computers were first conceived, people wondered whether such machines might become intelligent, over a hundred years before one was built (Lovelace, 1842). Today, artificial intelligence (AI) is a thriving field with many practical applications and active research topics. We look to intelligent software to automate routine labor, understand speech or images, make diagnoses in medicine and support basic scientific research.
In the early days of artificial intelligence, the field rapidly tackled and solved problems that are intellectually difficult for human beings but relatively straight forward for computers—problems that can be described by a list of formal, math ematical rules. The true challenge to artificial intelligence proved to be solving the tasks that are easy for people to perform but hard for people to describe formally—problems that we solve intuitively, that feel automatic, like recognizing spoken words or faces in images.
This book is about a solution to these more intuitive problems. This solution is to allow computers to learn from experience and understand the world in terms of a hierarchy of concepts, with each concept defined in terms of its relation to simpler concepts. By gathering knowledge from experience, this approach avoids the need for human operators to formally specify all of the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones. If we draw a graph showing how these
1
CHAPTER 1. INTRODUCTION
concepts are built on top of each other, the graph is deep, with many layers. For this reason, we call this approach to AI deep learning.
Many of the early successes of AI took place in relatively sterile and formal environments and did not require computers to have much knowledge about the world. For example, IBM’s Deep Blue chess-playing system defeated world champion Garry Kasparov in 1997 (Hsu, 2002). Chess is of course a very simple world, containing only sixty-four locations and thirty-two pieces that can move in only rigidly circumscribed ways. Devising a successful chess strategy is a tremendous accomplishment, but the challenge is not due to the difficulty of describing the set of chess pieces and allowable moves to the computer. Chess can be completely described by a very brief list of completely formal rules, easily provided ahead of time by the programmer.
Ironically, abstract and formal tasks that are among the most difficult mental undertakings for a human being are among the easiest for a computer. Computers have long been able to defeat even the best human chess player, but are only recently matching some of the abilities of average human beings to recognize objects or speech. A person’s everyday life requires an immense amount of knowledge about the world. Much of this knowledge is subjective and intuitive, and therefore difficult to articulate in a formal way. Computers need to capture this same knowledge in order to behave in an intelligent way. One of the key challenges in artificial intelligence is how to get this informal knowledge into a computer.
Several artificial intelligence projects have sought to hard-code knowledge about the world in formal languages. A computer can reason about statements in these formal languages automatically using logical inference rules. This is known as the knowledge base approach to artificial intelligence. None of these projects has led
to a major success. One of the most famous such projects is Cyc (Lenat and Guha, 1989). Cyc is an inference engine and a database of statements in a language called CycL. These statements are entered by a staff of human supervisors. It is an unwieldy process. People struggle to devise formal rules with enough complexity to accurately describe the world. For example, Cyc failed to understand a story about a person named Fred shaving in the morning (Linde, 1992). Its inference engine detected an inconsistency in the story: it knew that people do not have electrical parts, but because Fred was holding an electric razor, it believed the entity “FredWhileShaving” contained electrical parts. It therefore asked whether Fred was still a person while he was shaving.
The difficulties faced by systems relying on hard-coded knowledge suggest that AI systems need the ability to acquire their own knowledge, by extracting patterns from raw data. This capability is known as machine learning. The
2
CHAPTER 1. INTRODUCTION
introduction of machine learning allowed computers to tackle problems involving knowledge of the real world and make decisions that appear subjective. A simple machine learning algorithm called logistic regression can determine whether to recommend cesarean delivery (Mor-Yosef et al., 1990). A simple machine learning
algorithm called naive Bayes can separate legitimate e-mail from spam e-mail. The performance of these simple machine learning algorithms depends heavily on the representation of the data they are given. For example, when logistic regression is used to recommend cesarean delivery, the AI system does not examine the patient directly. Instead, the doctor tells the system several pieces of relevant information, such as the presence or absence of a uterine scar. Each piece of information included in the representation of the patient is known as a feature. Logistic regression learns how each of these features of the patient correlates with various outcomes. However, it cannot influence the way that the features are defined in any way. If logistic regression was given an MRI scan of the patient, rather than the doctor’s formalized report, it would not be able to make useful predictions. Individual pixels in an MRI scan have negligible correlation with any complications that might occur during delivery.
This dependence on representations is a general phenomenon that appears throughout computer science and even daily life. In computer science, opera tions such as searching a collection of data can proceed exponentially faster if the collection is structured and indexed intelligently. People can easily perform arithmetic on Arabic numerals, but find arithmetic on Roman numerals much more time-consuming. It is not surprising that the choice of representation has an enormous effect on the performance of machine learning algorithms. For a simple visual example, see figure 1.1.
Many artificial intelligence tasks can be solved by designing the right set of features to extract for that task, then providing these features to a simple machine learning algorithm. For example, a useful feature for speaker identification from sound is an estimate of the size of speaker’s vocal tract. It therefore gives a strong clue as to whether the speaker is a man, woman, or child.
However, for many tasks, it is difficult to know what features should be extracted. For example, suppose that we would like to write a program to detect cars in photographs. We know that cars have wheels, so we might like to use the presence of a wheel as a feature. Unfortunately, it is difficult to describe exactly what a wheel looks like in terms of pixel values. A wheel has a simple geometric shape but its image may be complicated by shadows falling on the wheel, the sun glaring off the metal parts of the wheel, the fender of the car or an object in the foreground obscuring part of the wheel, and so on.
3
CHAPTER 1. INTRODUCTION 
 
Figure 1.1: Example of different representations: suppose we want to separate two categories of data by drawing a line between them in a scatterplot. In the plot on the left, we represent some data using Cartesian coordinates, and the task is impossible. In the plot on the right, we represent the data with polar coordinates and the task becomes simple to solve with a vertical line. Figure produced in collaboration with David Warde-Farley.
One solution to this problem is to use machine learning to discover not only the mapping from representation to output but also the representation itself. This approach is known as representation learning. Learned representations often result in much better performance than can be obtained with hand-designed representations. They also allow AI systems to rapidly adapt to new tasks, with minimal human intervention. A representation learning algorithm can discover a good set of features for a simple task in minutes, or a complex task in hours to months. Manually designing features for a complex task requires a great deal of human time and effort; it can take decades for an entire community of researchers.
The quintessential example of a representation learning algorithm is the au toencoder. An autoencoder is the combination of an encoder function that converts the input data into a different representation, and a decoder function that converts the new representation back into the original format. Autoencoders are trained to preserve as much information as possible when an input is run through the encoder and then the decoder, but are also trained to make the new representation have various nice properties. Different kinds of autoencoders aim to achieve different kinds of properties.
When designing features or algorithms for learning features, our goal is usually to separate the factors of variation that explain the observed data. In this context, we use the word “factors” simply to refer to separate sources of influence; the factors are usually not combined by multiplication. Such factors are often not
4
CHAPTER 1. INTRODUCTION
quantities that are directly observed. Instead, they may exist either as unobserved objects or unobserved forces in the physical world that affect observable quantities. They may also exist as constructs in the human mind that provide useful simplifying explanations or inferred causes of the observed data. They can be thought of as concepts or abstractions that help us make sense of the rich variability in the data. When analyzing a speech recording, the factors of variation include the speaker’s age, their sex, their accent and the words that they are speaking. When analyzing an image of a car, the factors of variation include the position of the car, its color, and the angle and brightness of the sun.
A major source of difficulty in many real-world artificial intelligence applications is that many of the factors of variation influence every single piece of data we are able to observe. The individual pixels in an image of a red car might be very close to black at night. The shape of the car’s silhouette depends on the viewing angle.
Most applications require us to disentangle the factors of variation and discard the ones that we do not care about.
Of course, it can be very difficult to extract such high-level, abstract features from raw data. Many of these factors of variation, such as a speaker’s accent, can be identified only using sophisticated, nearly human-level understanding of the data. When it is nearly as difficult to obtain a representation as to solve the original problem, representation learning does not, at first glance, seem to help us.
Deep learning solves this central problem in representation learning by intro ducing representations that are expressed in terms of other, simpler representations. Deep learning allows the computer to build complex concepts out of simpler con cepts. Figure 1.2 shows how a deep learning system can represent the concept of an image of a person by combining simpler concepts, such as corners and contours, which are in turn defined in terms of edges.
The quintessential example of a deep learning model is the feedforward deep network or multilayer perceptron (MLP). A multilayer perceptron is just a mathematical function mapping some set of input values to output values. The function is formed by composing many simpler functions. We can think of each application of a different mathematical function as providing a new representation of the input.
The idea of learning the right representation for the data provides one perspec- tive on deep learning. Another perspective on deep learning is that depth allows the computer to learn a multi-step computer program. Each layer of the representation can be thought of as the state of the computer’s memory after executing another set of instructions in parallel. Networks with greater depth can execute more instructions in sequence. Sequential instructions offer great power because later
5
CHAPTER 1. INTRODUCTION
CAR PERSON ANIMAL Output
(object identity)
3rd hidden layer
(object parts)
2nd hidden layer
(corners and
contours)
1st hidden layer
(edges)
Visible layer
(input pixels)
Figure 1.2: Illustration of a deep learning model. It is difficult for a computer to understand the meaning of raw sensory input data, such as this image represented as a collection of pixel values. The function mapping from a set of pixels to an object identity is very complicated. Learning or evaluating this mapping seems insurmountable if tackled directly.
Deep learning resolves this difficulty by breaking the desired complicated mapping into a series of nested simple mappings, each described by a different layer of the model. The input is presented at the visible layer, so named because it contains the variables that
we are able to observe. Then a series of hidden layers extracts increasingly abstract features from the image. These layers are called “hidden” because their values are not given in the data; instead the model must determine which concepts are useful for explaining the relationships in the observed data. The images here are visualizations of the kind of feature represented by each hidden unit. Given the pixels, the first layer can easily identify edges, by comparing the brightness of neighboring pixels. Given the first hidden layer’s description of the edges, the second hidden layer can easily search for corners and extended contours, which are recognizable as collections of edges. Given the second hidden layer’s description of the image in terms of corners and contours, the third hidden layer can detect entire parts of specific objects, by finding specific collections of contours and corners. Finally, this description of the image in terms of the object parts it contains can be used to recognize the objects present in the image. Images reproduced with permission from Zeiler and Fergus (2014).
6
CHAPTER 1. INTRODUCTION
Element Set
σ
Element Set
+
+
×
×
×
Logistic
Logistic
σ
Regression
Regression
w1
x1
w2 x2
w x
Figure 1.3: Illustration of computational graphs mapping an input to an output where each node performs an operation. Depth is the length of the longest path from input to output but depends on the definition of what constitutes a possible computational step. The computation depicted in these graphs is the output of a logistic regression model, σ(wT x), where σ is the logistic sigmoid function. If we use addition, multiplication and logistic sigmoids as the elements of our computer language, then this model has depth three. If we view logistic regression as an element itself, then this model has depth one.
instructions can refer back to the results of earlier instructions. According to this view of deep learning, not all of the information in a layer’s activations necessarily encodes factors of variation that explain the input. The representation also stores
state information that helps to execute a program that can make sense of the input. This state information could be analogous to a counter or pointer in a traditional computer program. It has nothing to do with the content of the input specifically, but it helps the model to organize its processing.
There are two main ways of measuring the depth of a model. The first view is based on the number of sequential instructions that must be executed to evaluate the architecture. We can think of this as the length of the longest path through a flow chart that describes how to compute each of the model’s outputs given its inputs. Just as two equivalent computer programs will have different lengths depending on which language the program is written in, the same function may be drawn as a flowchart with different depths depending on which functions we allow to be used as individual steps in the flowchart. Figure 1.3 illustrates how this choice of language can give two different measurements for the same architecture.
Another approach, used by deep probabilistic models, regards the depth of a model as being not the depth of the computational graph but the depth of the graph describing how concepts are related to each other. In this case, the depth
7
CHAPTER 1. INTRODUCTION
of the flowchart of the computations needed to compute the representation of each concept may be much deeper than the graph of the concepts themselves. This is because the system’s understanding of the simpler concepts can be refined given information about the more complex concepts. For example, an AI system observing an image of a face with one eye in shadow may initially only see one eye. After detecting that a face is present, it can then infer that a second eye is probably present as well. In this case, the graph of concepts only includes two layers—a layer for eyes and a layer for faces—but the graph of computations includes 2n layers if we refine our estimate of each concept given the other n times.
Because it is not always clear which of these two views—the depth of the computational graph, or the depth of the probabilistic modeling graph—is most relevant, and because different people choose different sets of smallest elements from which to construct their graphs, there is no single correct value for the depth of an architecture, just as there is no single correct value for the length of
a computer program. Nor is there a consensus about how much depth a model requires to qualify as “deep.” However, deep learning can safely be regarded as the study of models that either involve a greater amount of composition of learned functions or learned concepts than traditional machine learning does.
To summarize, deep learning, the subject of this book, is an approach to AI. Specifically, it is a type of machine learning, a technique that allows computer systems to improve with experience and data. According to the authors of this book, machine learning is the only viable approach to building AI systems that can operate in complicated, real-world environments. Deep learning is a particular kind of machine learning that achieves great power and flexibility by learning to represent the world as a nested hierarchy of concepts, with each concept defined in relation to simpler concepts, and more abstract representations computed in terms of less abstract ones. Figure 1.4 illustrates the relationship between these different AI disciplines. Figure 1.5 gives a high-level schematic of how each works.
1.1 Who Should Read This Book?
This book can be useful for a variety of readers, but we wrote it with two main target audiences in mind. One of these target audiences is university students (undergraduate or graduate) learning about machine learning, including those who are beginning a career in deep learning and artificial intelligence research. The other target audience is software engineers who do not have a machine learning
or statistics background, but want to rapidly acquire one and begin using deep learning in their product or platform. Deep learning has already proven useful in
8
CHAPTER 1. INTRODUCTION
Deep learning
Example:
Example: Shallow
Example:
Example: autoencoders MLPs
Representation learning Machine learning
AI
Logistic regression
Knowledge bases
Figure 1.4: A Venn diagram showing how deep learning is a kind of representation learning, which is in turn a kind of machine learning, which is used for many but not all approaches to AI. Each section of the Venn diagram includes an example of an AI technology.
9
CHAPTER 1. INTRODUCTION
Output
Mapping from
Output
Output
Mapping from features
Additional
Output
Mapping from
layers of more
Hand
designed
program
Input
Rule-based
features
Hand
designed features
Input
Classic machine
features Features
Input
abstract features
Simple features
Input
Deep learning
systems
learning Representation learning
Figure 1.5: Flowcharts showing how the different parts of an AI system relate to each other within different AI disciplines. Shaded boxes indicate components that are able to learn from data.
10
CHAPTER 1. INTRODUCTION
many software disciplines including computer vision, speech and audio processing, natural language processing, robotics, bioinformatics and chemistry, video games, search engines, online advertising and finance.
This book has been organized into three parts in order to best accommodate a variety of readers. Part I introduces basic mathematical tools and machine learning concepts. Part II describes the most established deep learning algorithms that are essentially solved technologies. Part III describes more speculative ideas that are widely believed to be important for future research in deep learning.
Readers should feel free to skip parts that are not relevant given their interests or background. Readers familiar with linear algebra, probability, and fundamental machine learning concepts can skip part I, for example, while readers who just want to implement a working system need not read beyond part II. To help choose which chapters to read, figure 1.6 provides a flowchart showing the high-level organization of the book.
We do assume that all readers come from a computer science background. We assume familiarity with programming, a basic understanding of computational performance issues, complexity theory, introductory level calculus and some of the terminology of graph theory.
1.2 Historical Trends in Deep Learning
It is easiest to understand deep learning with some historical context. Rather than providing a detailed history of deep learning, we identify a few key trends:
• Deep learning has had a long and rich history, but has gone by many names reflecting different philosophical viewpoints, and has waxed and waned in popularity.
• Deep learning has become more useful as the amount of available training data has increased.
• Deep learning models have grown in size over time as computer infrastructure (both hardware and software) for deep learning has improved.
• Deep learning has solved increasingly complicated applications with increasing accuracy over time.
11
CHAPTER 1. INTRODUCTION
1. Introduction
2. Linear Algebra 3. Probability and
Part I: Applied Math and Machine Learning Basics
4. Numerical Computation
5. Machine Learning
Information Theory Basics
6. Deep Feedforward
Part II: Deep Networks: Modern Practices
Networks
7. Regularization 8. Optimization 9. CNNs 10. RNNs 11. Practical
Methodology 12. Applications
13. Linear Factor
Models 14. Autoencoders 15. Representation Part III: Deep Learning Research
16. Structured
Probabilistic Models
19. Inference20. Deep Generative
17. Monte Carlo
Learning
Methods
18. Partition Function
Figure 1.6: The high-level organization of the book. An arrow from one chapter to another indicates that the former chapter is prerequisite material for understanding the latter. Models
12
CHAPTER 1. INTRODUCTION
1.2.1 The Many Names and Changing Fortunes of Neural Net- works
We expect that many readers of this book have heard of deep learning as an exciting new technology, and are surprised to see a mention of “history” in a book about an emerging field. In fact, deep learning dates back to the 1940s. Deep learning only appears to be new, because it was relatively unpopular for several
years preceding its current popularity, and because it has gone through many different names, and has only recently become called “deep learning.” The field has been rebranded many times, reflecting the influence of different researchers and different perspectives.
A comprehensive history of deep learning is beyond the scope of this textbook. However, some basic context is useful for understanding deep learning. Broadly speaking, there have been three waves of development of deep learning: deep learning known as cybernetics in the 1940s–1960s, deep learning known as connectionism in the 1980s–1990s, and the current resurgence under the name deep learning beginning in 2006. This is quantitatively illustrated in figure 1.7.
Some of the earliest learning algorithms we recognize today were intended to be computational models of biological learning, i.e. models of how learning happens or could happen in the brain. As a result, one of the names that deep learning has gone by is artificial neural networks (ANNs). The corresponding perspective on deep learning models is that they are engineered systems inspired
by the biological brain (whether the human brain or the brain of another animal). While the kinds of neural networks used for machine learning have sometimes been used to understand brain function (Hinton and Shallice, 1991), they are generally not designed to be realistic models of biological function. The neural perspective on deep learning is motivated by two main ideas. One idea is that the brain provides a proof by example that intelligent behavior is possible, and a conceptually straightforward path to building intelligence is to reverse engineer the computational principles behind the brain and duplicate its functionality. Another perspective is that it would be deeply interesting to understand the brain and the principles that underlie human intelligence, so machine learning models that shed light on these basic scientific questions are useful apart from their ability to solve engineering applications.
The modern term “deep learning” goes beyond the neuroscientific perspective on the current breed of machine learning models. It appeals to a more general principle of learning multiple levels of composition, which can be applied in machine learning frameworks that are not necessarily neurally inspired.
13
CHAPTER 1. INTRODUCTION
0.000250
e
s
a
r
cybernetics
hP
0.000200
(connectionism + neural networks)
r
o
0.000150 d
r
o
W
0.000100 f
o
y
c
0.000050
n
e
u
q
e
r
0.000000
F
1940 1950 1960 1970 1980 1990 2000 Year
Figure 1.7: The figure shows two of the three historical waves of artificial neural nets research, as measured by the frequency of the phrases “cybernetics” and “connectionism” or “neural networks” according to Google Books (the third wave is too recent to appear). The first wave started with cybernetics in the 1940s–1960s, with the development of theories of biological learning (McCulloch and Pitts, 1943; Hebb, 1949) and implementations of
the first models such as the perceptron (Rosenblatt, 1958) allowing the training of a single neuron. The second wave started with the connectionist approach of the 1980–1995 period, with back-propagation (Rumelhart et al., 1986a) to train a neural network with one or two hidden layers. The current and third wave, deep learning, started around 2006 (Hinton et al., 2006; Bengio et al., 2007; Ranzato et al., 2007a), and is just now appearing in book form as of 2016. The other two waves similarly appeared in book form much later than the corresponding scientific activity occurred.
14
CHAPTER 1. INTRODUCTION
The earliest predecessors of modern deep learning were simple linear models motivated from a neuroscientific perspective. These models were designed to take a set of n input values x1, . . . , xn and associate them with an output y. These models would learn a set of weights w1, . . . , wn and compute their output f(x, w) = x1w1 + · · · + xnwn. This first wave of neural networks research was known as cybernetics, as illustrated in figure 1.7.
The McCulloch-Pitts Neuron (McCulloch and Pitts, 1943) was an early model of brain function. This linear model could recognize two different categories of inputs by testing whether f (x, w) is positive or negative. Of course, for the model to correspond to the desired definition of the categories, the weights needed to be
set correctly. These weights could be set by the human operator. In the 1950s, the perceptron (Rosenblatt, 1958, 1962) became the first model that could learn the weights defining the categories given examples of inputs from each category. The adaptive linear element (ADALINE), which dates from about the same time, simply returned the value of f(x) itself to predict a real number (Widrow and Hoff, 1960), and could also learn to predict these numbers from data.
These simple learning algorithms greatly affected the modern landscape of ma chine learning. The training algorithm used to adapt the weights of the ADALINE was a special case of an algorithm called stochastic gradient descent. Slightly modified versions of the stochastic gradient descent algorithm remain the dominant training algorithms for deep learning models today.
Models based on the f(x, w) used by the perceptron and ADALINE are called linear models. These models remain some of the most widely used machine learning models, though in many cases they are trained in different ways than the original models were trained.
Linear models have many limitations. Most famously, they cannot learn the XOR function, where f ([0, 1], w) = 1 and f([1, 0], w) = 1 but f([1, 1], w) = 0 and f ([0, 0], w) = 0. Critics who observed these flaws in linear models caused a backlash against biologically inspired learning in general (Minsky and Papert, 1969). This was the first major dip in the popularity of neural networks.
Today, neuroscience is regarded as an important source of inspiration for deep learning researchers, but it is no longer the predominant guide for the field. The main reason for the diminished role of neuroscience in deep learning research today is that we simply do not have enough information about the brain to use it as a guide. To obtain a deep understanding of the actual algorithms used by the brain, we would need to be able to monitor the activity of (at the very least) thousands of interconnected neurons simultaneously. Because we are not able to do this, we are far from understanding even some of the most simple and
15
CHAPTER 1. INTRODUCTION
well-studied parts of the brain (Olshausen and Field, 2005).
Neuroscience has given us a reason to hope that a single deep learning algorithm can solve many different tasks. Neuroscientists have found that ferrets can learn to “see” with the auditory processing region of their brain if their brains are rewired to send visual signals to that area (Von Melchner et al., 2000). This suggests that much of the mammalian brain might use a single algorithm to solve most of the different tasks that the brain solves. Before this hypothesis, machine learning research was more fragmented, with different communities of researchers studying natural language processing, vision, motion planning and speech recognition. Today, these application communities are still separate, but it is common for deep learning research groups to study many or even all of these application areas simultaneously.
We are able to draw some rough guidelines from neuroscience. The basic idea of having many computational units that become intelligent only via their interactions with each other is inspired by the brain. The Neocognitron (Fukushima, 1980) introduced a powerful model architecture for processing images that was inspired by the structure of the mammalian visual system and later became the basis for the modern convolutional network (LeCun et al., 1998b), as we will see in section 9.10. Most neural networks today are based on a model neuron called the rectified linear unit. The original Cognitron (Fukushima, 1975) introduced a more complicated version that was highly inspired by our knowledge of brain function. The simplified modern version was developed incorporating ideas from many viewpoints, with Nair and Hinton (2010) and Glorot et al. (2011a) citing neuroscience as an influence, and Jarrett et al. (2009) citing more engineering oriented influences. While neuroscience is an important source of inspiration, it need not be taken as a rigid guide. We know that actual neurons compute very different functions than modern rectified linear units, but greater neural realism has not yet led to an improvement in machine learning performance. Also, while neuroscience has successfully inspired several neural network architectures, we do not yet know enough about biological learning for neuroscience to offer much guidance for the learning algorithms we use to train these architectures.
Media accounts often emphasize the similarity of deep learning to the brain. While it is true that deep learning researchers are more likely to cite the brain as an influence than researchers working in other machine learning fields such as kernel machines or Bayesian statistics, one should not view deep learning as an attempt to simulate the brain. Modern deep learning draws inspiration from many fields, especially applied math fundamentals like linear algebra, probability, information theory, and numerical optimization. While some deep learning researchers cite neuroscience as an important source of inspiration, others are not concerned with
16
CHAPTER 1. INTRODUCTION
neuroscience at all.
It is worth noting that the effort to understand how the brain works on an algorithmic level is alive and well. This endeavor is primarily known as “computational neuroscience” and is a separate field of study from deep learning. It is common for researchers to move back and forth between both fields. The field of deep learning is primarily concerned with how to build computer systems that are able to successfully solve tasks requiring intelligence, while the field of computational neuroscience is primarily concerned with building more accurate models of how the brain actually works.
In the 1980s, the second wave of neural network research emerged in great part via a movement called connectionism or parallel distributed process ing (Rumelhart et al., 1986c; McClelland et al., 1995). Connectionism arose in the context of cognitive science. Cognitive science is an interdisciplinary approach to understanding the mind, combining multiple different levels of analysis. During the early 1980s, most cognitive scientists studied models of symbolic reasoning. Despite their popularity, symbolic models were difficult to explain in terms of how the brain could actually implement them using neurons. The connectionists began to study models of cognition that could actually be grounded in neural implementations (Touretzky and Minton, 1985), reviving many ideas dating back to the work of psychologist Donald Hebb in the 1940s (Hebb, 1949).
The central idea in connectionism is that a large number of simple computational units can achieve intelligent behavior when networked together. This insight applies equally to neurons in biological nervous systems and to hidden units in computational models.
Several key concepts arose during the connectionism movement of the 1980s that remain central to today’s deep learning.
One of these concepts is that of distributed representation (Hinton et al., 1986). This is the idea that each input to a system should be represented by many features, and each feature should be involved in the representation of many possible inputs. For example, suppose we have a vision system that can recognize cars, trucks, and birds and these objects can each be red, green, or blue. One way of representing these inputs would be to have a separate neuron or hidden unit that activates for each of the nine possible combinations: red truck, red car, red bird, green truck, and so on. This requires nine different neurons, and each neuron must independently learn the concept of color and object identity. One way to improve on this situation is to use a distributed representation, with three neurons describing the color and three neurons describing the object identity. This requires only six neurons total instead of nine, and the neuron describing redness is able to
17
CHAPTER 1. INTRODUCTION
learn about redness from images of cars, trucks and birds, not only from images of one specific category of objects. The concept of distributed representation is central to this book, and will be described in greater detail in chapter 15.
Another major accomplishment of the connectionist movement was the suc cessful use of back-propagation to train deep neural networks with internal repre sentations and the popularization of the back-propagation algorithm (Rumelhart et al., 1986a; LeCun, 1987). This algorithm has waxed and waned in popularity but as of this writing is currently the dominant approach to training deep models.
During the 1990s, researchers made important advances in modeling sequences with neural networks. Hochreiter (1991) and Bengio et al. (1994) identified some of the fundamental mathematical difficulties in modeling long sequences, described in section 10.7. Hochreiter and Schmidhuber (1997) introduced the long short-term memory or LSTM network to resolve some of these difficulties. Today, the LSTM is widely used for many sequence modeling tasks, including many natural language processing tasks at Google.
The second wave of neural networks research lasted until the mid-1990s. Ven tures based on neural networks and other AI technologies began to make unrealisti- cally ambitious claims while seeking investments. When AI research did not fulfill these unreasonable expectations, investors were disappointed. Simultaneously, other fields of machine learning made advances. Kernel machines (Boser et al.,
1992; Cortes and Vapnik, 1995; Schölkopf et al., 1999) and graphical models (Jor dan, 1998) both achieved good results on many important tasks. These two factors led to a decline in the popularity of neural networks that lasted until 2007.
During this time, neural networks continued to obtain impressive performance on some tasks (LeCun et al., 1998b; Bengio et al., 2001). The Canadian Institute for Advanced Research (CIFAR) helped to keep neural networks research alive via its Neural Computation and Adaptive Perception (NCAP) research initiative.
This program united machine learning research groups led by Geoffrey Hinton at University of Toronto, Yoshua Bengio at University of Montreal, and Yann LeCun at New York University. The CIFAR NCAP research initiative had a multi-disciplinary nature that also included neuroscientists and experts in human and computer vision.
At this point in time, deep networks were generally believed to be very difficult to train. We now know that algorithms that have existed since the 1980s work quite well, but this was not apparent circa 2006. The issue is perhaps simply that these algorithms were too computationally costly to allow much experimentation with the hardware available at the time.
The third wave of neural networks research began with a breakthrough in 18
CHAPTER 1. INTRODUCTION
2006. Geoffrey Hinton showed that a kind of neural network called a deep belief network could be efficiently trained using a strategy called greedy layer-wise pre training (Hinton et al., 2006), which will be described in more detail in section 15.1. The other CIFAR-affiliated research groups quickly showed that the same strategy could be used to train many other kinds of deep networks (Bengio et al., 2007; Ranzato et al., 2007a) and systematically helped to improve generalization on test examples. This wave of neural networks research popularized the use of the term “deep learning” to emphasize that researchers were now able to train deeper neural networks than had been possible before, and to focus attention on the theoretical importance of depth (Bengio and LeCun, 2007; Delalleau and Bengio, 2011; Pascanu et al., 2014a; Montufar et al., 2014). At this time, deep neural networks outperformed competing AI systems based on other machine learning technologies as well as hand-designed functionality. This third wave of popularity of neural networks continues to the time of this writing, though the focus of deep learning research has changed dramatically within the time of this wave. The third wave began with a focus on new unsupervised learning techniques and the ability of deep models to generalize well from small datasets, but today there is more interest in much older supervised learning algorithms and the ability of deep models to leverage large labeled datasets.
1.2.2 Increasing Dataset Sizes
One may wonder why deep learning has only recently become recognized as a crucial technology though the first experiments with artificial neural networks were conducted in the 1950s. Deep learning has been successfully used in commercial applications since the 1990s, but was often regarded as being more of an art than a technology and something that only an expert could use, until recently. It is true that some skill is required to get good performance from a deep learning algorithm. Fortunately, the amount of skill required reduces as the amount of training data increases. The learning algorithms reaching human performance on complex tasks today are nearly identical to the learning algorithms that struggled to solve toy problems in the 1980s, though the models we train with these algorithms have undergone changes that simplify the training of very deep architectures. The most important new development is that today we can provide these algorithms with the resources they need to succeed. Figure 1.8 shows how the size of benchmark datasets has increased remarkably over time. This trend is driven by the increasing digitization of society. As more and more of our activities take place on computers, more and more of what we do is recorded. As our computers are increasingly networked together, it becomes easier to centralize these records and curate them
19
CHAPTER 1. INTRODUCTION
into a dataset appropriate for machine learning applications. The age of “Big Data” has made machine learning much easier because the key burden of statistical estimation—generalizing well to new data after observing only a small amount of data—has been considerably lightened. As of 2016, a rough rule of thumb is that a supervised deep learning algorithm will generally achieve acceptable performance with around 5,000 labeled examples per category, and will match or exceed human performance when trained with a dataset containing at least 10 million labeled examples. Working successfully with datasets smaller than this is an important research area, focusing in particular on how we can take advantage of large quantities of unlabeled examples, with unsupervised or semi-supervised learning.
1.2.3 Increasing Model Sizes
Another key reason that neural networks are wildly successful today after enjoying comparatively little success since the 1980s is that we have the computational resources to run much larger models today. One of the main insights of connection ism is that animals become intelligent when many of their neurons work together. An individual neuron or small collection of neurons is not particularly useful.
Biological neurons are not especially densely connected. As seen in figure 1.10, our machine learning models have had a number of connections per neuron that was within an order of magnitude of even mammalian brains for decades.
In terms of the total number of neurons, neural networks have been astonishingly small until quite recently, as shown in figure 1.11. Since the introduction of hidden units, artificial neural networks have doubled in size roughly every 2.4 years. This growth is driven by faster computers with larger memory and by the availability of larger datasets. Larger networks are able to achieve higher accuracy on more complex tasks. This trend looks set to continue for decades. Unless new technologies allow faster scaling, artificial neural networks will not have the same number of neurons as the human brain until at least the 2050s. Biological neurons may represent more complicated functions than current artificial neurons, so biological neural networks may be even larger than this plot portrays.
In retrospect, it is not particularly surprising that neural networks with fewer neurons than a leech were unable to solve sophisticated artificial intelligence prob lems. Even today’s networks, which we consider quite large from a computational systems point of view, are smaller than the nervous system of even relatively primitive vertebrate animals like frogs.
The increase in model size over time, due to the availability of faster CPUs, 20
CHAPTER 1. INTRODUCTION 109
)
Canadian HansardWMT
s
e
lp
m
108 107
ImageNet10k
Sports-1M
a
106
x
e
r
Public SVHN
e
b
m
u
105 104
Criminals
ImageNet MNIST
ILSVRC 2014
n
(
103
CIFAR-10
e
z
i
s
102
T vs. G vs. F Rotated T vs. C
t
e
s
Iris
a
ta
101 100
D
1900 1950 1985 2000 2015 Year
Figure 1.8: Dataset sizes have increased greatly over time. In the early 1900s, statisticians studied datasets using hundreds or thousands of manually compiled measurements (Garson, 1900; Gosset, 1908; Anderson, 1935; Fisher, 1936). In the 1950s through 1980s, the pioneers
of biologically inspired machine learning often worked with small, synthetic datasets, such as low-resolution bitmaps of letters, that were designed to incur low computational cost and demonstrate that neural networks were able to learn specific kinds of functions (Widrow and Hoff, 1960; Rumelhart et al., 1986b). In the 1980s and 1990s, machine learning became more statistical in nature and began to leverage larger datasets containing tens of thousands of examples such as the MNIST dataset (shown in figure 1.9) of scans of handwritten numbers (LeCun et al., 1998b). In the first decade of the 2000s, more sophisticated datasets of this same size, such as the CIFAR-10 dataset (Krizhevsky and Hinton, 2009) continued to be produced. Toward the end of that decade and throughout the first half of the 2010s, significantly larger datasets, containing hundreds of thousands to tens of millions of examples, completely changed what was possible with deep learning. These datasets included the public Street View House Numbers dataset (Netzer et al., 2011), various versions of the ImageNet dataset (Deng et al., 2009, 2010a; Russakovsky et al., 2014a), and the Sports-1M dataset (Karpathy et al., 2014). At the top of the graph, we see that datasets of translated sentences, such as IBM’s dataset constructed from the Canadian Hansard (Brown et al., 1990) and the WMT 2014 English to French dataset (Schwenk, 2014) are typically far ahead of other dataset sizes.
21
CHAPTER 1. INTRODUCTION
Figure 1.9: Example inputs from the MNIST dataset. The “NIST” stands for National Institute of Standards and Technology, the agency that originally collected this data. The “M” stands for “modified,” since the data has been preprocessed for easier use with machine learning algorithms. The MNIST dataset consists of scans of handwritten digits and associated labels describing which digit 0–9 is contained in each image. This simple classification problem is one of the simplest and most widely used tests in deep learning research. It remains popular despite being quite easy for modern techniques to solve. Geoffrey Hinton has described it as “the drosophila of machine learning,” meaning that it allows machine learning researchers to study their algorithms in controlled laboratory conditions, much as biologists often study fruit flies.
22
CHAPTER 1. INTRODUCTION
the advent of general purpose GPUs (described in section 12.1.2), faster network connectivity and better software infrastructure for distributed computing, is one of the most important trends in the history of deep learning. This trend is generally expected to continue well into the future.
1.2.4 Increasing Accuracy, Complexity and Real-World Impact
Since the 1980s, deep learning has consistently improved in its ability to provide accurate recognition or prediction. Moreover, deep learning has consistently been applied with success to broader and broader sets of applications.
The earliest deep models were used to recognize individual objects in tightly cropped, extremely small images (Rumelhart et al., 1986a). Since then there has been a gradual increase in the size of images neural networks could process. Modern object recognition networks process rich high-resolution photographs and do not have a requirement that the photo be cropped near the object to be recognized (Krizhevsky et al., 2012). Similarly, the earliest networks could only recognize two kinds of objects (or in some cases, the absence or presence of a single kind of object), while these modern networks typically recognize at least 1,000 different categories of objects. The largest contest in object recognition is the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) held each year. A dramatic moment in the meteoric rise of deep learning came when a convolutional network won this challenge for the first time and by a wide margin, bringing down the state-of-the-art top-5 error rate from 26.1% to 15.3% (Krizhevsky et al., 2012), meaning that the convolutional network produces a ranked list of possible categories for each image and the correct category appeared in the first five entries of this list for all but 15.3% of the test examples. Since then, these competitions are consistently won by deep convolutional nets, and as of this writing, advances in deep learning have brought the latest top-5 error rate in this contest down to 3.6%, as shown in figure 1.12.
Deep learning has also had a dramatic impact on speech recognition. After improving throughout the 1990s, the error rates for speech recognition stagnated starting in about 2000. The introduction of deep learning (Dahl et al., 2010; Deng et al., 2010b; Seide et al., 2011; Hinton et al., 2012a) to speech recognition resulted in a sudden drop of error rates, with some error rates cut in half. We will explore this history in more detail in section 12.3.
Deep networks have also had spectacular successes for pedestrian detection and image segmentation (Sermanet et al., 2013; Farabet et al., 2013; Couprie et al., 2013) and yielded superhuman performance in traffic sign classification (Ciresan
23
CHAPTER 1. INTRODUCTION
104
n
o
r
u
67
9
4
Human Cat
Mouse
en
103
2
r e
p
s
n
o
i
5
8
10
tc
e
n
n
o
C
102
1
3
Fruit fly
101
1950 1985 2000 2015 Year
Figure 1.10: Initially, the number of connections between neurons in artificial neural networks was limited by hardware capabilities. Today, the number of connections between neurons is mostly a design consideration. Some artificial neural networks have nearly as many connections per neuron as a cat, and it is quite common for other neural networks to have as many connections per neuron as smaller mammals like mice. Even the human brain does not have an exorbitant amount of connections per neuron. Biological neural network sizes from Wikipedia (2015).
1. Adaptive linear element (Widrow and Hoff, 1960)
2. Neocognitron (Fukushima, 1980)
3. GPU-accelerated convolutional network (Chellapilla et al., 2006)
4. Deep Boltzmann machine (Salakhutdinov and Hinton, 2009a)
5. Unsupervised convolutional network (Jarrett et al., 2009)
6. GPU-accelerated multilayer perceptron (Ciresan et al., 2010)
7. Distributed autoencoder (Le et al., 2012)
8. Multi-GPU convolutional network (Krizhevsky et al., 2012)
9. COTS HPC unsupervised convolutional network (Coates et al., 2013)
10. GoogLeNet (Szegedy et al., 2014a)
24
CHAPTER 1. INTRODUCTION
et al., 2012).
At the same time that the scale and accuracy of deep networks has increased, so has the complexity of the tasks that they can solve. Goodfellow et al. (2014d) showed that neural networks could learn to output an entire sequence of characters transcribed from an image, rather than just identifying a single object. Previously, it was widely believed that this kind of learning required labeling of the individual elements of the sequence (Gülçehre and Bengio, 2013). Recurrent neural networks, such as the LSTM sequence model mentioned above, are now used to model relationships between sequences and other sequences rather than just fixed inputs. This sequence-to-sequence learning seems to be on the cusp of revolutionizing another application: machine translation (Sutskever et al., 2014; Bahdanau et al., 2015).
This trend of increasing complexity has been pushed to its logical conclusion with the introduction of neural Turing machines (Graves et al., 2014a) that learn to read from memory cells and write arbitrary content to memory cells. Such neural networks can learn simple programs from examples of desired behavior. For
example, they can learn to sort lists of numbers given examples of scrambled and sorted sequences. This self-programming technology is in its infancy, but in the future could in principle be applied to nearly any task.
Another crowning achievement of deep learning is its extension to the domain of reinforcement learning. In the context of reinforcement learning, an autonomous agent must learn to perform a task by trial and error, without any guidance from the human operator. DeepMind demonstrated that a reinforcement learning system based on deep learning is capable of learning to play Atari video games, reaching human-level performance on many tasks (Mnih et al., 2015). Deep learning has also significantly improved the performance of reinforcement learning for robotics (Finn et al., 2015).
Many of these applications of deep learning are highly profitable. Deep learning is now used by many top technology companies including Google, Microsoft, Facebook, IBM, Baidu, Apple, Adobe, Netflix, NVIDIA and NEC.
Advances in deep learning have also depended heavily on advances in software infrastructure. Software libraries such as Theano (Bergstra et al., 2010; Bastien et al., 2012), PyLearn2 (Goodfellow et al., 2013c), Torch (Collobert et al., 2011b), DistBelief (Dean et al., 2012), Caffe (Jia, 2013), MXNet (Chen et al., 2015), and TensorFlow (Abadi et al., 2015) have all supported important research projects or commercial products.
Deep learning has also made contributions back to other sciences. Modern convolutional networks for object recognition provide a model of visual processing
25
CHAPTER 1. INTRODUCTION
that neuroscientists can study (DiCarlo, 2013). Deep learning also provides useful tools for processing massive amounts of data and making useful predictions in scientific fields. It has been successfully used to predict how molecules will interact in order to help pharmaceutical companies design new drugs (Dahl et al., 2014), to search for subatomic particles (Baldi et al., 2014), and to automatically parse microscope images used to construct a 3-D map of the human brain (Knowles- Barley et al., 2014). We expect deep learning to appear in more and more scientific fields in the future.
In summary, deep learning is an approach to machine learning that has drawn heavily on our knowledge of the human brain, statistics and applied math as it developed over the past several decades. In recent years, it has seen tremendous growth in its popularity and usefulness, due in large part to more powerful com puters, larger datasets and techniques to train deeper networks. The years ahead are full of challenges and opportunities to improve deep learning even further and bring it to new frontiers.
26
CHAPTER 1. INTRODUCTION
1011
)
e
l
Human
a
cs
1010 109
17
1920
16
Octopus
c i
18
m
h
ti
r
a
108 107 106
14
11
8
Frog
g
3
Bee
o
l
(
105 104
Ant
s
n
o
r
u
e
n
f o
103 102 101 100
12
6
13
12
15
910
Leech
Roundworm
r
5
e
bm
10−1 10−2
4
7
Sponge
1950 1985 2000 2015 2056
u
N
Year
Figure 1.11: Since the introduction of hidden units, artificial neural networks have doubled in size roughly every 2.4 years. Biological neural network sizes from Wikipedia (2015).
1. Perceptron (Rosenblatt, 1958, 1962)
2. Adaptive linear element (Widrow and Hoff, 1960)
3. Neocognitron (Fukushima, 1980)
4. Early back-propagation network (Rumelhart et al., 1986b)
5. Recurrent neural network for speech recognition (Robinson and Fallside, 1991)
6. Multilayer perceptron for speech recognition (Bengio et al., 1991)
7. Mean field sigmoid belief network (Saul et al., 1996)
8. LeNet-5 (LeCun et al., 1998b)
9. Echo state network (Jaeger and Haas, 2004)
10. Deep belief network (Hinton et al., 2006)
11. GPU-accelerated convolutional network (Chellapilla et al., 2006)
12. Deep Boltzmann machine (Salakhutdinov and Hinton, 2009a)
13. GPU-accelerated deep belief network (Raina et al., 2009)
14. Unsupervised convolutional network (Jarrett et al., 2009)
15. GPU-accelerated multilayer perceptron (Ciresan et al., 2010)
16. OMP-1 network (Coates and Ng, 2011)
17. Distributed autoencoder (Le et al., 2012)
18. Multi-GPU convolutional network (Krizhevsky et al., 2012)
19. COTS HPC unsupervised convolutional network (Coates et al., 2013)
20. GoogLeNet (Szegedy et al., 2014a)
27
CHAPTER 1. INTRODUCTION
0.30
e
t
a
r
0.25
r
o
r
r
e
0.20
n
o
i
t
a
c
fii
s
s
a
lc
0.15 0.10
C
0.05
R
V
S
0.00
L
I
2010 2011 2012 2013 2014 2015 Year
Figure 1.12: Since deep networks reached the scale necessary to compete in the ImageNet Large Scale Visual Recognition Challenge, they have consistently won the competition every year, and yielded lower and lower error rates each time. Data from Russakovsky et al. (2014b) and He et al. (2015).
28
Part I
Applied Math and Machine Learning Basics
29
This part of the book introduces the basic mathematical concepts needed to understand deep learning. We begin with general ideas from applied math that allow us to define functions of many variables, find the highest and lowest points on these functions and quantify degrees of belief.
Next, we describe the fundamental goals of machine learning. We describe how to accomplish these goals by specifying a model that represents certain beliefs, designing a cost function that measures how well those beliefs correspond with reality and using a training algorithm to minimize that cost function.
This elementary framework is the basis for a broad variety of machine learning algorithms, including approaches to machine learning that are not deep. In the subsequent parts of the book, we develop deep learning algorithms within this framework.
30
Chapter 2
Linear Algebra
Linear algebra is a branch of mathematics that is widely used throughout science and engineering. However, because linear algebra is a form of continuous rather than discrete mathematics, many computer scientists have little experience with it. A good understanding of linear algebra is essential for understanding and working with many machine learning algorithms, especially deep learning algorithms. We therefore precede our introduction to deep learning with a focused presentation of the key linear algebra prerequisites.
If you are already familiar with linear algebra, feel free to skip this chapter. If you have previous experience with these concepts but need a detailed reference sheet to review key formulas, we recommend The Matrix Cookbook (Petersen and Pedersen, 2006). If you have no exposure at all to linear algebra, this chapter will teach you enough to read this book, but we highly recommend that you also consult another resource focused exclusively on teaching linear algebra, such as Shilov (1977). This chapter will completely omit many important linear algebra topics that are not essential for understanding deep learning.
2.1 Scalars, Vectors, Matrices and Tensors
The study of linear algebra involves several types of mathematical objects:
• Scalars: A scalar is just a single number, in contrast to most of the other objects studied in linear algebra, which are usually arrays of multiple numbers. We write scalars in italics. We usually give scalars lower-case variable names. When we introduce them, we specify what kind of number they are. For
31
CHAPTER 2. LINEAR ALGEBRA
example, we might say “Let s ∈ R be the slope of the line,” while defining a real-valued scalar, or “Let n ∈ N be the number of units,” while defining a natural number scalar.
• Vectors: A vector is an array of numbers. The numbers are arranged in order. We can identify each individual number by its index in that ordering. Typically we give vectors lower case names written in bold typeface, such as x. The elements of the vector are identified by writing its name in italic typeface, with a subscript. The first element of x is x1, the second element is x2 and so on. We also need to say what kind of numbers are stored in the vector. If each element is in R, and the vector has n elements, then the vector lies in the set formed by taking the Cartesian product of R n times, denoted as Rn. When we need to explicitly identify the elements of a vector, we write them as a column enclosed in square brackets:
x1

x2...
x =

. (2.1)
xn
We can think of vectors as identifying points in space, with each element giving the coordinate along a different axis.
Sometimes we need to index a set of elements of a vector. In this case, we define a set containing the indices and write the set as a subscript. For example, to access x1, x3 and x6, we define the set S = {1, 3, 6} and write xS. We use the − sign to index the complement of a set. For example x−1 is the vector containing all elements of x except for x1, and x−S is the vector containing all of the elements of x except for x1, x3 and x6 .
• Matrices: A matrix is a 2-D array of numbers, so each element is identified by two indices instead of just one. We usually give matrices upper-case variable names with bold typeface, such as A. If a real-valued matrix A has a height of m and a width of n, then we say that A ∈ Rm×n. We usually identify the elements of a matrix using its name in italic but not bold font, and the indices are listed with separating commas. For example, A1,1 is the upper left entry of A and Am,n is the bottom right entry. We can identify all of the numbers with vertical coordinate i by writing a “:” for the horizontal coordinate. For example, Ai,: denotes the horizontal cross section of A with vertical coordinate i. This is known as the i-th row of A. Likewise, A:,i is
32
CHAPTER 2. LINEAR ALGEBRA
 A1,1 A2,1 A3,1
A1,1 A1,2 A2,1 A2,2
A =
 ⇒ A =
A3,1 A3,2
A1,2 A2,2 A3,2
Figure 2.1: The transpose of the matrix can be thought of as a mirror image across the main diagonal.
the i-th column of A. When we need to explicitly identify the elements of a matrix, we write them as an array enclosed in square brackets:  A1,1 A1,2
. (2.2)
A2,1 A2,2
Sometimes we may need to index matrix-valued expressions that are not just a single letter. In this case, we use subscripts after the expression, but do not convert anything to lower case. For example, f(A)i,j gives element (i, j) of the matrix computed by applying the function f to A.
• Tensors: In some cases we will need an array with more than two axes. In the general case, an array of numbers arranged on a regular grid with a variable number of axes is known as a tensor. We denote a tensor named “A” with this typeface: A. We identify the element of A at coordinates (i, j, k) by writing Ai,j,k.
One important operation on matrices is the transpose. The transpose of a matrix is the mirror image of the matrix across a diagonal line, called the main diagonal, running down and to the right, starting from its upper left corner. See figure 2.1 for a graphical depiction of this operation. We denote the transpose of a matrix A as A, and it is defined such that
(A)i,j = Aj,i. (2.3)
Vectors can be thought of as matrices that contain only one column. The transpose of a vector is therefore a matrix with only one row. Sometimes we
33
CHAPTER 2. LINEAR ALGEBRA
define a vector by writing out its elements in the text inline as a row matrix, then using the transpose operator to turn it into a standard column vector, e.g., x = [x1, x2, x3 ].
A scalar can be thought of as a matrix with only a single entry. From this, we can see that a scalar is its own transpose: a = a.
We can add matrices to each other, as long as they have the same shape, just by adding their corresponding elements: C = A + B where Ci,j = Ai,j + Bi,j. We can also add a scalar to a matrix or multiply a matrix by a scalar, just by performing that operation on each element of a matrix: D = a · B + c where Di,j = a · Bi,j + c.
In the context of deep learning, we also use some less conventional notation. We allow the addition of matrix and a vector, yielding another matrix: C = A +b, where Ci,j = Ai,j + bj . In other words, the vector b is added to each row of the matrix. This shorthand eliminates the need to define a matrix with b copied into each row before doing the addition. This implicit copying of b to many locations is called broadcasting.
2.2 Multiplying Matrices and Vectors
One of the most important operations involving matrices is multiplication of two matrices. The matrix product of matrices A and B is a third matrix C. In order for this product to be defined, A must have the same number of columns as B has rows. If A is of shape m × n and B is of shape n × p, then C is of shape m × p. We can write the matrix product just by placing two or more matrices together, e.g.
C = AB. (2.4)
The product operation is defined by
Ci,j = 
Ai,kBk,j. (2.5)
k
Note that the standard product of two matrices is not just a matrix containing the product of the individual elements. Such an operation exists and is called the element-wise product or Hadamard product, and is denoted as A  B.
The dot product between two vectors x and y of the same dimensionality is the matrix product xy. We can think of the matrix product C = AB as computing Ci,j as the dot product between row i of A and column j of B.
34
CHAPTER 2. LINEAR ALGEBRA
Matrix product operations have many useful properties that make mathematical analysis of matrices more convenient. For example, matrix multiplication is distributive:
A(B + C) = AB + AC. (2.6)
It is also associative:
A(BC) = (AB)C. (2.7)
Matrix multiplication is not commutative (the condition AB = BA does not always hold), unlike scalar multiplication. However, the dot product between two vectors is commutative:
xy = yx. (2.8)
The transpose of a matrix product has a simple form:
(AB) = BA. (2.9)
This allows us to demonstrate equation 2.8, by exploiting the fact that the value of such a product is a scalar and therefore equal to its own transpose: xy =xy = yx. (2.10)
Since the focus of this textbook is not linear algebra, we do not attempt to develop a comprehensive list of useful properties of the matrix product here, but the reader should be aware that many more exist.
We now know enough linear algebra notation to write down a system of linear equations:
Ax = b (2.11)
where A ∈ Rm×nis a known matrix, b ∈ Rm is a known vector, and x ∈ Rnis a vector of unknown variables we would like to solve for. Each element xi of x is one of these unknown variables. Each row of A and each element of b provide another constraint. We can rewrite equation 2.11 as:
A1,:x = b1 (2.12)
A2,:x = b2 (2.13)
. . . (2.14)
Am,:x = bm (2.15)
or, even more explicitly, as:
A1,1x1 + A1,2x2 + · · · + A1,nxn = b1 (2.16)
35
CHAPTER 2. LINEAR ALGEBRA
1 0 0
0 1 0
0 0 1
Figure 2.2: Example identity matrix: This is I3.
A2,1x1 + A2,2x2 + · · · + A2,nxn = b2 (2.17)
. . . (2.18)
A m,1x1 + Am,2x2 + · · · + Am,nxn = bm . (2.19)
Matrix-vector product notation provides a more compact representation for equations of this form.
2.3 Identity and Inverse Matrices
Linear algebra offers a powerful tool called matrix inversion that allows us to analytically solve equation 2.11 for many values of A.
To describe matrix inversion, we first need to define the concept of an identity matrix. An identity matrix is a matrix that does not change any vector when we multiply that vector by that matrix. We denote the identity matrix that preserves n-dimensional vectors as In. Formally, In ∈ Rn×n, and
∀x ∈ Rn, Inx = x. (2.20)
The structure of the identity matrix is simple: all of the entries along the main diagonal are 1, while all of the other entries are zero. See figure 2.2 for an example. The matrix inverse of A is denoted as A−1, and it is defined as the matrix such that
A−1A = In. (2.21)
We can now solve equation 2.11 by the following steps:
Ax = b (2.22)
A−1Ax = A−1b (2.23)
Inx = A−1b (2.24)
36
CHAPTER 2. LINEAR ALGEBRA
x = A−1b. (2.25)
Of course, this process depends on it being possible to find A−1. We discuss the conditions for the existence of A−1in the following section. When A−1 exists, several different algorithms exist for finding it in closed form. In theory, the same inverse matrix can then be used to solve the equation many times for different values of b. However, A −1is primarily useful as a theoretical tool, and should not actually be used in practice for most software applications. Because A−1 can be represented with only limited precision on a digital computer, algorithms that make use of the value of b can usually obtain more accurate estimates of x.
2.4 Linear Dependence and Span
In order for A−1to exist, equation 2.11 must have exactly one solution for every value of b. However, it is also possible for the system of equations to have no solutions or infinitely many solutions for some values of b. It is not possible to have more than one but less than infinitely many solutions for a particular b; if
both x and y are solutions then
z = αx + (1 − α)y (2.26)
is also a solution for any real α.
To analyze how many solutions the equation has, we can think of the columns of A as specifying different directions we can travel from the origin (the point specified by the vector of all zeros), and determine how many ways there are of reaching b. In this view, each element of x specifies how far we should travel in each of these directions, with xi specifying how far to move in the direction of
column i:
Ax = 
xiA:,i. (2.27) i
In general, this kind of operation is called a linear combination. Formally, a linear combination of some set of vectors {v(1), . . . , v(n)} is given by multiplying each vector v(i) by a corresponding scalar coefficient and adding the results: 
civ(i). (2.28)
i
The span of a set of vectors is the set of all points obtainable by linear combination of the original vectors.
37
CHAPTER 2. LINEAR ALGEBRA
Determining whether Ax = b has a solution thus amounts to testing whether b is in the span of the columns of A. This particular span is known as the column space or the range of A.
In order for the system Ax = b to have a solution for all values of b ∈ Rm, we therefore require that the column space of A be all of Rm. If any point in Rm is excluded from the column space, that point is a potential value of b that has no solution. The requirement that the column space of A be all of Rm implies immediately that A must have at least m columns, i.e., n ≥ m. Otherwise, the dimensionality of the column space would be less than m. For example, consider a 3 × 2 matrix. The target b is 3-D, but x is only 2-D, so modifying the value of x at best allows us to trace out a 2-D plane within R3. The equation has a solution if and only if b lies on that plane.
Having n ≥ m is only a necessary condition for every point to have a solution. It is not a sufficient condition, because it is possible for some of the columns to be redundant. Consider a 2 ×2 matrix where both of the columns are identical. This has the same column space as a 2 × 1 matrix containing only one copy of the replicated column. In other words, the column space is still just a line, and fails to encompass all of R2, even though there are two columns.
Formally, this kind of redundancy is known as linear dependence. A set of vectors is linearly independent if no vector in the set is a linear combination of the other vectors. If we add a vector to a set that is a linear combination of the other vectors in the set, the new vector does not add any points to the set’s span. This means that for the column space of the matrix to encompass all of Rm, the matrix must contain at least one set of m linearly independent columns. This condition is both necessary and sufficient for equation 2.11 to have a solution for every value of b. Note that the requirement is for a set to have exactly m linear independent columns, not at least m. No set of m-dimensional vectors can have more than m mutually linearly independent columns, but a matrix with more than m columns may have more than one such set.
In order for the matrix to have an inverse, we additionally need to ensure that equation 2.11 has at most one solution for each value of b. To do so, we need to ensure that the matrix has at most m columns. Otherwise there is more than one way of parametrizing each solution.
Together, this means that the matrix must be square, that is, we require that m = n and that all of the columns must be linearly independent. A square matrix with linearly dependent columns is known as singular.
If A is not square or is square but singular, it can still be possible to solve the equation. However, we can not use the method of matrix inversion to find the
38
CHAPTER 2. LINEAR ALGEBRA
solution.
So far we have discussed matrix inverses as being multiplied on the left. It is also possible to define an inverse that is multiplied on the right:
AA−1 = I. (2.29)
For square matrices, the left inverse and right inverse are equal.
2.5 Norms
Sometimes we need to measure the size of a vector. In machine learning, we usually measure the size of vectors using a function called a norm. Formally, the Lp norm
is given by
for p ∈ R, p ≥ 1.

1p
|xi|p
||x||p =
i
(2.30)
Norms, including the Lp norm, are functions mapping vectors to non-negative values. On an intuitive level, the norm of a vector x measures the distance from the origin to the point x. More rigorously, a norm is any function f that satisfies the following properties:
• f (x) = 0 ⇒ x = 0
• f (x + y) ≤ f(x) + f (y) (the triangle inequality)
• ∀α ∈ R, f (αx) = |α|f (x)
The L2 norm, with p = 2, is known as the Euclidean norm. It is simply the Euclidean distance from the origin to the point identified by x. The L2 norm is used so frequently in machine learning that it is often denoted simply as ||x||, with the subscript 2 omitted. It is also common to measure the size of a vector using the squared L2 norm, which can be calculated simply as xx.
The squared L2 norm is more convenient to work with mathematically and computationally than the L2 norm itself. For example, the derivatives of the squared L2 norm with respect to each element of x each depend only on the corresponding element of x, while all of the derivatives of the L2 norm depend on the entire vector. In many contexts, the squared L2 norm may be undesirable because it increases very slowly near the origin. In several machine learning
39
CHAPTER 2. LINEAR ALGEBRA
applications, it is important to discriminate between elements that are exactly zero and elements that are small but nonzero. In these cases, we turn to a function that grows at the same rate in all locations, but retains mathematical simplicity: the L1 norm. The L1 norm may be simplified to
||x||1 = 
|xi |. (2.31)
i
The L1 norm is commonly used in machine learning when the difference between zero and nonzero elements is very important. Every time an element of x moves away from 0 by , the L1 norm increases by .
We sometimes measure the size of the vector by counting its number of nonzero elements. Some authors refer to this function as the “L0 norm,” but this is incorrect terminology. The number of non-zero entries in a vector is not a norm, because scaling the vector by α does not change the number of nonzero entries. The L1 norm is often used as a substitute for the number of nonzero entries.
One other norm that commonly arises in machine learning is the L∞ norm, also known as the max norm. This norm simplifies to the absolute value of the element with the largest magnitude in the vector,
||x||∞ = max
i |xi|. (2.32)
Sometimes we may also wish to measure the size of a matrix. In the context of deep learning, the most common way to do this is with the otherwise obscure
Frobenius norm:

A2i,j, (2.33)
||A|| F =
i,j
which is analogous to the L2 norm of a vector.
The dot product of two vectors can be rewritten in terms of norms. Specifically, xy = ||x||2||y|| 2 cos θ (2.34)
where θ is the angle between x and y.
2.6 Special Kinds of Matrices and Vectors
Some special kinds of matrices and vectors are particularly useful. Diagonal matrices consist mostly of zeros and have non-zero entries only along the main diagonal. Formally, a matrix D is diagonal if and only if Di,j = 0 for
40
CHAPTER 2. LINEAR ALGEBRA
all i = j . We have already seen one example of a diagonal matrix: the identity matrix, where all of the diagonal entries are 1. We write diag(v) to denote a square diagonal matrix whose diagonal entries are given by the entries of the vector v. Diagonal matrices are of interest in part because multiplying by a diagonal matrix is very computationally efficient. To compute diag(v)x, we only need to scale each element xi by vi. In other words, diag(v)x = v  x. Inverting a square diagonal matrix is also efficient. The inverse exists only if every diagonal entry is nonzero, and in that case, diag(v)−1 = diag([1/v1, . . . , 1/vn ]). In many cases, we may derive some very general machine learning algorithm in terms of arbitrary matrices, but obtain a less expensive (and less descriptive) algorithm by restricting some matrices to be diagonal.
Not all diagonal matrices need be square. It is possible to construct a rectangular diagonal matrix. Non-square diagonal matrices do not have inverses but it is still possible to multiply by them cheaply. For a non-square diagonal matrix D, the product Dx will involve scaling each element of x, and either concatenating some zeros to the result if D is taller than it is wide, or discarding some of the last elements of the vector if D is wider than it is tall.
A symmetric matrix is any matrix that is equal to its own transpose: A = A. (2.35)
Symmetric matrices often arise when the entries are generated by some function of two arguments that does not depend on the order of the arguments. For example, if A is a matrix of distance measurements, with Ai,j giving the distance from point
i to point j, then Ai,j = Aj,i because distance functions are symmetric. A unit vector is a vector with unit norm:
||x||2 = 1. (2.36)
A vector x and a vector y are orthogonal to each other if xy = 0. If both vectors have nonzero norm, this means that they are at a 90 degree angle to each other. In Rn, at most n vectors may be mutually orthogonal with nonzero norm. If the vectors are not only orthogonal but also have unit norm, we call them orthonormal.
An orthogonal matrix is a square matrix whose rows are mutually orthonor mal and whose columns are mutually orthonormal:
AA = AA = I. (2.37)
41
CHAPTER 2. LINEAR ALGEBRA
This implies that
A−1 = A, (2.38)
so orthogonal matrices are of interest because their inverse is very cheap to compute. Pay careful attention to the definition of orthogonal matrices. Counterintuitively, their rows are not merely orthogonal but fully orthonormal. There is no special term for a matrix whose rows or columns are orthogonal but not orthonormal.
2.7 Eigendecomposition
Many mathematical objects can be understood better by breaking them into constituent parts, or finding some properties of them that are universal, not caused by the way we choose to represent them.
For example, integers can be decomposed into prime factors. The way we represent the number 12 will change depending on whether we write it in base ten or in binary, but it will always be true that 12 = 2×2 ×3. From this representation we can conclude useful properties, such as that 12 is not divisible by 5, or that any integer multiple of 12 will be divisible by 3.
Much as we can discover something about the true nature of an integer by decomposing it into prime factors, we can also decompose matrices in ways that show us information about their functional properties that is not obvious from the representation of the matrix as an array of elements.
One of the most widely used kinds of matrix decomposition is called eigen decomposition, in which we decompose a matrix into a set of eigenvectors and eigenvalues.
An eigenvector of a square matrix A is a non-zero vector v such that multi plication by A alters only the scale of v:
Av = λv. (2.39)
The scalar λ is known as the eigenvalue corresponding to this eigenvector. (One can also find a left eigenvector such that vA = λv, but we are usually concerned with right eigenvectors).
If v is an eigenvector of A, then so is any rescaled vector sv for s ∈ R, s = 0. Moreover, sv still has the same eigenvalue. For this reason, we usually only look for unit eigenvectors.
Suppose that a matrix A has n linearly independent eigenvectors, {v(1), . . . , v(n)}, with corresponding eigenvalues {λ1, . . . , λn}. We may concatenate all of the
42
CHAPTER 2. LINEAR ALGEBRA
    
 

   

 
  

  

       
       
Figure 2.3: An example of the effect of eigenvectors and eigenvalues. Here, we have a matrix A with two orthonormal eigenvectors, v(1) with eigenvalue λ1 and v(2) with eigenvalue λ2. (Left)We plot the set of all unit vectors u ∈ R2 as a unit circle. (Right)We plot the set of all points Au. By observing the way that A distorts the unit circle, we can see that it scales space in direction v(i) by λi.
eigenvectors to form a matrix V with one eigenvector per column: V = [v(1), . . . , v(n)]. Likewise, we can concatenate the eigenvalues to form a vector λ = [λ1 , . . . , λn]. The eigendecomposition of A is then given by
A = V diag(λ)V−1. (2.40)
We have seen that constructing matrices with specific eigenvalues and eigenvec- tors allows us to stretch space in desired directions. However, we often want to decompose matrices into their eigenvalues and eigenvectors. Doing so can help us to analyze certain properties of the matrix, much as decomposing an integer into its prime factors can help us understand the behavior of that integer.
Not every matrix can be decomposed into eigenvalues and eigenvectors. In some 43
CHAPTER 2. LINEAR ALGEBRA
cases, the decomposition exists, but may involve complex rather than real numbers. Fortunately, in this book, we usually need to decompose only a specific class of matrices that have a simple decomposition. Specifically, every real symmetric matrix can be decomposed into an expression using only real-valued eigenvectors and eigenvalues:
A = QΛQ, (2.41)
where Q is an orthogonal matrix composed of eigenvectors of A, and Λ is a diagonal matrix. The eigenvalue Λi,i is associated with the eigenvector in column i of Q, denoted as Q:,i. Because Q is an orthogonal matrix, we can think of A as scaling space by λi in direction v(i). See figure 2.3 for an example.
While any real symmetric matrix A is guaranteed to have an eigendecomposi tion, the eigendecomposition may not be unique. If any two or more eigenvectors share the same eigenvalue, then any set of orthogonal vectors lying in their span are also eigenvectors with that eigenvalue, and we could equivalently choose a Q using those eigenvectors instead. By convention, we usually sort the entries of Λ in descending order. Under this convention, the eigendecomposition is unique only if all of the eigenvalues are unique.
The eigendecomposition of a matrix tells us many useful facts about the matrix. The matrix is singular if and only if any of the eigenvalues are zero. The eigendecomposition of a real symmetric matrix can also be used to optimize quadratic expressions of the form f(x) = x Ax subject to ||x||2 = 1. Whenever x is equal to an eigenvector of A, f takes on the value of the corresponding eigenvalue. The maximum value of f within the constraint region is the maximum eigenvalue and its minimum value within the constraint region is the minimum eigenvalue.
A matrix whose eigenvalues are all positive is called positive definite. A matrix whose eigenvalues are all positive or zero-valued is called positive semidefi nite. Likewise, if all eigenvalues are negative, the matrix is negative definite, and if all eigenvalues are negative or zero-valued, it is negative semidefinite. Positive semidefinite matrices are interesting because they guarantee that ∀x, xAx ≥ 0. Positive definite matrices additionally guarantee that xAx = 0 ⇒ x = 0.
2.8 Singular Value Decomposition
In section 2.7, we saw how to decompose a matrix into eigenvectors and eigenvalues. The singular value decomposition (SVD) provides another way to factorize a matrix, into singular vectors and singular values. The SVD allows us to discover some of the same kind of information as the eigendecomposition. However,
44
CHAPTER 2. LINEAR ALGEBRA
the SVD is more generally applicable. Every real matrix has a singular value decomposition, but the same is not true of the eigenvalue decomposition. For example, if a matrix is not square, the eigendecomposition is not defined, and we must use a singular value decomposition instead.
Recall that the eigendecomposition involves analyzing a matrix A to discover a matrix V of eigenvectors and a vector of eigenvalues λ such that we can rewrite A as
A = V diag(λ)V−1. (2.42)
The singular value decomposition is similar, except this time we will write A as a product of three matrices:
A = UDV . (2.43)
Suppose that A is an m ×n matrix. Then U is defined to be an m ×m matrix, D to be an m × n matrix, and V to be an n × n matrix.
Each of these matrices is defined to have a special structure. The matrices U and V are both defined to be orthogonal matrices. The matrix D is defined to be a diagonal matrix. Note that D is not necessarily square.
The elements along the diagonal of D are known as the singular values of the matrix A. The columns of U are known as the left-singular vectors. The columns of V are known as as the right-singular vectors.
We can actually interpret the singular value decomposition of A in terms of the eigendecomposition of functions of A . The left-singular vectors of A are the eigenvectors of AA. The right-singular vectors of A are the eigenvectors of AA. The non-zero singular values of A are the square roots of the eigenvalues of AA. The same is true for AA.
Perhaps the most useful feature of the SVD is that we can use it to partially generalize matrix inversion to non-square matrices, as we will see in the next section.
2.9 The Moore-Penrose Pseudoinverse
Matrix inversion is not defined for matrices that are not square. Suppose we want to make a left-inverse B of a matrix A, so that we can solve a linear equation
Ax = y (2.44)
45
CHAPTER 2. LINEAR ALGEBRA
by left-multiplying each side to obtain
x = By. (2.45)
Depending on the structure of the problem, it may not be possible to design a unique mapping from A to B.
If A is taller than it is wide, then it is possible for this equation to have no solution. If A is wider than it is tall, then there could be multiple possible solutions.
The Moore-Penrose pseudoinverse allows us to make some headway in these cases. The pseudoinverse of A is defined as a matrix
A+ = lim
α0(AA + αI) −1A. (2.46)
Practical algorithms for computing the pseudoinverse are not based on this defini tion, but rather the formula
A+ = V D+U, (2.47)
where U, D and V are the singular value decomposition of A, and the pseudoinverse D+ of a diagonal matrix D is obtained by taking the reciprocal of its non-zero elements then taking the transpose of the resulting matrix.
When A has more columns than rows, then solving a linear equation using the pseudoinverse provides one of the many possible solutions. Specifically, it provides the solution x = A+ y with minimal Euclidean norm ||x||2 among all possible solutions.
When A has more rows than columns, it is possible for there to be no solution. In this case, using the pseudoinverse gives us the x for which Ax is as close as possible to y in terms of Euclidean norm ||Ax − y||2.
2.10 The Trace Operator
The trace operator gives the sum of all of the diagonal entries of a matrix: Tr(A) =
Ai,i. (2.48)
i
The trace operator is useful for a variety of reasons. Some operations that are difficult to specify without resorting to summation notation can be specified using
46
CHAPTER 2. LINEAR ALGEBRA
matrix products and the trace operator. For example, the trace operator provides an alternative way of writing the Frobenius norm of a matrix: 
||A||F =
Tr(AA). (2.49)
Writing an expression in terms of the trace operator opens up opportunities to manipulate the expression using many useful identities. For example, the trace operator is invariant to the transpose operator:
Tr(A) = Tr(A). (2.50)
The trace of a square matrix composed of many factors is also invariant to moving the last factor into the first position, if the shapes of the corresponding matrices allow the resulting product to be defined:
Tr(ABC) = Tr(CAB) = Tr(BCA) (2.51)
or more generally,
Tr(n
n−1
F(i)) = Tr(F(n)
F(i) ). (2.52)
i=1
i=1
This invariance to cyclic permutation holds even if the resulting product has a different shape. For example, for A ∈ Rm×n and B ∈ Rn×m, we have
Tr(AB) = Tr(BA) (2.53)
even though AB ∈ Rm×m and BA ∈ Rn×n.
Another useful fact to keep in mind is that a scalar is its own trace: a = Tr(a).
2.11 The Determinant
The determinant of a square matrix, denoted det(A), is a function mapping matrices to real scalars. The determinant is equal to the product of all the eigenvalues of the matrix. The absolute value of the determinant can be thought of as a measure of how much multiplication by the matrix expands or contracts space. If the determinant is 0, then space is contracted completely along at least
one dimension, causing it to lose all of its volume. If the determinant is 1, then the transformation preserves volume.
47
CHAPTER 2. LINEAR ALGEBRA
2.12 Example: Principal Components Analysis
One simple machine learning algorithm, principal components analysis or PCA can be derived using only knowledge of basic linear algebra.
Suppose we have a collection of m points {x(1), . . . , x(m)} in Rn. Suppose we would like to apply lossy compression to these points. Lossy compression means storing the points in a way that requires less memory but may lose some precision. We would like to lose as little precision as possible.
One way we can encode these points is to represent a lower-dimensional version of them. For each point x(i) ∈ Rn we will find a corresponding code vector c(i) ∈ Rl. If l is smaller than n, it will take less memory to store the code points than the original data. We will want to find some encoding function that produces the code for an input, f(x) = c, and a decoding function that produces the reconstructed input given its code, x ≈ g(f (x)).
PCA is defined by our choice of the decoding function. Specifically, to make the decoder very simple, we choose to use matrix multiplication to map the code back into Rn. Let g(c) = Dc, where D ∈ Rn×l is the matrix defining the decoding.
Computing the optimal code for this decoder could be a difficult problem. To keep the encoding problem easy, PCA constrains the columns of D to be orthogonal to each other. (Note that D is still not technically “an orthogonal matrix” unless l = n)
With the problem as described so far, many solutions are possible, because we can increase the scale of D:,i if we decrease ci proportionally for all points. To give the problem a unique solution, we constrain all of the columns of D to have unit norm.
In order to turn this basic idea into an algorithm we can implement, the first thing we need to do is figure out how to generate the optimal code point c∗for each input point x. One way to do this is to minimize the distance between the input point x and its reconstruction, g(c∗). We can measure this distance using a
norm. In the principal components algorithm, we use the L2 norm: c∗ = arg min
c
||x − g(c)||2. (2.54)
We can switch to the squared L2 norm instead of the L2 norm itself, because both are minimized by the same value of c. Both are minimized by the same value of c because the L2 norm is non-negative and the squaring operation is
48
CHAPTER 2. LINEAR ALGEBRA
monotonically increasing for non-negative arguments.
c∗ = arg min
c
||x − g(c)||22. (2.55)
The function being minimized simplifies to
(x − g(c))(x − g(c)) (2.56)
(by the definition of the L2 norm, equation 2.30)
= xx − xg(c) − g(c)x + g(c)g(c) (2.57)
(by the distributive property)
= xx − 2xg(c) + g(c)g(c) (2.58)
(because the scalar g(c) x is equal to the transpose of itself). We can now change the function being minimized again, to omit the first term, since this term does not depend on c:
c∗ = arg min
−2xg(c) + g(c)g(c). (2.59)
c
To make further progress, we must substitute in the definition of g(c): c∗ = arg min
−2xDc + cDDc (2.60)
c
−2x Dc + cIlc (2.61)
= arg min
c
(by the orthogonality and unit norm constraints on D)
−2xDc + cc (2.62)
= arg min
c
We can solve this optimization problem using vector calculus (see section 4.3 if you do not know how to do this):
∇c(−2xDc + cc) = 0 (2.63)
− 2Dx + 2c = 0 (2.64)
c = D x. (2.65)
49
CHAPTER 2. LINEAR ALGEBRA
This makes the algorithm efficient: we can optimally encode x just using a matrix-vector operation. To encode a vector, we apply the encoder function
f (x) = Dx. (2.66)
Using a further matrix multiplication, we can also define the PCA reconstruction operation:
r(x) = g (f (x)) = DDx. (2.67)
Next, we need to choose the encoding matrix D. To do so, we revisit the idea of minimizing the L2 distance between inputs and reconstructions. Since we will use the same matrix D to decode all of the points, we can no longer consider the points in isolation. Instead, we must minimize the Frobenius norm of the matrix of errors computed over all dimensions and all points:

D∗ = arg min D
x(i)
j − r(x(i))j
2subject to DD = Il (2.68)
i,j
To derive the algorithm for finding D∗, we will start by considering the case where l = 1. In this case, D is just a single vector, d. Substituting equation 2.67 into equation 2.68 and simplifying D into d, the problem reduces to
||x(i) − ddx(i)||22 subject to ||d||2 = 1. (2.69)
d∗ = arg min
d
i
The above formulation is the most direct way of performing the substitution, but is not the most stylistically pleasing way to write the equation. It places the scalar value dx(i) on the right of the vector d. It is more conventional to write scalar coefficients on the left of vector they operate on. We therefore usually write such a formula as
d∗ = arg min
d
||x(i) − dx(i)d||22 subject to ||d||2 = 1, (2.70) i
or, exploiting the fact that a scalar is its own transpose, as
||x(i) − x(i)dd||22 subject to ||d||2 = 1. (2.71)
d∗ = arg min
d
i
The reader should aim to become familiar with such cosmetic rearrangements. 50
CHAPTER 2. LINEAR ALGEBRA
At this point, it can be helpful to rewrite the problem in terms of a single design matrix of examples, rather than as a sum over separate example vectors. This will allow us to use more compact notation. Let X ∈ Rm×n be the matrix defined by stacking all of the vectors describing the points, such that Xi,: = x(i). We can now rewrite the problem as
||X − Xdd||2F subject to dd = 1. (2.72)
d∗ = arg min
d
Disregarding the constraint for the moment, we can simplify the Frobenius norm portion as follows:
||X − Xdd||2F (2.73)
arg min
d

= arg min
X − Xdd X − Xdd(2.74)
Tr
d
(by equation 2.49)
Tr(XX − XXdd − ddXX + ddXXdd) (2.75)
= arg min
d
Tr(XX) − Tr(XXdd) − Tr(ddXX) + Tr(ddXXdd) = arg min
d
(2.76)
− Tr(XXdd) − Tr(ddXX) + Tr(ddXXdd ) (2.77)
= arg min
d
(because terms not involving d do not affect the arg min)
−2 Tr(XXdd) + Tr(ddXXdd ) (2.78)
= arg min
d
(because we can cycle the order of the matrices inside a trace, equation 2.52) −2 Tr(XXdd) + Tr(XXdddd ) (2.79)
= arg min
d
(using the same property again)
At this point, we re-introduce the constraint:
−2 Tr(XXdd ) + Tr(XXdddd) subject to dd = 1 (2.80)
arg min
d
−2 Tr(XXdd) + Tr(XXdd) subject to dd = 1 (2.81)
= arg min
d
(due to the constraint)
− Tr(XXdd ) subject to dd = 1 (2.82)
= arg min
d
51
CHAPTER 2. LINEAR ALGEBRA
Tr(XXdd ) subject to dd = 1 (2.83)
= arg max
d
Tr(dXXd) subject to dd = 1 (2.84)
= arg max
d
This optimization problem may be solved using eigendecomposition. Specifically, the optimal d is given by the eigenvector of XX corresponding to the largest eigenvalue.
This derivation is specific to the case of l = 1 and recovers only the first principal component. More generally, when we wish to recover a basis of principal components, the matrix D is given by the l eigenvectors corresponding to the largest eigenvalues. This may be shown using proof by induction. We recommend writing this proof as an exercise.
Linear algebra is one of the fundamental mathematical disciplines that is necessary to understand deep learning. Another key area of mathematics that is ubiquitous in machine learning is probability theory, presented next.
52
Chapter 3
Probability and Information Theory
In this chapter, we describe probability theory and information theory.
Probability theory is a mathematical framework for representing uncertain statements. It provides a means of quantifying uncertainty and axioms for deriving new uncertain statements. In artificial intelligence applications, we use probability theory in two major ways. First, the laws of probability tell us how AI systems
should reason, so we design our algorithms to compute or approximate various expressions derived using probability theory. Second, we can use probability and statistics to theoretically analyze the behavior of proposed AI systems.
Probability theory is a fundamental tool of many disciplines of science and engineering. We provide this chapter to ensure that readers whose background is primarily in software engineering with limited exposure to probability theory can understand the material in this book.
While probability theory allows us to make uncertain statements and reason in the presence of uncertainty, information theory allows us to quantify the amount of uncertainty in a probability distribution.
If you are already familiar with probability theory and information theory, you may wish to skip all of this chapter except for section 3.14, which describes the graphs we use to describe structured probabilistic models for machine learning. If you have absolutely no prior experience with these subjects, this chapter should be sufficient to successfully carry out deep learning research projects, but we do suggest that you consult an additional resource, such as Jaynes (2003).
53
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
3.1 Why Probability?
Many branches of computer science deal mostly with entities that are entirely deterministic and certain. A programmer can usually safely assume that a CPU will execute each machine instruction flawlessly. Errors in hardware do occur, but are rare enough that most software applications do not need to be designed to account for them. Given that many computer scientists and software engineers work in a relatively clean and certain environment, it can be surprising that machine learning makes heavy use of probability theory.
This is because machine learning must always deal with uncertain quantities, and sometimes may also need to deal with stochastic (non-deterministic) quantities. Uncertainty and stochasticity can arise from many sources. Researchers have made compelling arguments for quantifying uncertainty using probability since at least the 1980s. Many of the arguments presented here are summarized from or inspired by Pearl (1988).
Nearly all activities require some ability to reason in the presence of uncertainty. In fact, beyond mathematical statements that are true by definition, it is difficult to think of any proposition that is absolutely true or any event that is absolutely guaranteed to occur.
There are three possible sources of uncertainty:
1. Inherent stochasticity in the system being modeled. For example, most interpretations of quantum mechanics describe the dynamics of subatomic particles as being probabilistic. We can also create theoretical scenarios that we postulate to have random dynamics, such as a hypothetical card game where we assume that the cards are truly shuffled into a random order.
2. Incomplete observability. Even deterministic systems can appear stochastic when we cannot observe all of the variables that drive the behavior of the system. For example, in the Monty Hall problem, a game show contestant is asked to choose between three doors and wins a prize held behind the chosen door. Two doors lead to a goat while a third leads to a car. The outcome given the contestant’s choice is deterministic, but from the contestant’s point of view, the outcome is uncertain.
3. Incomplete modeling. When we use a model that must discard some of the information we have observed, the discarded information results in uncertainty in the model’s predictions. For example, suppose we build a robot that can exactly observe the location of every object around it. If the
54
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
robot discretizes space when predicting the future location of these objects, then the discretization makes the robot immediately become uncertain about the precise position of objects: each object could be anywhere within the discrete cell that it was observed to occupy.
In many cases, it is more practical to use a simple but uncertain rule rather than a complex but certain one, even if the true rule is deterministic and our modeling system has the fidelity to accommodate a complex rule. For example, the simple rule “Most birds fly” is cheap to develop and is broadly useful, while a rule of the form, “Birds fly, except for very young birds that have not yet learned to fly, sick or injured birds that have lost the ability to fly, flightless species of birds including the cassowary, ostrich and kiwi. . .” is expensive to develop, maintain and communicate, and after all of this effort is still very brittle and prone to failure.
While it should be clear that we need a means of representing and reasoning about uncertainty, it is not immediately obvious that probability theory can provide all of the tools we want for artificial intelligence applications. Probability theory was originally developed to analyze the frequencies of events. It is easy to see how probability theory can be used to study events like drawing a certain hand of cards in a game of poker. These kinds of events are often repeatable. When we say that an outcome has a probability p of occurring, it means that if we repeated the experiment (e.g., draw a hand of cards) infinitely many times, then proportion p of the repetitions would result in that outcome. This kind of reasoning does not seem immediately applicable to propositions that are not repeatable. If a doctor analyzes a patient and says that the patient has a 40% chance of having the flu, this means something very different—we can not make infinitely many replicas of the patient, nor is there any reason to believe that different replicas of the patient would present with the same symptoms yet have varying underlying conditions. In the case of the doctor diagnosing the patient, we use probability to represent a degree of belief, with 1 indicating absolute certainty that the patient has the flu and 0 indicating absolute certainty that the patient does not have the flu. The former kind of probability, related directly to the rates at which events occur, is known as frequentist probability, while the latter, related to qualitative levels of certainty, is known as Bayesian probability.
If we list several properties that we expect common sense reasoning about uncertainty to have, then the only way to satisfy those properties is to treat Bayesian probabilities as behaving exactly the same as frequentist probabilities. For example, if we want to compute the probability that a player will win a poker game given that she has a certain set of cards, we use exactly the same formulas as when we compute the probability that a patient has a disease given that she
55
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
has certain symptoms. For more details about why a small set of common sense assumptions implies that the same axioms must control both kinds of probability, see Ramsey (1926).
Probability can be seen as the extension of logic to deal with uncertainty. Logic provides a set of formal rules for determining what propositions are implied to be true or false given the assumption that some other set of propositions is true or false. Probability theory provides a set of formal rules for determining the likelihood of a proposition being true given the likelihood of other propositions.
3.2 Random Variables
A random variable is a variable that can take on different values randomly. We typically denote the random variable itself with a lower case letter in plain typeface, and the values it can take on with lower case script letters. For example, x1 and x2
are both possible values that the random variable x can take on. For vector-valued variables, we would write the random variable as x and one of its values as x. On its own, a random variable is just a description of the states that are possible; it must be coupled with a probability distribution that specifies how likely each of these states are.
Random variables may be discrete or continuous. A discrete random variable is one that has a finite or countably infinite number of states. Note that these states are not necessarily the integers; they can also just be named states that are not considered to have any numerical value. A continuous random variable is associated with a real value.
3.3 Probability Distributions
A probability distribution is a description of how likely a random variable or set of random variables is to take on each of its possible states. The way we describe probability distributions depends on whether the variables are discrete or continuous.
3.3.1 Discrete Variables and Probability Mass Functions
A probability distribution over discrete variables may be described using a proba bility mass function (PMF). We typically denote probability mass functions with a capital P. Often we associate each random variable with a different probability
56
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
mass function and the reader must infer which probability mass function to use based on the identity of the random variable, rather than the name of the function; P(x) is usually not the same as P(y).
The probability mass function maps from a state of a random variable to the probability of that random variable taking on that state. The probability that x = x is denoted as P(x), with a probability of 1 indicating that x = x is certain and a probability of 0 indicating that x = x is impossible. Sometimes to disambiguate which PMF to use, we write the name of the random variable explicitly: P (x = x). Sometimes we define a variable first, then use ∼ notation to specify which distribution it follows later: x ∼ P(x).
Probability mass functions can act on many variables at the same time. Such a probability distribution over many variables is known as a joint probability distribution. P(x = x, y = y) denotes the probability that x = x and y = y simultaneously. We may also write P(x, y) for brevity.
To be a probability mass function on a random variable x, a function P must satisfy the following properties:
• The domain of P must be the set of all possible states of x.
• ∀x ∈ x,0 ≤ P(x) ≤ 1. An impossible event has probability 0 and no state can be less probable than that. Likewise, an event that is guaranteed to happen has probability 1, and no state can have a greater chance of occurring.
• x∈x P(x) = 1. We refer to this property as being normalized. Without this property, we could obtain probabilities greater than one by computing the probability of one of many events occurring.
For example, consider a single discrete random variable x with k different states. We can place a uniform distribution on x—that is, make each of its states equally likely—by setting its probability mass function to
P(x = xi) =1k(3.1)
for all i. We can see that this fits the requirements for a probability mass function. The value 1kis positive because k is a positive integer. We also see that 
P(x = xi) = 
k =kk = 1, (3.2)
1
i
i
so the distribution is properly normalized. 57
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
3.3.2 Continuous Variables and Probability Density Functions
When working with continuous random variables, we describe probability distri butions using a probability density function (PDF) rather than a probability mass function. To be a probability density function, a function p must satisfy the following properties:
• The domain of p must be the set of all possible states of x.
• ∀x ∈ x, p(x) ≥ 0. Note that we do not require p(x) ≤ 1.
• p(x)dx = 1.
A probability density function p(x) does not give the probability of a specific state directly, instead the probability of landing inside an infinitesimal region with volume δx is given by p(x)δx.
We can integrate the density function to find the actual probability mass of a set of points. Specifically, the probability that x lies in some set S is given by the integral of p(x) over that set. In the univariate example, the probability that x lies in the interval [a, b] is given by [a,b]p(x)dx.
For an example of a probability density function corresponding to a specific probability density over a continuous random variable, consider a uniform distribu tion on an interval of the real numbers. We can do this with a function u(x; a, b), where a and b are the endpoints of the interval, with b > a. The “;” notation means “parametrized by”; we consider x to be the argument of the function, while a and b are parameters that define the function. To ensure that there is no probability mass outside the interval, we say u(x; a, b) = 0 for all x ∈ [a, b]. Within [a, b], u(x; a, b) =1
b−a. We can see that this is nonnegative everywhere. Additionally, it integrates to 1. We often denote that x follows the uniform distribution on [a, b] by writing x ∼ U(a, b).
3.4 Marginal Probability
Sometimes we know the probability distribution over a set of variables and we want to know the probability distribution over just a subset of them. The probability distribution over the subset is known as the marginal probability distribution.
For example, suppose we have discrete random variables x and y, and we know P(x, y). We can find P(x) with the sum rule:
∀x ∈ x,P(x = x) = 
P(x = x, y = y). (3.3)
y
58
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
The name “marginal probability” comes from the process of computing marginal probabilities on paper. When the values of P(x, y) are written in a grid with different values of x in rows and different values of y in columns, it is natural to sum across a row of the grid, then write P(x) in the margin of the paper just to the right of the row.
For continuous variables, we need to use integration instead of summation: 
p(x) =
p(x, y)dy. (3.4)
3.5 Conditional Probability
In many cases, we are interested in the probability of some event, given that some other event has happened. This is called a conditional probability. We denote the conditional probability that y = y given x = x as P(y = y | x = x). This conditional probability can be computed with the formula
P(y = y | x = x) =P(y = y, x = x)
P(x = x). (3.5)
The conditional probability is only defined when P(x = x) > 0. We cannot compute the conditional probability conditioned on an event that never happens. It is important not to confuse conditional probability with computing what would happen if some action were undertaken. The conditional probability that a person is from Germany given that they speak German is quite high, but if a randomly selected person is taught to speak German, their country of origin does not change. Computing the consequences of an action is called making an intervention query. Intervention queries are the domain of causal modeling, which we do not explore in this book.
3.6 The Chain Rule of Conditional Probabilities
Any joint probability distribution over many random variables may be decomposed into conditional distributions over only one variable:
P(x(1), . . . , x(n)) = P(x(1))Πni=2P(x(i)| x(1), . . . , x(i−1)). (3.6)
This observation is known as the chain rule or product rule of probability. It follows immediately from the definition of conditional probability in equation 3.5.
59
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
For example, applying the definition twice, we get
P(a, b, c) = P(a | b, c)P(b, c)
P(b, c) = P(b | c)P(c)
P(a, b, c) = P(a | b, c)P(b | c)P(c).
3.7 Independence and Conditional Independence
Two random variables x and y are independent if their probability distribution can be expressed as a product of two factors, one involving only x and one involving only y:
∀x ∈ x, y ∈ y, p(x = x, y = y) = p(x = x)p(y = y). (3.7)
Two random variables x and y are conditionally independent given a random variable z if the conditional probability distribution over x and y factorizes in this way for every value of z:
∀x ∈ x, y ∈ y, z ∈ z, p(x = x, y = y | z = z) = p(x = x | z = z)p(y = y | z = z). (3.8)
We can denote independence and conditional independence with compact notation: x⊥y means that x and y are independent, while x⊥y | z means that x and y are conditionally independent given z.
3.8 Expectation, Variance and Covariance
The expectation or expected value of some function f(x) with respect to a probability distribution P(x) is the average or mean value that f takes on when x is drawn from P. For discrete variables this can be computed with a summation: Ex∼P [f(x)] = 
P(x)f(x), (3.9)
x
while for continuous variables, it is computed with an integral:
Ex∼p[f(x)] =
p(x)f(x)dx. (3.10)
60
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
When the identity of the distribution is clear from the context, we may simply write the name of the random variable that the expectation is over, as in Ex[f(x)]. If it is clear which random variable the expectation is over, we may omit the subscript entirely, as in E[f (x)]. By default, we can assume that E[·] averages over the values of all the random variables inside the brackets. Likewise, when there is no ambiguity, we may omit the square brackets.
Expectations are linear, for example,
Ex[αf(x) + βg(x)] = αEx[f(x)] + βEx[g(x)], (3.11)
when α and β are not dependent on x.
The variance gives a measure of how much the values of a function of a random variable x vary as we sample different values of x from its probability distribution: 
(f(x) − E[f(x)])2. (3.12)
Var(f(x)) = E
When the variance is low, the values of f (x) cluster near their expected value. The square root of the variance is known as the standard deviation. The covariance gives some sense of how much two values are linearly related to each other, as well as the scale of these variables:
Cov(f(x), g(y)) = E [(f(x) − E [f(x)])(g(y) − E [g(y)])] . (3.13)
High absolute values of the covariance mean that the values change very much and are both far from their respective means at the same time. If the sign of the covariance is positive, then both variables tend to take on relatively high values simultaneously. If the sign of the covariance is negative, then one variable tends to take on a relatively high value at the times that the other takes on a relatively low value and vice versa. Other measures such as correlation normalize the contribution of each variable in order to measure only how much the variables are related, rather than also being affected by the scale of the separate variables.
The notions of covariance and dependence are related, but are in fact distinct concepts. They are related because two variables that are independent have zero covariance, and two variables that have non-zero covariance are dependent. How ever, independence is a distinct property from covariance. For two variables to have zero covariance, there must be no linear dependence between them. Independence is a stronger requirement than zero covariance, because independence also excludes nonlinear relationships. It is possible for two variables to be dependent but have zero covariance. For example, suppose we first sample a real number x from a uniform distribution over the interval [−1, 1]. We next sample a random variable
61
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
s. With probability12, we choose the value of s to be 1. Otherwise, we choose the value of s to be −1. We can then generate a random variable y by assigning y = sx. Clearly, x and y are not independent, because x completely determines the magnitude of y. However, Cov(x, y) = 0.
The covariance matrix of a random vector x ∈ Rnis an n × n matrix, such that
Cov(x)i,j = Cov(xi, xj). (3.14)
The diagonal elements of the covariance give the variance:
Cov(xi, xi) = Var(xi). (3.15)
3.9 Common Probability Distributions
Several simple probability distributions are useful in many contexts in machine learning.
3.9.1 Bernoulli Distribution
The Bernoulli distribution is a distribution over a single binary random variable. It is controlled by a single parameter φ ∈ [0, 1], which gives the probability of the random variable being equal to 1. It has the following properties:
P(x = 1) = φ (3.16)
P(x = 0) = 1 − φ (3.17)
P(x = x) = φx(1 − φ)1−x(3.18)
Ex[x] = φ (3.19)
Varx(x) = φ(1 − φ) (3.20)
3.9.2 Multinoulli Distribution
The multinoulli or categorical distribution is a distribution over a single discrete variable with k different states, where k is finite.1 The multinoulli distribution is
1“Multinoulli” is a term that was recently coined by Gustavo Lacerdo and popularized by Murphy (2012). The multinoulli distribution is a special case of the multinomial distribution. A multinomial distribution is the distribution over vectors in {0, . . . , n}krepresenting how many times each of the k categories is visited when n samples are drawn from a multinoulli distribution. Many texts use the term “multinomial” to refer to multinoulli distributions without clarifying that they refer only to the n = 1 case.
62
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
parametrized by a vector p ∈ [0, 1]k−1, where pi gives the probability of the i-th state. The final, k-th state’s probability is given by 1 − 1p. Note that we must constrain 1p ≤ 1. Multinoulli distributions are often used to refer to distributions over categories of objects, so we do not usually assume that state 1 has numerical value 1, etc. For this reason, we do not usually need to compute the expectation or variance of multinoulli-distributed random variables.
The Bernoulli and multinoulli distributions are sufficient to describe any distri bution over their domain. They are able to describe any distribution over their domain not so much because they are particularly powerful but rather because their domain is simple; they model discrete variables for which it is feasible to enumerate all of the states. When dealing with continuous variables, there are
uncountably many states, so any distribution described by a small number of parameters must impose strict limits on the distribution.
3.9.3 Gaussian Distribution
The most commonly used distribution over real numbers is the normal distribu tion, also known as the Gaussian distribution:
1
N (x;µ, σ2) =
−12σ2(x − µ)2. (3.21)
2πσ2 exp
See figure 3.1 for a plot of the density function.
The two parameters µ ∈ R and σ ∈ (0,∞) control the normal distribution. The parameter µ gives the coordinate of the central peak. This is also the mean of the distribution: E[x] = µ. The standard deviation of the distribution is given by σ, and the variance by σ2.
When we evaluate the PDF, we need to square and invert σ. When we need to frequently evaluate the PDF with different parameter values, a more efficient way of parametrizing the distribution is to use a parameter β ∈ (0,∞) to control the precision or inverse variance of the distribution:
β
N (x;µ, β−1) =
−12β(x − µ)2. (3.22)
2πexp
Normal distributions are a sensible choice for many applications. In the absence of prior knowledge about what form a distribution over the real numbers should take, the normal distribution is a good default choice for two major reasons.
63
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
0.40
0.35
0.30
0.25
)
Maximum at x = µ Inflection points at
x
(p
0.20 0.15 0.10 0.05 0.00
x = µ ± σ
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
x
Figure 3.1: The normal distribution: The normal distribution N(x; µ, σ2) exhibits a classic “bell curve” shape, with the x coordinate of its central peak given by µ, and the width of its peak controlled by σ. In this example, we depict the standard normal distribution, with µ = 0 and σ = 1.
First, many distributions we wish to model are truly close to being normal distributions. The central limit theorem shows that the sum of many indepen dent random variables is approximately normally distributed. This means that in practice, many complicated systems can be modeled successfully as normally distributed noise, even if the system can be decomposed into parts with more structured behavior.
Second, out of all possible probability distributions with the same variance, the normal distribution encodes the maximum amount of uncertainty over the real numbers. We can thus think of the normal distribution as being the one that inserts the least amount of prior knowledge into a model. Fully developing and justifying this idea requires more mathematical tools, and is postponed to section 19.4.2.
The normal distribution generalizes to Rn, in which case it is known as the multivariate normal distribution. It may be parametrized with a positive definite symmetric matrix Σ:
−12(x − µ)Σ−1(x − µ). (3.23)
N (x; µ, Σ) =
1
(2π)ndet(Σ) exp 64
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
The parameter µ still gives the mean of the distribution, though now it is vector-valued. The parameter Σ gives the covariance matrix of the distribution. As in the univariate case, when we wish to evaluate the PDF several times for many different values of the parameters, the covariance is not a computationally
efficient way to parametrize the distribution, since we need to invert Σ to evaluate the PDF. We can instead use a precision matrix β:
det(β)
N (x; µ, β−1) =
−12(x − µ)β(x − µ). (3.24)
(2π)nexp
We often fix the covariance matrix to be a diagonal matrix. An even simpler version is the isotropic Gaussian distribution, whose covariance matrix is a scalar times the identity matrix.
3.9.4 Exponential and Laplace Distributions
In the context of deep learning, we often want to have a probability distribution with a sharp point at x = 0. To accomplish this, we can use the exponential distribution:
p(x; λ) = λ1x≥0 exp (−λx). (3.25)
The exponential distribution uses the indicator function 1x≥0 to assign probability zero to all negative values of x.
A closely related probability distribution that allows us to place a sharp peak of probability mass at an arbitrary point µ is the Laplace distribution Laplace(x;µ, γ) =12γexp −|x − µ|
γ
. (3.26)
3.9.5 The Dirac Distribution and Empirical Distribution
In some cases, we wish to specify that all of the mass in a probability distribution clusters around a single point. This can be accomplished by defining a PDF using the Dirac delta function, δ(x):
p(x) = δ(x − µ). (3.27)
The Dirac delta function is defined such that it is zero-valued everywhere except 0, yet integrates to 1. The Dirac delta function is not an ordinary function that associates each value x with a real-valued output, instead it is a different kind of
65
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
Approaching (Almost) Any Machine
Learning
Problem
1
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
It would not have been possible for me to write this book without the support of my family and friends. I would also like to thank the reviewers who selflessly devoted their time in reviewing this book (names in alphabetical order).
Aakash Nain
Aditya Soni
Andreas Müller
Andrey Lukyanenko
Ayon Roy
Bojan Tunguz
Gilberto Titericz Jr.
Konrad Banachewicz
Luca Massaron
Nabajeet Barman
Parul Pandey
Ram Ramrakhya
Sanyam Bhutani
Sudalai Rajkumar
Tanishq Abraham
Walter Reade
Yuval Reina
I hope I did not miss anyone.
2
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
Before you start, there are a few things that you must be aware of while going through this book.
This is not a traditional book.
The book expects you to have basic knowledge of machine learning and deep learning.
Important terms are bold.
Variable names and function/class names are italic.
═════════════════════════════════════════════════════════════════════════ All the code is between these two lines
═════════════════════════════════════════════════════════════════════════ Most of the times, the output is provided right after the code blocks. Figures are locally defined. For example, figure 1 is the first figure
Code is very important in this book and there is a lot of it. You must go through the code carefully and implement it on your own if you want to understand what’s going on.
Comments in Python begin with a hash (#). All the code in this book is explained line-by-line only using comments. Thus, these comments must not be ignored.
Bash commands start with $ or ❯.
If you find a pirated copy of this book (print or e-book or pdf), contact me directly with the details so that I can take necessary actions.
If you didn’t code, you didn’t learn.
3
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
Table of Contents
Setting up your working environment .....................................................5 Supervised vs unsupervised learning.......................................................7 Cross-validation....................................................................................14 Evaluation metrics................................................................................30 Arranging machine learning projects ...................................................73 Approaching categorical variables........................................................85 Feature engineering ...........................................................................142 Feature selection ................................................................................155 Hyperparameter optimization .............................................................167 Approaching image classification & segmentation..............................185 Approaching text classification/regression..........................................225 Approaching ensembling and stacking................................................272 Approaching reproducible code & model serving................................283
4
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
Setting up your working environment
Before we begin with coding, it’s essential to get everything set-up on your machine. Throughout this book, we will be using Ubuntu 18.04 and Python 3.7.6. If you are a Windows user, you can install Ubuntu in multiple ways. On a virtual machine, for example, Virtual Box which is provided by Oracle and is free software. Alongside Windows as a dual boot system. I prefer dual boot as it is native. If you are not an Ubuntu user, you might face problems with some of the bash scripts in this book. To circumvent that you can install Ubuntu in a VM or go for Linux shell on Windows.
Setting up Python on any machine is quite easy with Anaconda. I particularly like Miniconda, which is a minimal installer for conda. It is available for Linux, OSX and Windows. Since Python 2 support ended at the end of 2019, we will be using the Python 3 distribution. You should keep in mind that miniconda does not come with all the packages as regular Anaconda. We will, thus, be installing packages as we go. Installing miniconda is quite easy.
The first thing that you need to do is download Miniconda3 to your system.
$ cd ~/Downloads
$ wget https://repo.anaconda.com/miniconda/...
where the URL after wget command is the URL from miniconda3 webpage. For 64-bit Linux systems, the URL at the time of writing this book was:
https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh Once you have downloaded miniconda3, you can run the following command: $ sh Miniconda3-latest-Linux-x86_64.sh
Next, please read and follow the instructions on your screen. If you installed everything correctly, you should be able to start the conda environment by typing conda init the terminal. We will create a conda environment that we will be using throughout this book. To create a conda environment, you can type:
$ conda create -n environment_name python=3.7.6
5
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
This command will create a conda environment named environment_name which can be activated using:
$ conda activate environment_name
And we are all set-up with the environment. Now it’s time to install some packages that we would be using. A package can be installed in two different ways when you are in a conda environment. You can either install the package from conda repository or the official PyPi repository.
$ conda/pip install package_name
Note: It might be possible that some packages are not available in the conda repo. Thus, installing using pip would be the most preferred way in this book. I have already created a list of packages used while writing this book which is saved in the environment.yml. You can find it in extra material available in my GitHub repository. You can create the environment using the following command:
$ conda env create -f environment.yml
This command will create an environment called ml. To activate this environment and start using it, you should run:
$ conda activate ml
And we are all set and ready to do some applied machine learning! Always remember to be in the “ml” environment when coding along with this book. Let’s start with our real first chapter now.
6
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
Supervised vs unsupervised learning
When dealing with machine learning problems, there are generally two types of data (and machine learning models):
• Supervised data: always has one or multiple targets associated with it. • Unsupervised data: does not have any target variable.
A supervised problem is considerably easier to tackle than an unsupervised one. A problem in which we are required to predict a value is known as a supervised problem. For example, if the problem is to predict house prices given historical house prices, with features like presence of a hospital, school or supermarket, distance to nearest public transport, etc. is a supervised problem. Similarly, when we are provided with images of cats and dogs, and we know beforehand which ones are cats and which ones are dogs, and if the task is to create a model which predicts whether a provided image is of a cat or a dog, the problem is considered to be supervised.
Figure 1: A supervised dataset.
As we see in figure 1, every row of the data is associated with a target or label. The columns are different features and rows represent different data points which are usually called samples. The example shows ten samples with ten features and a target variable which can be either a number or a category. If the target is categorical, the problem becomes a classification problem. And if the target is a real
7
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
number, the problem is defined as a regression problem. Thus, supervised problems can be divided into two sub-classes:
• Classification: predicting a category, e.g. dog or cat.
• Regression: predicting a value, e.g. house prices.
It must be noted that sometimes we might use regression in a classification setting depending on the metric used for evaluation. But we will come to that later.
Another type of machine learning problem is the unsupervised type. Unsupervised datasets do not have a target associated with them and in general, are more challenging to deal with when compared to supervised problems.
Let’s say you work in a financial firm which deals with credit card transactions. There is a lot of data that comes in every second. The only problem is that it is difficult to find humans who will mark each and every transaction either as a valid or genuine transaction or a fraud. When we do not have any information about a transaction being fraud or genuine, the problem becomes an unsupervised problem. To tackle these kinds of problems we have to think about how many clusters can data be divided into. Clustering is one of the approaches that you can use for problems like this, but it must be noted that there are several other approaches available that can be applied to unsupervised problems. For a fraud detection problem, we can say that data can be divided into two classes (fraud or genuine).
When we know the number of clusters, we can use a clustering algorithm for unsupervised problems. In figure 2, the data is assumed to have two classes, dark colour represents fraud, and light colour represents genuine transactions. These classes, however, are not known to us before the clustering approach. After a clustering algorithm is applied, we should be able to distinguish between the two assumed targets. To make sense of unsupervised problems, we can also use numerous decomposition techniques such as Principal Component Analysis (PCA), t-distributed Stochastic Neighbour Embedding (t-SNE) etc.
Supervised problems are easier to tackle in the sense that they can be evaluated easily. We will read more about evaluation techniques in the following chapters. However, it is challenging to assess the results of unsupervised algorithms and a lot of human interference or heuristics are required. In this book, we will majorly be focusing on supervised data and models, but it does not mean that we will be ignoring the unsupervised data problems.
8
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
Figure 2: An unsupervised dataset.
Most of the time, when people start with data science or machine learning, they begin with very well-known datasets, for example, Titanic dataset, or Iris dataset which are supervised problems. In the Titanic dataset, you have to predict the survival of people aboard Titanic based on factors like their ticket class, gender, age, etc. Similarly, in the iris dataset, you have to predict the species of flower based on factors like sepal width, petal length, sepal length and petal width.
Unsupervised datasets may include datasets for customer segmentation. For example, you have data for the customers visiting your e-commerce website or the data for customers visiting a store or a mall, and you would like to segment them or cluster them in different categories. Another example of unsupervised datasets may include things like credit card fraud detection or just clustering several images.
Most of the time, it’s also possible to convert a supervised dataset to unsupervised to see how they look like when plotted.
For example, let’s take a look at the dataset in figure 3. Figure 3 shows MNIST dataset which is a very popular dataset of handwritten digits, and it is a supervised problem in which you are given the images of the numbers and the correct label associated with them. You have to build a model that can identify which digit is it when provided only with the image.
This dataset can easily be converted to an unsupervised setting for basic visualization.
9
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
Figure 3: MNIST dataset1
If we do a t-Distributed Stochastic Neighbour Embedding (t-SNE) decomposition of this dataset, we can see that we can separate the images to some extent just by doing with two components on the image pixels. This is shown in figure 4.
Figure 4: t-SNE visualization of the MNIST dataset. 3000 images were used.
Let’s take a look at how this was done. First and foremost is importing all the required libraries.
1 Image source: By Josef Steppan - Own work, CC BY-SA 4.0,
https://commons.wikimedia.org/w/index.php?curid=64810040
10
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
═════════════════════════════════════════════════════════════════════════ import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import datasets
from sklearn import manifold
%matplotlib inline
═════════════════════════════════════════════════════════════════════════
We use matplotlib and seaborn for plotting, numpy to handle the numerical arrays, pandas to create dataframes from the numerical arrays and scikit-learn (sklearn) to get the data and perform t-SNE.
After the imports, we need to either download the data and read it separately or use sklearn’s built-in function that provides us with the MNIST dataset.
═════════════════════════════════════════════════════════════════════════ data = datasets.fetch_openml(
'mnist_784',
version=1,
return_X_y=True
)
pixel_values, targets = data
targets = targets.astype(int)
═════════════════════════════════════════════════════════════════════════
In this part of the code, we have fetched the data using sklearn datasets, and we have an array of pixel values and another array of targets. Since the targets are of string type, we convert them to integers.
pixel_values is a 2-dimensional array of shape 70000x784. There are 70000 different images, each of size 28x28 pixels. Flattening 28x28 gives 784 data points.
We can visualize the samples in this dataset by reshaping them to their original shape and then plotting them using matplotlib.
═════════════════════════════════════════════════════════════════════════ single_image = pixel_values[1, :].reshape(28, 28)
plt.imshow(single_image, cmap='gray')
═════════════════════════════════════════════════════════════════════════
11
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
This code will plot an image like the following:
Figure 5: Plotting a single image from MNIST dataset.
The most important step comes after we have grabbed the data.
═════════════════════════════════════════════════════════════════════════ tsne = manifold.TSNE(n_components=2, random_state=42)
transformed_data = tsne.fit_transform(pixel_values[:3000, :]) ═════════════════════════════════════════════════════════════════════════
This step creates the t-SNE transformation of the data. We use only two components as we can visualize them well in a two-dimensional setting. The transformed_data, in this case, is an array of shape 3000x2 (3000 rows and 2 columns). A data like this can be converted to a pandas dataframe by calling pd.DataFrame on the array.
═════════════════════════════════════════════════════════════════════════ tsne_df = pd.DataFrame(
np.column_stack((transformed_data, targets[:3000])),
columns=["x", "y", "targets"]
)
tsne_df.loc[:, "targets"] = tsne_df.targets.astype(int)
═════════════════════════════════════════════════════════════════════════
Here we are creating a pandas dataframe from a numpy array. There are three columns: x, y and targets. x and y are the two components from t-SNE decomposition and targets is the actual number. This gives us a dataframe which looks like the one shown in figure 6.
12
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
Figure 6: First 10 rows of pandas dataframe with t-SNE components and targets. And finally, we can plot it using seaborn and matplotlib.
═════════════════════════════════════════════════════════════════════════ grid = sns.FacetGrid(tsne_df, hue="targets", size=8)
grid.map(plt.scatter, "x", "y").add_legend()
═════════════════════════════════════════════════════════════════════════
This is one way of visualizing unsupervised datasets. We can also do k-means clustering on the same dataset and see how it performs in an unsupervised setting. One question that arises all the time is how to find the optimal number of clusters in k-means clustering. Well, there is no right answer. You have to find the number by cross-validation. Cross-validation will be discussed later in this book. Please note that the above code was run in a jupyter notebook.
In this book, we will use jupyter for simple things like the example above and for plotting. For most of the stuff in this book, we will be using python scripts. You can choose what you want to use since the results are going to be the same.
MNIST is a supervised classification problem, and we converted it to an unsupervised problem only to check if it gives any kind of good results and it is apparent that we do get good results with decomposition with t-SNE. The results would be even better if we use classification algorithms. What are they and how to use them? Let’s look at them in the next chapters.
13
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
Cross-validation
We did not build any models in the previous chapter. The reason for that is simple. Before creating any kind of machine learning model, we must know what cross validation is and how to choose the best cross-validation depending on your datasets.
So, what is cross-validation, and why should we care about it?
We can find multiple definitions as to what cross-validation is. Mine is a one-liner: cross-validation is a step in the process of building a machine learning model which helps us ensure that our models fit the data accurately and also ensures that we do not overfit. But this leads to another term: overfitting.
To explain overfitting, I think it’s best if we look at a dataset. There is a red wine quality dataset2 which is quite famous. This dataset has 11 different attributes that decide the quality of red wine.
These attributes include:
• fixed acidity
• volatile acidity
• citric acid
• residual sugar
• chlorides
• free sulfur dioxide
• total sulfur dioxide
• density
• pH
• sulphates
• alcohol
Based on these different attributes, we are required to predict the quality of red wine which is a value between 0 and 10.
2 P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis; Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
14
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
Let’s see how this data looks like.
═════════════════════════════════════════════════════════════════════════ import pandas as pd
df = pd.read_csv("winequality-red.csv")
═════════════════════════════════════════════════════════════════════════ This dataset looks something like this:
Figure 1: A snapshot of the red wine quality dataset.
We can treat this problem either as a classification problem or as a regression problem since wine quality is nothing but a real number between 0 and 10. For simplicity, let’s choose classification. This dataset, however, consists of only six types of quality values. We will thus map all quality values from 0 to 5.
═════════════════════════════════════════════════════════════════════════ # a mapping dictionary that maps the quality values from 0 to 5 quality_mapping = {
3: 0,
4: 1,
5: 2,
6: 3,
7: 4,
8: 5
}
# you can use the map function of pandas with
# any dictionary to convert the values in a given
# column to values in the dictionary
df.loc[:, "quality"] = df.quality.map(quality_mapping)
═════════════════════════════════════════════════════════════════════════
15
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
When we look at this data and consider it a classification problem, a lot of algorithms come to our mind that we can apply to it, probably, we can use neural networks. But it would be a bit of a stretch if we dive into neural networks from the beginning. So, let’s start with something simple that we can visualize too: decision trees.
Before we begin to understand what overfitting is, let’s divide the data into two parts. This dataset has 1599 samples. We keep 1000 samples for training and 599 as a separate set.
Splitting can be done easily by the following chunk of code:
═════════════════════════════════════════════════════════════════════════ # use sample with frac=1 to shuffle the dataframe
# we reset the indices since they change after
# shuffling the dataframe
df = df.sample(frac=1).reset_index(drop=True)
# top 1000 rows are selected
# for training
df_train = df.head(1000)
# bottom 599 values are selected
# for testing/validation
df_test = df.tail(599)
═════════════════════════════════════════════════════════════════════════
We will now train a decision tree model on the training set. For the decision tree model, I am going to use scikit-learn.
═════════════════════════════════════════════════════════════════════════ # import from scikit-learn
from sklearn import tree
from sklearn import metrics
# initialize decision tree classifier class
# with a max_depth of 3
clf = tree.DecisionTreeClassifier(max_depth=3)
# choose the columns you want to train on
# these are the features for the model
cols = ['fixed acidity',
'volatile acidity',
'citric acid',
16
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
'residual sugar',
'chlorides',
'free sulfur dioxide',
'total sulfur dioxide',
'density',
'pH',
'sulphates',
'alcohol']
# train the model on the provided features
# and mapped quality from before
clf.fit(df_train[cols], df_train.quality)
═════════════════════════════════════════════════════════════════════════
Note that I have used a max_depth of 3 for the decision tree classifier. I have left all other parameters of this model to its default value.
Now, we test the accuracy of this model on the training set and the test set:
═════════════════════════════════════════════════════════════════════════ # generate predictions on the training set
train_predictions = clf.predict(df_train[cols])
# generate predictions on the test set
test_predictions = clf.predict(df_test[cols])
# calculate the accuracy of predictions on
# training data set
train_accuracy = metrics.accuracy_score(
df_train.quality, train_predictions
)
# calculate the accuracy of predictions on
# test data set
test_accuracy = metrics.accuracy_score(
df_test.quality, test_predictions
)
═════════════════════════════════════════════════════════════════════════
The training and test accuracies are found to be 58.9% and 54.25%. Now we increase the max_depth to 7 and repeat the process. This gives training accuracy of 76.6% and test accuracy of 57.3%. Here, we have used accuracy, mainly because it is the most straightforward metric. It might not be the best metric for this problem. What about we calculate these accuracies for different values of max_depth and make a plot?
17
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
═════════════════════════════════════════════════════════════════════════ # NOTE: this code is written in a jupyter notebook
# import scikit-learn tree and metrics
from sklearn import tree
from sklearn import metrics
# import matplotlib and seaborn
# for plotting
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
# this is our global size of label text
# on the plots
matplotlib.rc('xtick', labelsize=20)
matplotlib.rc('ytick', labelsize=20)
# This line ensures that the plot is displayed
# inside the notebook
%matplotlib inline
# initialize lists to store accuracies
# for training and test data
# we start with 50% accuracy
train_accuracies = [0.5]
test_accuracies = [0.5]
# iterate over a few depth values
for depth in range(1, 25):
# init the model
clf = tree.DecisionTreeClassifier(max_depth=depth)
# columns/features for training
# note that, this can be done outside
# the loop
cols = [
'fixed acidity',
'volatile acidity',
'citric acid',
'residual sugar',
'chlorides',
'free sulfur dioxide',
'total sulfur dioxide',
'density',
'pH',
'sulphates',
18
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
'alcohol'
]
# fit the model on given features
clf.fit(df_train[cols], df_train.quality)
# create training & test predictions
train_predictions = clf.predict(df_train[cols])
test_predictions = clf.predict(df_test[cols])
# calculate training & test accuracies
train_accuracy = metrics.accuracy_score(
df_train.quality, train_predictions
)
test_accuracy = metrics.accuracy_score(
df_test.quality, test_predictions
)
# append accuracies
train_accuracies.append(train_accuracy)
test_accuracies.append(test_accuracy)
# create two plots using matplotlib
# and seaborn
plt.figure(figsize=(10, 5))
sns.set_style("whitegrid")
plt.plot(train_accuracies, label="train accuracy")
plt.plot(test_accuracies, label="test accuracy")
plt.legend(loc="upper left", prop={'size': 15})
plt.xticks(range(0, 26, 5))
plt.xlabel("max_depth", size=20)
plt.ylabel("accuracy", size=20)
plt.show()
═════════════════════════════════════════════════════════════════════════ This generates a plot, as shown in figure 2.
We see that the best score for test data is obtained when max_depth has a value of 14. As we keep increasing the value of this parameter, test accuracy remains the same or gets worse, but the training accuracy keeps increasing. It means that our simple decision tree model keeps learning about the training data better and better with an increase in max_depth, but the performance on test data does not improve at all.
19
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
This is called overfitting.
The model fits perfectly on the training set and performs poorly when it comes to the test set. This means that the model will learn the training data well but will not generalize on unseen samples. In the dataset above, one can build a model with very high max_depth which will have outstanding results on training data, but that kind of model is not useful as it will not provide a similar result on the real-world samples or live data.
Figure 2: Training and test accuracies for different values of max_depth.
One might argue that this approach isn’t overfitting as the accuracy of the test set more or less remains the same. Another definition of overfitting would be when the test loss increases as we keep improving training loss. This is very common when it comes to neural networks.
Whenever we train a neural network, we must monitor loss during the training time for both training and test set. If we have a very large network for a dataset which is quite small (i.e. very less number of samples), we will observe that the loss for both training and test set will decrease as we keep training. However, at some point, test loss will reach its minima, and after that, it will start increasing even though training loss decreases further. We must stop training where the validation loss reaches its minimum value.
This is the most common explanation of overfitting.
20
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
Occam’s razor in simple words states that one should not try to complicate things that can be solved in a much simpler manner. In other words, the simplest solutions are the most generalizable solutions. In general, whenever your model does not obey Occam’s razor, it is probably overfitting.
Figure 3: Most general definition of overfitting.
Now we can go back to cross-validation.
While explaining about overfitting, I decided to divide the data into two parts. I trained the model on one part and checked its performance on the other part. Well, this is also a kind of cross-validation commonly known as a hold-out set. We use this kind of (cross-) validation when we have a large amount of data and model inference is a time-consuming process.
There are many different ways one can do cross-validation, and it is the most critical step when it comes to building a good machine learning model which is generalizable when it comes to unseen data. Choosing the right cross-validation depends on the dataset you are dealing with, and one’s choice of cross-validation on one dataset may or may not apply to other datasets. However, there are a few types of cross-validation techniques which are the most popular and widely used.
These include:
• k-fold cross-validation
• stratified k-fold cross-validation
21
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
• hold-out based validation
• leave-one-out cross-validation
• group k-fold cross-validation
Cross-validation is dividing training data into a few parts. We train the model on some of these parts and test on the remaining parts. Take a look at figure 4.
Figure 4: Splitting a dataset into training and validation sets
Figure 4 & 5 say that when you get a dataset to build machine learning models, you separate them into two different sets: training and validation. Many people also split it into a third set and call it a test set. We will, however, be using only two sets. As you can see, we divide the samples and the targets associated with them. We can divide the data into k different sets which are exclusive of each other. This is known as k-fold cross-validation.
Figure 5: K-fold cross-validation
22
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
We can split any data into k-equal parts using KFold from scikit-learn. Each sample is assigned a value from 0 to k-1 when using k-fold cross validation.
═════════════════════════════════════════════════════════════════════════ # import pandas and model_selection module of scikit-learn import pandas as pd
from sklearn import model_selection
if __name__ == "__main__":
# Training data is in a CSV file called train.csv
df = pd.read_csv("train.csv")
# we create a new column called kfold and fill it with -1 df["kfold"] = -1
# the next step is to randomize the rows of the data
df = df.sample(frac=1).reset_index(drop=True)
# initiate the kfold class from model_selection module
kf = model_selection.KFold(n_splits=5)
# fill the new kfold column
for fold, (trn_, val_) in enumerate(kf.split(X=df)):
df.loc[val_, 'kfold'] = fold
# save the new csv with kfold column
df.to_csv("train_folds.csv", index=False)
═════════════════════════════════════════════════════════════════════════
You can use this process with almost all kinds of datasets. For example, when you have images, you can create a CSV with image id, image location and image label and use the process above.
The next important type of cross-validation is stratified k-fold. If you have a skewed dataset for binary classification with 90% positive samples and only 10% negative samples, you don't want to use random k-fold cross-validation. Using simple k-fold cross-validation for a dataset like this can result in folds with all negative samples. In these cases, we prefer using stratified k-fold cross-validation. Stratified k-fold cross-validation keeps the ratio of labels in each fold constant. So, in each fold, you will have the same 90% positive and 10% negative samples. Thus, whatever metric you choose to evaluate, it will give similar results across all folds.
23
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
It’s easy to modify the code for creating k-fold cross-validation to create stratified k-folds. We are only changing from model_selection.KFold to model_selection.StratifiedKFold and in the kf.split(...) function, we specify the target column on which we want to stratify. We assume that our CSV dataset has a column called “target” and it is a classification problem!
═════════════════════════════════════════════════════════════════════════ # import pandas and model_selection module of scikit-learn import pandas as pd
from sklearn import model_selection
if __name__ == "__main__":
# Training data is in a csv file called train.csv
df = pd.read_csv("train.csv")
# we create a new column called kfold and fill it with -1 df["kfold"] = -1
# the next step is to randomize the rows of the data
df = df.sample(frac=1).reset_index(drop=True)
# fetch targets
y = df.target.values
# initiate the kfold class from model_selection module
kf = model_selection.StratifiedKFold(n_splits=5)
# fill the new kfold column
for f, (t_, v_) in enumerate(kf.split(X=df, y=y)):
df.loc[v_, 'kfold'] = f
# save the new csv with kfold column
df.to_csv("train_folds.csv", index=False)
═════════════════════════════════════════════════════════════════════════ For the wine dataset, let’s look at the distribution of labels.
═════════════════════════════════════════════════════════════════════════ b = sns.countplot(x='quality', data=df)
b.set_xlabel("quality", fontsize=20)
b.set_ylabel("count", fontsize=20)
═════════════════════════════════════════════════════════════════════════
Note that we continue on the code above. So, we have converted the target values. Looking at figure 6 we can say that the quality is very much skewed. Some classes
24
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
have a lot of samples, and some don’t have that many. If we do a simple k-fold, we won’t have an equal distribution of targets in every fold. Thus, we choose stratified k-fold in this case.
Figure 6: Distribution of “quality” in wine dataset
The rule is simple. If it’s a standard classification problem, choose stratified k-fold blindly.
But what should we do if we have a large amount of data? Suppose we have 1 million samples. A 5 fold cross-validation would mean training on 800k samples and validating on 200k. Depending on which algorithm we choose, training and even validation can be very expensive for a dataset which is of this size. In these cases, we can opt for a hold-out based validation.
The process for creating the hold-out remains the same as stratified k-fold. For a dataset which has 1 million samples, we can create ten folds instead of 5 and keep one of those folds as hold-out. This means we will have 100k samples in the hold out, and we will always calculate loss, accuracy and other metrics on this set and train on 900k samples.
Hold-out is also used very frequently with time-series data. Let’s assume the problem we are provided with is predicting sales of a store for 2020, and you are provided all the data from 2015-2019. In this case, you can select all the data for 2019 as a hold-out and train your model on all the data from 2015 to 2018.
25
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
Figure 7: Example of a time-series data
In the example presented in figure 7, let’s say our job is to predict the sales from time step 31 to 40. We can then keep 21 to 30 as hold-out and train our model from step 0 to step 20. You should note that when you are predicting from 31 to 40, you should include the data from 21 to 30 in your model; otherwise, performance will be sub-par.
In many cases, we have to deal with small datasets and creating big validation sets means losing a lot of data for the model to learn. In those cases, we can opt for a type of k-fold cross-validation where k=N, where N is the number of samples in the dataset. This means that in all folds of training, we will be training on all data samples except 1. The number of folds for this type of cross-validation is the same as the number of samples that we have in the dataset.
One should note that this type of cross-validation can be costly in terms of the time it takes if the model is not fast enough, but since it’s only preferable to use this cross-validation for small datasets, it doesn’t matter much.
Now we can move to regression. The good thing about regression problems is that we can use all the cross-validation techniques mentioned above for regression problems except for stratified k-fold. That is we cannot use stratified k-fold directly, but there are ways to change the problem a bit so that we can use stratified k-fold for regression problems. Mostly, simple k-fold cross-validation works for any regression problem. However, if you see that the distribution of targets is not consistent, you can use stratified k-fold.
26
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
To use stratified k-fold for a regression problem, we have first to divide the target into bins, and then we can use stratified k-fold in the same way as for classification problems. There are several choices for selecting the appropriate number of bins. If you have a lot of samples( > 10k, > 100k), then you don’t need to care about the number of bins. Just divide the data into 10 or 20 bins. If you do not have a lot of samples, you can use a simple rule like Sturge’s Rule to calculate the appropriate number of bins.
Sturge’s rule:
Number of Bins = 1 + log2(N)
Where N is the number of samples you have in your dataset. This function is plotted in Figure 8.
Figure 8: Plotting samples vs the number of bins by Sturge’s Rule
Let’s make a sample regression dataset and try to apply stratified k-fold as shown in the following python snippet.
═════════════════════════════════════════════════════════════════════════ # stratified-kfold for regression
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import model_selection
27
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
def create_folds(data):
# we create a new column called kfold and fill it with -1 data["kfold"] = -1
# the next step is to randomize the rows of the data data = data.sample(frac=1).reset_index(drop=True)
# calculate the number of bins by Sturge's rule
# I take the floor of the value, you can also
# just round it
num_bins = int(np.floor(1 + np.log2(len(data))))
# bin targets
data.loc[:, "bins"] = pd.cut(
data["target"], bins=num_bins, labels=False
)
# initiate the kfold class from model_selection module kf = model_selection.StratifiedKFold(n_splits=5)
# fill the new kfold column
# note that, instead of targets, we use bins!
for f, (t_, v_) in enumerate(kf.split(X=data, y=data.bins.values)): data.loc[v_, 'kfold'] = f
# drop the bins column
data = data.drop("bins", axis=1)
# return dataframe with folds
return data
if __name__ == "__main__":
# we create a sample dataset with 15000 samples
# and 100 features and 1 target
X, y = datasets.make_regression(
n_samples=15000, n_features=100, n_targets=1
)
# create a dataframe out of our numpy arrays
df = pd.DataFrame(
X,
columns=[f"f_{i}" for i in range(X.shape[1])]
)
df.loc[:, "target"] = y
# create folds
df = create_folds(df)
═════════════════════════════════════════════════════════════════════════
28
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
Cross-validation is the first and most essential step when it comes to building machine learning models. If you want to do feature engineering, split your data first. If you're going to build models, split your data first. If you have a good cross validation scheme in which validation data is representative of training and real world data, you will be able to build a good machine learning model which is highly generalizable.
The types of cross-validation presented in this chapter can be applied to almost any machine learning problem. Still, you must keep in mind that cross-validation also depends a lot on the data and you might need to adopt new forms of cross-validation depending on your problem and data.
For example, let’s say we have a problem in which we would like to build a model to detect skin cancer from skin images of patients. Our task is to build a binary classifier which takes an input image and predicts the probability for it being benign or malignant.
In these kinds of datasets, you might have multiple images for the same patient in the training dataset. So, to build a good cross-validation system here, you must have stratified k-folds, but you must also make sure that patients in training data do not appear in validation data. Fortunately, scikit-learn offers a type of cross-validation known as GroupKFold. Here the patients can be considered as groups. But unfortunately, there is no way to combine GroupKFold with StratifiedKFold in scikit-learn. So you need to do that yourself. I’ll leave it as an exercise for the reader.
29
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
Evaluation metrics
When it comes to machine learning problems, you will encounter a lot of different types of metrics in the real world. Sometimes, people even end up creating metrics that suit the business problem. It’s out of the scope of this book to introduce and explain each and every type of metric. Instead, we will see some of the most common metrics that you can use when starting with your very first few projects.
At the start of the book, we introduced supervised and unsupervised learning. Although there are some kinds of metrics that you can use for unsupervised learning, we will only focus on supervised. The reason for this is because supervised problems are in abundance compared to un-supervised, and evaluation of unsupervised methods is quite subjective.
If we talk about classification problems, the most common metrics used are: - Accuracy
- Precision (P)
- Recall (R)
- F1 score (F1)
- Area under the ROC (Receiver Operating Characteristic) curve or simply AUC (AUC)
- Log loss
- Precision at k (P@k)
- Average precision at k (AP@k)
- Mean average precision at k (MAP@k)
When it comes to regression, the most commonly used evaluation metrics are: - Mean absolute error (MAE)
- Mean squared error (MSE)
- Root mean squared error (RMSE)
- Root mean squared logarithmic error (RMSLE)
- Mean percentage error (MPE)
- Mean absolute percentage error (MAPE)
- R2
Knowing about how the aforementioned metrics work is not the only thing we have to understand. We must also know when to use which metrics, and that depends on
30
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
what kind of data and targets you have. I think it’s more about the targets and less about the data.
To learn more about these metrics, let’s start with a simple problem. Suppose we have a binary classification problem, i.e. a problem in which there are only two targets. Let’s suppose it’s a problem of classifying chest x-ray images. There are chest x-ray images with no problem, and some of the chest x-ray images have collapsed lung which is also known as pneumothorax. So, our task is to build a classifier that given a chest x-ray image can detect if it has pneumothorax.
Figure 1: A lung image showing pneumothorax. Image is taken from SIIM-ACR Pneumothorax Segmentation Competition3
We also assume that we have an equal number of pneumothorax and non pneumothorax chest x-ray images; let’s say 100 each. Thus, we have 100 positive samples and 100 negative samples with a total of 200 images.
The first step is to divide the data described above into two equal sets of 100 images each, i.e. training and validation set. In both the sets, we have 50 positive and 50 negative samples.
3 https://www.kaggle.com/c/siim-acr-pneumothorax-segmentation
31
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
When we have an equal number of positive and negative samples in a binary classification metric, we generally use accuracy, precision, recall and f1.
Accuracy: It is one of the most straightforward metrics used in machine learning. It defines how accurate your model is. For the problem described above, if you build a model that classifies 90 images accurately, your accuracy is 90% or 0.90. If only 83 images are classified correctly, the accuracy of your model is 83% or 0.83. Simple.
Python code for calculating accuracy is also quite simple.
═════════════════════════════════════════════════════════════════════════ def accuracy(y_true, y_pred):
"""
Function to calculate accuracy
:param y_true: list of true values
:param y_pred: list of predicted values
:return: accuracy score
"""
# initialize a simple counter for correct predictions correct_counter = 0
# loop over all elements of y_true
# and y_pred "together"
for yt, yp in zip(y_true, y_pred):
if yt == yp:
# if prediction is equal to truth, increase the counter correct_counter += 1
# return accuracy
# which is correct predictions over the number of samples return correct_counter / len(y_true)
═════════════════════════════════════════════════════════════════════════ We can also calculate accuracy using scikit-learn.
═════════════════════════════════════════════════════════════════════════ In [X]: from sklearn import metrics
...: l1 = [0,1,1,1,0,0,0,1]
...: l2 = [0,1,0,1,0,1,0,0]
...: metrics.accuracy_score(l1, l2)
Out[X]: 0.625
═════════════════════════════════════════════════════════════════════════
32
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
Now, let’s say we change the dataset a bit such that there are 180 chest x-ray images which do not have pneumothorax and only 20 with pneumothorax. Even in this case, we will create the training and validation sets with the same ratio of positive to negative (pneumothorax to non- pneumothorax) targets. In each set, we have 90 non- pneumothorax and 10 pneumothorax images. If you say that all images in the validation set are non-pneumothorax, what would your accuracy be? Let’s see; you classified 90% of the images correctly. So, your accuracy is 90%.
But look at it one more time.
You didn’t even build a model and got an accuracy of 90%. That seems kind of useless. If we look carefully, we will see that the dataset is skewed, i.e., the number of samples in one class outnumber the number of samples in other class by a lot. In these kinds of cases, it is not advisable to use accuracy as an evaluation metric as it is not representative of the data. So, you might get high accuracy, but your model will probably not perform that well when it comes to real-world samples, and you won’t be able to explain to your managers why.
In these cases, it’s better to look at other metrics such as precision.
Before learning about precision, we need to know a few terms. Here we have assumed that chest x-ray images with pneumothorax are positive class (1) and without pneumothorax are negative class (0).
True positive (TP): Given an image, if your model predicts the image has pneumothorax, and the actual target for that image has pneumothorax, it is considered a true positive.
True negative (TN): Given an image, if your model predicts that the image does not have pneumothorax and the actual target says that it is a non-pneumothorax image, it is considered a true negative.
In simple words, if your model correctly predicts positive class, it is true positive, and if your model accurately predicts negative class, it is a true negative.
False positive (FP): Given an image, if your model predicts pneumothorax and the actual target for that image is non- pneumothorax, it a false positive.
33
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
False negative (FN): Given an image, if your model predicts non-pneumothorax and the actual target for that image is pneumothorax, it is a false negative.
In simple words, if your model incorrectly (or falsely) predicts positive class, it is a false positive. If your model incorrectly (or falsely) predicts negative class, it is a false negative.
Let’s look at implementations of these, one at a time.
═════════════════════════════════════════════════════════════════════════ def true_positive(y_true, y_pred):
"""
Function to calculate True Positives
:param y_true: list of true values
:param y_pred: list of predicted values
:return: number of true positives
"""
# initialize
tp = 0
for yt, yp in zip(y_true, y_pred):
if yt == 1 and yp == 1:
tp += 1
return tp
def true_negative(y_true, y_pred):
"""
Function to calculate True Negatives
:param y_true: list of true values
:param y_pred: list of predicted values
:return: number of true negatives
"""
# initialize
tn = 0
for yt, yp in zip(y_true, y_pred):
if yt == 0 and yp == 0:
tn += 1
return tn
def false_positive(y_true, y_pred):
"""
Function to calculate False Positives
:param y_true: list of true values
:param y_pred: list of predicted values
:return: number of false positives
"""
# initialize
34
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
fp = 0
for yt, yp in zip(y_true, y_pred):
if yt == 0 and yp == 1:
fp += 1
return fp
def false_negative(y_true, y_pred):
"""
Function to calculate False Negatives
:param y_true: list of true values
:param y_pred: list of predicted values
:return: number of false negatives
"""
# initialize
fn = 0
for yt, yp in zip(y_true, y_pred):
if yt == 1 and yp == 0:
fn += 1
return fn
═════════════════════════════════════════════════════════════════════════
The way I have implemented these here is quite simple and works only for binary classification. Let’s check these functions.
═════════════════════════════════════════════════════════════════════════ In [X]: l1 = [0,1,1,1,0,0,0,1]
...: l2 = [0,1,0,1,0,1,0,0]
In [X]: true_positive(l1, l2)
Out[X]: 2
In [X]: false_positive(l1, l2)
Out[X]: 1
In [X]: false_negative(l1, l2)
Out[X]: 2
In [X]: true_negative(l1, l2)
Out[X]: 3
═════════════════════════════════════════════════════════════════════════ If we have to define accuracy using the terms described above, we can write: Accuracy Score = (TP + TN) / (TP + TN + FP + FN)
35
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
We can now quickly implement accuracy score using TP, TN, FP and FN in python. Let’s call it accuracy_v2.
═════════════════════════════════════════════════════════════════════════ def accuracy_v2(y_true, y_pred):
"""
Function to calculate accuracy using tp/tn/fp/fn
:param y_true: list of true values
:param y_pred: list of predicted values
:return: accuracy score
"""
tp = true_positive(y_true, y_pred)
fp = false_positive(y_true, y_pred)
fn = false_negative(y_true, y_pred)
tn = true_negative(y_true, y_pred)
accuracy_score = (tp + tn) / (tp + tn + fp + fn)
return accuracy_score
═════════════════════════════════════════════════════════════════════════
We can quickly check the correctness of this function by comparing it to our previous implementation and scikit-learn version.
═════════════════════════════════════════════════════════════════════════ In [X]: l1 = [0,1,1,1,0,0,0,1]
...: l2 = [0,1,0,1,0,1,0,0]
In [X]: accuracy(l1, l2)
Out[X]: 0.625
In [X]: accuracy_v2(l1, l2)
Out[X]: 0.625
In [X]: metrics.accuracy_score(l1, l2)
Out[X]: 0.625
═════════════════════════════════════════════════════════════════════════ Please note that in this code, metrics.accuracy_score comes from scikit-learn.
Great. All values match. This means we have not made any mistakes in the implementation.
Now, we can move to other important metrics.
First one is precision. Precision is defined as:
36
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
Precision = TP / (TP + FP)
Let’s say we make a new model on the new skewed dataset and our model correctly identified 80 non-pneumothorax out of 90 and 8 pneumothorax out of 10. Thus, we identify 88 images out of 100 successfully. The accuracy is, therefore, 0.88 or 88%.
But, out of these 100 samples, 10 non-pneumothorax images are misclassified as having pneumothorax and 2 pneumothorax are misclassified as not having pneumothorax.
Thus, we have:
- TP : 8
- TN: 80
- FP: 10
- FN: 2
So, our precision is 8 / (8 + 10) = 0.444. This means our model is correct 44.4% times when it’s trying to identify positive samples (pneumothorax).
Now, since we have implemented TP, TN, FP and FN, we can easily implement precision in python.
═════════════════════════════════════════════════════════════════════════ def precision(y_true, y_pred):
"""
Function to calculate precision
:param y_true: list of true values
:param y_pred: list of predicted values
:return: precision score
"""
tp = true_positive(y_true, y_pred)
fp = false_positive(y_true, y_pred)
precision = tp / (tp + fp)
return precision
═════════════════════════════════════════════════════════════════════════ Let’s try this implementation of precision.
═════════════════════════════════════════════════════════════════════════ In [X]: l1 = [0,1,1,1,0,0,0,1]
...: l2 = [0,1,0,1,0,1,0,0]
37
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
In [X]: precision(l1, l2)
Out[X]: 0.6666666666666666
═════════════════════════════════════════════════════════════════════════ This seems fine.
Next, we come to recall. Recall is defined as:
Recall = TP / (TP + FN)
In the above case recall is 8 / (8 + 2) = 0.80. This means our model identified 80% of positive samples correctly.
═════════════════════════════════════════════════════════════════════════ def recall(y_true, y_pred):
"""
Function to calculate recall
:param y_true: list of true values
:param y_pred: list of predicted values
:return: recall score
"""
tp = true_positive(y_true, y_pred)
fn = false_negative(y_true, y_pred)
recall = tp / (tp + fn)
return recall
═════════════════════════════════════════════════════════════════════════ In the case of our two small lists, we should have a recall of 0.5. Let’s check.
═════════════════════════════════════════════════════════════════════════ In [X]: l1 = [0,1,1,1,0,0,0,1]
...: l2 = [0,1,0,1,0,1,0,0]
In [X]: recall(l1, l2)
Out[X]: 0.5
═════════════════════════════════════════════════════════════════════════ And that matches our calculated value!
For a “good” model, our precision and recall values should be high. We see that in the above example, the recall value is quite high. However, precision is very low! Our model produces quite a lot of false positives but less false negatives. Fewer false negatives are good in this type of problem because you don’t want to say that
38
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
patients do not have pneumothorax when they do. That is going to be more harmful. But we do have a lot of false positives, and that’s not good either.
Most of the models predict a probability, and when we predict, we usually choose this threshold to be 0.5. This threshold is not always ideal, and depending on this threshold, your value of precision and recall can change drastically. If for every threshold we choose, we calculate the precision and recall values, we can create a plot between these sets of values. This plot or curve is known as the precision-recall curve.
Before looking into the precision-recall curve, let’s assume two lists.
═════════════════════════════════════════════════════════════════════════ In [X]: y_true = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
...: 1, 0, 0, 0, 0, 0, 0, 0, 1, 0]
In [X]: y_pred = [0.02638412, 0.11114267, 0.31620708,
...: 0.0490937, 0.0191491, 0.17554844,
...: 0.15952202, 0.03819563, 0.11639273,
...: 0.079377, 0.08584789, 0.39095342,
...: 0.27259048, 0.03447096, 0.04644807,
...: 0.03543574, 0.18521942, 0.05934905,
...: 0.61977213, 0.33056815]
═════════════════════════════════════════════════════════════════════════
So, y_true is our targets, and y_pred is the probability values for a sample being assigned a value of 1. So, now, we look at probabilities in prediction instead of the predicted value (which is most of the time calculated with a threshold at 0.5).
═════════════════════════════════════════════════════════════════════════ precisions = []
recalls = []
# how we assumed these thresholds is a long story
thresholds = [0.0490937 , 0.05934905, 0.079377,
0.08584789, 0.11114267, 0.11639273,
0.15952202, 0.17554844, 0.18521942,
0.27259048, 0.31620708, 0.33056815,
0.39095342, 0.61977213]
# for every threshold, calculate predictions in binary
# and append calculated precisions and recalls
# to their respective lists
for i in thresholds:
temp_prediction = [1 if x >= i else 0 for x in y_pred]
39
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
p = precision(y_true, temp_prediction)
r = recall(y_true, temp_prediction)
precisions.append(p)
recalls.append(r)
═════════════════════════════════════════════════════════════════════════ Now, we can plot these values of precisions and recalls.
═════════════════════════════════════════════════════════════════════════ plt.figure(figsize=(7, 7))
plt.plot(recalls, precisions)
plt.xlabel('Recall', fontsize=15)
plt.ylabel('Precision', fontsize=15)
═════════════════════════════════════════════════════════════════════════
Figure 2 shows the precision-recall curve we get this way.
Figure 2: precision-recall curve
This precision-recall curve looks very different from what you might have seen on the internet. It’s because we had only 20 samples, and only 3 of them were positive samples. But there’s nothing to worry. It’s the same old precision-recall curve.
40
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
You will notice that it’s challenging to choose a value of threshold that gives both good precision and recall values. If the threshold is too high, you have a smaller number of true positives and a high number of false negatives. This decreases your recall; however, your precision score will be high. If you reduce the threshold too low, false positives will increase a lot, and precision will be less.
Both precision and recall range from 0 to 1 and a value closer to 1 is better.
F1 score is a metric that combines both precision and recall. It is defined as a simple weighted average (harmonic mean) of precision and recall. If we denote precision using P and recall using R, we can represent the F1 score as:
F1 = 2PR / (P + R)
A little bit of mathematics will lead you to the following equation of F1 based on TP, FP and FN
F1 = 2TP / (2TP + FP + FN)
A Python implementation is simple because we have already implemented these.
═════════════════════════════════════════════════════════════════════════ def f1(y_true, y_pred):
"""
Function to calculate f1 score
:param y_true: list of true values
:param y_pred: list of predicted values
:return: f1 score
"""
p = precision(y_true, y_pred)
r = recall(y_true, y_pred)
score = 2 * p * r / (p + r)
return score
═════════════════════════════════════════════════════════════════════════ Let’s see the results of this and compare it with scikit-learn.
═════════════════════════════════════════════════════════════════════════ In [X]: y_true = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
...: 1, 0, 0, 0, 0, 0, 0, 0, 1, 0]
41
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
In [X]: y_pred = [0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
...: 1, 0, 0, 0, 0, 0, 0, 0, 1, 0]
In [X]: f1(y_true, y_pred)
Out[X]: 0.5714285714285715
═════════════════════════════════════════════════════════════════════════ And from scikit learn for the same lists, we get:
═════════════════════════════════════════════════════════════════════════ In [X]: from sklearn import metrics
In [X]: metrics.f1_score(y_true, y_pred)
Out[X]: 0.5714285714285715
═════════════════════════════════════════════════════════════════════════
Instead of looking at precision and recall individually, you can also just look at F1 score. Same as for precision, recall and accuracy, F1 score also ranges from 0 to 1, and a perfect prediction model has an F1 of 1. When dealing with datasets that have skewed targets, we should look at F1 (or precision and recall) instead of accuracy.
Then there are other crucial terms that we should know about. The first one is TPR or True Positive Rate, which is the same as recall. TPR = TP / (TP + FN)
Even though it is same as recall, we will make a python function for it for further use with this name.
═════════════════════════════════════════════════════════════════════════ def tpr(y_true, y_pred):
"""
Function to calculate tpr
:param y_true: list of true values
:param y_pred: list of predicted values
:return: tpr/recall
"""
return recall(y_true, y_pred)
═════════════════════════════════════════════════════════════════════════ TPR or recall is also known as sensitivity.
42
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
And FPR or False Positive Rate, which is defined as:
FPR = FP / (TN + FP)
═════════════════════════════════════════════════════════════════════════ def fpr(y_true, y_pred):
"""
Function to calculate fpr
:param y_true: list of true values
:param y_pred: list of predicted values
:return: fpr
"""
fp = false_positive(y_true, y_pred)
tn = true_negative(y_true, y_pred)
return fp / (tn + fp)
═════════════════════════════════════════════════════════════════════════ And 1 - FPR is known as specificity or True Negative Rate or TNR.
These are a lot of terms, but the most important ones out of these are only TPR and FPR.
Let’s assume that we have only 15 samples and their target values are binary: Actual targets : [0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1]
We train a model like the random forest, and we can get the probability of when a sample is positive.
Predicted probabilities for 1: [0.1, 0.3, 0.2, 0.6, 0.8, 0.05, 0.9, 0.5, 0.3, 0.66, 0.3, 0.2, 0.85, 0.15, 0.99]
For a typical threshold of >= 0.5, we can evaluate all the above values of precision, recall/TPR, F1 and FPR. But we can do the same if we choose the value of the threshold to be 0.4 or 0.6. In fact, we can choose any value between 0 and 1 and calculate all the metrics described above.
Let’s calculate only two values, though: TPR and FPR.
═════════════════════════════════════════════════════════════════════════ # empty lists to store tpr
# and fpr values
43
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
tpr_list = []
fpr_list = []
# actual targets
y_true = [0, 0, 0, 0, 1, 0, 1,
0, 0, 1, 0, 1, 0, 0, 1]
# predicted probabilities of a sample being 1
y_pred = [0.1, 0.3, 0.2, 0.6, 0.8, 0.05,
0.9, 0.5, 0.3, 0.66, 0.3, 0.2,
0.85, 0.15, 0.99]
# handmade thresholds
thresholds = [0, 0.1, 0.2, 0.3, 0.4, 0.5,
0.6, 0.7, 0.8, 0.85, 0.9, 0.99, 1.0]
# loop over all thresholds
for thresh in thresholds:
# calculate predictions for a given threshold
temp_pred = [1 if x >= thresh else 0 for x in y_pred] # calculate tpr
temp_tpr = tpr(y_true, temp_pred)
# calculate fpr
temp_fpr = fpr(y_true, temp_pred)
# append tpr and fpr to lists
tpr_list.append(temp_tpr)
fpr_list.append(temp_fpr)
═════════════════════════════════════════════════════════════════════════ We can thus get a tpr and fpr value for each threshold.
Figure 3: Table for threshold, TPR and FPR values
44
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
If we plot the table as shown in figure 3, i.e. if we have TPR on the y-axis and FPR on the x-axis, we will get a curve as shown in figure 4.
═════════════════════════════════════════════════════════════════════════ plt.figure(figsize=(7, 7))
plt.fill_between(fpr_list, tpr_list, alpha=0.4)
plt.plot(fpr_list, tpr_list, lw=3)
plt.xlim(0, 1.0)
plt.ylim(0, 1.0)
plt.xlabel('FPR', fontsize=15)
plt.ylabel('TPR', fontsize=15)
plt.show()
═════════════════════════════════════════════════════════════════════════
Figure 4: Receiver operating characteristic (ROC) curve
This curve is also known as the Receiver Operating Characteristic (ROC). And if we calculate the area under this ROC curve, we are calculating another metric which is used very often when you have a dataset which has skewed binary targets.
This metric is known as the Area Under ROC Curve or Area Under Curve or just simply AUC. There are many ways to calculate the area under the ROC curve. For this particular purpose, we will stick to the fantastic implementation by scikit learn.
═════════════════════════════════════════════════════════════════════════ In [X]: from sklearn import metrics
45
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
In [X]: y_true = [0, 0, 0, 0, 1, 0, 1,
...: 0, 0, 1, 0, 1, 0, 0, 1]
In [X]: y_pred = [0.1, 0.3, 0.2, 0.6, 0.8, 0.05,
...: 0.9, 0.5, 0.3, 0.66, 0.3, 0.2,
...: 0.85, 0.15, 0.99]
In [X]: metrics.roc_auc_score(y_true, y_pred)
Out[X]: 0.8300000000000001
═════════════════════════════════════════════════════════════════════════ AUC values range from 0 to 1.
- AUC = 1 implies you have a perfect model. Most of the time, it means that you made some mistake with validation and should revisit data processing and validation pipeline of yours. If you didn’t make any mistakes, then congratulations, you have the best model one can have for the dataset you built it on.
- AUC = 0 implies that your model is very bad (or very good!). Try inverting the probabilities for the predictions, for example, if your probability for the positive class is p, try substituting it with 1-p. This kind of AUC may also mean that there is some problem with your validation or data processing.
- AUC = 0.5 implies that your predictions are random. So, for any binary classification problem, if I predict all targets as 0.5, I will get an AUC of 0.5.
AUC values between 0 and 0.5 imply that your model is worse than random. Most of the time, it’s because you inverted the classes. If you try to invert your predictions, your AUC might become more than 0.5. AUC values closer to 1 are considered good.
But what does AUC say about our model?
Suppose you get an AUC of 0.85 when you build a model to detect pneumothorax from chest x-ray images. This means that if you select a random image from your dataset with pneumothorax (positive sample) and another random image without pneumothorax (negative sample), then the pneumothorax image will rank higher than a non-pneumothorax image with a probability of 0.85.
46
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
After calculating probabilities and AUC, you would want to make predictions on the test set. Depending on the problem and use-case, you might want to either have probabilities or actual classes. If you want to have probabilities, it’s effortless. You already have them. If you want to have classes, you need to select a threshold. In the case of binary classification, you can do something like the following.
Prediction = Probability >= Threshold
Which means, that prediction is a new list which contains only binary variables. An item in prediction is 1 if the probability is greater than or equal to a given threshold else the value is 0.
And guess what, you can use the ROC curve to choose this threshold! The ROC curve will tell you how the threshold impacts false positive rate and true positive rate and thus, in turn, false positives and true positives. You should choose the threshold that is best suited for your problem and datasets.
For example, if you don’t want to have too many false positives, you should have a high threshold value. This will, however, also give you a lot more false negatives. Observe the trade-off and select the best threshold. Let’s see how these thresholds impact true positive and false positive values.
═════════════════════════════════════════════════════════════════════════ # empty lists to store true positive
# and false positive values
tp_list = []
fp_list = []
# actual targets
y_true = [0, 0, 0, 0, 1, 0, 1,
0, 0, 1, 0, 1, 0, 0, 1]
# predicted probabilities of a sample being 1
y_pred = [0.1, 0.3, 0.2, 0.6, 0.8, 0.05,
0.9, 0.5, 0.3, 0.66, 0.3, 0.2,
0.85, 0.15, 0.99]
# some handmade thresholds
thresholds = [0, 0.1, 0.2, 0.3, 0.4, 0.5,
0.6, 0.7, 0.8, 0.85, 0.9, 0.99, 1.0]
# loop over all thresholds
for thresh in thresholds:
47
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
# calculate predictions for a given threshold
temp_pred = [1 if x >= thresh else 0 for x in y_pred] # calculate tp
temp_tp = true_positive(y_true, temp_pred)
# calculate fp
temp_fp = false_positive(y_true, temp_pred)
# append tp and fp to lists
tp_list.append(temp_tp)
fp_list.append(temp_fp)
═══════════════════════════════════════════════════════════════════════ Using this, we can create a table, as shown in Figure 5.
Figure 5: TP and FP values for different thresholds
Most of the time, the top-left value on ROC curve should give you a quite good threshold, as shown in figure 6.
Comparing the table and the ROC curve, we see that a threshold of around 0.6 is quite good where we do not lose a lot of true positives and neither we have a lot of false positives.
48
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
Figure 6: Select the best threshold from the leftmost top point in the ROC curve
AUC is a widely used metric for skewed binary classification tasks in the industry, and a metric everyone should know about. Once you understand the idea behind AUC, as explained in the paragraphs above, it is also easy to explain it to non technical people who would probably be assessing your models in the industry.
Another important metric you should learn after learning AUC is log loss. In case of a binary classification problem, we define log loss as:
Log Loss = - 1.0 * ( target * log(prediction) + (1 - target) * log(1 - prediction) )
Where target is either 0 or 1 and prediction is a probability of a sample belonging to class 1.
For multiple samples in the dataset, the log-loss over all samples is a mere average of all individual log losses. One thing to remember is that log loss penalizes quite high for an incorrect or a far-off prediction, i.e. log loss punishes you for being very sure and very wrong.
═════════════════════════════════════════════════════════════════════════ import numpy as np
def log_loss(y_true, y_proba):
49
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
"""
Function to calculate log loss
:param y_true: list of true values
:param y_proba: list of probabilities for 1
:return: overall log loss
"""
# define an epsilon value
# this can also be an input
# this value is used to clip probabilities
epsilon = 1e-15
# initialize empty list to store
# individual losses
loss = []
# loop over all true and predicted probability values for yt, yp in zip(y_true, y_proba):
# adjust probability
# 0 gets converted to 1e-15
# 1 gets converted to 1-1e-15
# Why? Think about it!
yp = np.clip(yp, epsilon, 1 - epsilon)
# calculate loss for one sample
temp_loss = - 1.0 * (
yt * np.log(yp)
+ (1 - yt) * np.log(1 - yp)
)
# add to loss list
loss.append(temp_loss)
# return mean loss over all samples
return np.mean(loss)
═════════════════════════════════════════════════════════════════════════ Let’s test our implementation:
═════════════════════════════════════════════════════════════════════════ In [X]: y_true = [0, 0, 0, 0, 1, 0, 1,
...: 0, 0, 1, 0, 1, 0, 0, 1]
In [X]: y_proba = [0.1, 0.3, 0.2, 0.6, 0.8, 0.05,
...: 0.9, 0.5, 0.3, 0.66, 0.3, 0.2,
...: 0.85, 0.15, 0.99]
In [X]: log_loss(y_true, y_proba)
Out[X]: 0.49882711861432294
═════════════════════════════════════════════════════════════════════════ We can compare this with scikit-learn:
50
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
═════════════════════════════════════════════════════════════════════════ In [X]: from sklearn import metrics
In [X]: metrics.log_loss(y_true, y_proba)
Out[X]: 0.49882711861432294
═════════════════════════════════════════════════════════════════════════
Thus, our implementation is correct. Implementation of log loss is easy. Interpretation may seem a bit difficult. You must remember that log loss penalizes a lot more than other metrics.
For example, if you are 51% sure about a sample belonging to class 1, log loss would be:
- 1.0 * ( 1 * log(0.51) + (1 - 1) * log(1 – 0.51) ) = 0.67
And if you are 49% sure for a sample belonging to class 0, log loss would be: - 1.0 * ( 0 * log(0.49) + (1 - 0) * log(1 – 0.49) ) = 0.67
So, even though we can choose a cut off at 0.5 and get perfect predictions, we will still have a very high log loss. So, when dealing with log loss, you need to be very careful; any non-confident prediction will have a very high log loss.
Most of the metrics that we discussed until now can be converted to a multi-class version. The idea is quite simple. Let’s take precision and recall. We can calculate precision and recall for each class in a multi-class classification problem.
There are three different ways to calculate this which might get confusing from time to time. Let’s assume we are interested in precision first. We know that precision depends on true positives and false positives.
- Macro averaged precision: calculate precision for all classes individually and then average them
- Micro averaged precision: calculate class wise true positive and false positive and then use that to calculate overall precision
- Weighted precision: same as macro but in this case, it is weighted average depending on the number of items in each class
51
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
This seems complicated but is easy to understand by python implementations. Let’s see how macro-averaged precision is implemented.
═════════════════════════════════════════════════════════════════════════ import numpy as np
def macro_precision(y_true, y_pred):
"""
Function to calculate macro averaged precision
:param y_true: list of true values
:param y_pred: list of predicted values
:return: macro precision score
"""
# find the number of classes by taking
# length of unique values in true list
num_classes = len(np.unique(y_true))
# initialize precision to 0
precision = 0
# loop over all classes
for class_ in range(num_classes):
# all classes except current are considered negative temp_true = [1 if p == class_ else 0 for p in y_true] temp_pred = [1 if p == class_ else 0 for p in y_pred]
# calculate true positive for current class
tp = true_positive(temp_true, temp_pred)
# calculate false positive for current class
fp = false_positive(temp_true, temp_pred)
# calculate precision for current class
temp_precision = tp / (tp + fp)
# keep adding precision for all classes
precision += temp_precision
# calculate and return average precision over all classes precision /= num_classes
return precision
═════════════════════════════════════════════════════════════════════════
52
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
You will notice that it wasn’t so difficult. Similarly, we have micro-averaged precision score.
═════════════════════════════════════════════════════════════════════════ import numpy as np
def micro_precision(y_true, y_pred):
"""
Function to calculate micro averaged precision
:param y_true: list of true values
:param y_pred: list of predicted values
:return: micro precision score
"""
# find the number of classes by taking
# length of unique values in true list
num_classes = len(np.unique(y_true))
# initialize tp and fp to 0
tp = 0
fp = 0
# loop over all classes
for class_ in range(num_classes):
# all classes except current are considered negative temp_true = [1 if p == class_ else 0 for p in y_true] temp_pred = [1 if p == class_ else 0 for p in y_pred]
# calculate true positive for current class
# and update overall tp
tp += true_positive(temp_true, temp_pred)
# calculate false positive for current class
# and update overall tp
fp += false_positive(temp_true, temp_pred)
# calculate and return overall precision
precision = tp / (tp + fp)
return precision
═════════════════════════════════════════════════════════════════════════ This isn’t difficult, either. Then what is? Nothing. Machine learning is easy. Now, let’s look at the implementation of weighted precision.
53
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
═════════════════════════════════════════════════════════════════════════ from collections import Counter
import numpy as np
def weighted_precision(y_true, y_pred):
"""
Function to calculate weighted averaged precision
:param y_true: list of true values
:param y_pred: list of predicted values
:return: weighted precision score
"""
# find the number of classes by taking
# length of unique values in true list
num_classes = len(np.unique(y_true))
# create class:sample count dictionary
# it looks something like this:
# {0: 20, 1:15, 2:21}
class_counts = Counter(y_true)
# initialize precision to 0
precision = 0
# loop over all classes
for class_ in range(num_classes):
# all classes except current are considered negative temp_true = [1 if p == class_ else 0 for p in y_true] temp_pred = [1 if p == class_ else 0 for p in y_pred]
# calculate tp and fp for class
tp = true_positive(temp_true, temp_pred)
fp = false_positive(temp_true, temp_pred)
# calculate precision of class
temp_precision = tp / (tp + fp)
# multiply precision with count of samples in class weighted_precision = class_counts[class_] * temp_precision
# add to overall precision
precision += weighted_precision
# calculate overall precision by dividing by
# total number of samples
overall_precision = precision / len(y_true)
54
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
return overall_precision
═════════════════════════════════════════════════════════════════════════
Let’s compare our implementations with scikit-learn to know if we implemented it right.
═════════════════════════════════════════════════════════════════════════ In [X]: from sklearn import metrics
In [X]: y_true = [0, 1, 2, 0, 1, 2, 0, 2, 2]
In [X]: y_pred = [0, 2, 1, 0, 2, 1, 0, 0, 2]
In [X]: macro_precision(y_true, y_pred)
Out[X]: 0.3611111111111111
In [X]: metrics.precision_score(y_true, y_pred, average="macro") Out[X]: 0.3611111111111111
In [X]: micro_precision(y_true, y_pred)
Out[X]: 0.4444444444444444
In [X]: metrics.precision_score(y_true, y_pred, average="micro") Out[X]: 0.4444444444444444
In [X]: weighted_precision(y_true, y_pred)
Out[X]: 0.39814814814814814
In [X]: metrics.precision_score(y_true, y_pred, average="weighted") Out[X]: 0.39814814814814814
═════════════════════════════════════════════════════════════════════════
It seems like we implemented everything correctly. Please note that the implementations shown here may not be the most efficient, but they are the easiest to understand.
Similarly, we can implement the recall metric for multi-class. Precision and recall depend on true positive, false positive and false negative while F1 depends on precision and recall.
Implementation for recall is left as an exercise for the reader and one version of F1 for multi-class, i.e., weighted average is implemented here.
55
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
═════════════════════════════════════════════════════════════════════════ from collections import Counter
import numpy as np
def weighted_f1(y_true, y_pred):
"""
Function to calculate weighted f1 score
:param y_true: list of true values
:param y_proba: list of predicted values
:return: weighted f1 score
"""
# find the number of classes by taking
# length of unique values in true list
num_classes = len(np.unique(y_true))
# create class:sample count dictionary
# it looks something like this:
# {0: 20, 1:15, 2:21}
class_counts = Counter(y_true)
# initialize f1 to 0
f1 = 0
# loop over all classes
for class_ in range(num_classes):
# all classes except current are considered negative temp_true = [1 if p == class_ else 0 for p in y_true] temp_pred = [1 if p == class_ else 0 for p in y_pred]
# calculate precision and recall for class
p = precision(temp_true, temp_pred)
r = recall(temp_true, temp_pred)
# calculate f1 of class
if p + r != 0:
temp_f1 = 2 * p * r / (p + r)
else:
temp_f1 = 0
# multiply f1 with count of samples in class
weighted_f1 = class_counts[class_] * temp_f1
# add to f1 precision
f1 += weighted_f1
56
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
# calculate overall F1 by dividing by
# total number of samples
overall_f1 = f1 / len(y_true)
return overall_f1
═════════════════════════════════════════════════════════════════════════
Note that there are a few lines of code above which are new. And that’s why you should read the code carefully.
═════════════════════════════════════════════════════════════════════════ In [X]: from sklearn import metrics
In [X]: y_true = [0, 1, 2, 0, 1, 2, 0, 2, 2]
In [X]: y_pred = [0, 2, 1, 0, 2, 1, 0, 0, 2]
In [X]: weighted_f1(y_true, y_pred)
Out[X]: 0.41269841269841273
In [X]: metrics.f1_score(y_true, y_pred, average="weighted") Out[X]: 0.41269841269841273
═════════════════════════════════════════════════════════════════════════
Thus, we have precision, recall and F1 implemented for multi-class problems. You can similarly convert AUC and log loss to multi-class formats too. This format of conversion is known as one-vs-all. I’m not going to implement them here as the implementation is quite similar to what we have already discussed.
In binary or multi-class classification, it is also quite popular to take a look at confusion matrix. Don’t be confused; it’s quite easy. A confusion matrix is nothing but a table of TP, FP, TN and FN. Using the confusion matrix, you can quickly see how many samples were misclassified and how many were classified correctly.
One might argue that the confusion matrix should be covered quite early in this chapter, but I chose not to do it. If you understand TP, FP, TN, FN, precision, recall and AUC, it becomes quite easy to understand and interpret confusion matrix. Let’s see what confusion matrix looks like for a binary classification problem in figure 7.
We see that the confusion matrix is made up of TP, FP, FN and TN. These are the only values we need to calculate precision, recall, F1 score and AUC. Sometimes, people also prefer calling FP as Type-I error and FN as Type-II error.
57
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
Figure 7: Confusion matrix for a binary classification task
We can also expand the binary confusion matrix to a multi-class confusion matrix. How would that look like? If we have N classes, it will be a matrix of size NxN. For every class, we calculate the total number of samples that went to the class in concern and other classes. This can be best understood by an example.
Suppose we have the following actual classes:
[0, 1, 2, 0, 1, 2, 0, 2, 2]
And our predictions are:
[0, 2, 1, 0, 2, 1, 0, 0, 2]
Then our confusion matrix will look as shown in figure 8.
What does figure 8 tell us?
Let’s look at class 0. We see that there are 3 instances of class 0 in the actual target. However, in prediction, we have 3 instances that belong to class 0 and 1 instance that belongs to class 1. Ideally, for class 0 in the actual label, predicted labels 1 and 2 shouldn’t have any instance. Let’s see class 2. In actual labels, this count adds up to 4 while in predicted it adds up to 3. Only 1 instance has a perfect prediction for class 2 and 2 instances go to class 1.
A perfect confusion matrix should only be filled diagonally from left to right.
58
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
Figure 8: Confusion matrix for a multi-class problem
Confusion matrix gives an easy way to calculate different metrics that we have discussed before. Scikit-learn offers an easy and straightforward way to generate a confusion matrix. Please note that the confusion matrix that I have shown in figure 8 is a transpose of scikit-learn’s confusion matrix and an original version can be plotted by the following code.
═════════════════════════════════════════════════════════════════════════ import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
# some targets
y_true = [0, 1, 2, 0, 1, 2, 0, 2, 2]
#some predictions
y_pred = [0, 2, 1, 0, 2, 1, 0, 0, 2]
# get confusion matrix from sklearn
cm = metrics.confusion_matrix(y_true, y_pred)
# plot using matplotlib and seaborn
plt.figure(figsize=(10, 10))
cmap = sns.cubehelix_palette(50, hue=0.05, rot=0, light=0.9, dark=0, as_cmap=True)
sns.set(font_scale=2.5)
sns.heatmap(cm, annot=True, cmap=cmap, cbar=False)
59
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
plt.ylabel('Actual Labels', fontsize=20)
plt.xlabel('Predicted Labels', fontsize=20)
═════════════════════════════════════════════════════════════════════════
So, until now, we have tackled metrics for binary and multi-class classification. Then comes another type of classification problem called multi-label classification. In multi-label classification, each sample can have one or more classes associated with it. One simple example of this type of problem would be a task in which you are asked to predict different objects in a given image.
Figure 9: Different objects in an image4
Figure 9 shows an example image from a well-known dataset. Note that this dataset’s objective is something different but let’s not go there. Let’s assume that the aim is only to predict if an object is present in an image or not. For figure 9, we have a chair, flower-pot, window, but we don’t have other objects such as computer, bed, tv, etc. So, one image can have multiple targets associated with it. This type of problem is the multi-label classification problem.
The metrics for this type of classification problem are a bit different. Some suitable and most common metrics are:
- Precision at k (P@k)
- Average precision at k (AP@k)
4 https://www.flickr.com/photos/krakluski/2950388100 License: CC BY 2.0
60
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
- Mean average precision at k (MAP@k)
- Log loss
Let’s start with precision at k or P@k. One must not confuse this precision with the precision discussed earlier. If you have a list of original classes for a given sample and list of predicted classes for the same, precision is defined as the number of hits in the predicted list considering only top-k predictions, divided by k.
If that’s confusing, it will become apparent with python code.
═════════════════════════════════════════════════════════════════════════ def pk(y_true, y_pred, k):
"""
This function calculates precision at k
for a single sample
:param y_true: list of values, actual classes
:param y_pred: list of values, predicted classes
:param k: the value for k
:return: precision at a given value k
"""
# if k is 0, return 0. we should never have this
# as k is always >= 1
if k == 0:
return 0
# we are interested only in top-k predictions
y_pred = y_pred[:k]
# convert predictions to set
pred_set = set(y_pred)
# convert actual values to set
true_set = set(y_true)
# find common values
common_values = pred_set.intersection(true_set)
# return length of common values over k
return len(common_values) / len(y_pred[:k])
═════════════════════════════════════════════════════════════════════════ With code, everything becomes much easier to understand.
Now, we have average precision at k or AP@k. AP@k is calculated using P@k. For example, if we have to calculate AP@3, we calculate P@1, P@2 and P@3 and then divide the sum by 3.
Let’s see its implementation.
61
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
═════════════════════════════════════════════════════════════════════════ def apk(y_true, y_pred, k):
"""
This function calculates average precision at k
for a single sample
:param y_true: list of values, actual classes
:param y_pred: list of values, predicted classes
:return: average precision at a given value k
"""
# initialize p@k list of values
pk_values = []
# loop over all k. from 1 to k + 1
for i in range(1, k + 1):
# calculate p@i and append to list
pk_values.append(pk(y_true, y_pred, i))
# if we have no values in the list, return 0
if len(pk_values) == 0:
return 0
# else, we return the sum of list over length of list return sum(pk_values) / len(pk_values)
═════════════════════════════════════════════════════════════════════════
These two functions can be used to calculate average precision at k (AP@k) for two given lists; let’s see how.
In [X]: y_true = [
...: [1, 2, 3],
...: [0, 2],
...: [1],
...: [2, 3],
...: [1, 0],
...: []
...: ]
In [X]: y_pred = [
...: [0, 1, 2],
...: [1],
...: [0, 2, 3],
...: [2, 3, 4, 0],
...: [0, 1, 2],
...: [0]
...: ]
In [X]: for i in range(len(y_true)):
...: for j in range(1, 4):
...: print(
62
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
...: f"""
...: y_true={y_true[i]},
...: y_pred={y_pred[i]},
...: AP@{j}={apk(y_true[i], y_pred[i], k=j)} ...: """
...: )
...:
y_true=[1, 2, 3],
y_pred=[0, 1, 2],
AP@1=0.0
y_true=[1, 2, 3],
y_pred=[0, 1, 2],
AP@2=0.25
y_true=[1, 2, 3],
y_pred=[0, 1, 2],
AP@3=0.38888888888888884
.
.
═════════════════════════════════════════════════════════════════════════
Please note that I have omitted many values from the output, but you get the point. So, this is how we can calculate AP@k which is per sample. In machine learning, we are interested in all samples, and that’s why we have mean average precision at k or MAP@k. MAP@k is just an average of AP@k and can be calculated easily by the following python code.
═════════════════════════════════════════════════════════════════════════ def mapk(y_true, y_pred, k):
"""
This function calculates mean avg precision at k
for a single sample
:param y_true: list of values, actual classes
:param y_pred: list of values, predicted classes
:return: mean avg precision at a given value k
"""
# initialize empty list for apk values
apk_values = []
# loop over all samples
for i in range(len(y_true)):
# store apk values for every sample
apk_values.append(
63
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
apk(y_true[i], y_pred[i], k=k)
)
# return mean of apk values list
return sum(apk_values) / len(apk_values)
═════════════════════════════════════════════════════════════════════════ Now, we can calculate MAP@k for k=1, 2, 3 and 4 for the same list of lists.
═════════════════════════════════════════════════════════════════════════ In [X]: y_true = [
...: [1, 2, 3],
...: [0, 2],
...: [1],
...: [2, 3],
...: [1, 0],
...: []
...: ]
In [X]: y_pred = [
...: [0, 1, 2],
...: [1],
...: [0, 2, 3],
...: [2, 3, 4, 0],
...: [0, 1, 2],
...: [0]
...: ]
In [X]: mapk(y_true, y_pred, k=1)
Out[X]: 0.3333333333333333
In [X]: mapk(y_true, y_pred, k=2)
Out[X]: 0.375
In [X]: mapk(y_true, y_pred, k=3)
Out[X]: 0.3611111111111111
In [X]: mapk(y_true, y_pred, k=4)
Out[X]: 0.34722222222222215
═════════════════════════════════════════════════════════════════════════ P@k, AP@k and MAP@k all range from 0 to 1 with 1 being the best.
Please note that sometimes you might see different implementations of P@k and AP@k on the internet. For example, let’s take a look at one of these implementations.
64
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
═════════════════════════════════════════════════════════════════════════ # taken from:
# https://github.com/benhamner/Metrics/blob/
# master/Python/ml_metrics/average_precision.py
import numpy as np
def apk(actual, predicted, k=10):
"""
Computes the average precision at k.
This function computes the AP at k between two lists of items.
Parameters
----------
actual : list
A list of elements to be predicted (order doesn't matter) predicted : list
A list of predicted elements (order does matter) k : int, optional
The maximum number of predicted elements
Returns
-------
score : double
The average precision at k over the input lists """
if len(predicted)>k:
predicted = predicted[:k]
score = 0.0
num_hits = 0.0
for i,p in enumerate(predicted):
if p in actual and p not in predicted[:i]:
num_hits += 1.0
score += num_hits / (i+1.0)
if not actual:
return 0.0
return score / min(len(actual), k)
═════════════════════════════════════════════════════════════════════════
This implementation is another version of AP@k where order matters and we weigh the predictions. This implementation will have slightly different results from what I have presented.
65
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
Now, we come to log loss for multi-label classification. This is quite easy. You can convert the targets to binary format and then use a log loss for each column. In the end, you can take the average of log loss in each column. This is also known as mean column-wise log loss. Of course, there are other ways you can implement this, and you should explore it as you come across it.
We have now reached a stage where we can say that we now know all binary, multi class and multi-label classification metrics, and now we can move to regression metrics.
The most common metric in regression is error. Error is simple and very easy to understand.
Error = True Value – Predicted Value
Absolute error is just absolute of the above.
Absolute Error = Abs ( True Value – Predicted Value )
Then we have mean absolute error (MAE). It’s just mean of all absolute errors.
═════════════════════════════════════════════════════════════════════════ import numpy as np
def mean_absolute_error(y_true, y_pred):
"""
This function calculates mae
:param y_true: list of real numbers, true values
:param y_pred: list of real numbers, predicted values :return: mean absolute error
"""
# initialize error at 0
error = 0
# loop over all samples in the true and predicted list for yt, yp in zip(y_true, y_pred):
# calculate absolute error
# and add to error
error += np.abs(yt - yp)
# return mean error
return error / len(y_true)
═════════════════════════════════════════════════════════════════════════
66
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
Similarly, we have squared error and mean squared error (MSE). Squared Error = ( True Value – Predicted Value )2
And mean squared error (MSE) can be implemented as follows.
═════════════════════════════════════════════════════════════════════════ def mean_squared_error(y_true, y_pred):
"""
This function calculates mse
:param y_true: list of real numbers, true values
:param y_pred: list of real numbers, predicted values :return: mean squared error
"""
# initialize error at 0
error = 0
# loop over all samples in the true and predicted list for yt, yp in zip(y_true, y_pred):
# calculate squared error
# and add to error
error += (yt - yp) ** 2
# return mean error
return error / len(y_true)
═════════════════════════════════════════════════════════════════════════
MSE and RMSE (root mean squared error) are the most popular metrics used in evaluating regression models.
RMSE = SQRT ( MSE )
Another type of error in same class is squared logarithmic error. Some people call it SLE, and when we take mean of this error across all samples, it is known as MSLE (mean squared logarithmic error) and implemented as follows.
═════════════════════════════════════════════════════════════════════════ import numpy as np
def mean_squared_log_error(y_true, y_pred):
"""
This function calculates msle
:param y_true: list of real numbers, true values
:param y_pred: list of real numbers, predicted values :return: mean squared logarithmic error
67
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
"""
# initialize error at 0
error = 0
# loop over all samples in true and predicted list
for yt, yp in zip(y_true, y_pred):
# calculate squared log error
# and add to error
error += (np.log(1 + yt) - np.log(1 + yp)) ** 2
# return mean error
return error / len(y_true)
═════════════════════════════════════════════════════════════════════════
Root mean squared logarithmic error is just a square root of this. It is also known as RMSLE.
Then we have the percentage error:
Percentage Error = ( ( True Value – Predicted Value ) / True Value ) * 100 Same can be converted to mean percentage error for all samples.
═════════════════════════════════════════════════════════════════════════ def mean_percentage_error(y_true, y_pred):
"""
This function calculates mpe
:param y_true: list of real numbers, true values
:param y_pred: list of real numbers, predicted values :return: mean percentage error
"""
# initialize error at 0
error = 0
# loop over all samples in true and predicted list
for yt, yp in zip(y_true, y_pred):
# calculate percentage error
# and add to error
error += (yt - yp) / yt
# return mean percentage error
return error / len(y_true)
═════════════════════════════════════════════════════════════════════════
And an absolute version of the same (and more common version) is known as mean absolute percentage error or MAPE.
68
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
═════════════════════════════════════════════════════════════════════════ import numpy as np
def mean_abs_percentage_error(y_true, y_pred):
"""
This function calculates MAPE
:param y_true: list of real numbers, true values
:param y_pred: list of real numbers, predicted values :return: mean absolute percentage error
"""
# initialize error at 0
error = 0
# loop over all samples in true and predicted list
for yt, yp in zip(y_true, y_pred):
# calculate percentage error
# and add to error
error += np.abs(yt - yp) / yt
# return mean percentage error
return error / len(y_true)
═════════════════════════════════════════════════════════════════════════
The best thing about regression is that there are only a few most popular metrics that can be applied to almost every regression problem. And it is much easier to understand when we compare it to classification metrics.
Let’s talk about another regression metric known as R2 (R-squared), also known as the coefficient of determination.
In simple words, R-squared says how good your model fits the data. R-squared closer to 1.0 says that the model fits the data quite well, whereas closer 0 means that model isn’t that good. R-squared can also be negative when the model just makes absurd predictions.
The formula for R-squared is shown in figure 10, but as always a python implementation makes things more clear.
Figure 10: Formula for R-squared
69
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
═════════════════════════════════════════════════════════════════════════ import numpy as np
def r2(y_true, y_pred):
"""
This function calculates r-squared score
:param y_true: list of real numbers, true values
:param y_pred: list of real numbers, predicted values :return: r2 score
"""
# calculate the mean value of true values
mean_true_value = np.mean(y_true)
# initialize numerator with 0
numerator = 0
# initialize denominator with 0
denominator = 0
# loop over all true and predicted values
for yt, yp in zip(y_true, y_pred):
# update numerator
numerator += (yt - yp) ** 2
# update denominator
denominator += (yt - mean_true_value) ** 2
# calculate the ratio
ratio = numerator / denominator
# return 1 - ratio
return 1 – ratio
═════════════════════════════════════════════════════════════════════════
There are many more evaluation metrics, and this list is never-ending. I can write a book which is only about different evaluation metrics. Maybe I will. For now, these evaluations metrics will fit almost every problem you want to attempt. Please note that I have implemented these metrics in the most straightforward manner, and that means they are not efficient enough. You can make most of them in a very efficient way by properly using numpy. For example, take a look at the implementation of mean absolute error without any loops.
═════════════════════════════════════════════════════════════════════════ import numpy as np
def mae_np(y_true, y_pred):
return np.mean(np.abs(y_true - y_pred))
═════════════════════════════════════════════════════════════════════════
70
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
I could have implemented all the metrics this way but to learn it’s better to look at low-level implementation. Once you learn the low-level implementation in pure python, and without using a lot of numpy, you can easily convert it to numpy and make it much faster.
Then, there are some advanced metrics.
One of them which is quite widely used is quadratic weighted kappa, also known as QWK. It is also known as Cohen’s kappa. QWK measures the “agreement” between two “ratings”. The ratings can be any real numbers in 0 to N. And predictions are also in the same range. An agreement can be defined as how close these ratings are to each other. So, it’s suitable for a classification problem with N different categories/classes. If the agreement is high, the score is closer towards 1.0. In the case of low agreement, the score is close to 0. Cohen’s kappa has a good implementation in scikit-learn, and detailed discussion of this metric is beyond the scope of this book.
═════════════════════════════════════════════════════════════════════════ In [X]: from sklearn import metrics
In [X]: y_true = [1, 2, 3, 1, 2, 3, 1, 2, 3]
In [X]: y_pred = [2, 1, 3, 1, 2, 3, 3, 1, 2]
In [X]: metrics.cohen_kappa_score(y_true, y_pred, weights="quadratic") Out[X]: 0.33333333333333337
In [X]: metrics.accuracy_score(y_true, y_pred)
Out[X]: 0.4444444444444444
═════════════════════════════════════════════════════════════════════════
You can see that even though accuracy is high, QWK is less. A QWK greater than 0.85 is considered to be very good!
An important metric is Matthew’s Correlation Coefficient (MCC). MCC ranges from -1 to 1. 1 is perfect prediction, -1 is imperfect prediction, and 0 is random prediction. The formula for MCC is quite simple.
TP * TN - FP * FN
MCC = ───────────────────────────────────── [ (TP + FP) * (FN + TN) * (FP + TN) * (TP + FN) ] ^ (0.5)
71
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
We see that MCC takes into consideration TP, FP, TN and FN and thus can be used for problems where classes are skewed. You can quickly implement it in python by using what we have already implemented.
═════════════════════════════════════════════════════════════════════════ def mcc(y_true, y_pred):
"""
This function calculates Matthew's Correlation Coefficient for binary classification.
:param y_true: list of true values
:param y_pred: list of predicted values
:return: mcc score
"""
tp = true_positive(y_true, y_pred)
tn = true_negative(y_true, y_pred)
fp = false_positive(y_true, y_pred)
fn = false_negative(y_true, y_pred)
numerator = (tp * tn) - (fp * fn)
denominator = (
(tp + fp) *
(fn + tn) *
(fp + tn) *
(tp + fn)
)
denominator = denominator ** 0.5
return numerator/denominator
═════════════════════════════════════════════════════════════════════════
These are the metrics that can help you get started and will apply to almost every machine learning problem.
One thing to keep in mind is that to evaluate un-supervised methods, for example, some kind of clustering, it’s better to create or manually label the test set and keep it separate from everything that is going on in your modelling part. When you are done with clustering, you can evaluate the performance on the test set simply by using any of the supervised learning metrics.
Once we understand what metric to use for a given problem, we can start looking more deeply into our models for improvements.
72
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
Arranging machine learning projects
Finally, we are at a stage where we can start building our very first machine learning models.
Or are we?
Before we start, we must take care of a few things. Please remember that we will work in an IDE/text editor rather than jupyter notebooks. You can also work in jupyter notebooks, and it’s totally up to you. However, I will be using jupyter only for things like data exploration and for plotting charts and graphs. We will build the classification framework in such a way that most problems will become plug n’ play. You will be able to train a model without making too many changes to the code, and when you improve your models, you will be able to track them using git.
Let’s look at the structure of the files first of all. For any project that you are doing, create a new folder. For this example, I am calling the project “project”.
The inside of the project folder should look something like the following. .
├── input
│ ├── train.csv
│ └── test.csv
├── src
│ ├── create_folds.py
│ ├── train.py
│ ├── inference.py
│ ├── models.py
│ ├── config.py
│ └── model_dispatcher.py
├── models
│ ├── model_rf.bin
│ └── model_et.bin
├── notebooks
│ ├── exploration.ipynb
│ └── check_data.ipynb
├── README.md
└── LICENSE
73
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
Let’s see what these folders and file are about.
input/: This folder consists of all the input files and data for your machine learning project. If you are working on NLP projects, you can keep your embeddings here. If you are working on image projects, all images go to a subfolder inside this folder.
src/: We will keep all the python scripts associated with the project here. If I talk about a python script, i.e. any *.py file, it is stored in the src folder.
models/: This folder keeps all the trained models.
notebooks/: All jupyter notebooks (i.e. any *.ipynb file) are stored in the notebooks folder.
README.md: This is a markdown file where you can describe your project and write instructions on how to train the model or to serve this in a production environment.
LICENSE: This is a simple text file that consists of a license for the project, such as MIT, Apache, etc. Going into details of the licenses is beyond the scope of this book.
Let’s assume you are building a model to classify MNIST dataset (a dataset that has been used in almost every machine learning book). If you remember, we touched MNIST dataset in cross-validation chapter too. So, I am not going to explain how this dataset looks like. There are many different formats of MNIST dataset available online, but we will be using the CSV format of the dataset.
In this format of the dataset, each row of the CSV consists of the label of the image and 784 pixel values ranging from 0 to 255. The dataset consists of 60000 images in this format.
We can use pandas to read this data format easily.
Please note that even though Figure 1 shows all pixel values as zeros, it is not the case.
74
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
Figure 1: MNIST dataset in CSV format
Let’s take a look at the counts of the label column in this dataset.
Figure 2: Counts of label in MNIST dataset
We don’t need much more exploration for this dataset. We already know what we have, and there is no need to make plots on different pixel values. From figure 2, it is quite clear that the distribution of labels is quite good and even. We can thus use accuracy/F1 as metrics. This is the first step when approaching a machine learning problem: decide the metric!
Now, we can code a little bit. We need to create the src/ folder and some python scripts.
Please note that the training CSV file is located in the input/ folder and is called mnist_train.csv.
How should these files look like for such a project?
The first script that one should create is create_folds.py.
75
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
This will create a new file in the input/ folder called mnist_train_folds.csv, and it’s the same as mnist_train.csv. The only differences are that this CSV is shuffled and has a new column called kfold.
Once we have decided what kind of evaluation metric we want to use and have created the folds, we are good to go with creating a basic model. This is done in train.py.
═════════════════════════════════════════════════════════════════════════ # src/train.py
import joblib
import pandas as pd
from sklearn import metrics
from sklearn import tree
def run(fold):
# read the training data with folds
df = pd.read_csv("../input/mnist_train_folds.csv")
# training data is where kfold is not equal to provided fold # also, note that we reset the index
df_train = df[df.kfold != fold].reset_index(drop=True)
# validation data is where kfold is equal to provided fold df_valid = df[df.kfold == fold].reset_index(drop=True)
# drop the label column from dataframe and convert it to # a numpy array by using .values.
# target is label column in the dataframe
x_train = df_train.drop("label", axis=1).values
y_train = df_train.label.values
# similarly, for validation, we have
x_valid = df_valid.drop("label", axis=1).values
y_valid = df_valid.label.values
# initialize simple decision tree classifier from sklearn clf = tree.DecisionTreeClassifier()
# fit the model on training data
clf.fit(x_train, y_train)
# create predictions for validation samples
preds = clf.predict(x_valid)
76
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
# calculate & print accuracy
accuracy = metrics.accuracy_score(y_valid, preds)
print(f"Fold={fold}, Accuracy={accuracy}")
# save the model
joblib.dump(clf, f"../models/dt_{fold}.bin")
if __name__ == "__main__":
run(fold=0)
run(fold=1)
run(fold=2)
run(fold=3)
run(fold=4)
═════════════════════════════════════════════════════════════════════════ You can run this script by calling python train.py in the console.
═════════════════════════════════════════════════════════════════════════ ❯ python train.py
Fold=0, Accuracy=0.8680833333333333
Fold=1, Accuracy=0.8685
Fold=2, Accuracy=0.8674166666666666
Fold=3, Accuracy=0.8703333333333333
Fold=4, Accuracy=0.8699166666666667
═════════════════════════════════════════════════════════════════════════
When you look at the training script, you will see that there are still a few more things that are hardcoded, for example, the fold numbers, the training file and the output folder.
We can thus create a config file with all this information: config.py.
═════════════════════════════════════════════════════════════════════════ # config.py
TRAINING_FILE = "../input/mnist_train_folds.csv"
MODEL_OUTPUT = "../models/"
═════════════════════════════════════════════════════════════════════════
And we make some changes to our training script too. The training file utilizes the config file now. Thus making it easier to change data or the model output.
77
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
═════════════════════════════════════════════════════════════════════════ # train.py
import os
import config
import joblib
import pandas as pd
from sklearn import metrics
from sklearn import tree
def run(fold):
# read the training data with folds
df = pd.read_csv(config.TRAINING_FILE)
# training data is where kfold is not equal to provided fold # also, note that we reset the index
df_train = df[df.kfold != fold].reset_index(drop=True)
# validation data is where kfold is equal to provided fold df_valid = df[df.kfold == fold].reset_index(drop=True)
# drop the label column from dataframe and convert it to # a numpy array by using .values.
# target is label column in the dataframe
x_train = df_train.drop("label", axis=1).values
y_train = df_train.label.values
# similarly, for validation, we have
x_valid = df_valid.drop("label", axis=1).values
y_valid = df_valid.label.values
# initialize simple decision tree classifier from sklearn clf = tree.DecisionTreeClassifier()
# fir the model on training data
clf.fit(x_train, y_train)
# create predictions for validation samples
preds = clf.predict(x_valid)
# calculate & print accuracy
accuracy = metrics.accuracy_score(y_valid, preds)
print(f"Fold={fold}, Accuracy={accuracy}")
# save the model
78
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
joblib.dump(
clf,
os.path.join(config.MODEL_OUTPUT, f"dt_{fold}.bin") )
if __name__ == "__main__":
run(fold=0)
run(fold=1)
run(fold=2)
run(fold=3)
run(fold=4)
═════════════════════════════════════════════════════════════════════════
Please note that I am not showing the difference between this training script and the one before. Please take a careful look at both of them and find the differences yourself. There aren’t many of them.
There is still one more thing related to the training script that can be improved. As you can see, we call the run function multiple times for every fold. Sometimes it’s not advisable to run multiple folds in the same script as the memory consumption may keep increasing, and your program may crash. To take care of this problem, we can pass arguments to the training script. I like doing it using argparse.
═════════════════════════════════════════════════════════════════════════ # train.py
import argparse
.
.
.
if __name__ == "__main__":
# initialize ArgumentParser class of argparse
parser = argparse.ArgumentParser()
# add the different arguments you need and their type # currently, we only need fold
parser.add_argument(
"--fold",
type=int
)
# read the arguments from the command line
args = parser.parse_args()
# run the fold specified by command line arguments
79
Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur
run(fold=args.fold)
═════════════════════════════════════════════════════════════════════════ Now, we can run the python script again, but only for a given fold.
═════════════════════════════════════════════════════════════════════════ ❯ python train.py --fold 0
Fold=0, Accuracy=0.8656666666666667
═════════════════════════════════════════════════════════════════════════
If you see carefully, our fold 0 score was a bit different before. This is because of the randomness in the model. We will come to handling randomness in later chapters.
Now, if you want, you can create a shell script with different commands for different folds and run them all together, as shown below.
═════════════════════════════════════════════════════════════════════════ #!/bin/sh
python train.py --fold 0
python train.py --fold 1
python train.py --fold 2
python train.py --fold 3
python train.py --fold 4
═════════════════════════════════════════════════════════════════════════ And you can run this by the following command.
═════════════════════════════════════════════════════════════════════════ ❯ sh run.sh
Fold=0, Accuracy=0.8675
Fold=1, Accuracy=0.8693333333333333
Fold=2, Accuracy=0.8683333333333333
Fold=3, Accuracy=0.8704166666666666
Fold=4, Accuracy=0.8685
═════════════════════════════════════════════════════════════════════════
We have made quite some progress now, but if we look at our training script, we still are limited by a few things, for example, the model. The model is hardcoded in the training script, and the only way to change it is to modify the script. So, we will create a new python script called model_dispatcher.py. model_dispatcher.py, as the name suggests, will dispatch our models to our training script.
80
Data Science and Machine Learning Mathematical and Statistical Methods
Dirk P. Kroese, Zdravko I. Botev, Thomas Taimre, Radislav Vaisman 30th October 2023
To my wife and daughters: Lesley, Elise, and Jessica — DPK
To Sarah, Sofia, and my parents
— ZIB
To my grandparents: Arno, Harry, Juta, and Maila — TT
To Valerie
— RV
CONTENTS
Preface xiii Notation xvii
1 Importing, Summarizing, and Visualizing Data 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Structuring Features According to Type . . . . . . . . . . . . . . . . . . 3 1.3 Summary Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Visualizing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.1 Plotting Qualitative Variables . . . . . . . . . . . . . . . . . . . . 9 1.5.2 Plotting Quantitative Variables . . . . . . . . . . . . . . . . . . . 9 1.5.3 Data Visualization in a Bivariate Setting . . . . . . . . . . . . . . 12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Statistical Learning 19 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Supervised and Unsupervised Learning . . . . . . . . . . . . . . . . . . . 20 2.3 Training and Test Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 Tradeoffs in Statistical Learning . . . . . . . . . . . . . . . . . . . . . . 31 2.5 Estimating Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.1 In-Sample Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5.2 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6 Modeling Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.7 Multivariate Normal Models . . . . . . . . . . . . . . . . . . . . . . . . 44 2.8 Normal Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.9 Bayesian Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3 Monte Carlo Methods 67 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.2 Monte Carlo Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2.1 Generating Random Numbers . . . . . . . . . . . . . . . . . . . 68 3.2.2 Simulating Random Variables . . . . . . . . . . . . . . . . . . . 69 3.2.3 Simulating Random Vectors and Processes . . . . . . . . . . . . . 74 3.2.4 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.2.5 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . 78 3.3 Monte Carlo Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
vii
viii Contents
3.3.1 Crude Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.3.2 Bootstrap Method . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.3.3 Variance Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.4 Monte Carlo for Optimization . . . . . . . . . . . . . . . . . . . . . . . . 96 3.4.1 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . 96
3.4.2 Cross-Entropy Method . . . . . . . . . . . . . . . . . . . . . . . 100
3.4.3 Splitting for Optimization . . . . . . . . . . . . . . . . . . . . . . 103
3.4.4 Noisy Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 105
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4 Unsupervised Learning 121 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.2 Risk and Loss in Unsupervised Learning . . . . . . . . . . . . . . . . . . 122 4.3 Expectation–Maximization (EM) Algorithm . . . . . . . . . . . . . . . . 128 4.4 Empirical Distribution and Density Estimation . . . . . . . . . . . . . . . 131 4.5 Clustering via Mixture Models . . . . . . . . . . . . . . . . . . . . . . . 135 4.5.1 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.5.2 EM Algorithm for Mixture Models . . . . . . . . . . . . . . . . . 137
4.6 Clustering via Vector Quantization . . . . . . . . . . . . . . . . . . . . . 142 4.6.1 K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.6.2 Clustering via Continuous Multiextremal Optimization . . . . . . 146
4.7 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 4.8 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . 153 4.8.1 Motivation: Principal Axes of an Ellipsoid . . . . . . . . . . . . . 153
4.8.2 PCA and Singular Value Decomposition (SVD) . . . . . . . . . . 155
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
5 Regression 167 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 5.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 5.3 Analysis via Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . 171 5.3.1 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . 171
5.3.2 Model Selection and Prediction . . . . . . . . . . . . . . . . . . . 172
5.3.3 Cross-Validation and Predictive Residual Sum of Squares . . . . . 173
5.3.4 In-Sample Risk and Akaike Information Criterion . . . . . . . . . 175
5.3.5 Categorical Features . . . . . . . . . . . . . . . . . . . . . . . . 177
5.3.6 Nested Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
5.3.7 Coefficient of Determination . . . . . . . . . . . . . . . . . . . . 181
5.4 Inference for Normal Linear Models . . . . . . . . . . . . . . . . . . . . 182 5.4.1 Comparing Two Normal Linear Models . . . . . . . . . . . . . . 183
5.4.2 Confidence and Prediction Intervals . . . . . . . . . . . . . . . . 186
5.5 Nonlinear Regression Models . . . . . . . . . . . . . . . . . . . . . . . . 188 5.6 Linear Models in Python . . . . . . . . . . . . . . . . . . . . . . . . . . 191 5.6.1 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
5.6.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
5.6.3 Analysis of Variance (ANOVA) . . . . . . . . . . . . . . . . . . 195
Contents ix
5.6.4 Confidence and Prediction Intervals . . . . . . . . . . . . . . . . 198
5.6.5 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 198
5.6.6 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 199
5.7 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . 204 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
6 Regularization and Kernel Methods 215 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 6.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 6.3 Reproducing Kernel Hilbert Spaces . . . . . . . . . . . . . . . . . . . . . 222 6.4 Construction of Reproducing Kernels . . . . . . . . . . . . . . . . . . . . 224 6.4.1 Reproducing Kernels via Feature Mapping . . . . . . . . . . . . . 224
6.4.2 Kernels from Characteristic Functions . . . . . . . . . . . . . . . 225
6.4.3 Reproducing Kernels Using Orthonormal Features . . . . . . . . 227
6.4.4 Kernels from Kernels . . . . . . . . . . . . . . . . . . . . . . . . 229
6.5 Representer Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 6.6 Smoothing Cubic Splines . . . . . . . . . . . . . . . . . . . . . . . . . . 235 6.7 Gaussian Process Regression . . . . . . . . . . . . . . . . . . . . . . . . 238 6.8 Kernel PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
7 Classification 251 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 7.2 Classification Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 7.3 Classification via Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . 257 7.4 Linear and Quadratic Discriminant Analysis . . . . . . . . . . . . . . . . 259 7.5 Logistic Regression and Softmax Classification . . . . . . . . . . . . . . 266 7.6 K-Nearest Neighbors Classification . . . . . . . . . . . . . . . . . . . . . 268 7.7 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 7.8 Classification with Scikit-Learn . . . . . . . . . . . . . . . . . . . . . . . 277 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
8 Decision Trees and Ensemble Methods 287 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 8.2 Top-Down Construction of Decision Trees . . . . . . . . . . . . . . . . . 289 8.2.1 Regional Prediction Functions . . . . . . . . . . . . . . . . . . . 290
8.2.2 Splitting Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
8.2.3 Termination Criterion . . . . . . . . . . . . . . . . . . . . . . . . 292
8.2.4 Basic Implementation . . . . . . . . . . . . . . . . . . . . . . . . 294
8.3 Additional Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 298 8.3.1 Binary Versus Non-Binary Trees . . . . . . . . . . . . . . . . . . 298
8.3.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 298
8.3.3 Alternative Splitting Rules . . . . . . . . . . . . . . . . . . . . . 298
8.3.4 Categorical Variables . . . . . . . . . . . . . . . . . . . . . . . . 299
8.3.5 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
8.4 Controlling the Tree Shape . . . . . . . . . . . . . . . . . . . . . . . . . 300 8.4.1 Cost-Complexity Pruning . . . . . . . . . . . . . . . . . . . . . . 303
x Contents
8.4.2 Advantages and Limitations of Decision Trees . . . . . . . . . . . 304
8.5 Bootstrap Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 8.6 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 8.7 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
9 Deep Learning 323 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 9.2 Feed-Forward Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 326 9.3 Back-Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 9.4 Methods for Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 9.4.1 Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . . . 334
9.4.2 Levenberg–Marquardt Method . . . . . . . . . . . . . . . . . . . 335
9.4.3 Limited-Memory BFGS Method . . . . . . . . . . . . . . . . . . 336
9.4.4 Adaptive Gradient Methods . . . . . . . . . . . . . . . . . . . . . 338
9.5 Examples in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340 9.5.1 Simple Polynomial Regression . . . . . . . . . . . . . . . . . . . 340
9.5.2 Image Classification . . . . . . . . . . . . . . . . . . . . . . . . 344
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
A Linear Algebra and Functional Analysis 355 A.1 Vector Spaces, Bases, and Matrices . . . . . . . . . . . . . . . . . . . . . 355 A.2 Inner Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 A.3 Complex Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . 361 A.4 Orthogonal Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 A.5 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . 363 A.5.1 Left- and Right-Eigenvectors . . . . . . . . . . . . . . . . . . . . 364
A.6 Matrix Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . 368 A.6.1 (P)LU Decomposition . . . . . . . . . . . . . . . . . . . . . . . 368
A.6.2 Woodbury Identity . . . . . . . . . . . . . . . . . . . . . . . . . 370
A.6.3 Cholesky Decomposition . . . . . . . . . . . . . . . . . . . . . . 373
A.6.4 QR Decomposition and the Gram–Schmidt Procedure . . . . . . . 375
A.6.5 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . 376
A.6.6 Solving Structured Matrix Equations . . . . . . . . . . . . . . . . 379
A.7 Functional Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384 A.8 Fourier Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390 A.8.1 Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . . . 392
A.8.2 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . 394
B Multivariate Differentiation and Optimization 397 B.1 Multivariate Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . 397 B.1.1 Taylor Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . 400
B.1.2 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
B.2 Optimization Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402 B.2.1 Convexity and Optimization . . . . . . . . . . . . . . . . . . . . 403
B.2.2 Lagrangian Method . . . . . . . . . . . . . . . . . . . . . . . . . 406
B.2.3 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
Contents xi
B.3 Numerical Root-Finding and Minimization . . . . . . . . . . . . . . . . . 408 B.3.1 Newton-Like Methods . . . . . . . . . . . . . . . . . . . . . . . 409
B.3.2 Quasi-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . 411
B.3.3 Normal Approximation Method . . . . . . . . . . . . . . . . . . 413
B.3.4 Nonlinear Least Squares . . . . . . . . . . . . . . . . . . . . . . 414
B.4 Constrained Minimization via Penalty Functions . . . . . . . . . . . . . . 415
C Probability and Statistics 421 C.1 Random Experiments and Probability Spaces . . . . . . . . . . . . . . . 421 C.2 Random Variables and Probability Distributions . . . . . . . . . . . . . . 422 C.3 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426 C.4 Joint Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 C.5 Conditioning and Independence . . . . . . . . . . . . . . . . . . . . . . . 428 C.5.1 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . 428
C.5.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
C.5.3 Expectation and Covariance . . . . . . . . . . . . . . . . . . . . 429
C.5.4 Conditional Density and Conditional Expectation . . . . . . . . . 431
C.6 Functions of Random Variables . . . . . . . . . . . . . . . . . . . . . . . 431 C.7 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . 434 C.8 Convergence of Random Variables . . . . . . . . . . . . . . . . . . . . . 439 C.9 Law of Large Numbers and Central Limit Theorem . . . . . . . . . . . . 445 C.10 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 C.11 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 C.12 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 C.12.1 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . 455
C.12.2 Maximum Likelihood Method . . . . . . . . . . . . . . . . . . . 456
C.13 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457 C.14 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
D Python Primer 463 D.1 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 D.2 Python Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465 D.3 Types and Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466 D.4 Functions and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 468 D.5 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469 D.6 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471 D.7 Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472 D.8 Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473 D.9 Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475 D.10 NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478 D.10.1 Creating and Shaping Arrays . . . . . . . . . . . . . . . . . . . . 478
D.10.2 Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
D.10.3 Array Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 480
D.10.4 Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 482
D.11 Matplotlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483 D.11.1 Creating a Basic Plot . . . . . . . . . . . . . . . . . . . . . . . . 483
xii Contents
D.12 Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 D.12.1 Series and DataFrame . . . . . . . . . . . . . . . . . . . . . . . . 485
D.12.2 Manipulating Data Frames . . . . . . . . . . . . . . . . . . . . . 487
D.12.3 Extracting Information . . . . . . . . . . . . . . . . . . . . . . . 488
D.12.4 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
D.13 Scikit-learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490 D.13.1 Partitioning the Data . . . . . . . . . . . . . . . . . . . . . . . . 491
D.13.2 Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
D.13.3 Fitting and Prediction . . . . . . . . . . . . . . . . . . . . . . . . 492
D.13.4 Testing the Model . . . . . . . . . . . . . . . . . . . . . . . . . . 492
D.14 System Calls, URL Access, and Speed-Up . . . . . . . . . . . . . . . . . 493 Bibliography 495 Index 503
PREFACE
In our present world of automation, cloud computing, algorithms, artificial intelligence, and big data, few topics are as relevant as data science and machine learning. Their recent popularity lies not only in their applicability to real-life questions, but also in their natural blending of many different disciplines, including mathematics, statistics, computer science, engineering, science, and finance.
To someone starting to learn these topics, the multitude of computational techniques and mathematical ideas may seem overwhelming. Some may be satisfied with only learn ing how to use off-the-shelf recipes to apply to practical situations. But what if the assump tions of the black-box recipe are violated? Can we still trust the results? How should the algorithm be adapted? To be able to truly understand data science and machine learning it is important to appreciate the underlying mathematics and statistics, as well as the resulting algorithms.
The purpose of this book is to provide an accessible, yet comprehensive, account of data science and machine learning. It is intended for anyone interested in gaining a better understanding of the mathematics and statistics that underpin the rich variety of ideas and machine learning algorithms in data science. Our viewpoint is that computer languages come and go, but the underlying key ideas and algorithms will remain forever and will form the basis for future developments.
Before we turn to a description of the topics in this book, we would like to say a few words about its philosophy. This book resulted from various courses in data science and machine learning at the Universities of Queensland and New South Wales, Australia. When we taught these courses, we noticed that students were eager to learn not only how to apply algorithms but also to understand how these algorithms actually work. However, many existing textbooks assumed either too much background knowledge (e.g., measure theory and functional analysis) or too little (everything is a black box), and the information overload from often disjointed and contradictory internet sources made it more difficult for students to gradually build up their knowledge and understanding. We therefore wanted to write a book about data science and machine learning that can be read as a linear story, with a substantial “backstory” in the appendices. The main narrative starts very simply and builds up gradually to quite an advanced level. The backstory contains all the necessary
xiii
xiv Preface
background, as well as additional information, from linear algebra and functional analysis (Appendix A), multivariate differentiation and optimization (Appendix B), and probability and statistics (Appendix C). Moreover, to make the abstract ideas come alive, we believe it is important that the reader sees actual implementations of the algorithms, directly trans lated from the theory. After some deliberation we have chosen Python as our programming language. It is freely available and has been adopted as the programming language of choice for many practitioners in data science and machine learning. It has many useful packages for data manipulation (often ported from R) and has been designed to be easy to program. A gentle introduction to Python is given in Appendix D.
To keep the book manageable in size we had to be selective in our choice of topics. Important ideas and connections between various concepts are highlighted via keywords keywords and page references (indicated by a ☞) in the margin. Key definitions and theorems are highlighted in boxes. Whenever feasible we provide proofs of theorems. Finally, we place great importance on notation. It is often the case that once a consistent and concise system of notation is in place, seemingly difficult ideas suddenly become obvious. We use differ ent fonts to distinguish between different types of objects. Vectors are denoted by letters in boldface italics, x, X, and matrices by uppercase letters in boldface roman font, A, K. We also distinguish between random vectors and their values by using upper and lower case letters, e.g., X (random vector) and x (its value or outcome). Sets are usually denoted by calligraphic letters G, H. The symbols for probability and expectation are P and E, respect ively. Distributions are indicated by sans serif font, as in Bin and Gamma; exceptions are the ubiquitous notations N and U for the normal and uniform distributions. A summary of
☞ xvii the most important symbols and abbreviations is given on Pages xvii–xxi. Data science provides the language and techniques necessary for understanding and dealing with data. It involves the design, collection, analysis, and interpretation of nu merical data, with the aim of extracting patterns and other useful information. Machine learning, which is closely related to data science, deals with the design of algorithms and computer resources to learn from data. The organization of the book follows roughly the typical steps in a data science project: Gathering data to gain information about a research question; cleaning, summarization, and visualization of the data; modeling and analysis of the data; translating decisions about the model into decisions and predictions about the re search question. As this is a mathematics and statistics oriented book, most emphasis will be on modeling and analysis.
We start in Chapter 1 with the reading, structuring, summarization, and visualization of data using the data manipulation package pandas in Python. Although the material covered in this chapter requires no mathematical knowledge, it forms an obvious starting point for data science: to better understand the nature of the available data. In Chapter 2, we introduce the main ingredients of statistical learning. We distinguish between supervised and unsupervised learning techniques, and discuss how we can assess the predictive per formance of (un)supervised learning methods. An important part of statistical learning is the modeling of data. We introduce various useful models in data science including linear, multivariate Gaussian, and Bayesian models. Many algorithms in machine learning and data science make use of Monte Carlo techniques, which is the topic of Chapter 3. Monte Carlo can be used for simulation, estimation, and optimization. Chapter 4 is concerned with unsupervised learning, where we discuss techniques such as density estimation, clus tering, and principal component analysis. We then turn our attention to supervised learning
Preface xv
in Chapter 5, and explain the ideas behind a broad class of regression models. Therein, we also describe how Python’s statsmodels package can be used to define and analyze linear models. Chapter 6 builds upon the previous regression chapter by developing the power ful concepts of kernel methods and regularization, which allow the fundamental ideas of Chapter 5 to be expanded in an elegant way, using the theory of reproducing kernel Hilbert spaces. In Chapter 7, we proceed with the classification task, which also belongs to the supervised learning framework, and consider various methods for classification, including Bayes classification, linear and quadratic discriminant analysis, K-nearest neighbors, and support vector machines. In Chapter 8 we consider versatile methods for regression and classification that make use of tree structures. Finally, in Chapter 9, we consider the work ings of neural networks and deep learning, and show that these learning algorithms have a simple mathematical interpretation. An extensive range of exercises is provided at the end of each chapter.
Python code and data sets for each chapter can be downloaded from the GitHub site: https://github.com/DSML-book
Acknowledgments
Some of the Python code for Chapters 1 and 5 was adapted from [73]. We thank Benoit Liquet for making this available, and Lauren Jones for translating the R code into Python. We thank all who through their comments, feedback, and suggestions have contributed to this book, including Qibin Duan, Luke Taylor, Rémi Mouzayek, Harry Goodman, Bryce Stansfield, Ryan Tongs, Dillon Steyl, Bill Rudd, Nan Ye, Christian Hirsch, Chris van der Heide, Sarat Moka, Aapeli Vuorinen, Joshua Ross, Giang Nguyen, and the anonymous referees. David Grubbs deserves a special accollade for his professionalism and attention to detail in his role as Editor for this book.
The book was test-run during the 2019 Summer School of the Australian Mathemat ical Sciences Institute. More than 80 bright upper-undergraduate (Honours) students used the book for the course Mathematical Methods for Machine Learning, taught by Zdravko Botev. We are grateful for the valuable feedback that they provided.
Our special thanks go out to Robert Salomone, Liam Berry, Robin Carrick, and Sam Daley, who commented in great detail on earlier versions of the entire book and wrote and improved our Python code. Their enthusiasm, perceptiveness, and kind assistance have been invaluable.
Of course, none of this work would have been possible without the loving support, patience, and encouragement from our families, and we thank them with all our hearts. This book was financially supported by the Australian Research Council Centre of Excellence for Mathematical & Statistical Frontiers, under grant number CE140100049.
Dirk Kroese, Zdravko Botev,
Thomas Taimre, and Radislav Vaisman
Brisbane and Sydney
xvi
NOTATION
We could, of course, use any notation we want; do not laugh at notations; invent them, they are powerful. In fact, mathematics is, to a large extent, in vention of better notations.
Richard P. Feynman
We have tried to use a notation system that is, in order of importance, simple, descript ive, consistent, and compatible with historical choices. Achieving all of these goals all of the time would be impossible, but we hope that our notation helps to quickly recognize the type or “flavor” of certain mathematical objects (vectors, matrices, random vectors, probability measures, etc.) and clarify intricate ideas.
We make use of various typographical aids, and it will be beneficial for the reader to be aware of some of these.
• Boldface font is used to indicate composite objects, such as column vectors x = [x1, . . . , xn]>and matrices X = [xi j]. Note also the difference between the upright bold font for matrices and the slanted bold font for vectors.
• Random variables are generally specified with upper case roman letters X, Y, Z and their outcomes with lower case letters x, y,z. Random vectors are thus denoted in upper case slanted bold font: X = [X1, . . . , Xn]>.
• Sets of vectors are generally written in calligraphic font, such as X, but the set of real numbers uses the common blackboard bold font R. Expectation and probability also use the latter font.
• Probability distributions use a sans serif font, such as Bin and Gamma. Exceptions to this rule are the “standard” notations N and U for the normal and uniform distributions.
• We often omit brackets when it is clear what the argument is of a function or operator. For example, we prefer EX2to E[X2].
xvii
xviii Notation
• We employ color to emphasize that certain words refer to a dataset, function, or package in Python. All code is written in typewriter font. To be compatible with past notation choices, we introduced a special blue symbol X for the model (design) matrix of a linear model.
• Important notation such as T, g, g∗is often defined in a mnemonic way, such as T for “training”, g for “guess”, g∗for the “star” (that is, optimal) guess, and ` for “loss”.
• We will occasionally use a Bayesian notation convention in which the same symbol is used to denote different (conditional) probability densities. In particular, instead of writing fX(x) and fX | Y (x | y) for the probability density function (pdf) of X and the conditional pdf of X given Y, we simply write f(x) and f(x | y). This particular style of notation can be of great descriptive value, despite its apparent ambiguity.
General font/notation rules
x scalar
x vector
X random vector
X matrix
X set
bx estimate or approximation
x∗ optimal
x average
Common mathematical symbols
∀ for all
∃ there exists
∝ is proportional to
⊥ is perpendicular to
∼ is distributed as
iid∼, ∼iid are independent and identically distributed as
approx.
∼ is approximately distributed as
∇f gradient of f
∇2f Hessian of f
f ∈ Cpf has continuous derivatives of order p
≈ is approximately
' is asymptotically
is much smaller than
⊕ direct sum
Notation xix
elementwise product
∩ intersection
∪ union
:=, =: is defined as
a.s.
−→ converges almost surely to
d
−→ converges in distribution to
P
−→ converges in probability to
Lp
−→ converges in Lp-norm to
k · k Euclidean norm
dxe smallest integer larger than x
bxc largest integer smaller than x
x+ max{x, 0}
Matrix/vector notation
A>, x>transpose of matrix A or vector x
A−1inverse of matrix A
A+ pseudo-inverse of matrix A
A−> inverse of matrix A> or transpose of A−1
A 0 matrix A is positive definite
A 0 matrix A is positive semidefinite
dim(x) dimension of vector x
det(A) determinant of matrix A
|A| absolute value of the determinant of matrix A
tr(A) trace of matrix A
Reserved letters and words
C set of complex numbers
d differential symbol
E expectation
e the number 2.71828 . . .
f probability density (discrete or continuous)
g prediction function
1{A} or 1A indicator function of set A
i the square root of −1
` risk: expected loss
xx Notation
Loss loss function
ln (natural) logarithm
N set of natural numbers {0, 1, . . .}
O big-O order symbol: f(x) = O(g(x)) if | f(x)| 6 αg(x) for some constant α as x → a
o little-o order symbol: f(x) = o(g(x)) if f(x)/g(x) → 0 as x → a
P probability measure
π the number 3.14159 . . .
R set of real numbers (one-dimensional Euclidean space)
Rn n-dimensional Euclidean space
R+ positive real line: [0, ∞)
τ deterministic training set
T random training set
X model (design) matrix
Z set of integers {. . . , −1, 0, 1, . . .}
Probability distributions
Ber Bernoulli
Beta beta
Bin binomial
Exp exponential
Geom geometric
Gamma gamma
F Fisher–Snedecor F
N normal or Gaussian
Pareto Pareto
Poi Poisson
t Student’s t
U uniform
Abbreviations and acronyms
cdf cumulative distribution function
CMC crude Monte Carlo
CE cross-entropy
EM expectation–maximization
GP Gaussian process
KDE Kernel density estimate/estimator
Notation xxi
KL Kullback–Leibler
KKT Karush–Kuhn–Tucker
iid independent and identically distributed
MAP maximum a posteriori
MCMC Markov chain Monte Carlo
MLE maximum likelihood estimator/estimate
OOB out-of-bag
PCA principal component analysis
pdf probability density function (discrete or continuous)
SVD singular value decomposition
xxii
CHAPTER 1
IMPORTING, SUMMARIZING, AND
VISUALIZING DATA
This chapter describes where to find useful data sets, how to load them into Python, and how to (re)structure the data. We also discuss various ways in which the data can be summarized via tables and figures. Which type of plots and numerical summaries are appropriate depends on the type of the variable(s) in play. Readers unfamiliar with Python are advised to read Appendix D first.
1.1 Introduction
Data comes in many shapes and forms, but can generally be thought of as being the result of some random experiment — an experiment whose outcome cannot be determined in advance, but whose workings are still subject to analysis. Data from a random experiment are often stored in a table or spreadsheet. A statistical convention is to denote variables — often called features — as columns and the individual items (or units) as rows. It is useful features
to think of three types of columns in such a spreadsheet:
1. The first column is usually an identifier or index column, where each unit/row is given a unique name or ID.
2. Certain columns (features) can correspond to the design of the experiment, specify ing, for example, to which experimental group the unit belongs. Often the entries in these columns are deterministic; that is, they stay the same if the experiment were to be repeated.
3. Other columns represent the observed measurements of the experiment. Usually, these measurements exhibit variability; that is, they would change if the experiment were to be repeated.
There are many data sets available from the Internet and in software packages. A well known repository of data sets is the Machine Learning Repository maintained by the Uni versity of California at Irvine (UCI), found at https://archive.ics.uci.edu/.
1
2 1.1. Introduction
These data sets are typically stored in a CSV (comma separated values) format, which can be easily read into Python. For example, to access the abalone data set from this web site with Python, download the file to your working directory, import the pandas package via
import pandas as pd
and read in the data as follows:
abalone = pd.read_csv('abalone.data',header = None)
It is important to add header = None, as this lets Python know that the first line of the CSV does not contain the names of the features, as it assumes so by default. The data set was originally used to predict the age of abalone from physical measurements, such as shell weight and diameter.
Another useful repository of over 1000 data sets from various packages in the R pro gramming language, collected by Vincent Arel-Bundock, can be found at:
https://vincentarelbundock.github.io/Rdatasets/datasets.html.
For example, to read Fisher’s famous iris data set from R’s datasets package into Py thon, type:
urlprefix = 'https://vincentarelbundock.github.io/Rdatasets/csv/' dataname = 'datasets/iris.csv'
iris = pd. read_csv ( urlprefix + dataname )
The iris data set contains four physical measurements (sepal/petal length/width) on 50 specimens (each) of 3 species of iris: setosa, versicolor, and virginica. Note that in this case the headers are included. The output of read_csv is a DataFrame object, which is
☞ 485 pandas’s implementation of a spreadsheet; see Section D.12.1. The DataFrame method head gives the first few rows of the DataFrame, including the feature names. The number of rows can be passed as an argument and is 5 by default. For the iris DataFrame, we have:
iris.head ()
Unnamed: 0 Sepal.Length ... Petal.Width Species 0 1 5.1 ... 0.2 setosa 1 2 4.9 ... 0.2 setosa 2 3 4.7 ... 0.2 setosa 3 4 4.6 ... 0.2 setosa 4 5 5.0 ... 0.2 setosa
[5 rows x 6 columns]
The names of the features can be obtained via the columns attribute of the DataFrame object, as in iris.columns. Note that the first column is a duplicate index column, whose name (assigned by pandas) is 'Unnamed: 0'. We can drop this column and reassign the iris object as follows:
iris = iris.drop('Unnamed: 0',1)
Chapter 1. Importing, Summarizing, and Visualizing Data 3
The data for each feature (corresponding to its specific name) can be accessed by using Python’s slicing notation []. For example, the object iris[’Sepal.Length’] contains the 150 sepal lengths.
The first three rows of the abalone data set from the UCI repository can be found as follows:
abalone.head (3)
0 1 2 3 4 5 6 7 8 0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15 1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7 2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
Here, the missing headers have been assigned according to the order of the natural numbers. The names should correspond to Sex, Length, Diameter, Height, Whole weight, Shucked weight, Viscera weight, Shell weight, and Rings, as described in the file with the name abalone.names on the UCI website. We can manually add the names of the features to the DataFrame by reassigning the columns attribute, as in:
abalone.columns = ['Sex', 'Length', 'Diameter', 'Height',
'Whole weight','Shucked weight', 'Viscera weight', 'Shell weight', 'Rings']
1.2 Structuring Features According to Type
We can generally classify features as either quantitative or qualitative. Quantitative features Quantitative possess “numerical quantity”, such as height, age, number of births, etc., and can either be continuous or discrete. Continuous quantitative features take values in a continuous range of possible values, such as height, voltage, or crop yield; such features capture the idea that measurements can always be made more precisely. Discrete quantitative features have a countable number of possibilities, such as a count.
In contrast, qualitative features do not have a numerical meaning, but their possible qualitative values can be divided into a fixed number of categories, such as {M,F} for gender or {blue, black, brown, green} for eye color. For this reason such features are also called categorical. categorical A simple rule of thumb is: if it does not make sense to average the data, it is categorical. For example, it does not make sense to average eye colors. Of course it is still possible to represent categorical data with numbers, such as 1 = blue, 2 = black, 3 = brown, but such numbers carry no quantitative meaning. Categorical features are often called factors. factors
When manipulating, summarizing, and displaying data, it is important to correctly spe cify the type of the variables (features). We illustrate this using the nutrition_elderly data set from [73], which contains the results of a study involving nutritional measure ments of thirteen features (columns) for 226 elderly individuals (rows). The data set can be obtained from:
http://www.biostatisticien.eu/springeR/nutrition_elderly.xls.
Excel files can be read directly into pandas via the read_excel method:
4 1.2. Structuring Features According to Type
xls = 'http://www.biostatisticien.eu/springeR/nutrition_elderly.xls' nutri = pd. read_excel (xls)
This creates a DataFrame object nutri. The first three rows are as follows:
pd. set_option ('display.max_columns', 8) # to fit display
nutri.head (3)
gender situation tea ... cooked_fruit_veg chocol fat 0 2 1 0 ... 4 5 6 1 2 1 1 ... 5 1 4 2 2 1 0 ... 2 5 4
[3 rows x 13 columns]
You can check the type (or structure) of the variables via the info method of nutri.
nutri.info ()
<class 'pandas.core.frame.DataFrame '>
RangeIndex : 226 entries , 0 to 225
Data columns (total 13 columns ):
gender 226 non -null int64
situation 226 non -null int64
tea 226 non -null int64
coffee 226 non -null int64
height 226 non -null int64
weight 226 non -null int64
age 226 non -null int64
meat 226 non -null int64
fish 226 non -null int64
raw_fruit 226 non -null int64
cooked_fruit_veg 226 non -null int64
chocol 226 non -null int64
fat 226 non -null int64
dtypes: int64 (13)
memory usage: 23.0 KB
All 13 features in nutri are (at the moment) interpreted by Python as quantitative variables, indeed as integers, simply because they have been entered as whole numbers. The meaning of these numbers becomes clear when we consider the description of the features, given in Table 1.2. Table 1.1 shows how the variable types should be classified.
Table 1.1: The feature types for the data frame nutri.
Qualitative gender, situation, fat
meat, fish, raw_fruit, cooked_fruit_veg, chocol
Discrete quantitative tea, coffee
Continuous quantitative height, weight, age
Note that the categories of the qualitative features in the second row of Table 1.1, meat, . . . , chocol have a natural order. Such qualitative features are sometimes called ordinal, in
Chapter 1. Importing, Summarizing, and Visualizing Data 5
Table 1.2: Description of the variables in the nutritional study [73].
Feature Description Unit or Coding
gender Gender 1=Male; 2=Female
1=Single
situation Family status
2=Living with spouse 3=Living with family 4=Living with someone else
tea Daily consumption of tea Number of cups coffee Daily consumption of coffee Number of cups height Height cm weight Weight (actually: mass) kg age Age at date of interview Years 0=Never
1=Less than once a week
meat Consumption of meat
2=Once a week 3=2–3 times a week 4=4–6 times a week 5=Every day
fish Consumption of fish As in meat raw_fruit Consumption of raw fruits As in meat cooked_fruit_veg Consumption of cooked As in meat fruits and vegetables chocol Consumption of chocolate As in meat 1=Butter
2=Margarine
3=Peanut oil
fat
Type of fat used 4=Sunflower oil
for cooking 5=Olive oil
6=Mix of vegetable oils (e.g., Isio4)
7=Colza oil
8=Duck or goose fat
contrast to qualitative features without order, which are called nominal. We will not make such a distinction in this book.
We can modify the Python value and type for each categorical feature, using the replace and astype methods. For categorical features, such as gender, we can replace the value 1 with 'Male' and 2 with 'Female', and change the type to 'category' as follows.
DICT = {1:'Male', 2:'Female'} # dictionary specifies replacement nutri['gender'] = nutri['gender']. replace(DICT).astype('category')
The structure of the other categorical-type features can be changed in a similar way. Continuous features such as height should have type float:
nutri['height'] = nutri['height']. astype(float)
6 1.3. Summary Tables
We can repeat this for the other variables (see Exercise 2) and save this modified data frame as a CSV file, by using the pandas method to_csv.
nutri.to_csv('nutri.csv',index=False)
1.3 Summary Tables
It is often useful to summarize a large spreadsheet of data in a more condensed form. A table of counts or a table of frequencies makes it easier to gain insight into the underlying distribution of a variable, especially if the data are qualitative. Such tables can be obtained with the methods describe and value_counts.
As a first example, we load the nutri DataFrame, which we restructured and saved (see previous section) as 'nutri.csv', and then construct a summary for the feature (column) 'fat'.
nutri = pd.read_csv('nutri.csv')
nutri['fat']. describe ()
count 226
unique 8
top sunflower
freq 68
Name: fat , dtype: object
We see that there are 8 different types of fat used and that sunflower has the highest count, with 68 out of 226 individuals using this type of cooking fat. The method value_counts gives the counts for the different fat types.
nutri['fat']. value_counts ()
sunflower 68
peanut 48
olive 40
margarine 27
Isio4 23
butter 15
duck 4
colza 1
Name: fat , dtype: int64
Column labels are also attributes of a DataFrame, and nutri.fat, for example, is exactly the same object as nutri['fat'].
Chapter 1. Importing, Summarizing, and Visualizing Data 7
It is also possible to use crosstab to cross tabulate between two or more variables,cross tabulate giving a contingency table:
pd.crosstab (nutri.gender , nutri. situation )
situation Couple Family Single
gender
Female 56 7 78
Male 63 2 20
We see, for example, that the proportion of single men is substantially smaller than the proportion of single women in the data set of elderly people. To add row and column totals to a table, use margins=True.
pd.crosstab (nutri.gender , nutri.situation , margins=True)
situation Couple Family Single All
gender
Female 56 7 78 141
Male 63 2 20 85
All 119 9 98 226
1.4 Summary Statistics
In the following, x = [x1, . . . , xn]>is a column vector of n numbers. For our nutri data, the vector x could, for example, correspond to the heights of the n = 226 individuals.
The sample mean of x, denoted by x, is simply the average of the data values: sample mean x =1nXni=1xi.
Using the mean method in Python for the nutri data, we have, for instance:
nutri['height']. mean ()
163.96017699115043
The p-sample quantile (0 < p < 1) of x is a value x such that at least a fraction p of the sample quantile data is less than or equal to x and at least a fraction 1− p of the data is greater than or equal to x. The sample median is the sample 0.5-quantile. The p-sample quantile is also called sample median the 100 × p percentile. The 25, 50, and 75 sample percentiles are called the first, second, and third quartiles of the data. For the nutri data they are obtained as follows. quartiles
nutri['height']. quantile (q=[0.25 ,0.5 ,0.75])
0.25 157.0
0.50 163.0
0.75 170.0
8 1.5. Visualizing Data
The sample mean and median give information about the location of the data, while the distance between sample quantiles (say the 0.1 and 0.9 quantiles) gives some indication of the dispersion (spread) of the data. Other measures for dispersion are the sample range, sample rangemaxixi − minixi, the sample variance
sample variance
s2 =1 n − 1
Xn i=1
(xi − x )2, (1.1)
sample
standard deviation
and the sample standard deviation s =√s2. For the nutri data, the range (in cm) is:
☞ 455nutri['height'].max() - nutri['height'].min()
48.0
The variance (in cm2) is:
round(nutri['height']. var (), 2) # round to two decimal places
81.06
And the standard deviation can be found via:
round(nutri['height']. std (), 2)
9.0
We already encountered the describe method in the previous section for summarizing qualitative features, via the most frequent count and the number of unique elements. When applied to a quantitative feature, it returns instead the minimum, maximum, mean, and the three quartiles. For example, the 'height' feature in the nutri data has the following summary statistics.
nutri['height']. describe ()
count 226.000000
mean 163.960177
std 9.003368
min 140.000000
25\% 157.000000
50\% 163.000000
75\% 170.000000
max 188.000000
Name: height , dtype: float64
1.5 Visualizing Data
In this section we describe various methods for visualizing data. The main point we would like to make is that the way in which variables are visualized should always be adapted to the variable types; for example, qualitative data should be plotted differently from quantit ative data.
Chapter 1. Importing, Summarizing, and Visualizing Data 9
For the rest of this section, it is assumed that matplotlib.pyplot, pandas, and numpy, have been imported in the Python code as follows.
import matplotlib .pyplot as plt
import pandas as pd
import numpy as np
1.5.1 Plotting Qualitative Variables
Suppose we wish to display graphically how many elderly people are living by themselves, as a couple, with family, or other. Recall that the data are given in the situation column of our nutri data. Assuming that we already restructured the data, as in Section 1.2, we ☞ 3 can make a barplot of the number of people in each category via the plt.bar function ofbarplot the standard matplotlib plotting library. The inputs are the x-axis positions, heights, and widths of each bar respectively.
width = 0.35 # the width of the bars
x = [0, 0.8, 1.6] # the bar positions on x-axis
situation_counts =nutri['situation']. value_counts ()
plt.bar(x, situation_counts , width , edgecolor = 'black')
plt.xticks(x, situation_counts .index)
plt.show ()
125
100
75
50
25
0
Couple Single Family
Figure 1.1: Barplot for the qualitative variable 'situation'.
1.5.2 Plotting Quantitative Variables
We now present a few useful methods for visualizing quantitative data, again using the nutri data set. We will first focus on continuous features (e.g., 'age') and then add some specific graphs related to discrete features (e.g., 'tea'). The aim is to describe the variab ility present in a single feature. This typically involves a central tendency, where observa tions tend to gather around, with fewer observations further away. The main aspects of the distribution are the location (or center) of the variability, the spread of the variability (how far the values extend from the center), and the shape of the variability; e.g., whether or not values are spread symmetrically on either side of the center.
10 1.5. Visualizing Data
1.5.2.1 Boxplot
A boxplot can be viewed as a graphical representation of the five-number summary of boxplotthe data consisting of the minimum, maximum, and the first, second, and third quartiles. Figure 1.2 gives a boxplot for the 'age' feature of the nutri data.
plt.boxplot(nutri['age'],widths=width ,vert=False)
plt.xlabel('age')
plt.show ()
The widths parameter determines the width of the boxplot, which is by default plotted vertically. Setting vert=False plots the boxplot horizontally, as in Figure 1.2.
1
65 70 75 80 85 90
age
Figure 1.2: Boxplot for 'age'.
The box is drawn from the first quartile (Q1) to the third quartile (Q3). The vertical line inside the box signifies the location of the median. So-called “whiskers” extend to either side of the box. The size of the box is called the interquartile range: IQR = Q3 − Q1. The left whisker extends to the largest of (a) the minimum of the data and (b) Q1 − 1.5 IQR. Similarly, the right whisker extends to the smallest of (a) the maximum of the data and (b) Q3 + 1.5 IQR. Any data point outside the whiskers is indicated by a small hollow dot, indicating a suspicious or deviant point (outlier). Note that a boxplot may also be used for discrete quantitative features.
1.5.2.2 Histogram
A histogram is a common graphical representation of the distribution of a quantitative histogramfeature. We start by breaking the range of the values into a number of bins or classes. We tally the counts of the values falling in each bin and then make the plot by drawing rectangles whose bases are the bin intervals and whose heights are the counts. In Python we can use the function plt.hist. For example, Figure 1.3 shows a histogram of the 226 ages in nutri, constructed via the following Python code.
weights = np. ones_like (nutri.age)/nutri.age.count ()
plt.hist(nutri.age ,bins =9, weights=weights , facecolor ='cyan',
edgecolor ='black', linewidth =1)
plt.xlabel('age')
plt.ylabel('Proportion of Total')
plt.show ()
Chapter 1. Importing, Summarizing, and Visualizing Data 11
Here 9 bins were used. Rather than using raw counts (the default), the vertical axis here gives the percentage in each class, defined by count
total . This is achieved by choosing the
“weights” parameter to be equal to the vector with entries 1/266, with length 226. Various plotting parameters have also been changed.
0.20
l
a
t
o
T
f
o
n
o
it
r
o
p
o
r
P
0.15
0.10
0.05
0.00
65 70 75 80 85 90 age
Figure 1.3: Histogram of 'age'.
Histograms can also be used for discrete features, although it may be necessary to explicitly specify the bins and placement of the ticks on the axes.
1.5.2.3 Empirical Cumulative Distribution Function
The empirical cumulative distribution function, denoted by Fn, is a step function whichempirical
jumps an amount k/n at observation values, where k is the number of tied observations at that value. For observations x1, . . . , xn, Fn(x) is the fraction of observations less than or equal to x, i.e.,
n=1nXni=11 {xi 6 x} , (1.2)
Fn(x) =number of xi 6 x
cumulative distribution function
where 1 denotes the indicator function; that is, 1 {xi 6 x} is equal to 1 when xi 6 x and 0 indicator otherwise. To produce a plot of the empirical cumulative distribution function we can use the plt.step function. The result for the age data is shown in Figure 1.4. The empirical cumulative distribution function for a discrete quantitative variable is obtained in the same way.
x = np.sort(nutri.age)
y = np.linspace (0,1,len(nutri.age))
plt.xlabel('age')
plt.ylabel('Fn(x)')
plt.step(x,y)
plt.xlim(x.min(),x.max())
plt.show ()
12 1.5. Visualizing Data
1.0
0.8
0.6
)
x
(
n
F
0.4
0.2
0.0
65 70 75 80 85 90
age
Figure 1.4: Plot of the empirical distribution function for the continuous quantitative fea ture 'age'.
1.5.3 Data Visualization in a Bivariate Setting
In this section, we present a few useful visual aids to explore relationships between two features. The graphical representation will depend on the type of the two features.
1.5.3.1 Two-way Plots for Two Categorical Variables
Comparing barplots for two categorical variables involves introducing subplots to the fig ure. Figure 1.5 visualizes the contingency table of Section 1.3, which cross-tabulates the family status (situation) with the gender of the elderly people. It simply shows two barplots next to each other in the same figure.
s t
n
u
o
C
80
Male
Female
60
40
20
0
Couple Family Single Figure 1.5: Barplot for two categorical variables.
Chapter 1. Importing, Summarizing, and Visualizing Data 13
The figure was made using the seaborn package, which was specifically designed to simplify statistical visualization tasks.
import seaborn as sns
sns. countplot (x='situation', hue = 'gender', data=nutri ,
hue_order = ['Male', 'Female'], palette = ['SkyBlue','Pink'],
saturation = 1, edgecolor ='black')
plt.legend(loc='upper center')
plt.xlabel('')
plt.ylabel('Counts')
plt.show ()
1.5.3.2 Plots for Two Quantitative Variables
We can visualize patterns between two quantitative features using a scatterplot. This can be scatterplot done with plt.scatter. The following code produces a scatterplot of 'weight' against 'height' for the nutri data.
plt.scatter(nutri.height , nutri.weight , s=12, marker='o')
plt.xlabel('height')
plt.ylabel('weight')
plt.show ()
90
80
t h
g
i
e
w
70
60
50
40
140 150 160 170 180 190 height
Figure 1.6: Scatterplot of 'weight' against 'height'.
The next Python code illustrates that it is possible to produce highly sophisticated scat ter plots, such as in Figure 1.7. The figure shows the birth weights (mass) of babies whose mothers smoked (blue triangles) or not (red circles). In addition, straight lines were fitted to the two groups, suggesting that birth weight decreases with age when the mother smokes, but increases when the mother does not smoke! The question is whether these trends are statistically significant or due to chance. We will revisit this data set later on in the book. ☞ 199
14 1.5. Visualizing Data
urlprefix = 'https://vincentarelbundock.github.io/Rdatasets/csv/' dataname = 'MASS/birthwt.csv'
bwt = pd.read_csv( urlprefix + dataname)
bwt = bwt.drop('Unnamed: 0',1) #drop unnamed column
styles = {0: ['o','red'], 1: ['^','blue']}
for k in styles:
grp = bwt[bwt.smoke ==k]
m,b = np.polyfit(grp.age , grp.bwt , 1) # fit a straight line
plt.scatter(grp.age , grp.bwt , c=styles[k][1] , s=15, linewidth =0, marker = styles[k][0])
plt.plot(grp.age , m*grp.age + b, '-', color=styles[k][1])
plt.xlabel('age')
plt.ylabel('birth weight (g)')
plt.legend (['non-smokers','smokers'],prop ={'size':8},
loc =(0.5 ,0.8))
plt.show ()
6000
non-smokers
)
g
(
t
h
g
i
e
w
h
t
r
i
b
5000 4000 3000 2000 1000 0
smokers
10 15 20 25 30 35 40 45 50
age
Figure 1.7: Birth weight against age for smoking and non-smoking mothers.
1.5.3.3 Plots for One Qualitative and One Quantitative Variable
In this setting, it is interesting to draw boxplots of the quantitative feature for each level of the categorical feature. Assuming the variables are structured correctly, the function plt.boxplot can be used to produce Figure 1.8, using the following code:
males = nutri[nutri.gender == 'Male']
females = nutri[nutri.gender == 'Female']
plt.boxplot ([ males.coffee ,females.coffee],notch=True ,widths =(0.5 ,0.5))
plt.xlabel('gender')
plt.ylabel('coffee')
plt.xticks ([1 ,2] ,['Male','Female'])
plt.show ()
Chapter 1. Importing, Summarizing, and Visualizing Data 15
5
4
3
e
e
f
f
oc
2
1
0
Male Female
gender
Figure 1.8: Boxplots of a quantitative feature 'coffee' as a function of the levels of a categorical feature 'gender'. Note that we used a different, “notched”, style boxplot this time.
Further Reading
The focus in this book is on the mathematical and statistical analysis of data, and for the rest of the book we assume that the data is available in a suitable form for analysis. How ever, a large part of practical data science involves the cleaning of data; that is, putting it into a form that is amenable to analysis with standard software packages. Standard Py thon modules such as numpy and pandas can be used to reformat rows, rename columns, remove faulty outliers, merge rows, and so on. McKinney, the creator of pandas, gives many practical case studies in [84]. Effective data visualization techniques are beautifully illustrated in [65].
Exercises
Before you attempt these exercises, make sure you have up-to-date versions of the relevant Python packages, specifically matplotlib, pandas, and seaborn. An easy way to ensure this is to update packages via the Anaconda Navigator, as explained in Appendix D.
1. Visit the UCI Repository https://archive.ics.uci.edu/. Read the description of the data and download the Mushroom data set agaricus-lepiota.data. Using pandas, read the data into a DataFrame called mushroom, via read_csv.
(a) How many features are in this data set?
(b) What are the initial names and types of the features?
(c) Rename the first feature (index 0) to 'edibility' and the sixth feature (index 5) to 'odor' [Hint: the column names in pandas are immutable; so individual columns cannot be modified directly. However it is possible to assign the entire column names list via mushroom.columns = newcols. ]
16 Exercises
(d) The 6th column lists the various odors of the mushrooms: encoded as 'a', 'c', . . . . Replace these with the names 'almond', 'creosote', etc. (categories correspond ing to each letter can be found on the website). Also replace the 'edibility' cat egories 'e' and 'p' with 'edible' and 'poisonous'.
(e) Make a contingency table cross-tabulating 'edibility' and 'odor'.
(f) Which mushroom odors should be avoided, when gathering mushrooms for consump tion?
(g) What proportion of odorless mushroom samples were safe to eat?
2. Change the type and value of variables in the nutri data set according to Table 1.2 and save the data as a CSV file. The modified data should have eight categorical features, three floats, and two integer features.
3. It frequently happens that a table with data needs to be restructured before the data can be analyzed using standard statistical software. As an example, consider the test scores in Table 1.3 of 5 students before and after specialized tuition.
Table 1.3: Student scores.
Student Before After
1 75 85
2 30 50
3 100 100
4 50 52
5 60 65
This is not in the standard format described in Section 1.1. In particular, the student scores are divided over two columns, whereas the standard format requires that they are collected in one column, e.g., labelled 'Score'. Reformat (by hand) the table in standard format, using three features:
• 'Score', taking continuous values,
• 'Time', taking values 'Before' and 'After',
• 'Student', taking values from 1 to 5.
Useful methods for reshaping tables in pandas are melt, stack, and unstack.
4. Create a similar barplot as in Figure 1.5, but now plot the corresponding proportions of males and females in each of the three situation categories. That is, the heights of the bars should sum up to 1 for both barplots with the same ’gender’ value. [Hint: seaborn does not have this functionality built in, instead you need to first create a contingency table and use matplotlib.pyplot to produce the figure.]
☞ 2 5. The iris data set, mentioned in Section 1.1, contains various features, including 'Petal.Length' and 'Sepal.Length', of three species of iris: setosa, versicolor, and virginica.
Chapter 1. Importing, Summarizing, and Visualizing Data 17 (a) Load the data set into a pandas DataFrame object.
(b) Using matplotlib.pyplot, produce boxplots of 'Petal.Length' for each the three species, in one figure.
(c) Make a histogram with 20 bins for 'Petal.Length'.
(d) Produce a similar scatterplot for 'Sepal.Length' against 'Petal.Length' to that of the left plot in Figure 1.9. Note that the points should be colored according to the ’Species’ feature as per the legend in the right plot of the figure.
(e) Using the kdeplot method of the seaborn package, reproduce the right plot of Figure 1.9, where kernel density plots for 'Petal.Length' are given. ☞ 131
8
7
h
t
g
n
e
6
L
.
l
a
2.5
2.0
1.5
y
t
i
s
n
e
setosa
versicolor virginica
D
p
e
S
5
1 2 3 4 5 6 7 Petal.Length
1.0
0.5
0.0
2 4 6 8
Petal.Length
Figure 1.9: Left: scatterplot of 'Sepal.Length' against 'Petal.Length'. Right: kernel density estimates of 'Petal.Length' for the three species of iris.
6. Import the data set EuStockMarkets from the same website as the iris data set above. The data set contains the daily closing prices of four European stock indices during the 1990s, for 260 working days per year.
(a) Create a vector of times (working days) for the stock prices, between 1991.496 and 1998.646 with increments of 1/260.
(b) Reproduce Figure 1.10. [Hint: Use a dictionary to map column names (stock indices) to colors.]
18 Exercises 9000
8000 7000 6000 5000 4000 3000 2000 1000 0
DAX SMI
CAC FTSE
1991 1992 1993 1994 1995 1996 1997 1998 1999
Figure 1.10: Closing stock indices for various European stock markets.
7. Consider the KASANDR data set from the UCI Machine Learning Repository, which can be downloaded from
https://archive.ics.uci.edu/ml/machine-learning-databases/00385/de .tar.bz2.
This archive file has a size of 900Mb, so it may take a while to download. Uncompressing the file (e.g., via 7-Zip) yields a directory de containing two large CSV files: test_de.csv and train_de.csv, with sizes 372Mb and 3Gb, respectively. Such large data files can still be processed efficiently in pandas, provided there is enough memory. The files contain records of user information from Kelkoo web logs in Germany as well as meta-data on users, offers, and merchants. The data sets have 7 attributes and 1919561 and 15844717 rows, respectively. The data sets are anonymized via hex strings.
(a) Load train_de.csv into a pandas DataFrame object de, using read_csv('train_de.csv', delimiter = '\t').
If not enough memory is available, load test_de.csv instead. Note that entries are separated here by tabs, not commas. Time how long it takes for the file to load, using the time package. (It took 38 seconds for train_de.csv to load on one of our computers.)
(b) How many unique users and merchants are in this data set?
8. Visualizing data involving more than two features requires careful design, which is often more of an art than a science.
(a) Go to Vincent Arel-Bundocks’s website (URL given in Section 1.1) and read the Orange data set into a pandas DataFrame object called orange. Remove its first (unnamed) column.
(b) The data set contains the circumferences of 5 orange trees at various stages in their development. Find the names of the features.
(c) In Python, import seaborn and visualize the growth curves (circumference against age) of the trees, using the regplot and FacetGrid methods.
CHAPTER 2
STATISTICAL LEARNING
The purpose of this chapter is to introduce the reader to some common concepts and themes in statistical learning. We discuss the difference between supervised and unsupervised learning, and how we can assess the predictive performance of supervised learning. We also examine the central role that the linear and Gaussian properties play in the modeling of data. We conclude with a section on Bayesian learning. The required probability and statistics background is given in Appendix C.
2.1 Introduction
Although structuring and visualizing data are important aspects of data science, the main challenge lies in the mathematical analysis of the data. When the goal is to interpret the model and quantify the uncertainty in the data, this analysis is usually referred to as stat istical learning. In contrast, when the emphasis is on making predictions using large-scalestatistical
data, then it is common to speak about machine learning or data mining. learning
There are two major goals for modeling data: 1) to accurately predict some future quantity of interest, given some observed data, and 2) to discover unusual or interesting patterns in the data. To achieve these goals, one must rely on knowledge from three im portant pillars of the mathematical sciences.
Function approximation. Building a mathematical model for data usually means under standing how one data variable depends on another data variable. The most natural way to represent the relationship between variables is via a mathematical function or map. We usually assume that this mathematical function is not completely known, but can be approximated well given enough computing power and data. Thus, data scientists have to understand how best to approximate and represent functions using the least amount of computer processing and memory.
Optimization. Given a class of mathematical models, we wish to find the best possible model in that class. This requires some kind of efficient search or optimization pro cedure. The optimization step can be viewed as a process of fitting or calibrating a function to observed data. This step usually requires knowledge of optimization algorithms and efficient computer coding or programming.
19
machine
learning data mining
20 2.2. Supervised and Unsupervised Learning
Probability and Statistics. In general, the data used to fit the model is viewed as a realiz ation of a random process or numerical vector, whose probability law determines the accuracy with which we can predict future observations. Thus, in order to quantify the uncertainty inherent in making predictions about the future, and the sources of er ror in the model, data scientists need a firm grasp of probability theory and statistical inference.
2.2 Supervised and Unsupervised Learning
feature Given an input or feature vector x, one of the main goals of machine learning is to predict response an output or response variable y. For example, x could be a digitized signature and y a binary variable that indicates whether the signature is genuine or false. Another example is where x represents the weight and smoking habits of an expecting mother and y the birth weight of the baby. The data science attempt at this prediction is encoded in a mathematical function, which takes as an input x and outputs a guess g(x)
prediction function g, called the prediction function
for y (denoted by by, for example). In a sense, g encompasses all the information about the relationship between the variables x and y, excluding the effects of chance and randomness in nature.
In regression problems, the response variable y can take any real value. In contrast, regressionwhen y can only lie in a finite set, say y ∈ {0, . . . , c − 1}, then predicting y is conceptually the same as classifying the input x into one of c categories, and so prediction becomes a classification classification problem.
We can measure the accuracy of a prediction by with respect to a given response y by loss function using some loss function Loss(y,by). In a regression setting the usual choice is the squared error loss (y−by)2. In the case of classification, the zero–one (also written 0–1) loss function Loss(y,by) = 1{y , by} is often used, which incurs a loss of 1 whenever the predicted class by is not equal to the class y. Later on in this book, we will encounter various other useful loss functions, such as the cross-entropy and hinge loss functions (see, e.g., Chapter 7).
The word error is often used as a measure of distance between a “true” object y and some approximation by thereof. If y is real-valued, the absolute error |y − by| and the squared error (y−by)2are both well-established error concepts, as are the norm ky−byk and squared norm ky−byk2for vectors. The squared error (y−by)2is just one example of a loss function.
It is unlikely that any mathematical function g will be able to make accurate predictions for all possible pairs (x, y) one may encounter in Nature. One reason for this is that, even with the same input x, the output y may be different, depending on chance circumstances or randomness. For this reason, we adopt a probabilistic approach and assume that each pair (x, y) is the outcome of a random pair (X, Y) that has some joint probability density f(x, y). We then assess the predictive performance via the expected loss, usually called the risk risk, for g:
`(g) = E Loss(Y, g(X)). (2.1)
For example, in the classification case with zero–one loss function the risk is equal to the probability of incorrect classification: `(g) = P[Y , g(X)]. In this context, the prediction
Chapter 2. Statistical Learning 21
function g is called a classifier. Given the distribution of (X, Y) and any loss function, we classifier can in principle find the best possible g∗:= argming E Loss(Y, g(X)) that yields the smallest risk `∗:= `(g∗). We will see in Chapter 7 that in the classification case with y ∈ {0, . . . , c−1} ☞ 251 and `(g) = P[Y , g(X)], we have
g∗(x) = argmax y∈{0,...,c−1}
f(y | x),
where f(y | x) = P[Y = y | X = x] is the conditional probability of Y = y given X = x. As already mentioned, for regression the most widely-used loss function is the squared error loss. In this setting, the optimal prediction function g∗is often called the regression function. The following theorem specifies its exact form.regression
function
Theorem 2.1: Optimal Prediction Function for Squared-Error Loss
For the squared-error loss Loss(y,by) = (y −by)2, the optimal prediction function g∗is equal to the conditional expectation of Y given X = x:
g∗(x) = E[Y | X = x].
Proof: Let g∗(x) = E[Y | X = x]. For any function g, the squared-error risk satisfies E(Y − g(X))2 = E[(Y − g∗(X) + g∗(X) − g(X))2]
= E(Y − g∗(X))2 + 2E[(Y − g∗(X))(g∗(X) − g(X))] + E(g∗(X) − g(X))2
> E(Y − g∗(X))2 + 2E[(Y − g∗(X))(g∗(X) − g(X))]
= E(Y − g∗(X))2 + 2E {(g∗(X) − g(X))E[Y − g∗(X) | X]} .
In the last equation we used the tower property. By the definition of the conditional expect- ☞ 431 ation, we have E[Y − g∗(X) | X] = 0. It follows that E(Y − g(X))2 > E(Y − g∗(X))2, showing that g∗ yields the smallest squared-error risk.
One consequence of Theorem 2.1 is that, conditional on X = x, the (random) response Y can be written as
Y = g∗(x) + ε(x), (2.2)
where ε(x) can be viewed as the random deviation of the response from its conditional mean at x. This random deviation satisfies E ε(x) = 0. Further, the conditional variance of the response Y at x can be written as Var ε(x) = v2(x) for some unknown positive function v. Note that, in general, the probability distribution of ε(x) is unspecified.
Since, the optimal prediction function g∗ depends on the typically unknown joint distri bution of (X, Y), it is not available in practice. Instead, all that we have available is a finite number of (usually) independent realizations from the joint density f(x, y). We denote this sample by T = {(X1, Y1), . . . ,(Xn, Yn)} and call it the training set (T is a mnemonic for training set
training) with n examples. It will be important to distinguish between a random training set T and its (deterministic) outcome {(x1, y1), . . . ,(xn, yn)}. We will use the notation τ for the latter. We will also add the subscript n in τn when we wish to emphasize the size of the training set.
Our goal is thus to “learn” the unknown g∗ using the n examples in the training set T. Let us denote by gT the best (by some criterion) approximation for g∗that we can construct
22 2.2. Supervised and Unsupervised Learning
from T. Note that gT is a random function. A particular outcome is denoted by gτ. It is learner often useful to think of a teacher–learner metaphor, whereby the function gT is a learner who learns the unknown functional relationship g∗: x 7→ y from the training data T. We can imagine a “teacher” who provides n examples of the true relationship between the output Yi and the input Xi for i = 1, . . . , n, and thus “trains” the learner gT to predict the output of a new input X, for which the correct output Y is not provided by the teacher (is unknown).
learning, because one tries to learn the functional
supervised The above setting is called supervised learning
relationship between the feature vector x and response y in the presence of a teacher who provides n examples. It is common to speak of “explaining” or predicting y on the basis of explanatory x, where x is a vector of explanatory variables
variables.
An example of supervised learning is email spam detection. The goal is to train the learner gT to accurately predict whether any future email, as represented by the feature vector x, is spam or not. The training data consists of the feature vectors of a number of different email examples as well as the corresponding labels (spam or not spam). For instance, a feature vector could consist of the number of times sales-pitch words like “free”, “sale”, or “miss out” occur within a given email.
As seen from the above discussion, most questions of interest in supervised learning can be answered if we know the conditional pdf f(y | x), because we can then in principle work out the function value g∗(x).
learningmakes no distinction between response and explan atory variables, and the objective is simply to learn the structure of the unknown distribu
unsupervised In contrast, unsupervised learning
tion of the data. In other words, we need to learn f(x). In this case the guess g(x) is an approximation of f(x) and the risk is of the form
`(g) = E Loss(f(X), g(X)).
An example of unsupervised learning is when we wish to analyze the purchasing be haviors of the customers of a grocery shop that has a total of, say, a hundred items on sale. A feature vector here could be a binary vector x ∈ {0, 1}100 representing the items bought by a customer on a visit to the shop (a 1 in the k-th position if a customer bought item k ∈ {1, . . . , 100} and a 0 otherwise). Based on a training set τ = {x1, . . . , xn}, we wish to find any interesting or unusual purchasing patterns. In general, it is difficult to know if an unsupervised learner is doing a good job, because there is no teacher to provide examples of accurate predictions.
The main methodologies for unsupervised learning include clustering, principal com- ☞ 121 ponent analysis, and kernel density estimation, which will be discussed in Chapter 4. In the next three sections we will focus on supervised learning. The main super vised learning methodologies are regression and classification, to be discussed in detail in ☞ 167 Chapters 5 and 7. More advanced supervised learning techniques, including reproducing ☞ 251 kernel Hilbert spaces, tree methods, and deep learning, will be discussed in Chapters 6, 8, and 9.
Chapter 2. Statistical Learning 23 2.3 Training and Test Loss
Given an arbitrary prediction function g, it is typically not possible to compute its risk `(g) in (2.1). However, using the training sample T, we can approximate `(g) via the empirical (sample average) risk
`T (g) =1nXni=1Loss(Yi, g(Xi)), (2.3)
which we call the training loss. The training loss is thus an unbiased estimator of the risk training loss (the expected loss) for a prediction function g, based on the training data.
To approximate the optimal prediction function g∗(the minimizer of the risk `(g)) we first select a suitable collection of approximating functions G and then take our learner to be the function in G that minimizes the training loss; that is,
gGT= argmin g∈G
`T (g). (2.4)
For example, the simplest and most useful G is the set of linear functions of x; that is, the set of all functions g : x 7→ β>x for some real-valued vector β.
We suppress the superscript G when it is clear which function class is used. Note that minimizing the training loss over all possible functions g (rather than over all g ∈ G) does not lead to a meaningful optimization problem, as any function g for which g(Xi) = Yi for all i gives minimal training loss. In particular, for a squared-error loss, the training loss will be 0. Unfortunately, such functions have a poor ability to predict new (that is, independent from T) pairs of data. This poor generalization performance is called overfitting. overfitting
By choosing g a function that predicts the training data exactly (and is, for example, 0 otherwise), the squared-error training loss is zero. Minimizing the training loss is not the ultimate goal!
The prediction accuracy of new pairs of data is measured by the generalization risk generalization
of
risk
the learner. For a fixed training set τ it is defined as
`(gGτ) = E Loss(Y, gGτ(X)), (2.5)
where (X, Y) is distributed according to f(x, y). In the discrete case the generalization risk is therefore: `(gGτ ) =Px,y Loss(y, gGτ (x))f(x, y) (replace the sum with an integral for the continuous case). The situation is illustrated in Figure 2.1, where the distribution of (X, Y) is indicated by the red dots. The training set (points in the shaded regions) determines a fixed prediction function shown as a straight line. Three possible outcomes of (X, Y) are shown (black dots). The amount of loss for each point is shown as the length of the dashed lines. The generalization risk is the average loss over all possible pairs (x, y), weighted by the corresponding f(x, y).
24 2.3. Training and Test Loss
y
y
y
x xx
Figure 2.1: The generalization risk for a fixed training set is the weighted-average loss over all possible pairs (x, y).
For a random training set T, the generalization risk is thus a random variable that depends on T (and G). If we average the generalization risk over all possible instances of
expected T, we obtain the expected generalization risk
:
generalization
risk
E `(gGT) = E Loss(Y, gGT(X)), (2.6)
where (X, Y) in the expectation above is independent of T. In the discrete case, we have E`(gGT) =Px,y,x1,y1,...,xn,ynLoss(y, gGτ (x))f(x, y)f(x1, y1) · · · f(xn, yn). Figure 2.2 gives an il lustration.
y
x
y
y
x x
Figure 2.2: The expected generalization risk is the weighted-average loss over all possible pairs (x, y) and over all training sets.
For any outcome τ of the training data, we can estimate the generalization risk without bias by taking the sample average
`T0(gGτ) :=1n0nX0
i=1
where {(X01, Y01), . . . ,(X0n0, Y0n0)} =: T0
Loss(Y0i, gGτ(X0i)), (2.7)
test sample is a so-called test sample. The test sample is com pletely separate from T, but is drawn in the same way as T; that is, via independent draws from f(x, y), for some sample size n0
test loss . We call the estimator (2.7) the test loss. For a ran dom training set T we can define `T0(gGT) similarly. It is then crucial to assume that T is independent of T0. Table 2.1 summarizes the main definitions and notation for supervised learning.
Chapter 2. Statistical Learning 25
Table 2.1: Summary of definitions for supervised learning.
x Fixed explanatory (feature) vector.
X Random explanatory (feature) vector.
y Fixed (real-valued) response.
Y Random response.
f(x, y) Joint pdf of X and Y, evaluated at (x, y).
f(y | x) Conditional pdf of Y given X = x, evaluated at y.
τ or τn Fixed training data {(xi, yi), i = 1, . . . , n}.
T or Tn Random training data {(Xi, Yi), i = 1, . . . , n}.
X Matrix of explanatory variables, with n rows x>i, i = 1, . . . , n and dim(x) feature columns; one of the features may be the
constant 1.
y Vector of response variables (y1, . . . , yn)>.
g Prediction (guess) function.
Loss(y,by) Loss incurred when predicting response y with by.
`(g) Risk for prediction function g; that is, E Loss(Y, g(X)).
g∗ Optimal prediction function; that is, argming`(g).
gG Optimal prediction function in function class G; that is, argming∈G `(g).
`τ(g) Training loss for prediction function g; that is, the sample av erage estimate of `(g) based on a fixed training sample τ.
`T (g) The same as `τ(g), but now for a random training sample T.
gGτ or gτ The learner: argming∈G `τ(g). That is, the optimal prediction function based on a fixed training set τ and function class G.
We suppress the superscript G if the function class is implicit.
gGTor gT The learner, where we have replaced τ with a random training set T.
To compare the predictive performance of various learners in the function class G, as measured by the test loss, we can use the same fixed training set τ and test set τ0for all learners. When there is an abundance of data, the “overall” data set is usually (randomly) divided into a training and test set, as depicted in Figure 2.3. We then use the training data to construct various learners gG1
τ, gG2
τ, . . ., and use the test data to select the best (with the
smallest test loss) among these learners. In this context the test set is called the validation set. Once the best learner has been chosen, a third “test” set can be used to assess the validation set predictive performance of the best learner. The training, validation, and test sets can again be obtained from the overall data set via a random allocation. When the overall data set is of modest size, it is customary to perform the validation phase (model selection) on the training set only, using cross-validation. This is the topic of Section 2.5.2. ☞ 37
26 2.3. Training and Test Loss
Figure 2.3: Statistical learning algorithms often require the data to be divided into training and test data. If the latter is used for model selection, a third set is needed for testing the performance of the selected model.
We next consider a concrete example that illustrates the concepts introduced so far.
Example 2.1 (Polynomial Regression) In what follows, it will appear that we have ar bitrarily replaced the symbols x, g, G with u, h, H, respectively. The reason for this switch of notation will become clear at the end of the example.
The data (depicted as dots) in Figure 2.4 are n = 100 points (ui, yi), i = 1, . . . , n drawn from iid random points (Ui, Yi), i = 1, . . . , n, where the {Ui} are uniformly distributed on the interval (0, 1) and, given Ui = ui, the random variable Yi has a normal distribution with expectation 10 − 140ui + 400u2i − 250u3iand variance `∗ = 25. This is an example of a
polynomial polynomial regression model
. Using a squared-error loss, the optimal prediction function
regression model
h∗(u) = E[Y | U = u] is thus
h∗(u) = 10 − 140u + 400u2 − 250u3,
which is depicted by the dashed curve in Figure 2.4.
)
u
(
*
h
40
data points
true
30
20
10
0
10
0.0 0.2 0.4 0.6 0.8 1.0 u
Figure 2.4: Training data and the optimal polynomial prediction function h∗.
Chapter 2. Statistical Learning 27
To obtain a good estimate of h∗(u) based on the training set τ = {(ui, yi), i = 1, . . . , n}, we minimize the outcome of the training loss (2.3):
`τ(h) =1nXni=1(yi − h(ui))2, (2.8)
over a suitable set H of candidate functions. Let us take the set Hp of polynomial functions in u of order p − 1:
h(u) := β1 + β2u + β3u2 + · · · + βpup−1(2.9)
for p = 1, 2, . . . and parameter vector β = [β1, β2, . . . , βp]>. This function class contains the best possible h∗(u) = E[Y | U = u] for p > 4. Note that optimization over Hp is a parametric optimization problem, in that we need to find the best β. Optimization of (2.8) over Hp is not straightforward, unless we notice that (2.9) is a linear function in β. In particular, if we map each feature u to a feature vector x = [1, u, u2, . . . , up−1]>, then the right-hand side of (2.9) can be written as the function
g(x) = x>β,
which is linear in x (as well as β). The optimal h∗(u) in Hp for p > 4 then corresponds to the function g∗(x) = x>β∗in the set Gp of linear functions from Rpto R, where β∗ = [10, −140, 400, −250, 0, . . . , 0]>. Thus, instead of working with the set Hp of polynomial functions we may prefer to work with the set Gp of linear functions. This brings us to a very important idea in statistical learning:
Expand the feature space to obtain a linear prediction function.
Let us now reformulate the learning problem in terms of the new explanatory (feature) variables xi = [1, ui, u2i, . . . , up−1
i]>, i = 1, . . . , n. It will be convenient to arrange these
feature vectors into a matrix X with rows x>1, . . . , x>n:
X =
1 u1 u21· · · up−1
1
1 u2 u22· · · up−1
2
............... 1 un u2n· · · up−1
n
. (2.10)
Collecting the responses {yi} into a column vector y, the training loss (2.3) can now be written compactly as1
nky − Xβk2. (2.11)
To find the optimal learner (2.4) in the class Gp we need to find the minimizer of (2.11):
bβ = argmin β
ky − Xβk2, (2.12)
solution. As is illustrated in Figure 2.5, to find bβ,
which is called the ordinary least-squares ordinary
we choose Xbβ to be equal to the orthogonal projection of y onto the linear space spanned
least-squares
by the columns of the matrix X; that is, Xbβ = Py, where P is the projection matrix projection
.
matrix
28 2.3. Training and Test Loss y
Xbβ
Span(X)
Figure 2.5: Xbβ is the orthogonal projection of y onto the linear space spanned by the columns of the matrix X.
☞ 362 According to Theorem A.4, the projection matrix is given by
P = X X+, (2.13)
where the p × n matrix X+ ☞ 360 in (2.13) is the pseudo-inverse of X. If X happens to be of full pseudo-inverse column rank (so that none of the columns can be expressed as a linear combination of the ☞ 356 other columns), then X+ = (X>X)−1X>.
In any case, from Xbβ = Py and PX = X, we can see that bβ satisfies the normal normal equations
equations:X>Xβ = X>Py = (PX)>y = X>y. (2.14)
This is a set of linear equations, which can be solved very fast and whose solution can be written explicitly as:bβ = X+y. (2.15) Figure 2.6 shows the trained learners for various values of p:
hHp
τ (x) = x>bβ
)
u
(
p
h
τ (u) = gGp
data points
40
true
p = 2, underfit
p = 4, correct
30
p = 16, overfit
20
10
0
10
0.0 0.2 0.4 0.6 0.8 1.0 u
Figure 2.6: Training data with fitted curves for p = 2, 4, and 16. The true cubic polynomial curve for p = 4 is also plotted (dashed line).
Chapter 2. Statistical Learning 29
We see that for p = 16 the fitted curve lies closer to the data points, but is further away from the dashed true polynomial curve, indicating that we overfit. The choice p = 4 (the true cubic polynomial) is much better than p = 16, or indeed p = 2 (straight line). Each function class Gp gives a different learner gGp
τ, p = 1, 2, . . .. To assess which is
better, we should not simply take the one that gives the smallest training loss. We can always get a zero training loss by taking p = n, because for any set of n points there exists a polynomial of degree n − 1 that interpolates all points!
Instead, we assess the predictive performance of the learners using the test loss (2.7), computed from a test data set. If we collect all n0test feature vectors in a matrix X0and the corresponding test responses in a vector y0, then, similar to (2.11), the test loss can be written compactly as
τ ) =1n0ky0 − X0bβk2,
`τ0(gGp
where bβ is given by (2.15), using the training data.
Figure 2.7 shows a plot of the test loss against the number of parameters in the vector β; that is, p. The graph has a characteristic “bath-tub” shape and is at its lowest for p = 4, correctly identifying the polynomial order 3 for the true model. Note that the test loss, as an estimate for the generalization risk (2.7), becomes numerically unreliable after p = 16 (the graph goes down, where it should go up). The reader may check that the graph for the training loss exhibits a similar numerical instability for large p, and in fact fails to numerically decrease to 0 for large p, contrary to what it should do in theory. The numerical problems arise from the fact that for large p the columns of the (Vandermonde) matrix X are of vastly different magnitudes and so floating point errors quickly become very large.
Finally, observe that the lower bound for the test loss is here around 21, which corres ponds to an estimate of the minimal (squared-error) risk `∗ = 25.
160
140
120
100
s
s
o
l
t
s
eT
80
60
40
20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Number of parameters p
Figure 2.7: Test loss as function of the number of parameters p of the model. This script shows how the training data were generated and plotted in Python:
30 2.3. Training and Test Loss
polyreg1.py
import numpy as np
from numpy.random import rand , randn
from numpy.linalg import norm , solve
import matplotlib .pyplot as plt
def generate_data (beta , sig , n):
u = np.random.rand(n, 1)
y = (u ** np.arange (0, 4)) @ beta + sig * np.random.randn(n, 1)
return u, y
np.random.seed (12)
beta = np.array ([[10 , -140, 400, -250]]).T
n = 100
sig = 5
u, y = generate_data (beta , sig , n)
xx = np.arange(np.min(u), np.max(u)+5e-3, 5e -3)
yy = np.polyval(np.flip(beta), xx)
plt.plot(u, y, '.', markersize =8)
plt.plot(xx , yy , '--',linewidth =3)
plt.xlabel(r'$u$')
plt.ylabel(r'$h^*(u)$')
plt.legend (['data points','true'])
plt.show ()
The following code, which imports the code above, fits polynomial models with p = 1, . . . , K = 18 parameters to the training data and plots a selection of fitted curves, as shown in Figure 2.6.
polyreg2.py
from polyreg1 import *
max_p = 18
p_range = np.arange (1, max_p + 1, 1)
X = np.ones ((n, 1))
betahat , trainloss = {}, {}
for p in p_range: # p is the number of parameters
if p > 1:
X = np.hstack ((X, u**(p -1))) # add column to matrix
betahat[p] = solve(X.T @ X, X.T @ y)
trainloss [p] = (norm(y - X @ betahat[p]) **2/n)
p = [2, 4, 16] # select three curves
#replot the points and true line and store in the list "plots"
plots = [plt.plot(u, y, 'k.', markersize =8) [0],
plt.plot(xx , yy , 'k--',linewidth =3) [0]]
# add the three curves
for i in p:
yy = np.polyval(np.flip(betahat[i]), xx)
plots.append(plt.plot(xx , yy)[0])
Chapter 2. Statistical Learning 31
plt.xlabel(r'$u$')
plt.ylabel(r'$h^{\mathcal{H}_p}_{\tau}(u)$')
plt.legend(plots ,('data points', 'true','$p=2$, underfit',
'$p=4$, correct','$p=16$, overfit'))
plt.savefig('polyfitpy.pdf',format='pdf')
plt.show ()
The last code snippet which imports the previous code, generates the test data and plots the graph of the test loss, as shown in Figure 2.7.
polyreg3.py
from polyreg2 import *
# generate test data
u_test , y_test = generate_data (beta , sig , n)
MSE = []
X_test = np.ones ((n, 1))
for p in p_range:
if p > 1:
X_test = np.hstack (( X_test , u_test **(p -1)))
y_hat = X_test @ betahat[p] # predictions
MSE.append(np.sum(( y_test - y_hat)**2/n))
plt.plot(p_range , MSE , 'b', p_range , MSE , 'bo')
plt.xticks(ticks=p_range)
plt.xlabel('Number of parameters $p$')
plt.ylabel('Test loss')
2.4 Tradeoffs in Statistical Learning
The art of machine learning in the supervised case is to make the generalization risk (2.5) or expected generalization risk (2.6) as small as possible, while using as few computational resources as possible. In pursuing this goal, a suitable class G of prediction functions has to be chosen. This choice is driven by various factors, such as
• the complexity of the class (e.g., is it rich enough to adequately approximate, or even contain, the optimal prediction function g∗?),
• the ease of training the learner via the optimization program (2.4),
• how accurately the training loss (2.3) estimates the risk (2.1) within class G,
• the feature types (categorical, continuous, etc.).
As a result, the choice of a suitable function class G usually involves a tradeoff between conflicting factors. For example, a learner from a simple class G can be trained very
32 2.4. Tradeoffs in Statistical Learning
quickly, but may not approximate g∗ very well, whereas a learner from a rich class G that contains g∗ may require a lot of computing resources to train.
To better understand the relation between model complexity, computational simplicity, and estimation accuracy, it is useful to decompose the generalization risk into several parts, so that the tradeoffs between these parts can be studied. We will consider two such decom positions: the approximation–estimation tradeoff and the bias–variance tradeoff.
We can decompose the generalization risk (2.5) into the following three components:
`(gGτ) = `∗
|{z}
irreducible risk
+ `(gG) − `∗ | {z }
approximation error
+ `(gGτ) − `(gG)
| {z } statistical error
, (2.16)
irreducible risk ) is the irreducible risk and gG:= argming∈G `(g) is the best learner within
where `∗:= `(g∗
class G. No learner can predict a new response with a smaller risk than `∗.
error; it measures the difference between
approximation The second component is the approximation error
the irreducible risk and the best possible risk that can be obtained by selecting the best prediction function in the selected class of functions G. Determining a suitable class G and minimizing `(g) over this class is purely a problem of numerical and functional analysis, as the training data τ are not present. For a fixed G that does not contain the optimal g∗, the approximation error cannot be made arbitrarily small and may be the dominant component in the generalization risk. The only way to reduce the approximation error is by expanding the class G to include a larger set of possible functions.
statistical The third component is the statistical (estimation) error
. It depends on the training
(estimation) error
set τ and, in particular, on how well the learner gGτ estimates the best possible prediction function, gG, within class G. For any sensible estimator this error should decay to zero (in
☞ 439 probability or expectation) as the training size tends to infinity.
approximation– The approximation–estimation tradeoff
pits two competing demands against each
estimation tradeoff
other. The first is that the class G has to be simple enough so that the statistical error is not too large. The second is that the class G has to be rich enough to ensure a small approx imation error. Thus, there is a tradeoff between the approximation and estimation errors.
For the special case of the squared-error loss, the generalization risk is equal to `(gGτ ) = E(Y − gGτ (X))2; that is, the expected squared error1 between the predicted value gGτ (X) and the response Y. Recall that in this case the optimal prediction function is given by g∗(x) = E[Y | X = x]. The decomposition (2.16) can now be interpreted as follows.
1. The first component, `∗ = E(Y − g∗(X))2, is the irreducible error, as no prediction function will yield a smaller expected squared error.
2. The second component, the approximation error `(gG) − `(g∗), is equal to E(gG(X) − g∗(X))2. We leave the proof (which is similar to that of Theorem 2.1) as an exercise; see Exercise 2. Thus, the approximation error (defined as a risk difference) can here be interpreted as the expected squared error between the optimal predicted value and the optimal predicted value within the class G.
3. For the third component, the statistical error, `(gGτ ) − `(gG) there is no direct inter pretation as an expected squared error unless G is the class of linear functions; that is, g(x) = x>β for some vector β. In this case we can write (see Exercise 3) the statistical error as `(gGτ ) − `(gG) = E(gGτ (X) − gG(X))2.
1Colloquially called mean squared error.
Chapter 2. Statistical Learning 33
Thus, when using a squared-error loss, the generalization risk for a linear class G can be decomposed as:
`(gGτ) = E(gGτ(X) − Y)2 = `∗ + E(gG(X) − g∗(X))2
| {z }
approximation error
+ E(gGτ(X) − gG(X))2
| {z } statistical error
. (2.17)
Note that in this decomposition the statistical error is the only term that depends on the training set.
Example 2.2 (Polynomial Regression (cont.)) We continue Example 2.1. Here G = Gp is the class of linear functions of x = [1, u, u2, . . . , up−1]>, and g∗(x) = x>β∗. Condi tional on X = x we have that Y = g∗(x) + ε(x), with ε(x) ∼ N(0, `∗), where `∗ = E(Y − g∗(X))2 = 25 is the irreducible error. We wish to understand how the approximation and statistical errors behave as we change the complexity parameter p.
First, we consider the approximation error. Any function g ∈ Gp can be written as g(x) = h(u) = β1 + β2u + · · · + βpup−1 = [1, u, . . . , up−1] β,
and so g(X) is distributed as [1, U, . . . , Up−1]β, where U ∼ U(0, 1). Similarly, g∗(X) is distributed as [1, U, U2, U3]β∗. It follows that an expression for the approximation error is: R 10 [1, u, . . . , up−1] β − [1, u, u2, u3] β∗ 2du. To minimize this error, we set the gradient with respect to β to zero and obtain the p linear equations ☞ 397
R 1 0
[1, u, . . . , up−1] β − [1, u, u2, u3] β∗ du = 0,
[1, u, . . . , up−1] β − [1, u, u2, u3]β∗ u du = 0,
R 1
0
...
R 1
0
Let
[1, u, . . . , up−1] β − [1, u, u2, u3] β∗ up−1du = 0. Z 1
Hp =
0
[1, u, . . . , up−1]>[1, u, . . . , up−1] du
be the p × p Hilbert matrix, which has (i, j)-th entry given by Hilbert matrix R 10ui+j−2 du = 1/(i + j − 1). Then, the above system of linear equations can be written as Hpβ = Heβ∗, where He is the p × 4 upper left sub-block of Hep and ep = max{p, 4}. The solution, which we denote by βp,
is:
βp =
656, p = 1,
[−203, 35]>, p = 2, [−52, 10, 25]>, p = 3, [10, −140, 400, −250, 0, . . . , 0]>, p > 4.
(2.18)
Hence, the approximation error E gGp(X) − g∗(X) 2is given by 32225
Z 1 0
[1, u, . . . , up−1] βp − [1, u, u2, u3] β∗ 2du =
252 ≈ 127.9, p = 1, 1625
63 ≈ 25.8, p = 2, 625
28 ≈ 22.3, p = 3, 0, p > 4.
(2.19)
34 2.4. Tradeoffs in Statistical Learning
Notice how the approximation error becomes smaller as p increases. In this particular example the approximation error is in fact zero for p > 4. In general, as the class of ap proximating functions G becomes more complex, the approximation error goes down.
Next, we illustrate the typical behavior of the statistical error. Since gτ(x) = x>bβ, the statistical error can be written as
Z 1
[1, . . . , up−1](bβ − βp) 2du = (bβ − βp)>Hp(bβ − βp). (2.20) 0
Figure 2.8 illustrates the decomposition (2.17) of the generalization risk for the same train ing set that was used to compute the test loss in Figure 2.7. Recall that test loss gives an estimate of the generalization risk, using independent test data. Comparing the two figures, we see that in this case the two match closely. The global minimum of the statistical error is approximately 0.28, with minimizer p = 4. Since the approximation error is monotonically decreasing to zero, p = 4 is also the global minimizer of the generalization risk.
150 approximation error statistical error
100 50
0
irreducible error generalization risk
0 2 4 6 8 10 12 14 16 18
Figure 2.8: The generalization risk for a particular training set is the sum of the irreducible error, the approximation error, and the statistical error. The approximation error decreases to zero as p increases, whereas the statistical error has a tendency to increase after p = 4.
Note that the statistical error depends on the estimate bβ, which in its turn depends on the training set τ. We can obtain a better understanding of the statistical error by consid ering its expected behavior; that is, averaged over many training sets. This is explored in Exercise 11.
Using again a squared-error loss, a second decomposition (for general G) starts from `(gGτ) = `∗ + `(gGτ) − `(g∗),
where the statistical error and approximation error are combined. Using similar reasoning as in the proof of Theorem 2.1, we have
`(gGτ) = E(gGτ(X) − Y)2 = `∗ + E gGτ(X) − g∗(X) 2= `∗ + ED2(X, τ),
Chapter 2. Statistical Learning 35
where D(x, τ) := gGτ (x) − g∗(x). Now consider the random variable D(x,T) for a random training set T. The expectation of its square is:
E gGT(x) − g∗(x) 2= ED2(x,T) = (ED(x,T))2 + Var D(x,T)
= (EgGT(x) − g∗(x))2
| {z } pointwise squared bias
+ Var gGT(x)
| {z }
pointwise variance
.(2.21)
If we view the learner gGT(x) as a function of a random training set, then the pointwise term is a measure for how close gGT(x) is on average to the true g∗(x),
squared bias pointwise squared bias
whereas the pointwise variance term measures the deviation of gGT(x) from its expectedpointwise value Eg variance GT(x). The squared bias can be reduced by making the class of functions G more complex. However, decreasing the bias by increasing the complexity often leads to an in crease in the variance term. We are thus seeking learners that provide an optimal balance between the bias and variance, as expressed via a minimal generalization risk. This is called the bias–variance tradeoff bias–variance
.
Note that the expected generalization risk (2.6) can be written as `∗+ED2(X,T), where X and T are independent. It therefore decomposes as
tradeoff
E `(gGT) = `∗ + E (E[gGT(X) | X] − g∗(X))2
| {z }
expected squared bias
2.5 Estimating Risk
+ E[Var[gGT(X) | X]]
| {z } expected variance
. (2.22)
The most straightforward way to quantify the generalization risk (2.5) is to estimate it via the test loss (2.7). However, the generalization risk depends inherently on the training set, and so different training sets may yield significantly different estimates. Moreover, when there is a limited amount of data available, reserving a substantial proportion of the data for testing rather than training may be uneconomical. In this section we consider different methods for estimating risk measures which aim to circumvent these difficulties.
2.5.1 In-Sample Risk
We mentioned that, due to the phenomenon of overfitting, the training loss of the learner, `τ(gτ) (for simplicity, here we omit G from gGτ ), is not a good estimate of the generalization risk `(gτ) of the learner. One reason for this is that we use the same data for both training the model and assessing its risk. How should we then estimate the generalization risk or expected generalization risk?
To simplify the analysis, suppose that we wish to estimate the average accuracy of the predictions of the learner gτ at the n feature vectors x1, . . . , xn (these are part of the training set τ). In other words, we wish to estimate the in-sample risk of the learner gτ: in-sample risk
`in(gτ) =1nXni=1E Loss(Y0i, gτ(xi)), (2.23)
where each response Y0iis drawn from f(y | xi), independently. Even in this simplified set ting, the training loss of the learner will be a poor estimate of the in-sample risk. Instead, the
36 2.5. Estimating Risk
proper way to assess the prediction accuracy of the learner at the feature vectors x1, . . . , xn, is to draw new response values Y0i ∼ f(y | xi), i = 1, . . . , n, that are independent from the responses y1, . . . , yn in the training data, and then estimate the in-sample risk of gτ via
1 n
Xn i=1
Loss(Y0i, gτ(xi)).
For a fixed training set τ, we can compare the training loss of the learner with the in-sample risk. Their difference,
opτ = `in(gτ) − `τ(gτ),
is called the optimism (of the training loss), because it measures how much the training loss underestimates (is optimistic about) the unknown in-sample risk. Mathematically, it is expected simpler to work with the expected optimism
optimism:
E[opT| X1 = x1, . . . , Xn = xn] =: EX opT,
where the expectation is taken over a random training set T, conditional on Xi = xi, i = 1, . . . , n. For ease of notation, we have abbreviated the expected optimism to EX opT, where EX denotes the expectation operator conditional on Xi = xi, i = 1, . . . , n. As in Ex ample 2.1, the feature vectors are stored as the rows of an n×p matrix X. It turns out that the expected optimism for various loss functions can be expressed in terms of the (conditional) covariance between the observed and predicted response.
Theorem 2.2: Expected Optimism
For the squared-error loss and 0–1 loss with 0–1 response, the expected optimism is EX opT =2nXni=1CovX(gT (xi), Yi). (2.24)
Proof: In what follows, all expectations are taken conditional on X1 = x1, . . . , Xn = xn. Let Yi be the response for xi and let bYi = gT (xi) be the predicted value. Note that the latter depends on Y1, . . . , Yn. Also, let Y0ibe an independent copy of Yi for the same xi, as in (2.23). In particular, Y0ihas the same distribution as Yi and is statistically independent of all {Yj}, including Yi, and therefore is also independent of bYi. We have
EX opT =1nXni=1EXh(Y0i − bYi)2 − (Yi − Ybi)2i=2nXni=1EXh(Yi − Y0i)bYii
=2nXni=1 EX[YibYi] − EXYi EXbYi =2nXni=1CovX(bYi, Yi).
The proof for the 0–1 loss with 0–1 response is left as Exercise 4.
In summary, the expected optimism indicates how much, on average, the training loss deviates from the expected in-sample risk. Since the covariance of independent random variables is zero, the expected optimism is zero if the learner gT is statistically independent from the responses Y1, . . . , Yn.
Chapter 2. Statistical Learning 37
Example 2.3 (Polynomial Regression (cont.)) We continue Example 2.2, where the components of the response vector Y = [Y1, . . . , Yn]>are independent and normally distrib uted with variance `∗ = 25 (the irreducible error) and expectations EXYi = g∗(xi) = x>iβ∗, i = 1, . . . , n. Using the formula (2.15) for the least-squares estimator bβ, the expected op timism (2.24) is
2 n
Xn
CovX x>ibβ, Yi =2ntr CovX Xbβ, Y =2ntrCovXXX+Y, Y i=1
=2tr (XX+CovX (Y, Y))
n=2`∗tr (XX+)
n=2`∗p
n.
In the last equation we used the cyclic property of the trace (Theorem A.1): tr(XX+) = ☞ 357 tr(X+X) = tr(Ip), assuming that rank(X) = p. Therefore, an estimate for the in-sample risk (2.23) is:b`in(gτ) = `τ(gτ) + 2`∗p/n, (2.25) where we have assumed that the irreducible risk `∗is known. Figure 2.9 shows that this estimate is very close to the test loss from Figure 2.7. Hence, instead of computing the test loss to assess the best model complexity p, we could simply have minimized the training loss plus the correction term 2`∗p/n. In practice, `∗also has to be estimated somehow.
150
100
50
0
2 4 6 8 10 12 14 16 18
Figure 2.9: In-sample risk estimate b`in(gτ) as a function of the number of parameters p of the model. The test loss is superimposed as a blue dashed curve.
2.5.2 Cross-Validation
In general, for complex function classes G, it is very difficult to derive simple formulas of the approximation and statistical errors, let alone for the generalization risk or expected generalization risk. As we saw, when there is an abundance of data, the easiest way to assess the generalization risk for a given training set τ is to obtain a test set τ0and evaluate the test loss (2.7). When a sufficiently large test set is not available but computational ☞ 24
resources are cheap, one can instead gain direct knowledge of the expected generalization risk via a computationally intensive method called cross-validation. cross-validation
38 2.5. Estimating Risk
The idea is to make multiple identical copies of the data set, and to partition each copy into different training and test sets, as illustrated in Figure 2.10. Here, there are four copies of the data set (consisting of response and explanatory variables). Each copy is divided into a test set (colored blue) and training set (colored pink). For each of these sets, we estimate the model parameters using only training data and then predict the responses for the test set. The average loss between the predicted and observed responses is then a measure for the predictive power of the model.
Figure 2.10: An illustration of four-fold cross-validation, representing four copies of the same data set. The data in each copy is partitioned into a training set (pink) and a test set (blue). The darker columns represent the response variable and the lighter ones the explanatory variables.
folds In particular, suppose we partition a data set T of size n into K folds C1, . . . , CK of sizes n1, . . . , nK (hence, n1 + · · · + nK = n). Typically nk ≈ n/K, k = 1, . . . , K.
Let `Ckbe the test loss when using Ck as test data and all remaining data, denoted T−k, as training data. Each `Ckis an unbiased estimator of the generalization risk for training set T−k; that is, for `(gT−k).
cross-validationloss is the weighted average of these risk estimators: K-fold The K-fold cross-validation
CVK =
XK k=1
nk
n`Ck(gT−k)
=1nXK k=1
X i∈Ck
Loss(gT−k(xi), yi)
=1nXni=1Loss(gT−κ(i)(xi), yi),
where the function κ : {1, . . . , n} 7→ {1, . . . , K} indicates to which of the K folds each of the n observations belongs. As the average is taken over varying training sets {T−k}, it estimates the expected generalization risk E `(gT ), rather than the generalization risk `(gτ) for the particular training set τ.
Example 2.4 (Polynomial Regression (cont.)) For the polynomial regression ex ample, we can calculate a K-fold cross-validation loss with a nonrandom partitioning of the training set using the following code, which imports the previous code for the polynomial regression example. We omit the full plotting code.
Chapter 2. Statistical Learning 39
polyregCV.py
from polyreg3 import *
K_vals = [5, 10, 100] # number of folds
cv = np.zeros ((len(K_vals), max_p)) # cv loss
X = np.ones ((n, 1))
for p in p_range:
if p > 1:
X = np.hstack ((X, u**(p -1)))
j = 0
for K in K_vals:
loss = []
for k in range(1, K+1):
# integer indices of test samples
test_ind = ((n/K)*(k -1) + np.arange (1,n/K+1) -1).astype('int')
train_ind = np. setdiff1d (np.arange(n), test_ind)
X_train , y_train = X[train_ind , :], y[train_ind , :]
X_test , y_test = X[test_ind , :], y[ test_ind]
# fit model and evaluate test loss
betahat = solve(X_train.T @ X_train , X_train.T @ y_train)
loss.append(norm(y_test - X_test @ betahat) ** 2)
cv[j, p -1] = sum(loss)/n
j += 1
# basic plotting
plt.plot(p_range , cv[0, :], 'k-.')
plt.plot(p_range , cv[1, :], 'r')
plt.plot(p_range , cv[2, :], 'b--')
plt.show ()
s s
o
l
n
o
i
t
a
d
i
l
a
v
-
s
s
o
r
c
d
l
o
f
-
K
300 250 200 150 100 50
K=5
K=10
K=100
2 4 6 8 10 12 14 16 18 Number of parameters p
Figure 2.11: K-fold cross-validation for the polynomial regression example.
40 2.6. Modeling Data Figure 2.11 shows the cross-validation loss for K ∈ {5, 10, 100}. The case K = 100 cor
leave-one-out
responds to the leave-one-out cross-validation, which can be computed more efficiently
cross-validation using the formula in Theorem 5.1.
☞ 174
2.6 Modeling Data
model The first step in any data analysis is to model the data in one form or another. For example, in an unsupervised learning setting with data represented by a vector x = [x1, . . . , xp]>, a very general model is to assume that x is the outcome of a random vector X = [X1, . . . , Xp]> with some unknown pdf f . The model can then be refined by assuming a specific form of f .
When given a sequence of such data vectors x1, . . . , xn, one of the simplest models is to assume that the corresponding random vectors X1, . . . , Xn are independent and identically ☞ 429 distributed (iid). We write
X1, . . . , Xniid∼ f or X1, . . . , Xniid∼ Dist,
to indicate that the random vectors form an iid sample from a sampling pdf f or sampling distribution Dist. This model formalizes the notion that the knowledge about one variable does not provide extra information about another variable. The main theoretical use of independent data models is that the joint density of the random vectors X1, . . . , Xn is simply ☞ 429 the product of the marginal ones; see Theorem C.1. Specifically,
fX1, ...,Xn(x1, . . . , xn) = f(x1) · · · f(xn).
In most models of this kind, our approximation or model for the sampling distribution is specified up to a small number of parameters. That is, g(x) is of the form g(x | β) which is known up to some parameter vector β. Examples for the one-dimensional case (p = 1)
include the N(µ, σ2 ☞ 425 ), Bin(n, p), and Exp(λ) distributions. See Tables C.1 and C.2 for other common sampling distributions.
Typically, the parameters are unknown and must be estimated from the data. In a non parametric setting the whole sampling distribution would be unknown. To visualize the underlying sampling distribution from outcomes x1, . . . , xn one can use graphical repres entations such as histograms, density plots, and empirical cumulative distribution func- ☞ 11 tions, as discussed in Chapter 1.
If the order in which the data were collected (or their labeling) is not informative or relevant, then the joint pdf of X1, . . . , Xn satisfies the symmetry:
fX1,...,Xn(x1, . . . , xn) = fXπ1,...,Xπn(xπ1, . . . , xπn) (2.26)
for any permutation π1, . . . , πn of the integers 1, . . . , n. We say that the infinite sequence exchangeable X1, X2, . . . is exchangeable if this permutational invariance (2.26) holds for any finite subset of the sequence. As we shall see in Section 2.9 on Bayesian learning, it is common to assume that the random vectors X1, . . . , Xn are a subset of an exchangeable sequence and thus satisfy (2.26). Note that while iid random variables are exchangeable, the converse is not necessarily true. Thus, the assumption of an exchangeable sequence of random vectors is weaker than the assumption of iid random vectors.
Chapter 2. Statistical Learning 41
Figure 2.12 illustrates the modeling tradeoffs. The keywords within the triangle repres ent various modeling paradigms. A few keywords have been highlighted, symbolizing their importance in modeling. The specific meaning of the keywords does not concern us here, but the point is there are many models to choose from, depending on what assumptions are made about the data.
Figure 2.12: Illustration of the modeling dilemma. Complex models are more generally applicable, but may be difficult to analyze. Simple models may be highly tractable, but may not describe the data accurately. The triangular shape signifies that there are a great many specific models but not so many generic ones.
On the one hand, models that make few assumptions are more widely applicable, but at the same time may not be very mathematically tractable or provide insight into the nature of the data. On the other hand, very specific models may be easy to handle and interpret, but may not match the data very well. This tradeoff between the tractability and applicability of the model is very similar to the approximation–estimation tradeoff described in Section 2.4.
In the typical unsupervised setting we have a training set τ = {x1, . . . , xn} that is viewed as the outcome of n iid random variables X1, . . . , Xn from some unknown pdf f . The ob jective is then to learn or estimate f from the finite training data. To put the learning in a similar framework as for supervised learning discussed in the preceding Sections 2.3– 2.5, we begin by specifying a class of probability density functions Gp := {g(· | θ), θ ∈ Θ}, where θ is a parameter in some subset Θ of Rp. We now seek the best g in Gp to minimize some risk. Note that Gp may not necessarily contain the true f even for very large p.
We stress that our notation g(x) has a different meaning in the supervised and unsu pervised case. In the supervised case, g is interpreted as a prediction function for a response y; in the unsupervised setting, g is an approximation of a density f .
For each x we measure the discrepancy between the true model f(x) and the hypothes ized model g(x | θ) using the loss function
Loss(f(x), g(x | θ)) = ln f(x)
g(x | θ)= ln f(x) − ln g(x | θ).
42 2.6. Modeling Data The expected value of this loss (that is, the risk) is thus
`(g) = E ln f(X) g(X | θ)=
Z
f(x) ln f(x)
g(x | θ)dx. (2.27)
The integral in (2.27) provides a fundamental way to measure the distance between two
densities and is called the Kullback–Leibler (KL) divergence2 Kullback–
between f and g(· | θ). Note
Leibler
divergence
that the KL divergence is not symmetric in f and g(· | θ). Moreover, it is always greater than or equal to 0 (see Exercise 15) and equal to 0 when f = g(· | θ). Using similar notation as for the supervised learning setting in Table 2.1, define gGp as the global minimizer of the risk in the class Gp; that is, gGp = argming∈Gp`(g). If we define
θ∗ = argmin
E Loss(f(X), g(X | θ)) = argmin
Z
ln f(x) − ln g(x | θ) f(x) dx
θ
= argmax θ
Z
θ
f(x) ln g(x | θ) dx = argmax θ
E ln g(X | θ),
then gGp = g(· | θ∗) and learning gGpis equivalent to learning (or estimating) θ∗. To learn θ∗ from a training set τ = {x1, . . . , xn} we then minimize the training loss,
1
n
giving:
Xn i=1
Loss(f(xi), g(xi| θ)) = −1nXni=1ln g(xi| θ) +1nXni=1ln f(xi), Xn
bθn := argmax θ
1 n
ln g(xi| θ). (2.28) i=1
As the logarithm is an increasing function, this is equivalent to
bθn := argmax θ
Yn i=1
g(xi| θ),
maximum likelihood estimate ☞ 456
where Qni=1 g(xi| θ) is the likelihood of the data; that is, the joint density of the {Xi} eval uated at the points {xi}. We therefore have recovered the classical maximum likelihood estimate of θ∗.
When the risk `(g(· | θ)) is convex in θ over a convex set Θ, we can find the maximum likelihood estimator by setting the gradient of the training loss to zero; that is, we solve
−1nXni=1S(xi| θ) = 0,
where S(x | θ) :=∂ ln g(x | θ)
score the score.
∂θis the gradient of ln g(x | θ) with respect to θ and is often called
Example 2.5 (Exponential Model) Suppose we have the training data τn = {x1, . . . , xn}, which is modeled as a realization of n positive iid random variables: X1, . . . , Xn ∼iid f(x). We select the class of approximating functions G to be the parametric class {g : g(x | θ) =
2Sometimes called cross-entropy distance.
Chapter 2. Statistical Learning 43
θ exp(−x θ), x > 0, θ > 0}. In other words, we look for the best gG within the family of exponential distributions with unknown parameter θ > 0. The likelihood of the data is
Yn i=1
g(xi| θ) =
Yn i=1
θ exp(−θxi) = exp(−θ n xn + n ln θ)
and the score is S (x | θ) = −x+θ−1. Thus, maximizing the likelihood with respect to θ is the same as maximizing −θ n xn + n ln θ or solving −Pni=1 S (xi| θ)/n = xn − θ−1 = 0. In other words, the solution to (2.28) is the maximum likelihood estimate bθn = 1/xn.
In a supervised setting, where the data is represented by a vector x of explanatory variables and a response y, the general model is that (x, y) is an outcome of (X, Y) ∼ f for some unknown f . And for a training sequence (x1, y1), . . . ,(xn, yn) the default model assumption is that (X1, Y1), . . . ,(Xn, Yn) ∼iid f . As explained in Section 2.2, the analysis primarily involves the conditional pdf f(y | x) and in particular (when using the squared error loss) the conditional expectation g∗(x) = E[Y | X = x]. The resulting representation (2.2) allows us to then write the response at X = x as a function of the feature x plus an error term: Y = g∗(x) + ε(x).
This leads to the simplest and most important model for supervised learning, where we choose a linear class G of prediction or guess functions and assume that it is rich enough to contain the true g∗. If we further assume that, conditional on X = x, the error term ε does not depend on x, that is, E ε = 0 and Var ε = σ2, then we obtain the following model.
Definition 2.1: Linear Model
In a linear model the response Y depends on a p-dimensional explanatory variable linear model x = [x1, . . . , xp]> via the linear relationship
Y = x>β + ε, (2.29)
where E ε = 0 and Var ε = σ2.
Note that (2.29) is a model for a single pair (x, Y). The model for the training set {(xi, Yi)} is simply that each Yi satisfies (2.29) (with x = xi) and that the {Yi} are independ ent. Gathering all responses in the vector Y = [Y1, . . . , Yn]>, we can write
Y = Xβ + ε, (2.30)
where ε = [ε1, . . . , εn]>is a vector of iid copies of ε and X is the so-called model matrix, model matrix with rows x>1, . . . , x>n. Linear models are fundamental building blocks of statistical learning algorithms. For this reason, a large part of Chapter 5 is devoted to linear regression models. ☞ 167
Example 2.6 (Polynomial Regression (cont.)) For our running Example 2.1, we see ☞ 26 that the data is described by a linear model of the form (2.30), with model matrix X given in (2.10).
44 2.7. Multivariate Normal Models
Before we discuss a few other models in the following sections, we would like to em phasize a number of points about modeling.
• Any model for data is likely to be wrong. For example, real data (as opposed to computer-generated data) are often assumed to come from a normal distribution, which is never exactly true. However, an important advantage of using a normal distribution is that it has many nice mathematical properties, as we will see in Sec tion 2.7.
• Most data models depend on a number of unknown parameters, which need to be estimated from the observed data.
• Any model for real-life data needs to be checked for suitability. An important cri terion is that data simulated from the model should resemble the observed data, at least for a certain choice of model parameters.
Here are some guidelines for choosing a model. Think of the data as a spreadsheet or data frame, as in Chapter 1, where rows represent the data units and the columns the data features (variables, groups).
• First establish the type of the features (quantitative, qualitative, discrete, continuous, etc.).
• Assess whether the data can be assumed to be independent across rows or columns.
• Decide on the level of generality of the model. For example, should we use a simple model with a few unknown parameters or a more generic model that has a large number of parameters? Simple specific models are easier to fit to the data (low es timation error) than more general models, but the fit itself may not be accurate (high approximation error). The tradeoffs discussed in Section 2.4 play an important role here.
• Decide on using a classical (frequentist) or Bayesian model. Section 2.9 gives a short ☞ 47 introduction to Bayesian learning.
2.7 Multivariate Normal Models
A standard model for numerical observations x1, . . . , xn (forming, e.g., a column in a spreadsheet or data frame) is that they are the outcomes of iid normal random variables
X1, . . . , Xniid∼ N(µ, σ2).
It is helpful to view a normally distributed random variable as a simple transformation of a standard normal random variable. To wit, if Z has a standard normal distribution, then X = µ + σZ has a N(µ, σ2) distribution. The generalization to n dimensions is discussed
in Appendix C.7. We summarize the main points: Let Z1, . . . , Zniid ☞ 434 ∼ N(0, 1). The pdf of Z = [Z1, . . . , Zn]>(that is, the joint pdf of Z1, . . . , Zn) is given by
fZ(z) =
Yn
1√2πe−12z2i = (2π)−n2 e−12z> z, z ∈ Rn. (2.31)i=1
Chapter 2. Statistical Learning 45
We write Z ∼ N(0,In) and say that Z has a standard normal distribution in Rn. Let X = µ + B Z (2.32)
for some m×n matrix B and m-dimensional vector µ. Then X has expectation vector µ and covariance matrix Σ = BB>; see (C.20) and (C.21). This leads to the following definition. ☞ 432
Definition 2.2: Multivariate Normal Distribution
An m-dimensional random vector X that can be written in the form (2.32) for some m-dimensional vector µ and m × n matrix B, with Z ∼ N(0,In), is said to have a multivariate normal multivariate
or multivariate Gaussian distribution with mean vector µ and
covariance matrix Σ = BB>. We write X ∼ N(µ, Σ).
The m-dimensional density of a multivariate normal distribution has a very similar form to the density of the one-dimensional normal distribution and is given in the next theorem.
normal
We leave the proof as an exercise; see Exercise 5. ☞ 59
Theorem 2.3: Density of a Multivariate Random Vector
Let X ∼ N(µ, Σ), where the m × m covariance matrix Σ is invertible. Then X has pdf fX(x) =1
√(2π)m |Σ|e−12(x−µ)>Σ−1(x−µ), x ∈ Rm. (2.33)
Figure 2.13 shows the pdfs of two bivariate (that is, two-dimensional) normal distribu tions. In both cases the mean vector is µ = [0, 0]>and the variances (the diagonal elements of Σ) are 1. The correlation coefficients (or, equivalently here, the covariances) are respect ively % = 0 and % = 0.8.
0.2
0.1
0
2
0
-20 2 -2
0.2
0.1
0
2
0
-20 2 -2
Figure 2.13: Pdfs of bivariate normal distributions with means zero, variances 1, and cor relation coefficients 0 (left) and 0.8 (right).
46 2.8. Normal Linear Models
The main reason why the multivariate normal distribution plays an important role in data science and machine learning is that it satisfies the following properties, the details ☞ 434 and proofs of which can be found in Appendix C.7:
1. Affine combinations are normal.
2. Marginal distributions are normal.
3. Conditional distributions are normal.
2.8 Normal Linear Models
Normal linear models combine the simplicity of the linear model with the tractability of the Gaussian distribution. They are the principal model for traditional statistics, and include the classic linear regression and analysis of variance models.
Definition 2.3: Normal Linear Model
normal linear In a normal linear model
modelthe response Y depends on a p-dimensional explanatory variable x = [x1, . . . , xp]>, via the linear relationship
Y = x>β + ε, (2.34)
where ε ∼ N(0, σ2).
Thus, a normal linear model is a linear model (in the sense of Definition 2.1) with normal error terms. Similar to (2.30), the corresponding normal linear model for the whole training set {(xi, Yi)} has the form
Y = Xβ + ε, (2.35)
where X is the model matrix comprised of rows x>1, . . . , x>nand ε ∼ N(0, σ2In). Con sequently, Y can be written as Y = Xβ + σZ, where Z ∼ N(0,In), so that Y ∼ N(Xβ, σ2In). ☞ 45 It follows from (2.33) that its joint density is given by
g(y | β, σ2, X) = (2πσ2)−n2 e−12σ2||y−Xβ||2. (2.36)
Estimation of the parameter β can be performed via the least-squares method, as discussed in Example 2.1. An estimate can also be obtained via the maximum likelihood method. This simply means finding the parameters σ2and β that maximize the likelihood of the outcome y, given by the right-hand side of (2.36). It is clear that for every value of σ2 the likelihood is maximal when ky − Xβk2is minimal. As a consequence, the maximum likelihood estimate for β is the same as the least-squares estimate (2.15). We leave it as an exercise (see Exercise 18) to show that the maximum likelihood estimate of σ2 ☞ 63 is equal to
σc2 =ky − Xbβk2
n, (2.37)
where bβ is the maximum likelihood estimate (least squares estimate in this case) of β.
Chapter 2. Statistical Learning 47 2.9 Bayesian Learning
In Bayesian unsupervised learning, we seek to approximate the unknown joint density f(x1, . . . , xn) of the training data Tn = {X1, . . . , Xn} via a joint pdf of the form
Z Yn i=1
g(xi| θ)
w(θ) dθ, (2.38)
where g(· | θ) belongs to a family of parametric densities Gp := {g(· | θ), θ ∈ Θ} (viewed as a family of pdfs conditional on a parameter θ in some set Θ ⊂ Rp) and w(θ) is a pdf that belongs to a (possibly different) family of densities Wp. Note how the joint pdf (2.38) satisfies the permutational invariance (2.26) and can thus be useful as a model for training data which is part of an exchangeable sequence of random variables.
Following standard practice in a Bayesian context, instead of writing fX(x) and fX | Y (x | y) for the pdf of X and the conditional pdf of X given Y, one simply writes f(x) and f(x | y). If Y is a different random variable, its pdf (at y) is thus denoted by f(y).
Thus, we will use the same symbol g for different (conditional) approximating probab ility densities and f for the different (conditional) true and unknown probability densities. Using Bayesian notation, we can write g(τ | θ) =Qni=1 g(xi| θ) and thus the approximating
joint pdf (2.38) can then be written as Rg(τ | θ)w(θ) dθ and the true unknown joint pdf as f(τ) = f(x1, . . . , xn).
Once Gp and Wp are specified, selecting an approximating function g(x) of the form Z
g(x) =
g(x | θ)w(θ) dθ
is equivalent to selecting a suitable w fromWp. Similar to (2.27), we can use the Kullback– Leibler risk to measure the discrepancy between the proposed approximation (2.38) and the true f(τ):
`(g) = E ln f(T)
Rg(T | θ)w(θ) dθ=
Z
f(τ) ln f(τ)
Rg(τ | θ)w(θ) dθdτ. (2.39)
The main difference with (2.27) is that since the training data is not necessarily iid (it may be exchangeable, for example), the expectation must be with respect to the joint density of ☞ 40 T, not with respect to the marginal f(x) (as in the iid case).
Minimizing the training loss is equivalent to maximizing the likelihood of the training data τ; that is, solving the optimization problem
Z
max w∈Wp
g(τ | θ)w(θ) dθ,
where the maximization is over an appropriate class Wp of density functions that is be lieved to result in the smallest KL risk.
48 2.9. Bayesian Learning
Suppose that we have a rough guess, denoted w0(θ), for the best w ∈ Wp that min imizes the Kullback–Leibler risk. We can always increase the resulting likelihood L0 := Rg(τ | θ)w0(θ) dθ by instead using the density w1(θ) := w0(θ) g(τ | θ)/L0, giving a likeli hood L1 :=Rg(τ | θ)w1(θ) dθ. To see this, write L0 and L1 as expectations with respect to w0. In particular, we can write
L0 = Ew0g(τ | θ) and L1 = Ew1g(τ | θ) = Ew0g2(τ | θ)/L0.
It follows that
L1 − L0 =1L0Ew0hg2(τ | θ) − L20i=1L0Varw0[g(τ | θ)] > 0. (2.40)
We may thus expect to obtain better predictions using w1 instead of w0, because w1 has taken into account the observed data τ and increased the likelihood of the model. In fact, if we iterate this process (see Exercise 20) and create a sequence of densities w1,w2, . . . such that wt(θ) ∝ wt−1(θ) g(τ | θ), then wt(θ) concentrates more and more of its probability mass at the maximum likelihood estimator bθ (see (2.28)) and in the limit equals a (degen erate) point-mass pdf atbθ. In other words, in the limit we recover the maximum likelihood method: gτ(x) = g(x |bθ). Thus, unless the class of densities Wp is restricted to be non degenerate, maximizing the likelihood as much as possible leads to a degenerate choice for w(θ).
In many situations, the maximum likelihood estimate g(τ |bθ) is either not an ap propriate approximation to f(τ) (see Example 2.9), or simply fails to exist (see Exer- ☞ 161 cise 10 in Chapter 4). In such cases, given an initial non-degenerate guess w0(θ) = g(θ), one can obtain a more appropriate and non-degenerate approximation to f(τ) by taking w(θ) = w1(θ) ∝ g(τ | θ) g(θ) in (2.38), giving the following Bayesian learner of f(x):
g(x | θ)g(τ | θ) g(θ)
Rg(τ | ϑ) g(ϑ) dϑdθ, (2.41)
where R
Z
gτ(x) :=
☞ 428 g(τ | ϑ) g(ϑ) dϑ = g(τ). Using Bayes’ formula for probability densities, g(θ | τ) =g(τ | θ) g(θ)
g(τ), (2.42)
we can write w1(θ) = g(θ | τ). With this notation, we have the following definitions.
Definition 2.4: Prior, Likelihood, and Posterior
Let τ and Gp := {g(· | θ), θ ∈ Θ} be the training set and family of approximating functions.
prior • A pdf g(θ) that reflects our a priori beliefs about θ is called the prior pdf. likelihood • The conditional pdf g(τ | θ) is called the likelihood.
posterior • Inference about θ is given by the posterior pdf g(θ | τ), which is proportional to the product of the prior and the likelihood:
g(θ | τ) ∝ g(τ | θ) g(θ).
Chapter 2. Statistical Learning 49
Remark 2.1 (Early Stopping) Bayes iteration is an example of an “early stopping” heuristic for maximum likelihood optimization, where we exit after only one step. As ob served above, if we keep iterating, we obtain the maximum likelihood estimate (MLE). In a sense the Bayes rule provides a regularization of the MLE. Regularization is discussed in more detail in Chapter 6; see also Example 2.9. The early stopping rule is also of benefit in regularization; see Exercise 20 in Chapter 6.
On the one hand, the initial guess g(θ) conveys the a priori (prior to training the Bayesian learner) information about the optimal density in Wp that minimizes the KL risk. Using this prior g(θ), the Bayesian approximation to f(x) is the prior predictive density prior predictive
:
g(x) =
Z
g(x | θ) g(θ) dθ.
density
On the other hand, the posterior pdf conveys improved knowledge about this optimal dens ity in Wp after training with τ. Using the posterior g(θ | τ), the Bayesian learner of f(x) is the posterior predictive density posterior
:
gτ(x) = g(x | τ) =
Z
g(x | θ) g(θ | τ) dθ,
predictive density
where we have assumed that g(x | θ, τ) = g(x | θ); that is, the likelihood depends on τ only through the parameter θ.
The choice of the prior is typically governed by two considerations:
1. the prior should be simple enough to facilitate the computation or simulation of the posterior pdf;
2. the prior should be general enough to model ignorance of the parameter of interest.
Priors that do not convey much knowledge of the parameter are said to be uninformat ive. The uniform or flat prior in Example 2.9 (to follow) is frequently used.uninformative prior
For the purpose of analytical and numerical computations, we can view θ as a ran dom vector with prior density g(θ), which after training is updated to the posterior density g(θ | τ).
The above thinking allows us to write g(x | τ) ∝Rg(x | θ) g(τ | θ) g(θ) dθ, for example, thus ignoring any constants that do not depend on the argument of the densities.
Example 2.7 (Normal Model) Suppose that the training data T = {X1, . . . , Xn} is modeled using the likelihood g(x | θ) that is the pdf of
X | θ ∼ N(µ, σ2),
where θ := [µ, σ2]>. Next, we need to specify the prior distribution of θ to complete the model. We can specify prior distributions for µ and σ2separately and then take their product to obtain the prior for vector θ (assuming independence). A possible prior distri bution for µ is
µ ∼ N(ν, φ2). (2.43)
50 2.9. Bayesian Learning
hyperparamet- It is typical to refer to any parameters of the prior density as hyperparameters ersof the
Bayesian model. Instead of giving directly a prior for σ2(or σ), it turns out to be con venient to give the following prior distribution to 1/σ2:
1
σ2∼ Gamma(α, β). (2.44)
The smaller α and β are, the less informative is the prior. Under this prior, σ2is said to have inverse gamma an inverse gamma3 distribution. If 1/Z ∼ Gamma(α, β), then the pdf of Z is proportional to exp (−β/z) /zα+1 ☞ 63 (Exercise 19). The Bayesian posterior is then given by:
g(µ, σ2| τ) ∝ g(µ) × g(σ2) × g(τ | µ, σ2)
×exp n−β/σ2o
(
∝ exp
−(µ − ν)2 2φ2
(σ2)α+1×exp n−Pi(xi − µ)2/(2σ2)o
)
(σ2)n/2
(
)
2φ2−βσ2−(µ − xn)2 + S2n
∝ (σ2)−n/2−α−1exp
−(µ − ν)2
2σ2/n
,
where S2n:=1nPi x2i − x2n =1nPi(xi − xn)2is the (scaled) sample variance. All inference about (µ, σ2) is then represented by the posterior pdf. To facilitate computations it is helpful to find out if the posterior belongs to a recognizable family of distributions. For example, the conditional pdf of µ given σ2and τ is
(
g(µ |σ2, τ) ∝ exp
−(µ − ν)2
2φ2−(µ − xn)2 2σ2/n
)
,
which after simplification can be recognized as the pdf of
(µ |σ2, τ) ∼ N γn xn + (1 − γn)ν, γn σ2/n , (2.45)
where we have defined the weight parameter: γn :=nσ2. 1φ2 +nσ2 . We can then see that the posterior mean E[µ |σ2, τ] = γn xn + (1 − γn)ν is a weighted linear combination of the prior mean ν and the sample average xn. Further, as n → ∞, the weight γn → 1 and thus the posterior mean approaches the maximum likelihood estimate xn.
It is sometimes possible to use a prior g(θ) that is not a bona fide probability density, in the sense that Rg(θ) dθ = ∞, as long as the resulting posterior g(θ | τ) ∝ g(τ | θ)g(θ) is a proper improper prior pdf. Such a prior is called an improper prior.
Example 2.8 (Normal Model (cont.)) An example of an improper prior is obtained from (2.43) when we let φ → ∞ (the larger φ is, the more uninformative is the prior).
Then, g(µ) ∝ 1 is a flat prior, but Rg(µ) dµ = ∞, making it an improper prior. Neverthe less, the posterior is a proper density, and in particular the conditional posterior of (µ |σ2, τ)
simplifies to
(µ |σ2, τ) ∼ N xn, σ2/n ,
3Reciprocal gamma distribution would have been a better name.
Chapter 2. Statistical Learning 51
because the weight parameter γn goes to 1 as φ → ∞. The improper prior g(µ) ∝ 1 also allows us to simplify the posterior marginal for σ2:
g(σ2| τ) =
(
Z
g(µ, σ2| τ) dµ ∝ (σ2)−(n−1)/2−α−1exp
−β + nS 2n/2 σ2
)
,
which we recognize as the density corresponding to τ ∼ Gamma α +n − 1
1
σ2
2, β +n2S2n!.
In addition to g(µ) ∝ 1, we can also use an improper prior for σ2. If we take the limit α → 0 and β → 0 in (2.44), then we also obtain the improper prior g(σ2) ∝ 1/σ2(or equivalently g(1/σ2) ∝ 1/σ2). In this case, the posterior marginal density for σ2implies that:
nS 2n σ2
τ ∼ χ2n−1
and the posterior marginal density for µ implies that:
µ − xn
S n/√n − 1
τ ∼ tn−1. (2.46)
In general, deriving a simple formula for the posterior density of θ is either impossible or too tedious. Instead, the Monte Carlo methods in Chapter 3 can be used to simulate (approximately) from the posterior for the purposes of inference and prediction.
One way in which a distributional result such as (2.46) can be useful is in the construc tion of a 95% credible interval credible
I for the parameter µ; that is, an interval I such that the
probability P[µ ∈ I | τ] is equal to 0.95. For example, the symmetric 95% credible interval
interval
is
"
I =
xn −S n
√n − 1γ, xn +S n
√n − 1γ
#
,
where γ is the 0.975-quantile of the tn−1 distribution. Note that the credible interval is not a random object and that the parameter µ is interpreted as a random variable with a distribution. This is unlike the case of classical confidence intervals, where the parameter is nonrandom, but the interval is (the outcome of) a random object. ☞ 457
As a generalization of the 95% Bayesian credible interval we can define a 1−α credible region, which is any set R satisfying credible region Z
P[θ ∈ R | τ] =
θ∈R
g(θ | τ) dθ > 1 − α. (2.47)
52 2.9. Bayesian Learning
Example 2.9 (Bayesian Regularization of Maximum Likelihood) Consider model ing the number of deaths during birth in a maternity ward. Suppose that the hospital data consists of τ = {x1, . . . , xn}, with xi = 1 if the i-th baby has died during birth and xi = 0 otherwise, for i = 1, . . . , n. A possible Bayesian model for the data is θ ∼ U(0, 1) (uniform prior) with (X1, . . . , Xn | θ)iid∼ Ber(θ). The likelihood is therefore
g(τ | θ) =
Yn i=1
θxi(1 − θ)1−xi = θs(1 − θ)n−s,
where s = x1 + · · · + xn is the total number of deaths. Since g(θ) = 1, the posterior pdf is g(θ | τ) ∝ θs(1 − θ)n−s, θ ∈ [0, 1],
which is the pdf of the Beta(s + 1, n − s + 1) distribution. The normalization constant is (n + 1) ns . The posterior pdf is shown in Figure 2.14 for (s, n) = (0, 100). It is not difficult
Figure 2.14: Posterior pdf for θ, with n = 100 and s = 0.
maximum a to see that the maximum a posteriori
posteriori(MAP) estimate of θ (the mode or maximizer of the
posterior density) is
argmax
θ
g(θ | τ) =sn,
which agrees with the maximum likelihood estimate. Figure 2.14 also shows that the left one-sided 95% credible interval for θ is [0, 0.0292], where 0.0292 is the 0.95 quantile (rounded) of the Beta(1, 101) distribution.
Observe that when (s, n) = (0, 100) the maximum likelihood estimate bθ = 0 infers that deaths at birth are not possible. We know that this inference is wrong — the probability of death can never be zero, it is simply (and fortunately) too small to be inferred accurately from a sample size of n = 100. In contrast to the maximum likelihood estimate, the pos terior mean E[θ | τ] = (s + 1)/(n + 2) is not zero for (s, n) = (0, 100) and provides the more reasonable point estimate of 0.0098 for the probability of death.
Chapter 2. Statistical Learning 53
In addition, while computing a Bayesian credible interval poses no conceptual diffi culties, it is not simple to derive a confidence interval for the maximum likelihood estimate of bθ, because the likelihood as a function of θ is not differentiable at θ = 0. As a result of this lack of smoothness, the usual confidence intervals based on the normal approximation cannot be used.
We now return to the unsupervised learning setting of Section 2.6, but consider this from a Bayesian perspective. Recall from (2.39) that the Kullback–Leibler risk for an ap proximating function g is
Z
`(g) =
f(τ0n)[ln f(τ0n) − ln g(τ0n)] dτ0n,
where τ0n denotes the test data. Since Rf(τ0n) ln f(τ0n) dτ0n plays no role in minimizing the risk, we consider instead the cross-entropy risk, defined as ☞ 122 Z
`(g) = −
f(τ0n) ln g(τ0n) dτ0n.
Note that the smallest possible cross-entropy risk is `∗n = −Rf(τ0n) ln f(τ0n) dτ0n. The expec ted generalization risk of the Bayesian learner can then be decomposed as
Z
E `(gTn) = `∗n +
f(τ0n) lnf(τ0n)
E g(τ0n| Tn)dτ0n
Z
+ E
f(τ0n) lnE g(τ0n| Tn) g(τ0n| Tn)dτ0n
,
| {z } “bias” component
| {z } “variance” component
where gTn(τ0n) = g(τ0n| Tn) =Rg(τ0n| θ) g(θ | Tn) dθ is the posterior predictive density after observing Tn.
Assuming that the sets Tn and T0nare comprised of 2n iid random variables with density f , we can show (Exercise 23) that the expected generalization risk simplifies to
E `(gTn) = E ln g(Tn) − E ln g(T2n), (2.48)
where g(τn) and g(τ2n) are the prior predictive densities of τn and τ2n, respectively. Let θn = argmaxθg(θ | Tn) be the MAP estimator of θ∗:= argmaxθ E ln g(X | θ). As suming that θn converges to θ∗(with probability one) and 1nE ln g(Tn | θn) = E ln g(X | θ∗) + O(1/n), we can use the following large-sample approximation of the expected generaliza tion risk.
Theorem 2.4: Approximating the Bayesian Cross-Entropy Risk
For n → ∞, the expected cross-entropy generalization risk satisfies: E`(gTn) ' −E ln g(Tn) −p2ln n, (2.49)
where (with p the dimension of the parameter vector θ and θn the MAP estimator): E ln g(Tn) ' E ln g(Tn | θn) −p2ln n. (2.50)
54 2.9. Bayesian Learning Proof: To show (2.50), we apply Theorem C.21 to ln Re−nrn(θ) ☞ 450 g(θ) dθ, where rn(θ) := −1nln g(Tn | θ) = −1nXni=1ln g(Xi| θ)a.s.
−→ − E ln g(X | θ) =: r(θ) < ∞.
This gives (with probability one)
Z
ln
g(Tn | θ) g(θ) dθ ' −nr(θ∗) −p2ln(n).
Taking expectations on both sides and using nr(θ∗) = nE[rn(θn)] + O(1), we deduce (2.50). To demonstrate (2.49), we derive the asymptotic approximation of E ln g(T2n) by repeating the argument for (2.50), but replacing n with 2n, where necessary. Thus, we obtain:
E ln g(T2n) ' −2nr(θ∗) −p2ln(2n).
Then, (2.49) follows from the identity (2.48).
The results of Theorem 2.4 have two major implications for model selection and assess ment. First, (2.49) suggests that − ln g(Tn) can be used as a crude (leading-order) asymp totic approximation to the expected generalization risk for large n and fixed p. In this model evidence context, the prior predictive density g(Tn) is usually called the model evidence or marginal
likelihood for the class Gp. Since the integral Rg(Tn | θ) g(θ) dθ is rarely available in closed form, the exact computation of the model evidence is typically not feasible and may require ☞ 78 Monte Carlo estimation methods.
Second, when the model evidence is difficult to compute via Monte Carlo methods or otherwise, (2.50) suggests that we can use the following large-sample approximation:
−2E ln g(Tn) ' −2 ln g(Tn | θn) + p ln(n). (2.51)
The asymptotic approximation on the right-hand side of (2.51) is called the Bayesian in
Bayesian formation criterion
(BIC). We prefer the class Gp with the smallest BIC. The BIC is typic
information criterion
ally used when the model evidence is difficult to compute and n is sufficiently larger than p. For a fixed p, and as n becomes larger and larger, the BIC becomes a more and more accurate estimator of −2E ln g(Tn). Note that the BIC approximation is valid even when the true density f < Gp. The BIC provides an alternative to the Akaike information criterion
☞ 126 (AIC) for model selection. However, while the BIC approximation does not assume that the true model f belongs to the parametric class under consideration, the AIC assumes that f ∈ Gp. Thus, the AIC is merely a heuristic approximation based on the asymptotic approximations in Theorem 4.1.
Although the above Bayesian theory has been presented in an unsupervised learn ing setting, it can be readily extended to the supervised case. We only need to relabel the training set Tn. In particular, when (as is typical for regression models) the train ing responses Y1, . . . , Yn are considered as random variables but the corresponding fea ture vectors x1, . . . , xn are viewed as being fixed, then Tn is the collection of random re sponses {Y1, . . . , Yn}. Alternatively, we can simply identify Tn with the response vector Y = [Y1, . . . , Yn]>. We will adopt this notation in the next example.
Chapter 2. Statistical Learning 55
Example 2.10 (Polynomial Regression (cont.)) Consider Example 2.2 once again, but now in a Bayesian framework, where the prior knowledge on (σ2, β) is specified by g(σ2) = 1/σ2and β |σ2 ∼ N(0, σ2D), and D is a (matrix) hyperparameter. Let Σ := (X>X + D−1)−1. Then the posterior can be written as:
(2πσ2)n/2×exp −β>D−1β
g(β, σ2| y) =exp −ky−Xβk2 2σ2
(2πσ2)p/2|D|1/2×1σ2,g(y) 2σ2
=(σ2)−(n+p)/2−1
2σ2−(n + p + 2)σ2
! ,
(2π)(n+p)/2|D|1/2exp
−kΣ−1/2(β − β)k2
2σ2
g(y),
where β := ΣX>y and σ2:= y>(I − XΣX>)y/(n + p + 2) are the MAP estimates of β and σ2, and g(y) is the model evidence for Gp:
"
g(y) =
g(β, σ2, y) dβ dσ2
=|Σ|1/2
Z ∞
exp
−(n+p+2)σ2 2σ2
(2π)n/2|D|1/2
(σ2)n/2+1dσ2
0
=|Σ|1/2Γ(n/2)
|D|1/2(π(n + p + 2)σ2)n/2.
Therefore, based on (2.49), we have
2E`(gTn) ' −2 ln g(y) = n ln hπ(n + p + 2) σ2i− 2 ln Γ(n/2) + ln |D| − ln |Σ|. On the other hand, the minus of the log-likelihood of Y can be written as − ln g(y | β, σ2) =ky − Xβk2
2σ2+n2ln(2πσ2)
2σ2+(n + p + 2) σ2
=kΣ−1/2(β − β)k2
Therefore, the BIC approximation (2.51) is
2σ2+n2ln(2πσ2).
−2 ln g(y | β,σ2) + (p + 1) ln(n) = n[ln(2πσ2) + 1] + (p + 1) ln(n) + (p + 2), (2.52)
where the extra ln(n) term in (p + 1) ln(n) is due to the inclusion of σ2in θ = (σ2, β). Figure 2.15 shows the model evidence and its BIC approximation, where we used a hyper parameter D = 104 × Ip for the prior density of β. We can see that both approximations exhibit a pronounced minimum at p = 4, thus identifying the true polynomial regression model. Compare the overall qualitative shape of the cross-entropy risk estimate with the shape of the square-error risk estimate in Figure 2.11.
56 2.9. Bayesian Learning 800
750
700
650
600
1 2 3 4 5 6 7 8 910
Figure 2.15: The BIC and marginal likelihood used for model selection.
It is possible to give the model complexity parameter p a Bayesian treatment, in which we define a prior density on the set of all models under consideration. For example, let g(p), p = 1, . . . , m be a prior density on m candidate models. Treating the model com plexity index p as an additional parameter to θ ∈ Rp, and applying Bayes’ formula, the posterior for (θ, p) can be written as:
g(θ, p | τ) = g(θ | p, τ) × g(p | τ)
=g(τ | θ, p) g(θ | p) g(τ | p)
| {z } posterior of θ given model p
×g(τ | p) g(p)
.
g(τ)
| {z }
posterior of model p
The model evidence for a fixed p is now interpreted as the prior predictive density of τ,
conditional on the model p:
g(τ | p) =
Z
g(τ | θ, p) g(θ | p) dθ,
and the quantity g(τ) =Pmp=1 g(τ | p) g(p) is interpreted as the marginal likelihood of all the m candidate models. Finally, a simple method for model selection is to pick the index bp with the largest posterior probability:
bp = argmax p
g(p | τ) = argmax p
g(τ | p) g(p).
Example 2.11 (Polynomial Regression (cont.)) Let us revisit Example 2.10 by giving the parameter p = 1, . . . , m, with m = 10, a Bayesian treatment. Recall that we used the notation τ = y in that example. We assume that the prior g(p) = 1/m is flat and uninform ative so that the posterior is given by
g(p | y) ∝ g(y | p) =|Σ|1/2 Γ(n/2)
|D|1/2(π(n + p + 2)σ2)n/2,
Chapter 2. Statistical Learning 57
where all quantities in g(y | p) are computed using the first p columns of X. Figure 2.16 shows the resulting posterior density g(p | y). The figure also shows the posterior density bg(y | p) P10p=1bg(y | p), where
bg(y | p) := exp
!
−n[ln(2πσ2) + 1] + (p + 1) ln(n) + (p + 2) 2
is derived from the BIC approximation (2.52). In both cases, there is a clear maximum at p = 4, suggesting that a third-degree polynomial is the most appropriate model for the data.
1
0.8
0.6
0.4
0.2
0
1 2 3 4 5 6 7 8 9 10 Figure 2.16: Posterior probabilities for each polynomial model of degree p − 1.
Suppose that we wish to compare two models, say model p = 1 and model p = 2. Instead of computing the posterior g(p | τ) explicitly, we can compare the posterior odds ratio:
g(p = 1 | τ)
g(p = 2 | τ)=g(p = 1)
g(p = 2)×g(τ | p = 1)
.
g(τ | p = 2)
| {z }
Bayes factor B1 | 2
This gives rise to the Bayes factor Bi | j, whose value signifies the strength of the evidence Bayes factor in favor of model i over model j. In particular Bi | j > 1 means that the evidence in favor for model i is larger.
Example 2.12 (Savage–Dickey Ratio) Suppose that we have two models. Model p = 2 has a likelihood g(τ | µ, ν, p = 2), depending on two parameters. Model p = 1 has the same functional form for the likelihood but now ν is fixed to some (known) ν0; that is, g(τ | µ, p = 1) = g(τ | µ, ν = ν0, p = 2). We also assume that the prior information on µ
58 Exercises
for model 1 is the same as that for model 2, conditioned on ν = ν0. That is, we assume g(µ | p = 1) = g(µ | ν = ν0, p = 2). As model 2 contains model 1 as a special case, the latter is said to be nested inside model 2. We can formally write (see also Exercise 26): Z
g(τ | p = 1) = =
Z
g(τ | µ, p = 1) g(µ | p = 1) dµ
g(τ | µ, ν = ν0, p = 2) g(µ | ν = ν0, p = 2) dµ
= g(τ | ν = ν0, p = 2) =g(τ, ν = ν0 | p = 2)
g(ν = ν0 | p = 2).
Hence, the Bayes factor simplifies to
B1 | 2 =g(τ | p = 1)
g(τ | p = 2)=g(τ, ν = ν0 | p = 2) g(ν = ν0 | p = 2)
g(τ | p = 2) =g(ν = ν0 | τ, p = 2) g(ν = ν0 | p = 2).
In other words, B1 | 2 is the ratio of the posterior density to the prior density of ν, evaluated at ν = ν0 and both under the unrestricted model p = 2. This ratio of posterior to prior densities density ratio.
Savage–Dickey is called the Savage–Dickey density ratio
Whether to use a classical (frequentist) or Bayesian model is largely a question of con venience. Classical inference is useful because it comes with a huge repository of ready to-use results, and requires no (subjective) prior information on the parameters. Bayesian models are useful because the whole theory is based on the elegant Bayes’ formula, and uncertainty in the inference (e.g., confidence intervals) can be quantified much more nat urally (e.g., credible intervals). A usual practice is to “Bayesify” a classical model, simply by adding some prior information on the parameters.
Further Reading
A popular textbook on statistical learning is [55]. Accessible treatments of mathematical statistics can be found, for example, in [69], [74], and [124]. More advanced treatments are given in [10], [25], and [78]. A good overview of modern-day statistical inference is given in [36]. Classical references on pattern classification and machine learning are [12] and [35]. For advanced learning theory including information theory and Rademacher complexity, we refer to [28] and [109]. An applied reference for Bayesian inference is [46]. For a survey of numerical techniques relevant to computational statistics, see [90].
Exercises
1. Suppose that the loss function is the piecewise linear function
Loss(y,by) = α (by − y)+ + β (y −by)+, α, β > 0,
where c+ is equal to c if c > 0, and zero otherwise. Show that the minimizer of the risk `(g) = E Loss(Y, g(X)) satisfies
P[Y < g∗(x) | X = x] =β
α + β.
In other words, g∗(x) is the β/(α + β) quantile of Y, conditional on X = x.
Chapter 2. Statistical Learning 59
2. Show that, for the squared-error loss, the approximation error `(gG) − `(g∗) in (2.16), is equal to E(gG(X) − g∗(X))2. [Hint: expand `(gG) = E(Y − g∗(X) + g∗(X) − gG(X))2.]
3. Suppose G is the class of linear functions. A linear function evaluated at a feature x can be described as g(x) = β>x for some parameter vector β of appropriate dimension. Denote gG(x) = x>βGand gGτ (x) = x>bβ. Show that
E gGτ(X) − g∗(X) 2= E X>bβ − X>βG 2+ E X>βG − g∗(X) 2.
Hence, deduce that the statistical error in (2.16) is `(gGτ ) − `(gG) = E (gGτ (X) − gG(X))2. 4. Show that formula (2.24) holds for the 0–1 loss with 0–1 response.
5. Let X be an n-dimensional normal random vector with mean vector µ and covariance matrix Σ, where the determinant of Σ is non-zero. Show that X has joint probability density
fX(x) =1
√(2π)n|Σ|e−12(x−µ)>Σ−1(x−µ), x ∈ Rn.
6. Let bβ = A+y. Using the defining properties of the pseudo-inverse, show that for any ☞ 360
β ∈ Rp,
kAbβ − yk 6 kAβ − yk.
7. Suppose that in the polynomial regression Example 2.1 we select the linear class of functions Gp with p > 4. Then, g∗ ∈ Gp and the approximation error is zero, because gGp(x) = g∗(x) = x>β, where β = [10, −140, 400, −250, 0, . . . , 0]> ∈ Rp. Use the tower property to show that the learner gτ(x) = x>bβ with bβ = X+y, assuming rank(X) > 4, is ☞ 431
unbiased: unbiased E gT (x) = g∗(x).
8. (Exercise 7 continued.) Observe that the learner gT can be written as a linear combina tion of the response variable: gT (x) = x>X+Y. Prove that for any learner of the form x>Ay, where A ∈ Rp×nis some matrix and that satisfies EX[x>AY] = g∗(x), we have
VarX[x>X+Y] 6 VarX[x>AY],
where the equality is achieved for A = X+. This is called the Gauss–Markov inequality Gauss–Markov
.
Hence, using the Gauss–Markov inequality deduce that for the unconditional variance: Var gT (x) 6 Var[x>AY].
Deduce that A = X+also minimizes the expected generalization risk.
9. Consider again the polynomial regression Example 2.1. Use the fact that EXbβ = X+h∗(u), where h∗(u) = E[Y | U = u] = [h∗(u1), . . . , h∗(un)]>, to show that the expected in-sample
inequality
risk is:
EX `in(gT ) = `∗ +kh∗(u)k2 − kXX+h∗(u)k2 n+`∗pn.
Also, use Theorem C.2 to show that the expected statistical error is: ☞ 430 EX (bβ − β)>Hp(bβ − β) = `∗tr(X+(X+)>Hp) + (X+h∗(u) − β)>Hp(X+h∗(u) − β).
60 Exercises 10. Consider the setting of the polynomial regression in Example 2.2. Use Theorem C.19
☞ 449 to prove that
−→ N 0, `∗H−1
, (2.53)
√n (bβn − βp)d
p + H−1
p MpH−1
p
where Mp := E[XX>(g∗(X) − gGp(X))2] is the matrix with (i, j)-th entry:
Z 1
ui+j−2(hHp(u) − h∗(u))2du,
0
and H−1
matrixwith (i, j)-th entry:
inverse Hilbert is the p × p inverse Hilbert matrix
p
(−1)i+j(i + j − 1)
p + i − 1 p − j
! p + j − 1 p − i
! i + j − 2 i − 1
!2
.
Observe that Mp = 0 for p > 4, so that the matrix Mp term is due to choosing a restrictive class Gp that does not contain the true prediction function.
11. In Example 2.2 we saw that the statistical error can be expressed (see (2.20)) as
Z 1 0
[1, . . . , up−1](bβ − βp) 2du = (bβ − βp)>Hp(bβ − βp).
By Exercise 10 the random vector Zn :=√n(bβn − βp) has asymptotically a multivariate normal distribution with mean vector 0 and covariance matrix V := `∗H−1
p + H−1
p MpH−1
p.
☞ 430 Use Theorem C.2 to show that the expected statistical error is asymptotically E (bβ − βp)>Hp(bβ − βp) '`∗pn+tr(MpH−1
p)
n, n → ∞. (2.54)
Plot this large-sample approximation of the expected statistical error and compare it with the outcome of the statistical error.
We note a subtle technical detail: In general, convergence in distribution does not imply ☞ 442 convergence in Lp-norm (see Example C.6), and so here we have implicitly assumed that −→ Dist. ⇒ kZnkL2
kZnkd
−→ constant := limn↑∞ EkZnk.
12. Consider again Example 2.2. The result in (2.53) suggests that Ebβ → βpas n → ∞, where βpis the solution in the class Gp given in (2.18). Thus, the large-sample approxim T(x) = x>bβ at x = [1, . . . , up−1]>is
ation of the pointwise bias of the learner gGp
E gGp
T(x) − g∗(x) ' [1, . . . , up−1] βp − [1, u, u2, u3] β∗, n → ∞.
Use Python to reproduce Figure 2.17, which shows the (large-sample) pointwise squared bias of the learner for p ∈ {1, 2, 3}. Note how the bias is larger near the endpoints u = 0 and u = 1. Explain why the areas under the curves correspond to the approximation errors.
Chapter 2. Statistical Learning 61 250
200
150
100
50
0
0 0.2 0.4 0.6 0.8 1
Figure 2.17: The large-sample pointwise squared bias of the learner for p = 1, 2, 3. The bias is zero for p > 4.
13. For our running Example 2.2 we can use (2.53) to derive a large-sample approximation of the pointwise variance of the learner gT (x) = x>bβn. In particular, show that for large n
n+x>H−1
Var gT (x) '`∗ x>H−1 p x
p MpH−1
p x
n, n → ∞. (2.55)
Figure 2.18 shows this (large-sample) variance of the learner for different values of the predictor u and model index p. Observe that the variance ultimately increases in p and that it is smaller at u = 1/2 than closer to the endpoints u = 0 or u = 1. Since the bias is also
4
3
2
1
0.05
3
0.5 1 0.95
7
5
9
Figure 2.18: The pointwise variance of the learner for various pairs of p and u.
larger near the endpoints, we deduce that the pointwise mean squared error (2.21) is larger near the endpoints of the interval [0, 1] than near its middle. In other words, the error is much smaller in the center of the data cloud than near its periphery.
62 Exercises
14. Let h : x 7→ R be a convex function and let X be a random variable. Use the subgradi- ☞ 403 ent definition of convexity to prove Jensen’s inequality:
Jensen’s
inequality E h(X) > h(EX). (2.56)
15. Using Jensen’s inequality, show that the Kullback–Leibler divergence between prob ability densities f and g is always positive; that is,
E ln f(X)
g(X)> 0,
where X ∼ f .
Vapnik– 16. The purpose of this exercise is to prove the following Vapnik–Chernovenkis bound : for
Chernovenkis bound
any finite class G (containing only a finite number |G| of possible functions) and a general bounded loss function, l 6 Loss 6 u, the expected statistical error is bounded from above
according to:
E `(gGTn) − `(gG) 6(u − l)√2 ln(2|G|)
√n. (2.57)
Note how this bound conveniently does not depend on the distribution of the training set Tn (which is typically unknown), but only on the complexity (i.e., cardinality) of the class G. We can break up the proof of (2.57) into the following four parts:
(a) For a general function class G, training set T, risk function `, and training loss `T , we have, by definition, `(gG) 6 `(g) and `T (gGT) 6 `T (g) for all g ∈ G. Show that
`(gGT) − `(gG) 6 sup g∈G
|`T (g) − `(g)| + `T (gG) − `(gG),
where we used the notation sup (supremum) for the least upper bound. Since E`T (g) = E`(g), we obtain, after taking expectations on both sides of the inequal ity above:
E `(gGT) − `(gG) 6 E sup g∈G
|`T (g) − `(g)|.
(b) If X is a zero-mean random variable taking values in the interval [l, u], then the fol inequalitystates that the moment generating function satisfies
Hoeffding’s lowing Hoeffding’s inequality
E etX 6 exp
t2(u − l)2 8
!
, t ∈ R. (2.58)
Prove this result by using the fact that the line segment joining points (l, exp(tl)) and (u, exp(tu)) bounds the convex function x 7→ exp(tx) for x ∈ [l, u]; that is:
etx 6 etl u − x
u − l+ etu x − l
u − l, x ∈ [l, u].
(c) Let Z1, . . . , Zn be (possibly dependent and non-identically distributed) zero-mean ran dom variables with moment generating functions that satisfy E exp(tZk) 6 exp(t2η2/2) ☞ 427 for all k and some parameter η. Use Jensen’s inequality (2.56) to prove that for any
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment