Since we can build a NN using just matrix multiplication (for the case of no activation function), the outputs are just linear combinations of the inputs. Adding something like ReLU adds a non-linearity, but can never increase the power of an argument.
For example, below is the output of a network that was trained to predict f(x) = x^2, trained on the interval [-1, 1]. Test data lies in the interval [-2, 2], and obviously network perfectly predicts the data it has seen, without making useful generalisation.
This is the same for many architectures, including CNNs, RNNs, and others. Is that a significant limitation in practice? what kind of dataset would demonstrate this limitation (not a toy one like I used below)