While I was working at SpaceX on the developer tools team, there was a command line tool everyone used that was taking longer and longer to start over time.
There was a sort of shrug, its just python being slow attitude.
Another engineer discovered it was actually python doing a lot of something - reading YAML files using the default loader.
Noticing there was a "CSafeLoader" he switched to that. And the tool went from taking 90-150s just reading the config files to 30-40s.
But, over the next year as the configs grew/got more complex, we were back to upto 3 minutes for it to start.
This wasn't on Windows, this was on Linux, on workstation spec machines, but Python 3.5 or 3.6 I think.
The fact is that when you call a function in Python, it's just one op-code as far as the python model is concerned.
The c code behind that op-code is anything but modern-cpu friendly or trivial.
https://github.com/python/cpython/blob/3a77980002845c22e5b294ca47a12d62bf5baf53/Python/specialize.c
The old _Py_Call
method has gone away, and they're writing specializations for different method types now, but even these optimized flavors still do a lot of work.
Long ago the C code hade to create all the stack-frame objects we see in exceptions before it could enter the method
(* too lazy to verify that they've actually deferred that until they actually have an exception)
Ultimately, Python was intended as a scripting language: a wrapper that let you use high-level idioms to wrangle more performant code written in compiled languages, or do mundane tasks that need to be easy to put together quickly with good accuracy and reliability, without the complications of a low-level language, at least not to the author.
Those complications don't go away, and because it's a language, they have to be more generalized, and so at the very simplest level, theoretical overhead for a C programmer is encoded into most operations in Python.
For example, "LOAD_FAST" is, in theory, just
PyObject* value = GETLOCALS(i);
if (value == NULL) {
do_backtrace_stuff();
goto error;
}
Py_INCREF(value); // bump the ref count, which is going to take 8-30ns, equivalent to a couple hundred cpu cycles.
PUSH(value); // onto python's own heap-based stack
DISPATCH(); // branch to the top of the virtual machine loop.
The thing is, that NULL pointer check is there to be correct, but you pay it every time you do something like:
i = 1
and unfortunately conditional branches are the nemesis of modern CPU performance:
https://www.youtube.com/watch?v=wGSSUSeaLgA
So in a program as large as CPython, with all the code that ends up loaded dynamically at runtime, the way it spreads out in memory, there's actually a good chance that you're going to be getting expensive branch misses from that, too.
At a nanosecond scale, the difference is huge. From a practical, experiential perspective, it's not massively more than if you were doing something similar yourself, in C.
Where Python loses on some basic instructions like this, it gains elsewhere in more sophisticated operations like list/set/dict comprehensions; the pervasive use of lazy evaluation (iterators/generators)...
And then sometimes it just plain dunks itself.
Try in jupyter/ipython:
%%timeit x, y = "hello", "world"
y = x + ", " + y
or assign to x.
Now, instead, assign to "z".
A simple assignment shouldn't take 10us when the operation being assigned to it takes 50ns.
y = f"{x}, {y}"
takes 11us (11,000ns) in Python 3.10.6, 12.1us in Python 3.12.3, and 12.2us in Python 3.13.0 or 12.4us using python-jit and setting PYTHON_JIT=1...
It seems to take ~184ns in the 3.13-without-gil experiment. Or ~75x faster.
I get this is a dumb/stupid case, but it tracks with how some os.path operations are so insanely slow. (y = prefix + "/" + y; boosh, 10us).
Also, people don't use python's built in "dis" near often enough to educate themselves on what python actually does.
for i in range(1000):
for j in range(1000);
filename = os.path.abspath(f"./file{i}.{j}.txt")
The first line has to look up 'range' in locals, then globals, then builtins to start the loop.
But the second line is executed 1000 times, so that's 1000x hashing the word 'range' and doing a hash-table lookup in locals, then globals, before checking builtins.
Meanwhile, in the final line, we have to do a million locals-then-globals hashes and lookups of "os", and then a million lookups in the "os" module's dictionary of "path", and then in the "path" dictionary, a million lookups of "abspath".
In [14]: %%timeit
...: for i in range(100_000):
...: for j in range(100):
...: x = os.path.abspath
...:
270 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I'm not even calling the function there
If we hoist the names out of the loops?
In [15]: %%timeit
...: r, ap = range, os.path.abspath
...: for i in r(100_000):
...: for j in r(100):
...: x = ap
...:
108 ms ± 724 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
That's a lot of miliseconds for such a trivial loop.
For contrast, I made a worst-case C++ bench (x is volatile which prevents the compiler optimizing it out).
#include <algorithm>
#include <array>
#include <chrono>
#include <iostream>
#include <numeric>
static constexpr uint64_t NumRuns = 5000;
static constexpr uint64_t NumIters = 10000000;
int main() {
namespace time = std::chrono;
std::array<uint64_t, NumRuns> timings {};
const auto benchStart = time::high_resolution_clock::now();
volatile int x = 0;
for (auto& val : timings) {
const auto start = time::high_resolution_clock::now();
for (int i = 0; i < NumIters; i++) {
x = 1;
}
const auto end = time::high_resolution_clock::now();
val = time::duration_cast<time::nanoseconds>(end - start).count();
if (x == 0) break;
}
const auto benchEnd = time::high_resolution_clock::now();
const auto benchTime = time::duration_cast<time::milliseconds>(benchEnd - benchStart).count();
std::sort(timings.begin(), timings.end());
auto begin = timings.begin() + 3, end = timings.end() - 3; // skip the best/worst 3 results
auto sum = std::accumulate(begin, end, 0ULL);
auto avg = (sum / std::distance(begin, end));
std::cout << "avg for " << NumIters << " loop: " << avg / 1000 << "us over " << std::distance(begin, end) << " iterations.\n";
std::cout << "avg assignment: " << avg / NumIters << "ns.\n";
std::cout << "total loop of " << NumRuns << "x" << NumIters << ": " << benchTime << "ms.\n";
}
Repeating the entire test takes 11s, and it's 50x faster than the Python variant (2216us = 2.2ms, vs python's 108ms for each 10million assignments)
oliver@osmith-pc:/mnt/c/Users/oliver.smith/src/bench$ g++ -O3 -Os -march=native -mtune=native -o test test.cpp
oliver@osmith-pc:/mnt/c/Users/oliver.smith/src/bench$ ./test
avg for 10000000 loop: 2224us over 4994 iterations.
avg assignment: 0ns.
total loop of 5000x10000000: 11141ms.
(2224us over 4994 iterations
meaning it timed the big loop 5000 times and rejected the best/worst 6 cases, then averaged out their accumulated time)
This is absolutely not an apples to apples comparison - Python is doing a lot under the hood that makes testing a single-assignment loop moronic, and I'm forcing an artificial ruleset on C++ that prevents the CPU doing some optimizations.
But it is a fair comparison for underscoring that Python has overheads, and if you set about writing significant programs in it, and then worry about millisecond, microsecond or nanosecond performance, you're barking up the wrong tree.