kfsone/CMakeLists.txt

Disclosure: I use windows as my desktop but I'm an old Amiga/Linux user and I've generally been nix-favoring agnostic wrt to what I use in anger.

I don't know if WSL was an attempt at embrace-and-destroy that backfired or went sideways or something, but it sates my personal linux needs sufficiently that I haven't logged into my ESXi vms or powered on my linux desktop in over 2.5 years.

And I was using BSD when Linus walked onto the stage, so I can tolerate a mac.

Windows is a Desktop OS. On Linux, the difference between a Desktop/Server is subtle, but on both Mac and Win it's significant.

Foremost, Windows Desktop vs Windows Server use very different scheduling/prioritization. Server expects everything to be a service and gives service-like processes more priority and time.

When you're spawning console-type headless python stuff, the Desktop OS considers those low priority vs service process currently wanting some slices.

But there's also a different overhead pattern to NTFS than Linux/Mac filesystems. The result is that it takes MASSIVELY longer to access metadata about a file - so stat, open, etc, are significantly slower.

Windows has a more complex ACL system than base linux, so the security aspect of accessing a file and launching it as a process are painful - at the nanosecond scale.

See https://github.com/kfsone/filebench

Once you're past that, tho, NTFS is pretty reasonable (or was, last time I bench'd, but that was Win10 on v. diff hardware).

While I was working at SpaceX on the developer tools team, there was a command line tool everyone used that was taking longer and longer to start over time.

There was a sort of shrug, its just python being slow attitude.

Another engineer discovered it was actually python doing a lot of something - reading YAML files using the default loader.

Noticing there was a "CSafeLoader" he switched to that. And the tool went from taking 90-150s just reading the config files to 30-40s.

But, over the next year as the configs grew/got more complex, we were back to upto 3 minutes for it to start.

This wasn't on Windows, this was on Linux, on workstation spec machines, but Python 3.5 or 3.6 I think.

The fact is that when you call a function in Python, it's just one op-code as far as the python model is concerned.

The c code behind that op-code is anything but modern-cpu friendly or trivial.

https://github.com/python/cpython/blob/3a77980002845c22e5b294ca47a12d62bf5baf53/Python/specialize.c

The old _Py_Call method has gone away, and they're writing specializations for different method types now, but even these optimized flavors still do a lot of work.

Long ago the C code hade to create all the stack-frame objects we see in exceptions before it could enter the method (* too lazy to verify that they've actually deferred that until they actually have an exception)

Ultimately, Python was intended as a scripting language: a wrapper that let you use high-level idioms to wrangle more performant code written in compiled languages, or do mundane tasks that need to be easy to put together quickly with good accuracy and reliability, without the complications of a low-level language, at least not to the author.

Those complications don't go away, and because it's a language, they have to be more generalized, and so at the very simplest level, theoretical overhead for a C programmer is encoded into most operations in Python.

For example, "LOAD_FAST" is, in theory, just

PyObject* value = GETLOCALS(i);
if (value == NULL) {
  do_backtrace_stuff();
  goto error;
}
Py_INCREF(value);  // bump the ref count, which is going to take 8-30ns, equivalent to a couple hundred cpu cycles.
PUSH(value);  // onto python's own heap-based stack
DISPATCH();  // branch to the top of the virtual machine loop.

The thing is, that NULL pointer check is there to be correct, but you pay it every time you do something like:

i = 1

and unfortunately conditional branches are the nemesis of modern CPU performance: https://www.youtube.com/watch?v=wGSSUSeaLgA

So in a program as large as CPython, with all the code that ends up loaded dynamically at runtime, the way it spreads out in memory, there's actually a good chance that you're going to be getting expensive branch misses from that, too.

At a nanosecond scale, the difference is huge. From a practical, experiential perspective, it's not massively more than if you were doing something similar yourself, in C.

Where Python loses on some basic instructions like this, it gains elsewhere in more sophisticated operations like list/set/dict comprehensions; the pervasive use of lazy evaluation (iterators/generators)...

And then sometimes it just plain dunks itself.

Try in jupyter/ipython:

%%timeit x, y = "hello", "world"
y = x + ", " + y

or assign to x.

Now, instead, assign to "z".

A simple assignment shouldn't take 10us when the operation being assigned to it takes 50ns.

y = f"{x}, {y}" takes 11us (11,000ns) in Python 3.10.6, 12.1us in Python 3.12.3, and 12.2us in Python 3.13.0 or 12.4us using python-jit and setting PYTHON_JIT=1...

It seems to take ~184ns in the 3.13-without-gil experiment. Or ~75x faster.

I get this is a dumb/stupid case, but it tracks with how some os.path operations are so insanely slow. (y = prefix + "/" + y; boosh, 10us).

Also, people don't use python's built in "dis" near often enough to educate themselves on what python actually does.

for i in range(1000):
  for j in range(1000);
    filename = os.path.abspath(f"./file{i}.{j}.txt")

The first line has to look up 'range' in locals, then globals, then builtins to start the loop. But the second line is executed 1000 times, so that's 1000x hashing the word 'range' and doing a hash-table lookup in locals, then globals, before checking builtins. Meanwhile, in the final line, we have to do a million locals-then-globals hashes and lookups of "os", and then a million lookups in the "os" module's dictionary of "path", and then in the "path" dictionary, a million lookups of "abspath".

In [14]: %%timeit
    ...: for i in range(100_000):
    ...:   for j in range(100):
    ...:     x = os.path.abspath
    ...:
270 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I'm not even calling the function there

If we hoist the names out of the loops?

In [15]: %%timeit
    ...: r, ap = range, os.path.abspath
    ...: for i in r(100_000):
    ...:   for j in r(100):
    ...:     x = ap
    ...:
108 ms ± 724 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

That's a lot of miliseconds for such a trivial loop.

For contrast, I made a worst-case C++ bench (x is volatile which prevents the compiler optimizing it out).

#include <algorithm>
#include <array>
#include <chrono>
#include <iostream>
#include <numeric>


static constexpr uint64_t NumRuns  = 5000;
static constexpr uint64_t NumIters = 10000000;


int main() {
  namespace time = std::chrono;
  std::array<uint64_t, NumRuns> timings {};

  const auto benchStart = time::high_resolution_clock::now();

  volatile int x = 0;
  for (auto& val : timings) {
    const auto start = time::high_resolution_clock::now();
    for (int i = 0; i < NumIters; i++) {
     x = 1;
    }
    const auto end = time::high_resolution_clock::now();
    val = time::duration_cast<time::nanoseconds>(end - start).count();
    if (x == 0) break;
  }

  const auto benchEnd = time::high_resolution_clock::now();
  const auto benchTime = time::duration_cast<time::milliseconds>(benchEnd - benchStart).count();

  std::sort(timings.begin(), timings.end());
  auto begin = timings.begin() + 3, end = timings.end() - 3;  // skip the best/worst 3 results
  auto sum = std::accumulate(begin, end, 0ULL);
  auto avg = (sum / std::distance(begin, end));

  std::cout << "avg for " << NumIters << " loop: " << avg / 1000 << "us over " << std::distance(begin, end) << " iterations.\n";
  std::cout << "avg assignment: " << avg / NumIters << "ns.\n";
  std::cout << "total loop of " << NumRuns << "x" << NumIters << ": " << benchTime << "ms.\n";

}

Repeating the entire test takes 11s, and it's 50x faster than the Python variant (2216us = 2.2ms, vs python's 108ms for each 10million assignments)

oliver@osmith-pc:/mnt/c/Users/oliver.smith/src/bench$ g++ -O3 -Os -march=native -mtune=native -o test test.cpp
oliver@osmith-pc:/mnt/c/Users/oliver.smith/src/bench$ ./test
avg for 10000000 loop: 2224us over 4994 iterations.
avg assignment: 0ns.
total loop of 5000x10000000: 11141ms.

(2224us over 4994 iterations meaning it timed the big loop 5000 times and rejected the best/worst 6 cases, then averaged out their accumulated time)

This is absolutely not an apples to apples comparison - Python is doing a lot under the hood that makes testing a single-assignment loop moronic, and I'm forcing an artificial ruleset on C++ that prevents the CPU doing some optimizations.

But it is a fair comparison for underscoring that Python has overheads, and if you set about writing significant programs in it, and then worry about millisecond, microsecond or nanosecond performance, you're barking up the wrong tree.

Mac

Let's try on an m3 mac mini

Sample invocation:

# -G Ninja <- chooses 'ninja build' as the build tool to use (vs e.g make)
# -S . <- CMakeLists.txt is in this directory
# -B out <- use 'out' for all the artifacts and the binaries etc
# -DCMAKE_BUILD_TYPE=Release <- I want a non-debug build thanks
# -DCMAKE_VERBOSE_MAKEFILE=ON <- show me what you're doing
oliver@pinta ~ % cmake -G Ninja -S . -B out -DCMAKE_BUILD_TYPE=Release -DCMAKE_VERBOSE_MAKEFILE=ON
oliver@pinta ~ % cmake --build out   # <- cmake doing sub-commands the wrong way
oliver@pinta ~ % for i in {1..20}; do ./out/bench; done
726000
467000
645000
545000
421000
273000
286000
201000
217000
194000
219000
1300000
821000
132000
3598000
1499000
135000
1941000
858000
1066000

Process Spawning is not Cheap

This should be news to nobody, unless they only recently discovered what 'usec' and 'nsec' stand for.

Technically the machine is idle, but a modern machine always has drivers and services that are happy to take spare cpu cycles.

oliver@pinta ~ % uptime
21:09  up 18 days,  8:09, 4 users, load averages: 2.08 2.08 2.30

top:

Processes: 841 total, 2 running, 839 sleeping, 3690 threads                                                                                         21:12:21
Load Avg: 1.60, 1.87, 2.18  CPU usage: 4.79% user, 3.7% sys, 92.12% idle   SharedLibs: 638M resident, 137M data, 86M linkedit.
MemRegions: 360168 total, 4675M resident, 751M private, 1107M shared. PhysMem: 15G used (1640M wired, 3701M compressor), 168M unused.
VM: 347T vsize, 5406M framework vsize, 1200(0) swapins, 3924(0) swapouts. Networks: packets: 629716721/60G in, 617663556/34G out.
Disks: 14211331/334G read, 7071369/247G written.

PID    COMMAND      %CPU TIME     #TH    #WQ  #PORT MEM    PURG  CMPRS PGRP  PPID  STATE    BOOSTS              %CPU_ME %CPU_OTHRS UID  FAULTS    COW
164    airportd     18.6 67:13.64 8      6    417+  10M    0B    1536K 164   1     sleeping  0[3589]            0.27466 15.52762   0    7495556+  227
131    dasd         14.7 10:03:22 6      5    570   11M    128K  1248K 131   1     sleeping *1519821+[4838]     0.00000 12.34855   0    296181    220
75621  top          8.6  00:03.17 1/1    0    27    9041K- 0B    0B    75621 75278 running  *0[1]               0.00000 0.00000    0    25311+    102
157    runningboard 3.0  01:45:03 7      6    1280  12M    0B    1008K 157   1     sleeping *392+[1]            0.00000 2.68431    0    166605    152
0      kernel_task  2.9  05:09:30 639/10 0    0     47M-   0B    0B    0     0     running   0[0]               0.00000 0.00000    0    80372     0

Note: That's 18%, e.g, of one core/cpu-thread.

What about CPython itself

I've worked with a lot of engineers who thought python had an instant startup no different than a simple C program.

Bad news: Python does a lot of work on startup, at a nanosecond scale.

"But what is it doing?"

We've already dabbled in C so lets just look at the source?

https://github.com/python/cpython/blob/main/Modules/main.c#L793-L806

static int
pymain_main(_PyArgv *args)
{
    PyStatus status = pymain_init(args);
    if (_PyStatus_IS_EXIT(status)) {
        pymain_free();
        return status.exitcode;
    }
    if (_PyStatus_EXCEPTION(status)) {
        pymain_exit_error(status);
    }

    return Py_RunMain();
}

note - this is something all the different CPython implementations call there's stuff happens before we get here.

Looks simple, right?

Let's take a look at pymain_init, and then you're on your own:

static PyStatus
pymain_init(const _PyArgv *args)
{
    PyStatus status;

    status = _PyRuntime_Initialize();
    if (_PyStatus_EXCEPTION(status)) {
        return status;
    }

    PyPreConfig preconfig;
    PyPreConfig_InitPythonConfig(&preconfig);

    status = _Py_PreInitializeFromPyArgv(&preconfig, args);
    if (_PyStatus_EXCEPTION(status)) {
        return status;
    }

    PyConfig config;
    PyConfig_InitPythonConfig(&config);

    /* pass NULL as the config: config is read from command line arguments,
       environment variables, configuration files */
    if (args->use_bytes_argv) {
        status = PyConfig_SetBytesArgv(&config, args->argc, args->bytes_argv);
    }
    else {
        status = PyConfig_SetArgv(&config, args->argc, args->wchar_argv);
    }
    if (_PyStatus_EXCEPTION(status)) {
        goto done;
    }

    status = Py_InitializeFromConfig(&config);
    if (_PyStatus_EXCEPTION(status)) {
        goto done;
    }
    status = _PyStatus_OK();

done:
    PyConfig_Clear(&config);
    return status;
}

pylifecycle.c:

PyStatus
_PyRuntime_Initialize(void)
{
    /* XXX We only initialize once in the process, which aligns with
       the static initialization of the former globals now found in
       _PyRuntime.  However, _PyRuntime *should* be initialized with
       every Py_Initialize() call, but doing so breaks the runtime.
       This is because the runtime state is not properly finalized
       currently. */
    if (runtime_initialized) {
        return _PyStatus_OK();
    }
    runtime_initialized = 1;

    return _PyRuntimeState_Init(&_PyRuntime);
}

So far so seemingly innocent, but not nothing.

pystate.c:

PyStatus
_PyRuntimeState_Init(_PyRuntimeState *runtime)
{
    /* We preserve the hook across init, because there is
       currently no public API to set it between runtime
       initialization and interpreter initialization. */
    void *open_code_hook = runtime->open_code_hook;
    void *open_code_userdata = runtime->open_code_userdata;
    _Py_AuditHookEntry *audit_hook_head = runtime->audit_hooks.head;
    // bpo-42882: Preserve next_index value if Py_Initialize()/Py_Finalize()
    // is called multiple times.
    Py_ssize_t unicode_next_index = runtime->unicode_state.ids.next_index;

    if (runtime->_initialized) {
        // Py_Initialize() must be running again.
        // Reset to _PyRuntimeState_INIT.
        memcpy(runtime, &initial, sizeof(*runtime));
        // Preserve the cookie from the original runtime.
        memcpy(runtime->debug_offsets.cookie, _Py_Debug_Cookie, 8);
        assert(!runtime->_initialized);
    }

    if (gilstate_tss_init(runtime) != 0) {
        _PyRuntimeState_Fini(runtime);
        return _PyStatus_NO_MEMORY();
    }

    if (PyThread_tss_create(&runtime->trashTSSkey) != 0) {
        _PyRuntimeState_Fini(runtime);
        return _PyStatus_NO_MEMORY();
    }

    init_runtime(runtime, open_code_hook, open_code_userdata, audit_hook_head,
                 unicode_next_index);

    return _PyStatus_OK();
}

Well, OK, it's touching some gil and thread stuff, and doing some silly init thing, right? A couple more steps and we end up in preconfig.c

Read your way down into and scan preconfig.c. This is python preparing to configure an instance of the interpreter and trying to be "compatible".

When you see the methods in this file tho, you'll see that there's a long road ahead of the parser starting up.

The code we were following ends up here:

preconfig.c

void
_PyPreConfig_InitCompatConfig(PyPreConfig *config)
{
    memset(config, 0, sizeof(*config));

    config->_config_init = (int)_PyConfig_INIT_COMPAT;
    config->parse_argv = 0;
    config->isolated = -1;
    config->use_environment = -1;
    config->configure_locale = 1;

    /* bpo-36443: C locale coercion (PEP 538) and UTF-8 Mode (PEP 540)
       are disabled by default using the Compat configuration.

       Py_UTF8Mode=1 enables the UTF-8 mode. PYTHONUTF8 environment variable
       is ignored (even if use_environment=1). */
    config->utf8_mode = 0;
    config->coerce_c_locale = 0;
    config->coerce_c_locale_warn = 0;

    config->dev_mode = -1;
    config->allocator = PYMEM_ALLOCATOR_NOT_SET;
#ifdef MS_WINDOWS
    config->legacy_windows_fs_encoding = -1;
#endif
}

This wipes the "Config" object and then goes back and populates it, including setting specific values to zero - hopefully the optimizer recognizes that's pointless.

Now, the Config object isn't large, but one thing it is is not on the stack or in cache.

So there is probably a page fault here, which is hopefully only a couple hundred ns top.

However - that also sets us waiting on a main memory access so mileage varies.

Also, since we're accessing a file the OS may well need to fetch the page from disk, or feel it should.

Now we're talking hundreds of microseconds depending on your SSD speed.

"But it's in filesystem cache"

Yes, it is. However since this is a very short-lived and small program, the OS may less inclined to give it max cache presence.

Also, "filesystem cache" is an operating system facility, not a part of your CPU. FileSystem cache massively reduces the overhead of retrieving "hot" file system items, but it's not free.

First, we have to leave userland and enter kernel space. Next we have to go thru the filesystem's security access checks, and then finally the kernel can tell the CPU how to map that section of main-memory into a copy-on-write page.

Nothing about that is cheap, at a nanosecond scale. There's thousands, possibly millions, of cpu cycles required.

And all of it is guaranteed CPU-cache miss because the program is just starting.

It takes a Python mindset to think that Python startup is anything but trivial.

Spawning an executable

posix_spawn is not a magic wand, it's a complex workhorse that has to go through the kernel equivalent of TSA.

It is, after all, trying to get the machine to execute an object from the disk.

The immediate disk IO it needs to do involves looking up /, then /usr, then /usr/bin, and then the executable path itself.

That's four disk-io related operations covering relatively diverse inodes. And possibly one or two symlinks to traverse.

oliver@osmith-pc:/mnt/c/Users/oliver.smith/src/bench$ ls -lh /usr/bin/python
lrwxrwxrwx 1 root root 7 Oct 11  2021 /usr/bin/python -> python3
oliver@osmith-pc:/mnt/c/Users/oliver.smith/src/bench$ ls -lh /usr/bin/python3
lrwxrwxrwx 1 root root 10 Aug 18  2022 /usr/bin/python3 -> python3.10
oliver@osmith-pc:/mnt/c/Users/oliver.smith/src/bench$ ls -lh /usr/bin/python3.10
-rwxr-xr-x 1 root root 5.7M Nov  6 12:22 /usr/bin/python3.10

5MB isn't huge but it's not trivial when we're at the nanosecond scale.

Kernel has to map 1425 4k pages into memory, which is at least 1425 machine instructions (probably a lot more) probably some cache and branch misses, it probably nukes the spawner threads' branch cache, etc.

That's easily going to take some hundreds or thousands of nanoseconds.

No-op executable

Let's try the minimum-effort possible C program.

oliver@osmith-pc:/mnt/c/Users/oliver.smith/src/bench$ echo "int main(){}" | gcc -Wall -O3 -march=native -g0 -o noop -x c -
oliver@osmith-pc:/mnt/c/Users/oliver.smith/src/bench$ ls -lh noop
-rwxrwxrwx 1 oliver oliver 16K Nov 27 21:40 noop
oliver@osmith-pc:/mnt/c/Users/oliver.smith/src/bench$ for i in {1..10}; do ./out/bench noop ; done
2134457
1868787
1478282
1089202
976298
1219780
1128081
1006530
1086574
2157846

Oh, those are awful timings. What changed? The path. Firstly this one is on the non-WSL filesystem, ie it goes through translation.

So lets move it to /usr/bin?

oliver@osmith-pc:/mnt/c/Users/oliver.smith/src/bench$ ls -al /usr/bin/noop
ls: cannot access '/usr/bin/noop': No such file or directory
oliver@osmith-pc:/mnt/c/Users/oliver.smith/src/bench$ sudo cp -i noop /usr/bin/noop
oliver@osmith-pc:/mnt/c/Users/oliver.smith/src/bench$ for i in {1..10}; do ./out/bench /usr/bin/noop; done
167061
213973
201501
228607
121843
137056
157396
148124
170928
199209
oliver@osmith-pc:/mnt/c/Users/oliver.smith/src/bench$ sudo rm -i /usr/bin/noop
rm: remove regular file '/usr/bin/noop'? y

I walked into the trap intentionally, to demonstrate there's a lot that can go on launching another process.

Windows

So how about trying it on a non-idle Windows 11 machine through WSL? That's gotta suck, right?

oliver@osmith-pc:/mnt/c/Users/oliver.smith/src/bench$ rm -rf out
oliver@osmith-pc:/mnt/c/Users/oliver.smith/src/bench$ cmake -G Ninja -S . -B out \
> -DCMAKE_BUILD_TYPE=Release -DCMAKE_VERBOSE_MAKEFILE=ON
-- The C compiler identification is GNU 11.4.0

-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /mnt/c/Users/oliver.smith/src/bench/out
oliver@osmith-pc:/mnt/c/Users/oliver.smith/src/bench$
oliver@osmith-pc:/mnt/c/Users/oliver.smith/src/bench$ cmake --build out
[1/2] /usr/bin/cc   -O3 -DNDEBUG -Wall -Wextra -Werror -march=native -MD -MT CMakeFiles/bench.dir/bench.c.o -MF CMakeFiles/bench.dir/bench.c.o.d -o CMakeFiles/bench.dir/bench.c.o -c /mnt/c/Users/oliver.smith/src/bench/bench.c
[2/2] : && /usr/bin/cc -O3 -DNDEBUG  CMakeFiles/bench.dir/bench.c.o -o bench   && :
oliver@osmith-pc:/mnt/c/Users/oliver.smith/src/bench$ for i in {1..20}; do ./out/bench; done
260961
301532
199895
310395
270956
191001
144151
196306
237981
182007
213117
211473
165986
169424
138308
179126
172012
155874
269777
118306

Insight

Timings on an idle Linux server are slightly shorter, but still stretch into microseconds.

Summary

Is this supposed to tell you Windows is better than MacOS, where sometimes it takes a whole millisecond to launch the process?

No.

I'm technically not "logged in" to that desktop machine. MacOS' timings get better if I actually wake the machine and log in.

So it's being "fair" in its allotment of slices and prioritization amongst what it considers background/service processes.

	#include <inttypes.h>
	#include <spawn.h>
	#include <stdio.h>
	#include <stdint.h>
	#include <string.h>
	#include <sys/wait.h>
	#include <time.h>

	const char* bench(const char* cmd) {
	struct timespec ts;
	if (clock_gettime(CLOCK_MONOTONIC, &ts) != 0)
	return "get time start failed";

	const uint64_t start = ts.tv_sec * (1000ULL * 1000ULL * 1000ULL) + ts.tv_nsec;

	const char* argv[4] = {"/usr/bin/python3", "-c", "", NULL};
	if (cmd) { argv[0] = cmd; }
	if (!strstr(cmd, "python")) { argv[1] = NULL; }

	pid_t pid = 0;
	if (posix_spawn(&pid, argv[0], NULL, NULL, (char**)argv, NULL) != 0 \|\| pid <= 0)
	return "spawn failed";

	if (clock_gettime(CLOCK_MONOTONIC, &ts) != 0)
	return "get time end failed";

	const uint64_t end = ts.tv_sec * (1000ULL * 1000ULL * 1000ULL) + ts.tv_nsec;

	printf("%" PRIu64 "\n", end - start);

	return NULL;
	}

	int main(int argc, char* argv[]) {
	if (argc > 2) { printf("usage: %s [<command>]\n", argv[0]); return 22 /* EINVAL */; }
	const char* error = bench(argc == 2 ? argv[1] : NULL);
	if (error != NULL) {
	printf("error: %s\n", error);
	return 1;
	}
	}

	# I'm using gcc 14, Ninja Build and cmake 3.21
	# cmake -G Ninja -S . -B out -DCMAKE_BUILD_TYPE=Release -DCMAKE_VERBOSE_MAKEFILE=ON
	# cmake --build out
	# ./out/bench
	cmake_minimum_required (VERSION 3.17)
	project (bench LANGUAGES C)

	add_executable (bench bench.c)

	if (MSVC)
	message (FATAL_ERROR "You're funny, benchmark posix_spawn on Windows? Come back when you're not stoned")
	else ()
	target_compile_options (bench PUBLIC -Wall -Wextra -Werror -flto $<$<CONFIG:Release>:-march=native>)
	endif()