Faster int<->float conversion by setting the exponent bits such that ULP of float is exactly 1, and offsetting x by a power of 2 so that the int gets stuffed in the bottom of the mantissa bits:
- See Sec 3.4 of "Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production"
- Code is basically:
half int_to_f16(int8_t i) {
uint8_t unsigned_val = (uint8_t)(i + 128); // signed to unsigned
uint16_t combined_bits = 0x6400 | unsigned_val; // 0x6400 is 1024 in FP16; makes ULP 1
half combined_float = *(half*)&combined_bits; // no-op reinterpret cast
return combined_float - (half)1152.0; // subtract (1024 + 128) to get original value
}