Last active
April 10, 2025 22:54
-
-
Save sbchapin/fc82290b798c930eaa41eff87abd1cae to your computer and use it in GitHub Desktop.
hex function accepting decimal(38,0) in pure Spark SQL - supporting hex conversion for whole numbers between 0 and 99,999,999,999,999,999,999,999,999,999,999,999,999
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
with | |
-- Establish some test cases with expected outputs. | |
-- Note that this is the extent of the testing. | |
-- No claims are made to complete correctness for the full range of numbers - only for the following tests is it guaranteed. | |
test_cases as ( | |
select | |
num, | |
description, | |
expected_output | |
from values | |
(0, "zero", "0"), | |
(1, "one", "1"), | |
(16, "2^8", "10"), | |
(4294967296, "2^32", "100000000"), | |
(4294967296+1, "2^32 + 1", "100000001"), | |
(4294967296-1, "2^32 - 1", "FFFFFFFF"), | |
(18446744073709551616, "2^64", "10000000000000000"), | |
(18446744073709551616+1, "2^64 + 1", "10000000000000001"), | |
(18446744073709551616-1, "2^64 - 1", "FFFFFFFFFFFFFFFF"), | |
(79228162514264337593543950336, "2^96", "1000000000000000000000000"), | |
(79228162514264337593543950336+1, "2^96 + 1", "1000000000000000000000001"), | |
(79228162514264337593543950336-1, "2^96 - 1", "FFFFFFFFFFFFFFFFFFFFFFFF"), | |
(21267647932558653966460912964485513216, "2^127", "10000000000000000000000000000000"), | |
(21267647932558653966460912964485513216+1, "2^127 + 1", "10000000000000000000000000000001"), | |
(21267647932558653966460912964485513216-1, "2^127 - 1", "FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"), | |
(99999999999999999999999999999999999999, "biggest possible decimal", "4B3B4CA85A86C47A098A223FFFFFFFFF"), | |
(53081669891584275206450103577647629806, "arbitrary case excercising unsignedness", "27EF286294C24971D69112A68A29C5EE") | |
AS data(num, description, expected_output) | |
), | |
-- The implementation, and the application of that implementation against the test data: | |
test_results as ( | |
select | |
num, | |
description, | |
expected_output, | |
-- The below is a single-line function that allows the conversion of very large whole numbers in decimal(38,0) form to hexadecimal form. | |
-- If you need to support whole numbers between 9,223,372,036,854,775,807 and 99,999,999,999,999,999,999,999,999,999,999,999,999 use this. | |
-- You should normally reach for the native hex() function, but hex() unfortunately only supports bigints (up to 9,223,372,036,854,775,807). | |
-- | |
-- Considerations: | |
-- - Zero is supported (returning 0) | |
-- - Leading zeroes are omitted (as they would be normally using hex()) | |
-- - Negative numbers are NOT SUPPORTED (but you could if you're clever) | |
-- | |
-- Methodology: | |
-- - Four separate four-byte (32 bit) "registers" are created by bit-shifting | |
-- - `4294967296` is 2^32 - used to "bit shift" 32 places via integer division AND also used to mask out 4 bytes via modulo | |
-- - `18446744073709551616` is 2^64 - used to "bit shift" 64 places via integer division | |
-- - `79228162514264337593543950336` is 2^96 - used to "bit shift" 96 places via integer division | |
-- - `div` is used because it does not suffer from floating point rounding error like `/` does (try `SELECT 18446744073709551615 / 18446744073709551616;` to see what I mean) | |
-- - `pmod` is used because it treats the negative integer space from `div` shifting as signed even if they arent, unlike `%` | |
-- - A key component is to consider the integer sign - try `SELECT 39614081257132168796771975168 div 4294967296;` which is (2^95 >> 2^32) to see how `div` produces signed negative ints (but `pmod` extracts them as if they were unsigned) | |
case | |
when num = 0 | |
then '0' | |
when num < 9223372036854775807 | |
then hex(num) | |
else | |
ltrim( | |
'0', | |
lpad(hex(pmod(num::decimal(38,0) div 79228162514264337593543950336, 4294967296)), 8, '0') || -- register3 (4 bytes) - tracking bytes 2^96 to 2^128: FFFFFFFF000000000000000000000000 | |
lpad(hex(pmod(num::decimal(38,0) div 18446744073709551616, 4294967296)), 8, '0') || -- register2 (4 bytes) - tracking bytes 2^64 to 2^96: 00000000FFFFFFFF0000000000000000 | |
lpad(hex(pmod(num::decimal(38,0) div 4294967296, 4294967296)), 8, '0') || -- register1 (4 bytes) - tracking bytes 2^32 to 2^64: 0000000000000000FFFFFFFF00000000 | |
lpad(hex(pmod(num::decimal(38,0), 4294967296)), 8, '0') -- register0 (4 bytes) - tracking bytes 2^0 to 2^32: 000000000000000000000000FFFFFFFF | |
) | |
end as actual_output | |
from test_cases | |
) | |
-- The results, with a comparison: | |
select *, actual_output = expected_output | |
from test_results |
This was actually such a life saver wow, thank you
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
WHY?!?!
Because...
Why not just
hex()
?