Skip to content

Instantly share code, notes, and snippets.

@sbchapin
Last active April 10, 2025 22:54
Show Gist options
  • Save sbchapin/fc82290b798c930eaa41eff87abd1cae to your computer and use it in GitHub Desktop.
Save sbchapin/fc82290b798c930eaa41eff87abd1cae to your computer and use it in GitHub Desktop.
hex function accepting decimal(38,0) in pure Spark SQL - supporting hex conversion for whole numbers between 0 and 99,999,999,999,999,999,999,999,999,999,999,999,999
with
-- Establish some test cases with expected outputs.
-- Note that this is the extent of the testing.
-- No claims are made to complete correctness for the full range of numbers - only for the following tests is it guaranteed.
test_cases as (
select
num,
description,
expected_output
from values
(0, "zero", "0"),
(1, "one", "1"),
(16, "2^8", "10"),
(4294967296, "2^32", "100000000"),
(4294967296+1, "2^32 + 1", "100000001"),
(4294967296-1, "2^32 - 1", "FFFFFFFF"),
(18446744073709551616, "2^64", "10000000000000000"),
(18446744073709551616+1, "2^64 + 1", "10000000000000001"),
(18446744073709551616-1, "2^64 - 1", "FFFFFFFFFFFFFFFF"),
(79228162514264337593543950336, "2^96", "1000000000000000000000000"),
(79228162514264337593543950336+1, "2^96 + 1", "1000000000000000000000001"),
(79228162514264337593543950336-1, "2^96 - 1", "FFFFFFFFFFFFFFFFFFFFFFFF"),
(21267647932558653966460912964485513216, "2^127", "10000000000000000000000000000000"),
(21267647932558653966460912964485513216+1, "2^127 + 1", "10000000000000000000000000000001"),
(21267647932558653966460912964485513216-1, "2^127 - 1", "FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"),
(99999999999999999999999999999999999999, "biggest possible decimal", "4B3B4CA85A86C47A098A223FFFFFFFFF"),
(53081669891584275206450103577647629806, "arbitrary case excercising unsignedness", "27EF286294C24971D69112A68A29C5EE")
AS data(num, description, expected_output)
),
-- The implementation, and the application of that implementation against the test data:
test_results as (
select
num,
description,
expected_output,
-- The below is a single-line function that allows the conversion of very large whole numbers in decimal(38,0) form to hexadecimal form.
-- If you need to support whole numbers between 9,223,372,036,854,775,807 and 99,999,999,999,999,999,999,999,999,999,999,999,999 use this.
-- You should normally reach for the native hex() function, but hex() unfortunately only supports bigints (up to 9,223,372,036,854,775,807).
--
-- Considerations:
-- - Zero is supported (returning 0)
-- - Leading zeroes are omitted (as they would be normally using hex())
-- - Negative numbers are NOT SUPPORTED (but you could if you're clever)
--
-- Methodology:
-- - Four separate four-byte (32 bit) "registers" are created by bit-shifting
-- - `4294967296` is 2^32 - used to "bit shift" 32 places via integer division AND also used to mask out 4 bytes via modulo
-- - `18446744073709551616` is 2^64 - used to "bit shift" 64 places via integer division
-- - `79228162514264337593543950336` is 2^96 - used to "bit shift" 96 places via integer division
-- - `div` is used because it does not suffer from floating point rounding error like `/` does (try `SELECT 18446744073709551615 / 18446744073709551616;` to see what I mean)
-- - `pmod` is used because it treats the negative integer space from `div` shifting as signed even if they arent, unlike `%`
-- - A key component is to consider the integer sign - try `SELECT 39614081257132168796771975168 div 4294967296;` which is (2^95 >> 2^32) to see how `div` produces signed negative ints (but `pmod` extracts them as if they were unsigned)
case
when num = 0
then '0'
when num < 9223372036854775807
then hex(num)
else
ltrim(
'0',
lpad(hex(pmod(num::decimal(38,0) div 79228162514264337593543950336, 4294967296)), 8, '0') || -- register3 (4 bytes) - tracking bytes 2^96 to 2^128: FFFFFFFF000000000000000000000000
lpad(hex(pmod(num::decimal(38,0) div 18446744073709551616, 4294967296)), 8, '0') || -- register2 (4 bytes) - tracking bytes 2^64 to 2^96: 00000000FFFFFFFF0000000000000000
lpad(hex(pmod(num::decimal(38,0) div 4294967296, 4294967296)), 8, '0') || -- register1 (4 bytes) - tracking bytes 2^32 to 2^64: 0000000000000000FFFFFFFF00000000
lpad(hex(pmod(num::decimal(38,0), 4294967296)), 8, '0') -- register0 (4 bytes) - tracking bytes 2^0 to 2^32: 000000000000000000000000FFFFFFFF
)
end as actual_output
from test_cases
)
-- The results, with a comparison:
select *, actual_output = expected_output
from test_results
@sbchapin
Copy link
Author

sbchapin commented Mar 5, 2025

WHY?!?!

Because...

  1. I unfortunately need to hex very large whole numbers, and...
  2. python UDFs can be pretty dang slow if you have a LOT of numbers, and...
  3. bit shifting is notoriously fast

Why not just hex()?

-- successful:
select hex(9223372036854775807); -- max bigint
-- casting failure!
select hex(9223372036854775808); -- one more than max bigint, making this number a decimal
-- successful:
select ltrim('0',
    lpad(hex(pmod(9223372036854775808::decimal(38,0) div 79228162514264337593543950336, 4294967296)), 8, '0') ||
    lpad(hex(pmod(9223372036854775808::decimal(38,0) div 18446744073709551616, 4294967296)), 8, '0') ||      
    lpad(hex(pmod(9223372036854775808::decimal(38,0) div 4294967296, 4294967296)), 8, '0')  ||
    lpad(hex(pmod(9223372036854775808::decimal(38,0), 4294967296)), 8, '0')
); -- the crazy function from the above gist

@laurenkwng
Copy link

This was actually such a life saver wow, thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment