Skip to content

Instantly share code, notes, and snippets.

@kif
Created May 31, 2018 13:47
Show Gist options
  • Save kif/abba6883fe0317e312aeecb777d69228 to your computer and use it in GitHub Desktop.
Save kif/abba6883fe0317e312aeecb777d69228 to your computer and use it in GitHub Desktop.
Bitshuffle/LZ4 and precision reduction
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Reducing the precision of floats to improve the compression rate ...\n",
"\n",
"This notebook tried to evaluate compression rate of bitshuffle and LZ4 scheme, when coupled with precision reduction, i.e. reduction of the number of bits of the mantissa of the floating point.\n",
"\n",
"In this document we will work with IEEE754 floating point values, mainly:\n",
"* Single precision float, stored in 32 bits (4 bytes) and containing 23+1 bits of mantissa for a precision of 1.2e-7\n",
"* Double precision float, stored in 64 bits (8 bytes) and containing 52+1 bits of mantissa for a precision of 2.2e-16\n",
"\n",
"## Rational:\n",
"* Our detector work with ADC of precision 12-20 bits, often. \n",
"* For sake on convieniance, most scientists perform their calculation in double precision floating points which offer 53 bits of mantissa (64 bits)\n",
"* This represents a large increase in the size of the data. Moreover floating point data are hardly compressible.\n",
"\n",
"We focuses only on LZ4 compression because it is already used in Eiger detector on the one hand and it provides a high speed compression and decompression on the other making it suitable for on-the-fly compression.\n",
"\n",
"This work is an extension of https://arxiv.org/pdf/1503.00638.pdf to floating point data to characterize different preprocessor..\n",
"\n",
"## Double precision, single precision and 32 bits data in 64 bits containers\n",
"\n",
"The easiest way to evaluate the effect of the mantissa-size on the compression is to compare double precision data and single precision data stored in 64 bits containers.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy\n",
"import bitshuffle\n",
"import lz4\n",
"from lz4 import block, frame"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"shape = 1024,1024\n",
"random_64 = numpy.random.random(shape)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total size in float64: 8388608\n",
"Total size of buffer: 8388608 100.0\n",
"Total size in LZ4 : 8389139 100.01%\n",
"Total size in bitshuffle/LZ4 : 7287095 86.87%\n"
]
}
],
"source": [
"print(\"Total size in float64:\", random_64.nbytes)\n",
"ref = len(random_64.tobytes())\n",
"print(\"Total size of buffer:\", ref, 100.*ref/ref)\n",
"lz4_64 = len(frame.compress(random_64.tobytes()))\n",
"print(\"Total size in LZ4 : %i %.2f%%\"%(lz4_64, 100.*lz4_64/ref))\n",
"blz4_64 = len(frame.compress(bitshuffle.bitshuffle(random_64).tobytes()))\n",
"print(\"Total size in bitshuffle/LZ4 : %i %.2f%%\"%(blz4_64, 100.*blz4_64/ref))"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Maximum error when considering 32bits floats 2.9802308953996715e-08\n"
]
}
],
"source": [
"#Now work with 32bit data, and store them in a 64bit containers ...\n",
"random_32 = random_64.astype(\"float32\")\n",
"random_64_32 = random_32.astype(\"float64\")\n",
"maximum_error = abs(random_64_32-random_64).max()\n",
"print(\"Maximum error when considering 32bits floats\", maximum_error)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total size in float32: 4194304\n",
"Total size in float64: 8388608\n",
"Total size of LZ4_32: 8388608 100.01%\n",
"Total size in LZ4 64 truncated to 32: 6099766 72.71% 145.43%\n",
"Total size in bitshuffle/LZ4_32 : 3460057 82.49% \n",
"Total size in bitshuffle/LZ4_64_32 : 3482568 41.52% 83.03%\n"
]
}
],
"source": [
"print(\"Total size in float32:\", random_32.nbytes)\n",
"ref32 = len(random_32.tobytes())\n",
"print(\"Total size in float64:\", random_64_32.nbytes)\n",
"lz4_32 = len(frame.compress(random_32.tobytes()))\n",
"print(\"Total size of LZ4_32: %i %.2f%%\"%(ref, 100.*lz4_32/ref32))\n",
"lz4_64_32 = len(frame.compress(random_64_32.tobytes()))\n",
"print(\"Total size in LZ4 64 truncated to 32: %i %.2f%% %.2f%%\"%(lz4_64_32, 100.*lz4_64_32/ref, 100.*lz4_64_32/ref32))\n",
"blz4_32 = len(frame.compress(bitshuffle.bitshuffle(random_32).tobytes()))\n",
"print(\"Total size in bitshuffle/LZ4_32 : %i %.2f%% \"%\n",
" (blz4_32, 100.*blz4_32/ref32))\n",
"blz4_64_32 = len(frame.compress(bitshuffle.bitshuffle(random_64_32).tobytes()))\n",
"print(\"Total size in bitshuffle/LZ4_64_32 : %i %.2f%% %.2f%%\"%\n",
" ( blz4_64_32, 100.*blz4_64_32/ref, 100.*blz4_64_32/ref32))\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"## Analysis\n",
"\n",
"* Floating point data compresses very bady, if at all (here we used white noise which is the worse case !)\n",
"* Bitshuffle preprocessor helps in reformating the floating point data to compress togeather exponant and mantissa. All data being in the range 0-1 the exponant are the same but the mantissa contains white noise. So the theoretical maximum compression rate (output size/inputsize) should be 81% for double and 71% for single. There is no compression without preprocessing.\n",
"* Storing limited precision float in larger containers compresses to the same size as when working with smaller containers (when using bitshuffle/LZ4). \n",
"\n",
"So it looks interesting to perform all calculation in double precision and to limit the size of the mantissa just prior to saving as suggested in the publication, even when working with floating point data.\n",
"\n",
"## Reducing the precision of the mantissa\n",
"\n",
"In computer, floating point are represented according to IEEE754 specification (https://en.wikipedia.org/wiki/IEEE_754), so the mantissa is stored in the least significant bits of the structure, and the heading \"1\" is omitted. The idea of the publication was to round the value so that the n-least significant bits are set to zeros to improve compression. \n",
"\n",
"We will treat this mantissa as an integrer number and use the same strategy, Kiyo Masui published in:\n",
"https://gist.github.com/kiyo-masui/b61c7fa4f11fca453bdd"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0b100000 0b11111\n",
"12345 0b11000000111001\n",
"False 0b110010\n",
"12352 0b11000001000000\n"
]
}
],
"source": [
"# let n be the minimum number of bits set to 0\n",
"n = 5\n",
"gran = (1<<n)\n",
"bitmask = gran - 1\n",
"print(bin(gran), bin(bitmask))\n",
"val = 12345\n",
"print(val, bin(val))\n",
"tie = ((val & bitmask) << 1) == gran\n",
"print(tie, bin(((val & bitmask) << 1)))\n",
"val_t = (val - (gran >> 1)) | bitmask\n",
"val_t += 1\n",
"val_t -= (gran >> 1) == 0\n",
"val_t -= val_t & (tie * gran)\n",
"print(val_t, bin(val_t))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"def round_int(val, nbits=5):\n",
" \"Round an integer so that the last nbits are set to zero\"\n",
" gran = (1<<nbits)\n",
" bitmask = gran - 1\n",
" tie = ((val & bitmask) << 1) == gran\n",
" val_t = (val - (gran >> 1)) | bitmask\n",
" val_t += 1\n",
" val_t -= (gran >> 1) == 0\n",
" val_t -= val_t & (tie * gran)\n",
" return val_t\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"#test round_int\n",
"n = 8\n",
"gran = (1<<n)\n",
"bitmask = gran - 1\n",
"for i in range(-65000, 65000):\n",
" j = round_int(i, n)\n",
" delta = abs(j-i)\n",
" if (delta>gran/2) or (j&bitmask):\n",
" print(i,j, delta, j&bitmask == 0)\n",
" \n",
" "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"def round_float(val, nbits=5):\n",
" mlen = None\n",
" try:\n",
" if val.dtype==numpy.float32:\n",
" mlen = 23\n",
" elif val.dtype == numpy.float64:\n",
" mlen = 52\n",
" except:\n",
" pass\n",
" if mlen is None:\n",
" mlen = 52\n",
" val = numpy.float64(val) \n",
" intval = int(val.view(int))\n",
" mask = 1<<(mlen)\n",
" bitmask = mask - 1\n",
" bigint = intval & bitmask\n",
" \n",
" bigint |= mask #set the bit at mantissa-length\n",
" #print(bin(bigint))\n",
" rndint = round_int(bigint, mlen-nbits)\n",
" #print(bin(rndint))\n",
" if rndint & (1<<(mlen+1)):\n",
" #print(val, bin(bigint), bin(rndint))\n",
" #Handle the rare case where offset of exponent is needed\n",
" bitmask = (1<<63) - mask\n",
" expo = (intval & bitmask)\n",
" expo += 1<<mlen\n",
" rndint = rndint>>1 #drop the right-handside bit\n",
" expo|= intval & (1<<63) #copy the sign which is left-hand bit\n",
" intval = expo\n",
" #print(val, bin(intval))\n",
" rndint &= ~mask #clear the the bit at mantissa-length\n",
" bitmask = (1<<64) - (1<<mlen)\n",
" rndint |= intval & bitmask\n",
" return numpy.uint64(rndint).view(\"float64\")\n",
" #return rndint"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3.141592653589793\n",
"Remaining Mantissa size | Value | Mesured error\n",
"52 3.141592653589793 0.0\n",
"51 3.141592653589793 0.0\n",
"50 3.141592653589793 0.0\n",
"49 3.141592653589793 0.0\n",
"48 3.1415926535897967 -3.552713678800501e-15\n",
"47 3.1415926535897967 -3.552713678800501e-15\n",
"46 3.1415926535897825 1.0658141036401503e-14\n",
"45 3.1415926535897825 1.0658141036401503e-14\n",
"44 3.1415926535897825 1.0658141036401503e-14\n",
"43 3.141592653589896 -1.0302869668521453e-13\n",
"42 3.1415926535896688 1.2434497875801753e-13\n",
"41 3.1415926535901235 -3.304023721284466e-13\n",
"40 3.1415926535901235 -3.304023721284466e-13\n",
"39 3.1415926535883045 1.4885870314174099e-12\n",
"38 3.1415926535919425 -2.149391775674303e-12\n",
"37 3.1415926535846666 5.126565838509123e-12\n",
"36 3.1415926535846666 5.126565838509123e-12\n",
"35 3.1415926535846666 5.126565838509123e-12\n",
"34 3.1415926535846666 5.126565838509123e-12\n",
"33 3.141592653701082 -1.1128875598842569e-10\n",
"32 3.1415926534682512 1.2154188766544394e-10\n",
"31 3.1415926534682512 1.2154188766544394e-10\n",
"30 3.1415926534682512 1.2154188766544394e-10\n",
"29 3.1415926553308964 -1.741103261565513e-09\n",
"28 3.141592651605606 1.984187036896401e-09\n",
"27 3.141592651605606 1.984187036896401e-09\n",
"26 3.141592651605606 1.984187036896401e-09\n",
"25 3.1415926814079285 -2.781813535079891e-08\n",
"24 3.1415926218032837 3.1786509424591713e-08\n",
"23 3.1415927410125732 -8.742278012618954e-08\n",
"22 3.141592502593994 1.5099579897537296e-07\n",
"21 3.1415929794311523 -3.2584135922775204e-07\n",
"20 3.141592025756836 6.27832957178498e-07\n",
"19 3.1415939331054688 -1.279515675634002e-06\n",
"18 3.1415939331054688 -1.279515675634002e-06\n",
"17 3.1415863037109375 6.349878855615998e-06\n",
"16 3.1416015625 -8.908910206884002e-06\n",
"15 3.1416015625 -8.908910206884002e-06\n",
"14 3.1416015625 -8.908910206884002e-06\n",
"13 3.1416015625 -8.908910206884002e-06\n",
"12 3.1416015625 -8.908910206884002e-06\n",
"11 3.1416015625 -8.908910206884002e-06\n",
"10 3.140625 0.000967653589793116\n",
"9 3.140625 0.000967653589793116\n",
"8 3.140625 0.000967653589793116\n",
"7 3.140625 0.000967653589793116\n",
"6 3.15625 -0.014657346410206884\n",
"5 3.125 0.016592653589793116\n",
"4 3.125 0.016592653589793116\n",
"3 3.25 -0.10840734641020688\n",
"2 3.0 0.14159265358979312\n",
"1 3.0 0.14159265358979312\n"
]
}
],
"source": [
"pi = numpy.pi\n",
"print(pi)\n",
"print(\"Remaining Mantissa size | Value | Mesured error\")\n",
"for i in range(52, 0, -1):\n",
" print( i, round_float(pi, i),pi - round_float(pi, i))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3.75 µs ± 90.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n"
]
}
],
"source": [
"%timeit round_float(pi, 20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analysis:\n",
"We are now able to round floating point values to a given number of digits in the mantissa. Let's see how it behaves when compressing the data"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"def round_array(ary, precision):\n",
" res = numpy.empty_like(ary)\n",
" flat_ary = ary.ravel()\n",
" flat_res = res.ravel()\n",
" for i in range(ary.size):\n",
" flat_res[i] = round_float(flat_ary[i], precision)\n",
" return res "
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 3.94 s, sys: 0 ns, total: 3.94 s\n",
"Wall time: 3.94 s\n"
]
},
{
"data": {
"text/plain": [
"7.957555701794154e-08"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%time abs(round_array(random_64, 20)- random_64).mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Make it useable\n",
"\n",
"Let's port this to Cython to make it a bit faster"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"%load_ext Cython"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"%%cython\n",
"\n",
"import numpy\n",
"cdef union float_and_int:\n",
" long long int integer\n",
" double floating\n",
" \n",
"cdef inline _round_int(val, nbits=5):\n",
" \"Round an integer so that the last nbits are set to zero\"\n",
"# cdef: \n",
"# working_t gran, bitmask, tie, val_t\n",
" gran = (1<<nbits)\n",
" bitmask = gran - 1\n",
" tie = ((val & bitmask) << 1) == gran\n",
" val_t = (val - (gran >> 1)) | bitmask\n",
" val_t += 1\n",
" val_t -= (gran >> 1) == 0\n",
" val_t -= val_t & (tie * gran)\n",
" return val_t\n",
"\n",
"def cround_int(val, nbits=5):\n",
" \"Round an integer so that the last nbits are set to zero\"\n",
" return _round_int(val, nbits)\n",
"\n",
"cdef inline double _round_float(double val, int nbits=5):\n",
" cdef:\n",
" float_and_int inp, out\n",
" int mlen = 52\n",
" inp.floating = val\n",
" intval = int(inp.integer)\n",
" mask = 1<<(mlen)\n",
" bitmask = mask - 1\n",
" bigint = intval & bitmask\n",
" \n",
" bigint |= mask #set the bit at mantissa-length\n",
" rndint = _round_int(bigint, mlen-nbits)\n",
" if rndint & (1<<(mlen+1)):\n",
" #Handle the rare case where offset of exponent is needed\n",
" bitmask = (1<<63) - mask\n",
" expo = (intval & bitmask)\n",
" expo += 1<<mlen\n",
" rndint = rndint>>1 #drop the right-handside bit\n",
" expo|= intval & (1<<63) #copy the sign which is left-hand bit\n",
" intval = expo\n",
" rndint &= ~mask #clear the the bit at mantissa-length\n",
" bitmask = (1<<64) - (1<<mlen)\n",
" rndint |= intval & bitmask\n",
" out.integer = rndint\n",
" return out.floating\n",
" \n",
" \n",
"def cround_float(val, nbits=5):\n",
" \"Round the floating point value with nbits remaining in the mantissa\"\n",
" return _round_float(val, nbits)\n",
"\n",
"\n",
"def cround_array(ary, precision):\n",
" cdef:\n",
" int i\n",
" double[:] flat_ary, flat_res\n",
" dary = numpy.ascontiguousarray(ary, dtype=numpy.float64)\n",
" res = numpy.zeros(ary.shape, dtype=numpy.float64)\n",
" flat_ary = dary.ravel()\n",
" flat_res = res.ravel()\n",
" for i in range(ary.size):\n",
" flat_res[i] = _round_float(flat_ary[i], precision)\n",
" return res "
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"265 ns ± 8.39 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)\n"
]
}
],
"source": [
"#test round_int\n",
"n = 8\n",
"gran = (1<<n)\n",
"bitmask = gran - 1\n",
"for i in range(0, 65000):\n",
" j = cround_int(i, n)\n",
" delta = abs(j-i)\n",
" if (delta>gran/2) or (j&bitmask):\n",
" print(i,j, delta, j&bitmask == 0)\n",
"\n",
"%timeit cround_int(i, n)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3.141592653589793\n",
"Remaining Mantissa size | Value | Mesured error\n",
"52 3.141592653589793 0.0\n",
"51 3.141592653589793 0.0\n",
"50 3.141592653589793 0.0\n",
"49 3.141592653589793 0.0\n",
"48 3.1415926535897967 -3.552713678800501e-15\n",
"47 3.1415926535897967 -3.552713678800501e-15\n",
"46 3.1415926535897825 1.0658141036401503e-14\n",
"45 3.1415926535897825 1.0658141036401503e-14\n",
"44 3.1415926535897825 1.0658141036401503e-14\n",
"43 3.141592653589896 -1.0302869668521453e-13\n",
"42 3.1415926535896688 1.2434497875801753e-13\n",
"41 3.1415926535901235 -3.304023721284466e-13\n",
"40 3.1415926535901235 -3.304023721284466e-13\n",
"39 3.1415926535883045 1.4885870314174099e-12\n",
"38 3.1415926535919425 -2.149391775674303e-12\n",
"37 3.1415926535846666 5.126565838509123e-12\n",
"36 3.1415926535846666 5.126565838509123e-12\n",
"35 3.1415926535846666 5.126565838509123e-12\n",
"34 3.1415926535846666 5.126565838509123e-12\n",
"33 3.141592653701082 -1.1128875598842569e-10\n",
"32 3.1415926534682512 1.2154188766544394e-10\n",
"31 3.1415926534682512 1.2154188766544394e-10\n",
"30 3.1415926534682512 1.2154188766544394e-10\n",
"29 3.1415926553308964 -1.741103261565513e-09\n",
"28 3.141592651605606 1.984187036896401e-09\n",
"27 3.141592651605606 1.984187036896401e-09\n",
"26 3.141592651605606 1.984187036896401e-09\n",
"25 3.1415926814079285 -2.781813535079891e-08\n",
"24 3.1415926218032837 3.1786509424591713e-08\n",
"23 3.1415927410125732 -8.742278012618954e-08\n",
"22 3.141592502593994 1.5099579897537296e-07\n",
"21 3.1415929794311523 -3.2584135922775204e-07\n",
"20 3.141592025756836 6.27832957178498e-07\n",
"19 3.1415939331054688 -1.279515675634002e-06\n",
"18 3.1415939331054688 -1.279515675634002e-06\n",
"17 3.1415863037109375 6.349878855615998e-06\n",
"16 3.1416015625 -8.908910206884002e-06\n",
"15 3.1416015625 -8.908910206884002e-06\n",
"14 3.1416015625 -8.908910206884002e-06\n",
"13 3.1416015625 -8.908910206884002e-06\n",
"12 3.1416015625 -8.908910206884002e-06\n",
"11 3.1416015625 -8.908910206884002e-06\n",
"10 3.140625 0.000967653589793116\n",
"9 3.140625 0.000967653589793116\n",
"8 3.140625 0.000967653589793116\n",
"7 3.140625 0.000967653589793116\n",
"6 3.15625 -0.014657346410206884\n",
"5 3.125 0.016592653589793116\n",
"4 3.125 0.016592653589793116\n",
"3 3.25 -0.10840734641020688\n",
"2 3.0 0.14159265358979312\n",
"1 3.0 0.14159265358979312\n",
"661 ns ± 20.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)\n"
]
}
],
"source": [
"pi = numpy.pi\n",
"print(pi)\n",
"print(\"Remaining Mantissa size | Value | Mesured error\")\n",
"for i in range(52, 0, -1):\n",
" print( i, cround_float(pi, i),pi - cround_float(pi, i))\n",
"%timeit cround_float(pi, 20)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 576 ms, sys: 12 ms, total: 588 ms\n",
"Wall time: 582 ms\n"
]
},
{
"data": {
"text/plain": [
"2.9802308953996715e-08"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%time abs(cround_array(random_64, 23)- random_64).max()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Compressed size as a function on the mantissa length\n",
"\n",
"We have now the tools to benchmark the compressed size as function of the mantissa size of the float64."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"sizes = [len(frame.compress(bitshuffle.bitshuffle(cround_array(random_64, i)).tobytes())) for i in range(53)]"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[5.3472161293029785, 7.01749324798584, 8.63410234451294, 10.21111011505127, 11.792588233947754, 13.352382183074951, 14.932501316070557, 16.48489236831665, 18.059325218200684, 19.612765312194824, 21.192705631256104, 22.74419069290161, 24.32166337966919, 25.873327255249023, 27.442800998687744, 29.002833366394043, 30.56471347808838, 32.130515575408936, 33.69598388671875, 35.25642156600952, 36.82820796966553, 38.3844256401062, 39.947545528411865, 41.515445709228516, 43.084120750427246, 44.64414119720459, 46.21344804763794, 47.77116775512695, 49.33271408081055, 50.906574726104736, 52.45786905288696, 54.038870334625244, 55.59808015823364, 57.15765953063965, 58.72153043746948, 60.29253005981445, 61.835408210754395, 63.41524124145508, 64.97175693511963, 66.55751466751099, 68.1148886680603, 69.67103481292725, 71.21807336807251, 72.80173301696777, 74.35799837112427, 75.9243369102478, 77.46745347976685, 79.04058694839478, 80.62218427658081, 82.20208883285522, 83.7560772895813, 85.3075385093689, 86.86894178390503]\n"
]
}
],
"source": [
"compression_rate = [100*i/ref for i in sizes]\n",
"print(compression_rate)\n",
"theoretical = [(64-(52-i))*100/64 for i in range(53)]"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"from matplotlib.pyplot import subplots"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/mntdirect/_scisoft/users/jupyter/jupy35/lib/python3.5/site-packages/matplotlib/figure.py:459: UserWarning: matplotlib is currently using a non-GUI backend, so cannot show the figure\n",
" \"matplotlib is currently using a non-GUI backend, \"\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"fig, ax = subplots()\n",
"ax.plot(compression_rate, label=\"measured\")\n",
"ax.plot(theoretical, label=\"worse case\")\n",
"ax.set_xlabel(\"Number of bits in the mantissa\")\n",
"ax.set_ylabel(\"Compression ratio\")\n",
"ax.legend()\n",
"fig.show()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/mntdirect/_scisoft/users/jupyter/jupy35/lib/python3.5/site-packages/matplotlib/figure.py:459: UserWarning: matplotlib is currently using a non-GUI backend, so cannot show the figure\n",
" \"matplotlib is currently using a non-GUI backend, \"\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"error = [abs(cround_array(random_64, i) - random_64 ).max() for i in range(53)]\n",
"\n",
"fig, ax = subplots()\n",
"ax.plot(error, label=\"error\")\n",
"ax.set_xlabel(\"Number of bits in the mantissa\")\n",
"ax.set_ylabel(\"max error\")\n",
"ax.legend()\n",
"fig.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusions:\n",
"\n",
"Bitshuffle is needed to make loseless algorithms like LZ4 efficient on floating point data.\n",
"Zeroing out the last bits of the mantissa helps the bitshuffle/LZ4 algorithm to better compress data. \n",
"A demonstrator is proposed and works on float64. \n",
"The compression ratio varies linearly with the size of mantissa used. The precision of the data drops accordingly"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment