Last active
September 4, 2023 13:39
-
-
Save iconara/4969c247e8abb69600cdbe6f4b20f50b to your computer and use it in GitHub Desktop.
Athena metadata file parser
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env ruby | |
# This code parses the .csv.metadata files written by Athena and produces a | |
# structure similar to what you get from the GetQueryResults API call. | |
# | |
# I have reverse engineered the format and I'm not sure about all the details, | |
# but it seems to correspond to the GetQueryResults API call well. Some things, | |
# like nullability, the difference between name and label, and the schema_name | |
# and table_name fields, I haven't been able to figure out because they seem | |
# not to be used, or never takes any other values. | |
# | |
# The format looks something like this: | |
# * A header starting with 0x0a followed by a Pascal string (one byte for the | |
# size followed by the bytes of the string). The string seems to include the | |
# date and time of the execution, and a random string. I'm not sure if this | |
# corresponds to the query execution ID somehow. | |
# * One or more structures corresponding to the columns of the result. The | |
# structure starts with 0x22 followed by the size of the structure as one byte. | |
# Next comes a field ID as one byte, and the field contents. The field contents | |
# are determined by the field ID. The name, label, and type are Pascal strings, | |
# the precision is a varint, the scale, nullability, and case insensitivity | |
# are just single bytes. | |
# * I have not seen the schema name or table name fields in use, and haven't | |
# been able to provoke their use either, I can only guess that those would be | |
# fields 0x12 and 0x1a and have values that are Pascal strings. | |
# * I haven't been able to make Athena produce different values for name and | |
# label, so I can't actually tell which field is which, I have made a guess | |
# that name comes first, but it could be the other way around. | |
# * I haven't been able to make Athena set any other value for nullable than | |
# "UNKNOWN", corresponding to 0x03, so I can't tell what "NO_NULLS" and | |
# "NULLABLE" would correspond to. My guess is 0x01 and 0x02 respectively. | |
class Parser | |
FIELDS = { | |
0x0a => :catalog, | |
0x12 => :schema_name, # guess, never used | |
0x1a => :table_name, # guess, never used | |
0x22 => :name, # guess, never different from label | |
0x2a => :label, # guess, never different from name | |
0x32 => :type, | |
0x38 => :precision, | |
0x40 => :scale, | |
0x48 => :nullable, | |
0x50 => :case_sensitive, | |
} | |
NULLABILITY = { | |
0x01 => :no_nulls, # guess, I have never actually seen this value | |
0x02 => :nullable, # guess, I have never actually seen this value | |
0x03 => :unknown, | |
} | |
def initialize(buffer) | |
@buffer = buffer | |
@offset = 0 | |
end | |
def parse | |
name = read_name | |
columns = [] | |
while @offset < @buffer.bytesize | |
columns << read_column | |
end | |
{:name => name, :columns => columns} | |
end | |
private def expect(n) | |
actual = @buffer[@offset].ord | |
unless actual == n | |
raise format('Expected 0x%02x at position %d, but was 0x%02x', n, @offset, actual) | |
end | |
end | |
private def read_name | |
expect(10) | |
@offset += 1 | |
read_string | |
end | |
private def read_column | |
expect(34) | |
@offset += 1 | |
size = read_byte | |
end_offset = @offset + size | |
fields = [] | |
while @offset < end_offset | |
field = read_field | |
fields << field | |
end | |
fields.to_h | |
end | |
private def read_field | |
id = read_byte | |
field = FIELDS[id] | |
case field | |
when :catalog, :name, :label, :type | |
str = read_string | |
[field, str] | |
when :nullable | |
raw_value = read_byte | |
value = NULLABILITY[raw_value] || raw_value | |
[field, value] | |
when :scale | |
[field, read_byte] | |
when :case_sensitive | |
[field, read_byte == 1] | |
when :precision | |
value = read_varint | |
[field, value] | |
else | |
raise format('Unsupported field id: 0x%x', id) | |
end | |
end | |
private def read_byte | |
value = @buffer[@offset, 1].ord | |
@offset += 1 | |
value | |
end | |
private def read_string | |
size = read_byte | |
str = @buffer[@offset, size] | |
@offset += size | |
str | |
end | |
private def read_varint | |
b = read_byte | |
n = b & 0b01111111 | |
i = 1 | |
until b & 0b10000000 == 0 | |
b = read_byte | |
n |= ((b & 0b01111111) << (i * 7)) | |
i += 1 | |
end | |
n | |
end | |
end | |
if File.identical?(__FILE__, $0) | |
require 'json' | |
puts(JSON.pretty_generate(Parser.new(ARGF.read.force_encoding('ascii-8bit')).parse)) | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment