Little-Endian Snafu
03 Sep 2022 CommentsYou learned about this in college classes. You thought working SWE jobs in 2022 you’d never have to deal with this. But it comes back to trick you at your worst.
I was tripped by endian-ness when implementing inverted list listno-offset
/LO
as a fixed-width byte string key in RocksDB.
// When offsets list id + offset are encoded in an uint64
// we call this LO = list-offset
inline uint64_t lo_build(uint64_t list_id, uint64_t offset) {
return list_id << 32 | offset;
}
inline uint64_t lo_listno(uint64_t lo) {
return lo >> 32;
}
inline uint64_t lo_offset(uint64_t lo) {
return lo & 0xffffffff;
}
The original implementation used rather smart bitwise shifting and masking to accommodate two uint32
integers in uint64
allocations into one uint64
integer. It worked out beautifully on its own. I proceeded to set up my custom comparator in RocksDB reinterpreting the uint64
integer LO
as a byte string, and that worked out great. Then I discovered prefix seek and slapped on a NewFixedPrefixTransform(4)
(4 bytes for the list_id
/listno
). This was when the iterator seeks broke down. One should expect all elements prefixed by the same listno
to be returned with the fixed prefix, but the results suggested otherwise. It somehow seemed to be treating the offset
as the prefix.
I spent hours isolating the problem with trial and error. Then at dinner it dawned on me… Little Endian1 2!
When memory bytes are printed sequentially from left to right (e.g. in a hex dump), little-endian representation of integers has the significance increasing from left to right. In other words, it appears backwards when visualized, which can be counter-intuitive.
A (most?) modern x86_64
or aarch64
machine adopts a little-endian system. Here, a naive slice of the first 4 bytes in that memory blob would give you the offset
, not the listno
!
Realizing that, the fix was to switch the list_id
and offset
in lo_build
, or, build LO
with string concatenation or memcpy
in the expected order. And the problem was solved.