How Data Types Look in a Hex Editor

Brain Squirrels

2017-05-04

When you have an unknown file format open in a hex editor, you don’t actually have a guide for what data type you are looking at. It’s up to you to guess whether something represents an integer, string, float, etc. Here’s a quick guide of what to look for and pitfalls.

Endianness

Endianness refers to the order of bytes in a number. Big endian is when the most significant byte is on the left and the least significant on the right. This is what you intuitively expect. The hex number 0x12345678 will show up in your hex editor as 12 34 56 78.

Little endian puts the least significant byte first, and the most significant last. So 0x12345678 will show up as 78 56 34 12. Notice that the digits in a byte didn’t reverse order, but the pairs of digits(bytes) moved.

Little endian is what is used by the x86 architecture. I use little endian in my examples below, because that’s what you most commonly see in files for PCs (but not always!). You’re most likely to find big endian in networking or a small handful of embedded architectures.

Data Types

Depending on your programming background, this will be mostly review:

int32 - a 32 bit integer (4 bytes). This is the most common way of storing a non-decimal number. Values range from −2,147,483,648 to 2,147,483,647 signed. If it is unsigned, meaning it doesn’t support negative numbers, then it can go from 0 to 4,294,967,295. This is the type you get when you type “int” in most implementations of C++. Also known as a “double word” or “DWORD”.

int16 - 16 bit (2 byte) int. Goes from -32,768 to 32,767 (signed) or 0 to 65, 535 (unsigned). This is much less common than an int32. It is known as a “word” or a “short.”

int8 - A one byte value going either from -128 to 127 (signed) or 0 to 255 (unsigned). Rarely used.

char - One byte (same memory footprint as an int8), but representing an ascii character.

w_char - A “wide” (multibyte) character. On Windows it is typically 16 bits.

string - A series of characters. It is usually null-terminated, meaning the is a zero at the end of the string to signal the end was reached. If for some reason it isn’t null-terminated, the length has to be known some other way, possibly by storing an integer length somewhere else.

wide string - A string of w_chars. If you look at a wide string in your hex editor with the right pane set to ascii, the string “hello, world” will look like “h.e.l.l.o.,. .w.o.r.l.d.” This is because the same values are used for ascii numbers, but each one now has a zero byte as well. Also, the null terminator will be two 0 bytes instead of one.

float - Floating point numbers are 4 bytes long, represent real numbers, and convert to decimal according to a funky formula involving exponents. Floats are extremely precis around zero (you can accurately represent .00000001), and they can also handle massive numbers like 100,000,000,000,000,000,000. The trick is that the spacing from one number to the next gets bigger as the number themselves get bigger. There is no such thing as an unsigned float.

double - An 8 byte float. It is more precise and can handle even larger values. Most files don’t need the precision of doubles, so these are rare in files.

Timestamps - date/time values can be stored in a number of formats. I leave you to figure these out on your own. Google “UNIX timestamps” and “Windows timestamps” to see the most common ways of storing time in binary.

Hashes

The last thing to cover is hashes. A hash is a way of taking a bunch of data and reducing it to a single number. You lose a lot of information this way, but you gain the ability to make quick comparisons.

The most common use of hashes in games is for quick string comparisons. Say you have a list of game objects that all have names, and you want to find an object by its name. String comparisons take many cpu cycles, so looking for a string in this list would be very slow. If all your strings were turned into 32bit numbers by a hashing algorithm, however, then you could search using quick integer comparisons instead.

The other use is for checksumming save games. A checksum is the result of hashing a file. If you know what the contents of a file hashed to before, you can hash it again to make sure it hasn’t been corrupted or tampered with. Games may store checksums of save game files in the saves themselves, so that players can’t cheat the save games without hacking the checksum system.

Conclusion

Reverse engineering is about trial and error; you guess what bytes represent and see if that makes sense. Understanding what data types will look like in a hex editor is critical for solving the puzzle. The list above should be a decent foundation, now let’s go find these in the wild.