Clownacy
Member
- Messages
- 40
[Another cross-post from the blog. I figured that it is relevant here, being about creating a compressor for a compression format used by various Mega Drive games.]
Long ago, I made a compressor for Kosinski, which is a compression format seemingly created by Sega and used in various Mega Drive games, including Sonic the Hedgehog. There are two other compression format used by those games as well: Enigma and Nemesis. Having recently become unhappy with the portability and code quality of Flamewing's mdcomp compression library, I decided that I would make my own Nemesis compressor. If only I knew what I was getting myself into...
I began by reading the trusty Sega Retro page on the format, which provided a detailed specification of the format and an overview of the algorithms used to encode data in it. With that information alone, I was able to produce a working decompressor.
It was interesting to see just how simple Nemesis was: it is a nybble-based run-length encoding scheme with each run placed into a kind of dictionary that substitutes it with a variable length code. The most common runs get the shortest codes, and the rarest runs get the longest codes. In effect, the data has two layers of compression applied to it. The process of substituting data for variable length codes is known as 'Huffman encoding', though this is a misnomer as I will explain later.
With the decompressor done, it was time to move onto making the compressor. I had heard from Vladikcomper that the Nemesis data in the Sonic games was encoded with "Shannon coding", so I figured that I would begin with making a compressor which used that.
Upon looking into it, I learned that Shannon coding was the first of its kind, dating all the way back to the 1940s. It derived the variable-length codes from the frequency of the symbol (nybble-run) within the data. It was pretty easy to implement, but soon it became obvious that this was not the algorithm that Sega used in their Nemesis compressor: the lengths of the variable-length codes differed unpredictably, and, rather than be directly derived from the symbol's frequency, the codes were clearly generated in a structured pattern. Undeterred, I continued working on my compressor using Shannon coding. In the end, I had it working and successfully producing valid Nemesis-encoded data. Unfortunately, the use of Shannon coding had proven to be especially unwise, as data produced by the compressor was considerably larger than the data produced by Sega's compressor.
Whilst reading about Shannon coding, I learned of a similar algorithm known as Fano coding, which is frequently confused with Shannon coding. This would explain why Vladikcomper claimed that Shannon coding was used by Sega's Nemesis compressor. So, I began modifying my compressor to use Fano coding instead.
Fano coding works completely differently to Shannon coding, producing its variable-length codes by repeatedly performing a binary split on a list of the symbols sorted by their frequency within the data. A single recursive function was all that was needed to implement this coding scheme. This matched the output of Sega's compressor exactly (though unfortunately some sorting quirks keep it from being binary-exact), meaning that my compressor was able to match its compression ratio, being neither more efficient nor less efficient!
While it was nice to have a Nemesis compressor with a decent compression ratio, I wanted it to be able to compress data better than Sega's compressor. After all, that is what I did with my Kosinski compressor all those years ago. For this, I knew that there was only one solution: Huffman coding.
Huffman coding is the perfection of what Shannon coding pioneered: while Shannon and Fano generate codes that are 'good', they are not perfect; in contrast, Huffman always produces codes that are optimal. The trade-off is that Huffman is a lot more complex to implement, requiring the construction and parsing of an entire binary tree using priority queues. While trivial to do with C++'s standard library, my compressor uses ANSI C (C89), so I had to get very creative in order to implement this as easily, but also efficiently, as possible. In the end, I was able to implement this using a single statically-allocated array. However, something was wrong: the generated data was larger than the output of Fano coding. How could Huffman, an optimal code generator, be less efficient than Fano?
The first shortcoming that I noticed was with a quirk of the Nemesis format: there is a specific code that is reserved - 111111. Variable length codes are not allowed to use this code nor a prefix of it (1, 11, 111, etc.). With Fano coding, codes that conflicted with the reserved code were rare, whereas with Shannon and Huffman they were common. Conflicting codes had to either be rejected (causing all instances of the symbol they represent to instead be inlined into the bit stream) or moved to another, longer, code.
The second shortcoming was that codes had a limit to the length that they could be: Nemesis only supports up to a length of 8, while Huffman coding was generating its optimal codes on the assumption that they could be any length. The fact that these overly-long codes had to be rejected meant that the output of Huffman coding was no longer optimal! Granted, Shannon coding and Fano coding are susceptible to this as well, but somehow it was just worse with Huffman. However, Huffman has a solution to this: there exists a variant of Huffman coding that is specifically intended for codes with a limited maximum length, producing codes that are optimal within this constraint.
Performing Huffman coding in this manner requires what is known as the package-merge algorithm, which can be implemented as a slightly modified version of the usual Huffman algorithm with an extra queue. Rather than produce a whole tree, this algorithm produces a series of them, and, rather than produce the codes directly, parsing these trees produces the codes' lengths. A simple algorithm can be used to generate 'canonical Huffman' codes from these lengths.
Unbelievably, despite implementing this algorithm, the generated data was still too large. I was able to find one file that compressed slightly better with my compressor than Sega's, but the vast majority compressed to a considerably larger size. Examining the data, I was able to discern a pattern: while Huffman coding allowed many more symbols to be assigned a code, the most common symbols would be assigned longer codes than they would be with Fano coding. The space saved by assigning codes to more symbols was counteracted by the most common symbols having larger codes. This was unavoidable, as Huffman is a 'greedy' algorithm, the same kind of algorithm that keeps standard LZSS compressors from being optimal: Huffman is only ideal when you do not consider that encoding fewer symbols may be for the better.
I considered hackishly adding a heuristic, but could not stand the thought of it. Then I considered brute-forcing: I could repeat the tree-generation process over and over again, using fewer and fewer symbols until there was only one left, all the while I would be keeping track of which iteration would require the fewest bits to output. This proved to work incredibly well in practice: the brute-forcing does not scale with the size of the input data, but rather the number of symbols and the maximum code length, keeping the performance impact low; but, more importantly, this allowed Huffman to finally outdo Fano! It is not a huge amount: just a dozen bytes or so for small files or a few hundred bytes for larger ones, but that is about on-par with how my perfect Kosinski compressor compares to standard Kosinski compressors!
It has been a wild ride, but I finally have a Nemesis compressor that outperforms the one used by Sega. It still produces larger data than Flamewing's compressor, but that is to be expected considering the crazy black magic which that thing does. I am glad to finally see the completion of this project: I have been working on it non-stop for three days. So many headaches and so many hours spent comprehending obscure, poorly-documented algorithms. If only I was good at reading academic papers.
With this and clownlzss, I hope to replace all of Flamewing's compressors in ClownMapEd. Then I can stop getting a bunch of compiler warnings about the compressors treating the return value of 'std::istream::tellg()' as an integer. Ew.
The source code to my Nemesis compressor and decompressor can be found here: https://github.com/Clownacy/clownnemesis
Long ago, I made a compressor for Kosinski, which is a compression format seemingly created by Sega and used in various Mega Drive games, including Sonic the Hedgehog. There are two other compression format used by those games as well: Enigma and Nemesis. Having recently become unhappy with the portability and code quality of Flamewing's mdcomp compression library, I decided that I would make my own Nemesis compressor. If only I knew what I was getting myself into...
I began by reading the trusty Sega Retro page on the format, which provided a detailed specification of the format and an overview of the algorithms used to encode data in it. With that information alone, I was able to produce a working decompressor.
It was interesting to see just how simple Nemesis was: it is a nybble-based run-length encoding scheme with each run placed into a kind of dictionary that substitutes it with a variable length code. The most common runs get the shortest codes, and the rarest runs get the longest codes. In effect, the data has two layers of compression applied to it. The process of substituting data for variable length codes is known as 'Huffman encoding', though this is a misnomer as I will explain later.
With the decompressor done, it was time to move onto making the compressor. I had heard from Vladikcomper that the Nemesis data in the Sonic games was encoded with "Shannon coding", so I figured that I would begin with making a compressor which used that.
Upon looking into it, I learned that Shannon coding was the first of its kind, dating all the way back to the 1940s. It derived the variable-length codes from the frequency of the symbol (nybble-run) within the data. It was pretty easy to implement, but soon it became obvious that this was not the algorithm that Sega used in their Nemesis compressor: the lengths of the variable-length codes differed unpredictably, and, rather than be directly derived from the symbol's frequency, the codes were clearly generated in a structured pattern. Undeterred, I continued working on my compressor using Shannon coding. In the end, I had it working and successfully producing valid Nemesis-encoded data. Unfortunately, the use of Shannon coding had proven to be especially unwise, as data produced by the compressor was considerably larger than the data produced by Sega's compressor.
Whilst reading about Shannon coding, I learned of a similar algorithm known as Fano coding, which is frequently confused with Shannon coding. This would explain why Vladikcomper claimed that Shannon coding was used by Sega's Nemesis compressor. So, I began modifying my compressor to use Fano coding instead.
Fano coding works completely differently to Shannon coding, producing its variable-length codes by repeatedly performing a binary split on a list of the symbols sorted by their frequency within the data. A single recursive function was all that was needed to implement this coding scheme. This matched the output of Sega's compressor exactly (though unfortunately some sorting quirks keep it from being binary-exact), meaning that my compressor was able to match its compression ratio, being neither more efficient nor less efficient!
While it was nice to have a Nemesis compressor with a decent compression ratio, I wanted it to be able to compress data better than Sega's compressor. After all, that is what I did with my Kosinski compressor all those years ago. For this, I knew that there was only one solution: Huffman coding.
Huffman coding is the perfection of what Shannon coding pioneered: while Shannon and Fano generate codes that are 'good', they are not perfect; in contrast, Huffman always produces codes that are optimal. The trade-off is that Huffman is a lot more complex to implement, requiring the construction and parsing of an entire binary tree using priority queues. While trivial to do with C++'s standard library, my compressor uses ANSI C (C89), so I had to get very creative in order to implement this as easily, but also efficiently, as possible. In the end, I was able to implement this using a single statically-allocated array. However, something was wrong: the generated data was larger than the output of Fano coding. How could Huffman, an optimal code generator, be less efficient than Fano?
The first shortcoming that I noticed was with a quirk of the Nemesis format: there is a specific code that is reserved - 111111. Variable length codes are not allowed to use this code nor a prefix of it (1, 11, 111, etc.). With Fano coding, codes that conflicted with the reserved code were rare, whereas with Shannon and Huffman they were common. Conflicting codes had to either be rejected (causing all instances of the symbol they represent to instead be inlined into the bit stream) or moved to another, longer, code.
The second shortcoming was that codes had a limit to the length that they could be: Nemesis only supports up to a length of 8, while Huffman coding was generating its optimal codes on the assumption that they could be any length. The fact that these overly-long codes had to be rejected meant that the output of Huffman coding was no longer optimal! Granted, Shannon coding and Fano coding are susceptible to this as well, but somehow it was just worse with Huffman. However, Huffman has a solution to this: there exists a variant of Huffman coding that is specifically intended for codes with a limited maximum length, producing codes that are optimal within this constraint.
Performing Huffman coding in this manner requires what is known as the package-merge algorithm, which can be implemented as a slightly modified version of the usual Huffman algorithm with an extra queue. Rather than produce a whole tree, this algorithm produces a series of them, and, rather than produce the codes directly, parsing these trees produces the codes' lengths. A simple algorithm can be used to generate 'canonical Huffman' codes from these lengths.
Unbelievably, despite implementing this algorithm, the generated data was still too large. I was able to find one file that compressed slightly better with my compressor than Sega's, but the vast majority compressed to a considerably larger size. Examining the data, I was able to discern a pattern: while Huffman coding allowed many more symbols to be assigned a code, the most common symbols would be assigned longer codes than they would be with Fano coding. The space saved by assigning codes to more symbols was counteracted by the most common symbols having larger codes. This was unavoidable, as Huffman is a 'greedy' algorithm, the same kind of algorithm that keeps standard LZSS compressors from being optimal: Huffman is only ideal when you do not consider that encoding fewer symbols may be for the better.
I considered hackishly adding a heuristic, but could not stand the thought of it. Then I considered brute-forcing: I could repeat the tree-generation process over and over again, using fewer and fewer symbols until there was only one left, all the while I would be keeping track of which iteration would require the fewest bits to output. This proved to work incredibly well in practice: the brute-forcing does not scale with the size of the input data, but rather the number of symbols and the maximum code length, keeping the performance impact low; but, more importantly, this allowed Huffman to finally outdo Fano! It is not a huge amount: just a dozen bytes or so for small files or a few hundred bytes for larger ones, but that is about on-par with how my perfect Kosinski compressor compares to standard Kosinski compressors!
It has been a wild ride, but I finally have a Nemesis compressor that outperforms the one used by Sega. It still produces larger data than Flamewing's compressor, but that is to be expected considering the crazy black magic which that thing does. I am glad to finally see the completion of this project: I have been working on it non-stop for three days. So many headaches and so many hours spent comprehending obscure, poorly-documented algorithms. If only I was good at reading academic papers.
With this and clownlzss, I hope to replace all of Flamewing's compressors in ClownMapEd. Then I can stop getting a bunch of compiler warnings about the compressors treating the return value of 'std::istream::tellg()' as an integer. Ew.
The source code to my Nemesis compressor and decompressor can be found here: https://github.com/Clownacy/clownnemesis
Last edited: