The simplest compression algorithms: RLE and LZ77. Coding methods

Basic classification of reversible compression algorithms

There are many different practical lossless compression methods, which tend to have varying effectiveness for different types data and different volumes. However, these methods are based on three main classical compression schemes:

Series run encoding (RLE);

Statistical compression methods;

Dictionary (heuristic) compression methods.

Compression by series encoding: RLE algorithm

The most well-known simple approach and algorithm for compressing information in a reversible way is Run Length Encoding (RLE). The essence of the methods in this approach is to replace chains or series of repeating bytes or their sequences with one coding byte and a counter for the number of their repetitions. The problem with all similar methods is to determine the way in which the decompression algorithm could distinguish an encoded series of bytes from other unencoded byte sequences in the resulting byte stream. The solution to the problem is usually achieved by placing marks at the beginning of the coded chains. Such marks can be, for example, characteristic bit values in the first byte of a coded series, the values of the first byte of a coded series, etc. These methods are usually quite effective for raster compression. graphic images(BMP, PCX, TIF, GIF), because the latter contain quite a lot of long series of repeating byte sequences. The disadvantage of the RLE method is the rather low compression ratio or the cost of encoding files with a small number of series and, even worse, with a small number of repeating bytes in the series.

The RLE algorithm is based on the idea of identifying repeated data sequences and replacing them with a simpler structure that specifies the data code and repetition factor. For example, let the following sequence of data be given that is subject to compression: 1 1 1 1 2 2 3 4 4 4 . The RLE algorithm proposes replacing it with the following structure: 1 4 2 2 3 1 4 3 , where the first number of each pair of numbers is the data code, and the second is the repetition factor. If 1 byte is allocated to store each data element of the input sequence, then the entire sequence will occupy 10 bytes of memory, while the output sequence (compressed version) will occupy 8 bytes of memory. It is clear that the RLE algorithm will give a better compression effect when the repeating data sequence is longer. In the case of the example discussed above, if the input sequence looks like this: 1 1 1 1 1 1 3 4 4 4, then the compression ratio will be 60%. In this regard, greater efficiency of the RLE algorithm is achieved when compressing graphic data (especially for monochromatic images).

Heuristic (dictionary) algorithms compression (LZ, LZW) searches for lines in a file that are repeated and builds a dictionary of phrases that have already been encountered. Typically, such algorithms have a number of specific parameters (buffer size, maximum phrase length, etc.), the selection of which depends on the experience of the author of the work, and these parameters are selected in such a way as to achieve the optimal ratio of the algorithm's operating time, compression ratio and list of files , which compress well.

Statistical algorithms(Huffman, Shannon-Fano, arithmetic) Statistical algorithms use a statistical model of data, and the quality of information compression directly depends on how good this model is. Taking into account the statistical properties of data in the simplest case means using uneven codes for encoding - frequently occurring characters are encoded with short codes, and rarely - with long ones. That is, such algorithms need to know the probabilities of the appearance of characters in the image, which can be assessed in a simple case by the frequency of appearance of characters in the input data. Typically these probabilities are unknown. The purpose of coding is to convert the input stream into a stream of bits of minimum length, which is achieved by reducing the entropy (randomness) of the input stream by taking into account symbol frequencies.

The length of the code representing characters from the stream alphabet must be proportional to the amount of information in the input stream, and the length of the stream characters in bits may not be a multiple of 8 or even variable. If the probability distribution of frequencies of occurrence of symbols from the alphabet of the input stream is known, then an optimal coding model can be constructed. However, due to the existence of a huge number of different file formats, the task becomes much more complicated, because... The frequency distribution of data symbols is unknown in advance. In this case, two approaches are used.

First consists in viewing the input stream and constructing encoding based on the collected statistics (this requires two passes through the file - one for viewing and collecting statistical information, the second for encoding, which somewhat limits the scope of such algorithms, because, thus, eliminates the possibility of single-pass on-the-fly encoding used in telecommunication systems, where the volume of data is sometimes unknown, and its retransmission or parsing can take an unreasonably long time). In this case, the statistical scheme of the encoding used is written to the output stream. This method known as static Huffman coding. Together with compressed file the symbol table is transmitted. Such algorithms compress most files quite well, but additional transmission of the symbol frequency table is required, as well as the need for two passes source file for collecting statistics and compression.

Second method- adaptive coding method. Its principle is to change the encoding scheme depending on the nature of changes in the input stream. Adaptive - they begin to work with a fixed initial table of symbol frequencies (usually all symbols are initially equally probable) and during operation this table changes depending on the symbols that appear in the file. Advantages: single-pass algorithm, does not require transfer of the symbol frequency table, compresses a wide class of files quite effectively. This approach has a one-pass algorithm and does not require storing information about the encoding used explicitly. Adaptive encoding can provide a higher compression ratio than static encoding, since changes in the frequencies of the input stream are more fully taken into account. This method is known as dynamic Huffman coding.

Taking this into account, statistical algorithms can be divided into three classes:

1. Non-adaptive - they use fixed, predetermined probabilities of symbols. The symbol probability table is not transmitted with the file because it is known in advance. Disadvantage: narrow range of use for a specific table of frequencies for which an acceptable compression ratio is achieved.

2. Semi-adaptive - a table of symbol frequencies is built for each file and the file is compressed based on it. A symbol table is transmitted along with the compressed file. Such algorithms compress most files quite well, but additional transmission of the symbol frequency table is necessary, as well as the need for two passes of the source file to collect statistics and compression.

3. Adaptive - they begin to work with a fixed initial table of symbol frequencies (usually all symbols are initially equally probable) and during operation this table changes depending on the symbols that appear in the file. Advantages: single-pass algorithm, does not require transfer of the symbol frequency table, compresses a wide class of files quite effectively.

The Huffman algorithm is based on the idea of bit group coding. First, a frequency analysis of the input data sequence is carried out, that is, the frequency of occurrence of each character found in it is established. After this, the symbols are sorted by decreasing frequency of occurrence.

The basic idea is this: the more frequently a character occurs, the fewer bits it is encoded. The encoding result is entered into the dictionary required for decoding. Let's look at a simple example illustrating how the Huffman algorithm works.

In static Huffman coding, input symbols (bit chains of various lengths) are associated with bit chains, as well as their codes of variable length. The code length of each symbol is taken to be proportional to the binary logarithm of its frequency, taken with the opposite sign. And the total set of all the different symbols encountered constitutes the flow alphabet. This encoding is prefix, which makes it easy to decode it into a result stream, because, with prefix encoding, the code of any character is not a prefix of the code of any other character - the alphabet is unique.

Run-length encoding (RLE) or Repeat Coding is a simple data compression algorithm that operates on data runs, that is, sequences in which the same character appears several times in a row. When encoding, a string of identical characters that make up a series is replaced by a string that contains the repeating character itself and the number of its repetitions.

Characteristics of the RLE algorithm:

Compression ratios: First option: 32, 2, 0.5. Second option: 64, 3, 128/129. (Best, average, worst odds). Image class: The algorithm is focused on images with a small number of colors: business and scientific graphics. Symmetry: About one.

Characteristics: The positive aspects of the algorithm, perhaps, can only be attributed to the fact that it does not require additional memory when archiving and unzipping, and also works quickly. An interesting feature of batch encoding is that the degree of archiving for some images can be significantly increased simply by changing the order of the colors in the image palette.

First version of the algorithm

This algorithm is extremely simple to implement.<Group encoding - from English Run Length Encoding (RLE) - is one of the oldest and simplest graphics archiving algorithms. The image in it (as in several algorithms described below) is drawn into a chain of bytes along raster lines. The compression itself in RLE occurs due to the fact that the source image contains chains of identical bytes. Replacing them with pairs

repetition counter, value

> reduces data redundancy.

The algorithm is designed for business graphics - images with large areas of repeating color. The situation when the file grows is not so rare for this simple algorithm. It can be easily obtained by applying group encoding to processed color photographs. In order to double an image, it must be applied to an image in which the values of all pixels are greater than binary 11000000 and are not repeated in pairs in a row.

Second version of the algorithm

The second version of this algorithm has a larger maximum archiving ratio and increases the size of the original file less. 29. lzw compression algorithm, The name of the algorithm was given by the first letters of the surnames of its developers - Lempel Ziv.

The LZW algorithm is based on the idea of extending the alphabet, which allows additional characters to be used to represent strings of regular characters. Using, for example, 9-bit ASCII codes instead of 8-bit ones, you get an additional 256 characters. The work of the compressor comes down to building a table consisting of rows and their corresponding codes. The compression algorithm boils down to the following: the program reads the next character and adds it to the string. If the row is already in the table, reading continues, if not, given line is added to the string table. The more duplicate lines there are, the more the data will be compressed.

Returning to the example with the telephone, we can, using a very simplified analogy, say that by compressing the record 233 34 44 using the LZW method, we will end up introducing new lines - 333 and 444 and, expressing them with additional characters, we will be able to reduce the length of the record.Compression ratios Characteristics of the LZW algorithm: Image class: Approximately 1000, 4, 5/7 (Best, average, worst odds). Compression of 1000 times is achieved only on single-color images that are multiples of approximately 7 MB in size. : LZW is focused on 8-bit images built on a computer. Compresses due to identical subchains in the stream. Symmetry

Characteristics: Almost symmetrical, provided that the search operation for a row in a table is optimally implemented.

: The situation when the algorithm enlarges the image is extremely rare.

LZW is universal - it is its variants that are used in conventional archivers.

The compression process looks quite simple. We read the characters of the input stream sequentially and check whether such a string exists in the string table we created. If there is a line, then we read the next character, and if there is no line, then we enter the code for the previous found line into the stream, enter the line into the table and start the search again. The peculiarity of LZW is that for decompression there is no need to save the table of strings to a file for decompression. The algorithm is designed in such a way that we are able to restore a table of strings using only a stream of codes. Simple coding method (

RUN-LENGHT CODING ), which is successfully used by popular archivers. This method is based on counting the length of repeated characters in a row, and writing information like the number of characters and the repeating character itself into the packed file instead of the sequence of these characters. For example, there is a string like "". As can be seen from the example, a sequence of 5 letters "A" was packed into two characters "5" and "A", and the sequences "DD", "EE", "L" were not packed at all, since there is no gain from replacing these characters into a sequence of type "length"+"letter".

When packing a file using this method, difficulties arise such as how the unpacker will distinguish control information from a packed file from unpacked data. For example, as in the case discussed above, will the unpacker distinguish the control character "5" from an unpacked character with the same code? There are two options for solving this problem:

(I). Find a character that occurs less frequently than the others, or does not occur at all, in the file being packed, and use it as a control. Let's call it a prefix for convenience in the following explanation.

Now the sequence “АААА” will be converted by the packer into a prefix (8 bits), “A” (8 bits), quantity (8 bits), i.e. 5 bytes will be replaced by three. If a bit with a prefix code is encountered in the source file during packaging, then two bytes with the prefix code are written to the resulting file, and based on this feature, the unpacker will determine where the prefix is a symbol and where it is control information. Modifications of this method are possible, for example:

1. Encode the quantity not with eight bits, but with three, four, and so on, which will increase the percentage of packaging, but will limit the maximum length of repeating characters that can be packed if encoded with three bits (from 3 to 10, i.e. .000=3, ... ,111=10), and if the length is more than 10 characters, then pack ten characters each. 2. A second modification is possible, when the number of repeated characters is encoded not by eight bits, but by three bits, and the length of the repeated characters is limited to 265. The question arises of how to encode 256 different lengths into 3 bits. It turns out that it is possible, but only the range is from 3 to 9, but if the length of the repeated characters is more than 9, then “111” is written in three bits binary code

, which is equal to "7". This means that the true length of the sequence is in the next byte (8 bits are allocated for lengths from 10 to 256 characters).

Let's give some examples:

We have the sequences:

a) "AAAAABBBBBBBBBBCCAAAAAAAAAAAAAAA"

b) "CBBBBBBBBBBEEEEEEEEEEA"

Let's pack them:

1. Using the RLE method (run-length encoding - pack repeating bytes).
a) Taking a prefix equal to “D”, we get:

Compression percentage = 12 bytes/32 bytes = 37.5%

In the packed file, the first byte is the prefix byte so that the unpacker knows which byte is the prefix.

b) Let's take a prefix equal to "A", although in fact the packer will not do this, since there are not many characters in this sequence, so the archiver will take an unused character as a prefix.

Packed sequence:
"A", "C", "A", "B", 10, "A", "E", 10, "A", "A"

Compression percentage = 10 bytes/22 bytes = 45.5%

2. Now let’s pack the same lines according to modification 1 of the RLE method with the same prefix values in order to analyze the efficiency.

"D" , "D" , "A" , 2 , "D" , "B" , 7 ,
8 bits 8 bits 8 bits 3 bits 8 bits 8 bits 3 bits

"D" , "A" , 7 , "D" , "A" , 2
8 bits 8 bits 3 bits 8 bits 8 bits 3 bits

Compression percentage = 84 bits/(22*8) bits = 32.8%

"A" , "C" , "A" , "B" , 7 , "A" , "E" , 7 , "A" , "A"
8 bits 8 bits 8 bits 8 bits 4 bits 8 bits 8 bits 3 bits 8 bits 8 bits

Compression percentage = 70 bits/(22*8) bits = 39.8%

3. Now let’s check the effectiveness of the latest modification:

a) Packed sequence:

"D" , "D" , "A" , 2 , "D" , "B" , 7 , 0
8 bits 8 bits 8 bits 3 bits 8 bits 8 bits 3 bits 8 bits

"D" , "A" , 7 , 5
8 bits 8 bits 3 bits 8 bits

Compression percentage = 81 bits/(32*8) bits = 31.6%

b) Packed sequence:

"A" , "C" , "A" , "B" , 7 , 0 , "A" , "E"
8 bits 8 bits 8 bits 8 bits 3 bits 8 bits 8 bits 8 bits

7 , 0 , "A" , "A"
3 bits 8 bits 8 bits 8 bits

Compression percentage = 86 bits/(22*8) bits = 48.9%

From these examples it is clear that depending on the content of the packed data, one or the other algorithm is effective, so in order to choose which algorithm is more effective, you need to check them on different types of files.

(II). The second option for recording control information for the unpacker is as follows: if a single character is found in the text, then one control bit equal to one and this character itself are written to the output file. If a sequence of repeated bytes is encountered, for example “AAAAAA”, then the packed information will look like this: 0 (1 bit), byte A (8 bits), quantity (3-8 bits);

Let's give an example for clarity. To do this, let's take sequences from previous examples.

1) The number of repeating bytes is encoded into 8 bits.

a) 0 , "A" , 5 , 0 , "B" , 10 , 1 , "C" , 1 , "C" , 0 , "A" , 15
1b. 8b. 8b. 1b. 8b. 8b. 1b. 8b. 1b. 8b. 1b. 8b. 8b.

Compression percentage = 71 bit/256 bit = 27.7%

b) 1 , "C" , 0 , "B" , 10 , 0 , "E" , 10 , 1 , "A"
1b. 8b. 1b. 8b. 8b. 1b. 8b. 8b. 1b. 8b.

Compression percentage = 52 bits/176 bits = 29.5%

2) Now let’s take 3 bits to encode the number of repeating bytes, in which values from 2 to 8 (0 - 6) can be encoded, if the length of the repeating sequence is greater, then these three bits contain “111” (7-decimal), and the true length is in the next byte (length from 9 to 264).

a) 0 , "A" , 3 , 0 , "B" , 7 , 1 , 0 , "C" , 0 , 0 , "A" , 7 , 6
1b. 8b. 3b. 1b. 8b. 3b. 8b. 1b. 8b. 3b. 1b. 8b. 3b. 8b.

Compression percentage = 64 bits/256 bits = 20.3%

b) 1 , "C" , 0 , "B" , 7 , 1 , 0 , "E" , 7 , 1 , 1, "A"
1b. 8b. 1b. 8b. 3b. 8b. 1b. 8b. 3b. 8b. 1b. 8b.

Compression percentage = 58 bit/176 bit = 33%

If the quantity is encoded into 4 bits (2..16), then

1 , "C" , 0 , "B" , 8 , 0 , "E" , 8 , 1 , "A"
1b. 8b. 1b. 8b. 4b. 1b. 8b. 4b. 1b. 8b.

Compression percentage = 44 bits/176 bits = 25%

As can be seen from the examples, the second packaging option is more efficient, since it compresses sequences starting with two characters, which are usually the vast majority. The first option packed sequences, starting with three characters.

Using the RLE algorithm, encode the character sequence BBBBBBACCCABBBBBB

Write the result as hexadecimal codes (each character is encoded as a byte, which is represented by two hexadecimal digits). Check the result using the RLE program.

Decode the sequence packed using the RLE algorithm (hexadecimal codes are given): 01 4D 8E 41 01 4D 8E 4116. To determine characters by their hexadecimal code, use an ASCII table. In the table below, the first digit of the hexadecimal code of the character is written in the first line, and the second digit is written in the second line. For example, the character "&" has a hexadecimal code of 2616.

Determine the number of bytes in the original and decompressed sequence and calculate the compression ratio: Check the result obtained in the previous paragraph using the RLE program. Suggest two verification methods. Construct sequences that are compressed by the RLE algorithm by exactly 2 times, 4 times, 5 times. Check your answers using the RLE program. Think of three sequences that cannot be compressed using the RLE algorithm: Using the RLE program, apply RLE compression to the following files and find the compression ratio for each of them: Explain the results obtained in the previous paragraph:

JPEG format

Estimate the maximum achievable compression ratio using the version of the RLE algorithm discussed in the textbook. In what case will it be possible to achieve it?
Estimate the worst-case compression ratio using the RLE algorithm. Describe this worst case scenario.

PROFILE When compressing using the RLE algorithm, the number of repetitions of the first character is first recorded, then the first character itself, then the number of repetitions of the second character, etc. In this case, the entire encoded file takes 4 bytes: 0 11 00100 2 11 000000 2 011 00100 2 11 000001 2 100 A (code 192) 100 B (code 193) Thus, we compressed the file 50 times due to the fact that it again had redundancy - chains of identical characters. This is lossless compression because, knowing the packaging algorithm, you can recover the original data from the code. Obviously, this approach will lead to an increase (by 2 times) in the amount of data in the case when there are no adjacent identical characters in the file. To improve RLE encoding results even in this worst case, the algorithm was modified as follows. The packed sequence contains control bytes, each control byte followed by one or more data bytes. If the most significant bit of the control byte is 1, then the data byte following the control byte during unpacking must be repeated as many times as written in the remaining 7 bits of the control byte. If the most significant bit of the control byte is 0, then the next few bytes of data must be taken without changing. How much exactly is written in the remaining 7 bits of the control byte. For example, control byte 10000 11 1 2 means that the next byte must be repeated 7 times, and the control byte 00000100 2 means that the next 4 bytes must be taken without changes. So the sequence is 1000 11 11 2 11 000000 2 00000010 2 11 000001 2 11 000010 2 repeat 15 A (code 192) 2 B (code 193) B (code 194) is unpacked into 17 characters: AAAAAAAAAAAAAAAABV. The RLE algorithm has been successfully used to compress drawings in which large areas are filled with the same color, and some audio data. Nowadays, more advanced but more complex methods are used instead. We will look at one of them (Huffman coding) below. The RLE algorithm is used, for example, at one of the stages of encoding pictures in JPEG format. RLE compression is also available in BMP format (for images with a palette of 16 or 256 colors). The best way understand the principle of operation of the algorithm - practice using it. From the author’s website (http://kpolyakov.narod.ru/prog/compress.htm) you can download a free simulator program, which is designed to study the RLE algorithm: On the left side of the program window there is text editor. When you click the button, the entered text is compressed using the RLE algorithm, the compressed data is displayed as hexadecimal codes on the right side of the window. The window on the right side is also an editor, so the codes can be changed and the reverse operation (unpacking, decompression) can be performed by clicking on the button. The buttons at the top of the window allow you to compress and restore files on the disk. You just need to take into account that the program uses its own data storage format. December 6, 2012 / INFORMATION SCIENCE Test questions 1) Estimate the maximum achievable compression ratio using the considered version of the RLE algorithm. In what case will it be possible to achieve it? 2) Estimate the worst-case compression ratio using the RLE algorithm. Describe this worst case scenario. 3) Come up with three sequences that cannot be compressed using the RLE algorithm. 4) Construct sequences that are compressed by the RLE algorithm by exactly 2 times, 4 times, 5 times. Practice 1) Using the RLE algorithm, encode the sequence of characters BBBBBBACCCABBBBBB Write the result as hexadecimal codes (each character is encoded as a byte, which is represented by two hexadecimal digits). Check the result using the RLE program.

2) Decode the sequence packed using the RLE algorithm (hexadecimal codes are given): 01 4D 8E 41 01 4D 8E 41 16. To identify characters by their hexadecimal codes, use an ASCII table. Determine the number of bytes in the original and decompressed sequence and calculate the compression ratio. Check the result using the RLE program. Suggest two verification methods. 3) Using the RLE program, apply RLE compression to the following 1 files and find the compression ratio for each of them. grad_vert.bmp grad_horz.bmp grad_diag.jpg Explain the results obtained: why for two images in the BMP format of the same size, the compression ratios using the RLE algorithm are so different; Why can't I compress pictures saved in JPEG format? Prefix codes Think of Morse code, which uses an uneven code to reduce message length - frequently occurring letters (A, E, M, H, T) are coded in short sequences, and rarely occurring ones are coded in longer sequences. Such a code can be represented in the form of a structure called a tree: Root This shows an incomplete Morse code tree, built only for characters whose codes consist of one and two characters (dots and dashes). The tree consists of nodes (black dot and circles with alphabet symbols) and directed edges connecting them, arrows indicate the direction of movement. The top node (which does not include any arrows) is called the “root” of the tree. Two arrows emerge from the root and from all intermediate nodes (except for the final nodes - “leaves”), the left one is marked with a dot, and the right one is marked with a dash. To find the symbol code, you need to follow the arrows from the “root” of the tree to the desired “leaf”, writing down the labels of the arrows that we follow. There are no cycles (closed paths) in the tree, so the code for each 1 These and other files used in the workshop tasks are located on the supplement disk to this issue of the journal. character is determined in a unique way. Using this tree, you can construct the following codes: E I A – T – N – M – – This is an uneven code, in which the characters have codes different lengths. In this case, the problem of dividing the sequence into separate codewords always arises. In Morse code, it is solved using a separator character - a pause. However, you do not have to enter an additional character if the Fano condition is satisfied: none of the codewords is the beginning of another codeword. This allows you to unambiguously decode the message in real time as the next characters are received. A prefix code is a code in which no codeword is the beginning of another codeword (Fano condition). Robert Fano (b. 1917) (nytimes.com) Claude Shannon (1916–2001) To use this idea in computer processing data, it was necessary to develop an algorithm for constructing a prefix code. This problem was first solved, independently of each other, by American mathematicians and engineers Claude Shannon (in 1948) and Robert Fano (in 1949). They used message redundancy, which consists in the fact that characters in the text have different frequencies of occurrence. In this case, you need to read the data from the source file twice: on the first pass, the frequency of occurrence of each character is determined, then a code is built taking this data into account, and on the second pass, the text characters are replaced with their codes. The coding algorithm proposed by Shannon and Fano was called the Shannon-Fano code. Example 3. Let the text consist only of the letters O, E, N, T and a space. It is known how many times they appeared in the text: space - 179, O - 89, E - 72, N - 53 and T - 50 times. Following the Shannon-Fano method, we divide the characters into two groups so that the total number of characters in the first group found in the text is approximately equal to the total number of characters in the second group. In our case the best option- this is to combine the space and the letter T into the first group (sum 179+50 = 229), and the remaining characters into the second (sum 89+72+53 = 214). The characters in the first group will have codes starting with 0, and the rest - with 1. There are only two characters in the first group, one of them, for example the space, will have a second digit of the code 0 (and the full code 00), and the second will have 1 (letter code T - 01). December 7, 2012 / INFORMATICS