Question? Leave a message!




DATA COMPRESSION

DATA COMPRESSION
ROBERT SEDGEWICK KEVIN WAYNE Algorithms 5.5 DATA COMPRESSION introduction ‣ runlength coding ‣ Huffman compression ‣ Algorithms FOUR TH EDITION LZW compression ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.edu5.5 DATA COMPRESSION introduction ‣ runlength coding ‣ Huffman compression ‣ Algorithms LZW compression ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.eduData compression Compression reduces the size of a file: To save space when storing it. To save time when transmitting it. Most files have lots of redundancy. Who needs compression Moore's law: transistors on a chip doubles every 18–24 months. Parkinson's law: data expands to fill space available. Text, images, sound, video, … “ Everyday, we create 2.5 quintillion bytes of data—so much that 90 of the data in the world today has been created in the last two years alone. ” — IBM report on big data (2011) Basic concepts ancient (1950s), best technology recently developed. 3Applications Generic file compression. Files: GZIP, BZIP, 7z. Archivers: PKZIP. File systems: NTFS, HFS+, ZFS. Multimedia. Images: GIF, JPEG. Sound: MP3. Video: MPEG, DivX™, HDTV. Communication. ITUT T4 Group 3 Fax. V.42bis modem. Skype. Databases. Google, Facebook, .... 4Lossless compression and expansion Message. Binary data B we want to compress. Compress. Generates a "compressed" representation C (B). Expand. Reconstructs original bitstream B. uses fewer bits (you hope) Compress Expand bitstream B compressed version C(B) original bitstream B 0110110101... 0110110101... 1101011111... Basic model for data compression Compression ratio. Bits in C (B) / bits in B. Ex. 50–75 or better compression ratio for natural language. 5Food for thought Data compression has been omnipresent since antiquity: Number systems. 1 2 X 1 ⇡ = Natural languages. 2 n 6 n=1 Mathematical notation. has played a central role in communications technology, brai l l e Grade 2 Braille. Morse code. but rather a I like like every Telephone system. and is part of modern life. MP3. MPEG. Q. What role will it play in the future 6Data representation: genomic code Genome. String over the alphabet A, C, T, G . Goal. Encode an Ncharacter genome: ATAGATGCATAG... Standard ASCII encoding. Twobit encoding. 8 bits per char. 2 bits per char. 8 N bits. 2 N bits. char hex binary char binary A 41 01000001 A 00 C 43 01000011 C 01 T 54 01010100 T 10 G 47 01000111 G 11 k Fixedlength code. kbit code supports alphabet of size 2 . Amazing but true. Some genomic databases in 1990s used ASCII. 7 664 CHAPTER 6 Strings Binary input and output. Most systems nowadays, including Java, base their I/O on 664 CHAPTER 6 Strings 8bit bytestreams, so we might decide to read and write bytestreams to match I/O for mats with the internal representations of primitive types, encoding an 8bit char with 1 byte, a 16bit short with 2 bytes, a 32bit int with 4 bytes, and so forth. Since bit streams are the primary abstraction for data compression, we go a bit further to allow Binary input and output. Most systems nowadays, including Java, base their I/O on clients to read and write individual bits, intermixed with data of various types (primi 8bit bytestreams, so we might decide to read and write bytestreams to match I/O for Reading and writing binary data tive types and String). The goal is to minimize the necessity for type conversion in mats with the internal representations of primitive types, encoding an 8bit char with client programs and also to take care of operatingsystem conventions for representing 1 byte, a 16bit short with 2 bytes, a 32bit int with 4 bytes, and so forth. Since bit data.We use the following API for reading a bitstream from standard input: streams are the primary abstraction for data compression, we go a bit further to allow Binary standard input and standard output. Libraries to read and write bits public class BinaryStdIn clients to read and write individual bits, intermixed with data of various types (primi tive types and String). The goal is to minimize the necessity for type conversion in from standard input and to standard output. boolean readBoolean() read 1 bit of data and return as a boolean value client programs and also to take care of operatingsystem conventions for representing char readChar() read 8 bits of data and return as a char value data.We use the following API for reading a bitstream from standard input: char readChar(int r) read r bits of data and return as a char value public class BinaryStdIn similar methods for byte (8 bits); short (16 bits); int (32 bits); long and double (64 bits) boolean readBoolean() read 1 bit of data and return as a boolean value boolean isEmpty() is the bitstream empty char readChar() read 8 bits of data and return as a char value close() void close the bitstream char readChar(int r) read r bits of data and return as a char value API for static methods that read from a bitstream on standard input similar methods for byte (8 bits); short (16 bits); int (32 bits); long and double (64 bits) boolean isEmpty() is the bitstream empty A key feature of the abstraction is that, in marked constrast to StdIn, the data on stan dard input is not necessarily aligned on byte boundaries. If the input stream is a single close() void close the bitstream byte, a client could read it 1 bit at a time with 8 calls to readBoolean(). The close() API for static methods that read from a bitstream on standard input method is not essential, but, for clean termination, clients should call close() to in dicate that no more bits are to be read. As with StdIn/StdOut, we use the following A key feature of the abstraction is that, in marked constrast to StdIn, the data on stan complementary API for writing bitstreams to standard output: dard input is not necessarily aligned on byte boundaries. If the input stream is a single public class BinaryStdOut byte, a client could read it 1 bit at a time with 8 calls to readBoolean(). The close() method is not essential, but, for clean termination, clients should call close() to in void write(boolean b) write the speciefi d bit dicate that no more bits are to be read. As with StdIn/StdOut, we use the following void write(char c) write the speciefi d 8bit char complementary API for writing bitstreams to standard output: void write(char c, int r) write the r least signicfi ant bits of the speciefi d char public class BinaryStdOut similar methods for byte (8 bits); short (16 bits); int (32 bits); long and double (64 bits) void write(boolean b) write the speciefi d bit close() void close the bitstream void write(char c) write the speciefi d 8bit char API for static methods that write to a bitstream on standard output void write(char c, int r) write the r least signicfi ant bits of the speciefi d char 8 similar methods for byte (8 bits); short (16 bits); int (32 bits); long and double (64 bits) close() void close the bitstream API for static methods that write to a bitstream on standard outputWriting binary data Date representation. Three different ways to represent 12/31/1999. A character stream (StdOut) A character stream (StdOut) A character stream (StdOut) StdOut.print(month + "/" + day + "/" + year); StdOut.print(month + "/" + day + "/" + year); StdOut.print(month + "/" + day + "/" + year); 00110001001100100010111100110111001100010010111100110001001110010011100100111001 00110001001100100010111100110111001100010010111100110001001110010011100100111001 1 2 / 3 1 / 1 9 9 9 00110001001100100010111100110111001100010010111100110001001110010011100100111001 80 bits 1 2 / 3 1 / 1 9 9 9 1 2 / 3 1 / 1 9 9 9 80 bits Three ints (BinaryStdOut) 80 bits Three ints (BinaryStdOut) Three ints (BinaryStdOut) BinaryStdOut.write(month); BinaryStdOut.write(month); BinaryStdOut.write(day); BinaryStdOut.write(month); BinaryStdOut.write(day); BinaryStdOut.write(year); BinaryStdOut.write(day); BinaryStdOut.write(year); BinaryStdOut.write(year); 000000000000000000000000000011000000000000000000000000000001111100000000000000000000011111001111 000000000000000000000000000011000000000000000000000000000001111100000000000000000000011111001111 12 31 1999 000000000000000000000000000011000000000000000000000000000001111100000000000000000000011111001111 96 bits 12 31 1999 96 bits 12 31 1999 96 bits Two chars and a short (BinaryStdOut) A 4bit fi eld, a 5bit fi eld, and a 12bit fi eld (BinaryStdOut) Two chars and a short (BinaryStdOut) A 4bit fi eld, a 5bit fi eld, and a 12bit fi eld (BinaryStdOut) Two chars and a short (BinaryStdOut) A 4bit fi eld, a 5bit fi eld, and a 12bit fi eld (BinaryStdOut) BinaryStdOut.write((char) month); BinaryStdOut.write(month, 4); BinaryStdOut.write((char) month); BinaryStdOut.write(month, 4); BinaryStdOut.write((char) day); BinaryStdOut.write(day, 5); BinaryStdOut.write((char) month); BinaryStdOut.write(month, 4); BinaryStdOut.write((char) day); BinaryStdOut.write(day, 5); BinaryStdOut.write((short) year); BinaryStdOut.write(year, 12); BinaryStdOut.write((char) day); BinaryStdOut.write(day, 5); BinaryStdOut.write((short) year); BinaryStdOut.write(year, 12); BinaryStdOut.write((short) year); BinaryStdOut.write(year, 12); 00001100000111110000011111001111 110011111011111001111000 00001100000111110000011111001111 110011111011111001111000 00001100000111110000011111001111 110011111011111001111 12 31 000 1999 12 31 1999 32 bits 21 bits ( + 3 bits for byte alignment at close) 12 31 1999 12 31 1999 12 31 1999 12 31 1999 32 bits 21 bits ( + 3 bits for byte alignment at close) 32 bits 21 bits ( + 3 bits for byte alignment at close) Four ways to put a date onto standard output Four ways to put a date onto standard output Four ways to put a date onto standard output 9 628 CHAPTER 5 Strings to open a file with an edi public class BinaryDump tor or view it in the manner you view text files (or just public static void bits(String args) run a program that uses int width = Integer.parseInt(args0); BinaryStdOut), you are int cnt; likely to see gibberish, de for (cnt = 0; BinaryStdIn.isEmpty(); cnt++) pending on the system you use. BinaryStdIn allows if (cnt width == 0) StdOut.println(); us to avoid such system de if (BinaryStdIn.readBoolean()) pendencies by writing our StdOut.print("1"); own programs to convert else StdOut.print("0"); bitstreams such that we can StdOut.println(cnt + " bits"); see them with our standard tools. For example, the pro gram BinaryDump at left is a BinaryStdIn client that Printing a bitstream on standard (character) output prints out the bits from standard input, encoded with the characters 0 and 1. This program is useful for debug ging when working with small inputs. We use a slightly more complicated version that just prints the count when the width argument is 0 (see Exercise 5.5.X). The similar Binary dumps client HexDump groups the data into 8bit bytes and prints each as two hexadecimal digits that each represent 4 bits. The client PictureDump displays the bits in a Picture. You can download HexDump and PictureDump from the booksite. Typically, we use pip Q. How to examine the contents of a bitstream ing and redirection at the commandline level when working with binary files: we can pipe the output of an encoder to BinaryDump, HexDump, or PictureDump, or redirect it to a file. Standard character stream Bitstream represented with hex digits more abra.txt java HexDump 4 abra.txt ABRACADABRA 41 42 52 41 43 41 44 41 42 52 41 21 Bitstream represented as 0 and 1 characters 12 bytes java BinaryDump 16 abra.txt 0100000101000010 Bitstream represented as pixels in a Picture 0101001001000001 java PictureDump 16 6 abra.txt 0100001101000001 0100010001000001 16by6 pixel 0100001001010010 window, magnified 6.5 Data Compression 667 0100000100100001 96 bits 96 bits Four ways to look at a bitstream ASCII encoding. When you HexDump a bit 0 1 2 3 4 5 6 7 8 9 A B C D E F stream that contains ASCIIencoded charac NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 0 ters, the table at right is useful for reference. DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 1 Given a 2digit hex number, use the first hex SP 2 “ ‘ ( ) + , . / digit as a row index and the second hex digit 3 0 1 2 3 4 5 6 7 8 9 : ; = as a column reference to find the character 4 A B C D E F G H I J K L M N O that it encodes. For example, 31 encodes the 5 P Q R S T U V W X Y Z \ digit 1, 4A encodes the letter J, and so forth. 6 ` a b c d e f g h i j k l m n o This table is for 7bit ASCII, so the first hex DEL 7 p q r s t u v w x y z digit must be 7 or less. Hex numbers starting Hexadecimal to ASCII conversion table with 0 and 1 (and the numbers 20 and 7F) 10 correspond to nonprinting control charac ters. Many of the control characters are left over from the days when physical devices like typewriters were controlled by ASCII input; the table highlights a few that you might see in dumps. For example SP is the space character, NUL is the null character, LF is linefeed, and CR is carriagereturn. In summary, working with data compression requires us to reorient our thinking about standard input and standard output to include binary encoding of data. BinaryStdIn and BinaryStdOut provide the methods that we need. They provide a way for you to make a clear distinction in your client programs between writing out information in tended for file storage and data transmission (that will be read by programs) and print ing information (that is likely to be read by humans). Universal data compression US Patent 5,533,051 on "Methods for Data Compression", which is capable of compression all files. Slashdot reports of the Zero Space Tuner™ and BinaryAccelerator™. “ ZeoSync has announced a breakthrough in data compression that allows for 100:1 lossless compression of random data. If this is true, our bandwidth problems just got a lot smaller.… ” 11Universal data compression Proposition. No algorithm can compress every bitstring. U Pf 1. by contradiction U Suppose you have a universal data compression algorithm U that can compress every bitstream. U Given bitstring B , compress it to get smaller bitstring B . 0 1 . . Compress B to get a smaller bitstring B . 1 2 . Continue until reaching bitstring of size 0. Implication: all bitstrings can be compressed to 0 bits U Pf 2. by counting U Suppose your algorithm that can compress all 1,000bit strings. 1000 2 possible bitstrings with 1,000 bits. U 998 999 Only 1 + 2 + 4 + … + 2 + 2 can be encoded with ≤ 999 bits. 499 Similarly, only 1 in 2 bitstrings can be encoded with ≤ 500 bits Universal data compression 12Undecidability java RandomBits java PictureDump 2000 500 1000000 bits A difficult le t fi o compress: one million (pseudo) random bits public class RandomBits public static void main(String args) int x = 11111; for (int i = 0; i 1000000; i++) x = x 314159 + 218281; BinaryStdOut.write(x 0); BinaryStdOut.close(); 13Rdenudcany in Enlgsih lnagugae Q. How mcuh rdenudcany is in the Enlgsih lnagugae “ ... randomising letters in the middle of words has little or no effect on the ability of skilled readers to understand the text. This is easy to denmtrasote. In a pubiltacion of New Scnieitst you could ramdinose all the letetrs, keipeng the first two and last two the same, and reibadailty would hadrly be aftcfeed. My ansaylis did not come to much beucase the thoery at the time was for shape and senqeuce retigcionon. Saberi's work sugsegts we may have some pofrweul palrlael prsooscers at work. The resaon for this is suerly that idnetiyfing coentnt by paarllel prseocsing speeds up regnicoiton. We only need the first and last two letetrs to spot chganes in meniang. ” — Graham Rawlinson A. Quite a bit. 145.5 DATA COMPRESSION introduction ‣ runlength coding ‣ Huffman compression ‣ Algorithms LZW compression ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.eduRunlength encoding Simple type of redundancy in a bitstream. Long runs of repeated bits. 0000000000000001111111000000011111111111 40 bits Representation. 4bit counts to represent alternating runs of 0s and 1s: 15 0s, then 7 1s, then 7 0s, then 11 1s. 16 bits (instead of 40) 1111011101111011 15 7 7 11 Q. How many bits to store the counts A. We'll use 8 (but 4 in the example above). Q. What to do when run length exceeds max count A. If longer than 255, intersperse runs of length 0. Applications. JPEG, ITUT T4 Group 3 Fax, ... 16Runlength encoding: Java implementation public class RunLength maximum runlength count private final static int R = 256; private final static int lgR = 8; number of bits per count public static void compress() / see textbook / public static void expand() boolean bit = false; while (BinaryStdIn.isEmpty()) int run = BinaryStdIn.readInt(lgR); read 8bit count from standard input for (int i = 0; i run; i++) BinaryStdOut.write(bit); write 1 bit to standard output bit = bit; BinaryStdOut.close(); pad 0s for byte alignment 175.5 DATA COMPRESSION introduction ‣ runlength coding ‣ Huffman compression ‣ Algorithms LZW compression ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.edu David HuffmanVariablelength codes Use different number of bits to encode different chars. Ex. Morse code: • • • − − − • • • Issue. Ambiguity. SOS V7 IAMIE EEWNI In practice. Use a medium gap to separate codewords. codeword for S is a prefix of codeword for V 19Variablelength codes Q. How do we avoid ambiguity A. Ensure that no codeword is a prefix of another. Trie representation Codeword table key value 00 11 101 Ex 1. Fixedlength code. A 0 A 00 11 B 1111 Ex 2. Append special stop char to each codeword. C 110 00 11 00 11 D 100 Ex 3. General prefixfree code. D R C 1110 00 11 R B Compressed bitstring 011111110011001000111111100101 30 bits A B RA CA DA B RA Trie representation Codeword table Trie representation Codeword table key value key value 00 11 101 00 11 101 A 0 A 11 A 00 11 B 1111 00 11 00 11 B 00 C 110 C 010 B A 00 11 00 11 D 100 00 11 00 11 D 100 D R C 1110 R 011 C R D 00 11 R B Compressed bitstring Compressed bitstring 11000111101011100110001111101 29 bits 30 bits 011111110011001000111111100101 A B R A C A D A B R A A B RA CA DA B RA Two prefi xfree codes 20 Trie representation Codeword table key value 00 11 101 A 11 00 11 00 11 B 00 C 010 B A 00 11 00 11 D 100 R 011 C D R Compressed bitstring 11000111101011100110001111101 29 bits A B R A C A D A B R A Two prefi xfree codesPrefixfree codes: trie representation Q. How to represent the prefixfree code A. A binary trie T Trie r rie repr epresen esenta tation tion C Code odew wor ord table d table Chars in leaves. ke key y v value alue 0000 1111 101 101 Codeword is path from root to leaf. AA 00 AA 0000 1111 BB 1111 1111 CC 110 110 0000 1111 0000 1111 DD 100 100 DD RR CC 1110 1110 0000 1111 RR BB C Compr ompressed bitstring essed bitstring 011111110011001000111111100101 011111110011001000111111100101 30 b 30 bits its A B RA CA DA B RA A B RA CA DA B RA T Trie r rie repr epresen esenta tation tion C Code odew wor ord table d table T Trie r rie repr epresen esenta tation tion C Code odew wor ord table d table ke key y v value alue ke key y v value alue 0000 1111 101 101 0000 1111 101 101 AA 00 AA 11 11 AA 0000 1111 BB 1111 1111 0000 1111 0000 1111 BB 00 00 CC 110 110 CC 010 010 BB AA 0000 1111 0000 1111 DD 100 100 0000 1111 0000 1111 DD 100 100 DD RR CC 1110 1110 RR 011 011 CC RR DD 0000 1111 RR BB C Compr ompressed bitstring essed bitstring C Compr ompressed bitstring essed bitstring 11000111101011100110001111101 11000111101011100110001111101 29 b 29 bits its 30 b 30 bits its 011111110011001000111111100101 011111110011001000111111100101 A B R A C A D A B R A A B R A C A D A B R A A B RA CA DA B RA A B RA CA DA B RA T Tw wo pr o pre efifi x xfr free c ee codes odes 21 T Trie r rie repr epresen esenta tation tion C Code odew wor ord table d table ke key y v value alue 0000 1111 101 101 AA 11 11 0000 1111 0000 1111 BB 00 00 CC 010 010 BB AA 0000 1111 0000 1111 DD 100 100 RR 011 011 CC DD RR C Compr ompressed bitstring essed bitstring 11000111101011100110001111101 11000111101011100110001111101 29 b 29 bits its A B R A C A D A B R A A B R A C A D A B R A T Tw wo pr o pre efifi x xfr free c ee codes odesPrefixfree codes: compression and expansion Compression. Method 1: start at leaf; follow path up to the root; print bits in reverse. Trie representation Codeword table Method 2: create ST of keyvalue pairs. key value 00 11 101 A 0 A 00 11 B 1111 Expansion. C 110 00 11 00 11 D 100 Start at root. D R C 1110 00 11 Go left if bit is 0; go right if 1. R B Compressed bitstring If leaf node, print char and return to root. 011111110011001000111111100101 30 bits A B RA CA DA B RA Trie representation Codeword table Trie representation Codeword table key value key value 00 11 101 00 11 101 A 0 A 11 A 00 11 B 1111 00 11 00 11 B 00 C 110 C 010 B A 00 11 00 11 D 100 00 11 00 11 D 100 D R C 1110 R 011 C R D 00 11 R B Compressed bitstring Compressed bitstring 11000111101011100110001111101 29 bits 30 bits 011111110011001000111111100101 A B R A C A D A B R A A B RA CA DA B RA Two prefi xfree codes 22 Trie representation Codeword table key value 00 11 101 A 11 00 11 00 11 B 00 C 010 B A 00 11 00 11 D 100 R 011 C D R Compressed bitstring 11000111101011100110001111101 29 bits A B R A C A D A B R A Two prefi xfree codesHuffman coding overview Dynamic model. Use a custom prefixfree code for each message. Compression. Read message. Built best prefixfree code for message. How Write prefixfree code (as a trie) to file. Compress message using prefixfree code. Expansion. Read prefixfree code (as a trie) from file. Read compressed message and expand using trie. 23Huffman trie node data type private static class Node implements ComparableNode private final char ch; // used only for leaf nodes private final int freq; // used only for compress private final Node left, right; public Node(char ch, int freq, Node left, Node right) this.ch = ch; initializing constructor this.freq = freq; this.left = left; this.right = right; public boolean isLeaf() is Node a leaf return left == null right == null; compare Nodes by frequency public int compareTo(Node that) (stay tuned) return this.freq that.freq; 24Prefixfree codes: expansion public void expand() read in encoding trie Node root = readTrie(); read in number of chars int N = BinaryStdIn.readInt(); for (int i = 0; i N; i++) Node x = root; th expand codeword for i char while (x.isLeaf()) if (BinaryStdIn.readBoolean()) x = x.left; else x = x.right; BinaryStdOut.write(x.ch, 8); BinaryStdOut.close(); Running time. Linear in input size N. 25Prefixfree codes: how to transmit Q. How to write the trie A. Write preorder traversal of trie; mark leaf and internal nodes with a bit. preorder private static void writeTrie(Node x) 11 traversal if (x.isLeaf()) 22 A 33 55 BinaryStdOut.write(true); BinaryStdOut.write(x.ch, 8); 44 D R B return; C BinaryStdOut.write(false); leaves writeTrie(x.left); A D C R B writeTrie(x.right); 01010000010010100010001000010101010000110101010010101000010 11 2233 44 55 internal nodes Using preorder traversal to encode a trie as a bitstream Note. If message is long, overhead of transmitting trie is small. 26 Prefixfree codes: how to transmit Q. How to read in the trie A. Reconstruct from preorder traversal of trie. preorder private static Node readTrie() 11 traversal if (BinaryStdIn.readBoolean()) 22 A 33 55 char c = BinaryStdIn.readChar(8); return new Node(c, 0, null, null); 44 D R B Node x = readTrie(); C Node y = readTrie(); leaves return new Node('\0', 0, x, y); A D C R B 01010000010010100010001000010101010000110101010010101000010 arbitrary value (value not used with internal nodes) 11 2233 44 55 internal nodes Using preorder traversal to encode a trie as a bitstream 27 ShannonFano codes Q. How to find best prefixfree code ShannonFano algorithm: Partition symbols S into two subsets S and S of (roughly) equal freq. 0 1 Codewords for symbols in S start with 0; for symbols in S start with 1. 0 1 Recur in S and S . 0 1 char freq encoding char freq encoding A 5 0... B 2 1... C 1 0... D 1 1... R 2 1... S = codewords starting with 0 0 1 1... S = codewords starting with 1 1 Problem 1. How to divide up symbols Problem 2. Not optimal 28Huffman algorithm demo char freq encoding Count frequency for each character in input. A 5 B 2 C 1 D 1 R 2 1 input A B R A C A D A B R A Huffman algorithm demo char freq encoding A 5 0 B 2 1 1 1 C 1 1 0 1 1 D 1 1 0 0 R 2 1 1 0 1 1 0 1 0 0 1 A 0 1 0 1 0 1 D R B 0 1 CHuffman codes Q. How to find best prefixfree code Huffman algorithm: Count frequency freqi for each char i in input. Start with one node corresponding to each char i (with weight freqi). Repeat until single trie formed: – select two tries with min weight freqi and freqj – merge into single trie with weight freqi + freqj Applications: 31Constructing a Huffman encoding trie: Java implementation private static Node buildTrie(int freq) MinPQNode pq = new MinPQNode(); for (char i = 0; i R; i++) initialize PQ with singleton tries if (freqi 0) pq.insert(new Node(i, freqi, null, null)); while (pq.size() 1) merge two smallest tries Node x = pq.delMin(); Node y = pq.delMin(); Node parent = new Node('\0', x.freq + y.freq, x, y); pq.insert(parent); not used for total frequency two subtries return pq.delMin(); internal nodes 32Huffman encoding summary Proposition. Huffman 1950s Huffman algorithm produces an optimal prefixfree code. no prefixfree code Pf. See textbook. uses fewer bits Implementation. Pass 1: tabulate char frequencies and build trie. Pass 2: encode file by traversing trie or lookup table. Running time. Using a binary heap ⇒ N + R log R . input alphabet size size Q. Can we do better stay tuned 335.5 DATA COMPRESSION introduction ‣ runlength coding ‣ Huffman compression ‣ Algorithms LZW compression ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.edu Abraham Lempel Jacob ZivStatistical methods Static model. Same model for all texts. Fast. Not optimal: different texts have different statistical properties. Ex: ASCII, Morse code. Dynamic model. Generate model based on text. Preliminary pass needed to generate model. Must transmit the model. Ex: Huffman code. Adaptive model. Progressively learn and update model as you read text. More accurate modeling produces better compression. Decoding must start from beginning. Ex: LZW. 35LZW compression demo input AA B R A C A D A B R A B R A B R A B R A C A D A B R A B R A B R A matches A B R A C A D A B R A B R A B R A value 41 42 52 41 43 41 44 81 83 82 88 41 80 LZW compression for A B R A C A D A B R A B R A B R A key value key value key value ⋮ ⋮ AB 81 DA 87 A 41 BR 82 ABR 88 B 42 RA 83 RAB 89 C 43 AC 84 BRA 8A D 44 CA 85 ABRA 8B ⋮ ⋮ AD 86 codeword table 36LempelZivWelch compression LZW compression. Create ST associating Wbit codewords with string keys. Initialize ST with codewords for singlechar keys. Find longest string s in ST that is a prefix of unscanned part of input. Write the Wbit codeword associated with s. longest prefix match Add s + c to ST, where c is next char in the input. Q. How to represent LZW compression code table A. A trie to support longest prefix match. 41 42 43 44 52 A B R C D B 81 84 86 R 82 85 87 83 C D A A A R 88 8A B 89 A 8B A 37LZW expansion demo value 41 42 52 41 43 41 44 81 83 82 88 41 80 output A B R A C A D A B R A B R A B R A LZW expansion for 41 42 52 41 43 41 44 81 83 82 88 41 80 key value key value key value ⋮ ⋮ 81 AB 87 DA 41 A 82 BR 88 ABR 42 B 83 RA 89 RAB 43 C 84 AC 8A BRA 44 D 85 CA 8B ABRA ⋮ ⋮ 86 AD codeword table 38LZW expansion LZW expansion. key value ⋮ ⋮ Create ST associating string values with Wbit keys. 65 A Initialize ST to contain singlechar values. 66 B Read a Wbit key. 67 C Find associated string value in ST and write it out. 68 D Update ST. ⋮ ⋮ 129 AB 130 BR Q. How to represent LZW expansion code table 131 RA W A. An array of size 2 . 132 AC 133 CA 134 AD 135 DA 136 ABR 137 RAB 138 BRA 139 ABRA ⋮ ⋮ 39LZW tricky case: compression input A B A B A B A A B A B A B A matches A B A B A B A value 41 42 81 83 80 LZW compression for ABABABA key value key value ⋮ ⋮ AB 81 A 41 BA 82 B 42 ABA 83 C 43 D 44 ⋮ ⋮ codeword table 40LZW tricky case: expansion value 41 42 81 83 80 need to know which key has value 83 output A B A B A B A before it is in ST LZW expansion for 41 42 81 83 80 key value key value ⋮ ⋮ 81 AB 41 A 82 BA 42 B 83 ABA 43 C 44 D ⋮ ⋮ codeword table 41LZW implementation details How big to make ST How long is message Whole message similar model many other variations What to do when ST fills up Throw away and start over. GIF Throw away when not effective. Unix compress many other variations Why not put longer substrings in ST many variations have been developed 42LZW in the real world LempelZiv and friends. LZ77. LZ77 not patented widely used in open source LZW patent 4,558,302 expired in U.S. on June 20, 2003 LZ78. LZW. Deflate / zlib = LZ77 variant + Huffman. 43LZW in the real world LempelZiv and friends. LZ77. LZ78. LZW. Deflate / zlib = LZ77 variant + Huffman. Unix compress, GIF, TIFF, V.42bis modem: LZW. zip, 7zip, gzip, jar, png, pdf: deflate / zlib. iPhone, Sony Playstation 3, Apache HTTP server: deflate / zlib. 44Lossless data compression benchmarks year scheme bits / char 1967 ASCII 7.00 1950 Huffman 4.70 1977 LZ77 3.94 1984 LZMW 3.32 1987 LZH 3.30 1987 movetofront 3.24 1987 LZB 3.18 1987 gzip 2.71 1988 PPMC 2.48 1994 SAKDC 2.47 1994 PPM 2.34 next programming assignment 1995 BurrowsWheeler 2.29 1997 BOA 1.99 1999 RK 1.89 data compression using Calgary corpus 45Data compression summary Lossless compression. Represent fixedlength symbols with variablelength codes. Huffman Represent variablelength symbols with fixedlength codes. LZW Lossy compression. not covered in this course JPEG, MPEG, MP3, … FFT, wavelets, fractals, … n X Theoretical limits on compression. Shannon entropy: H(X)= p(x )lgp(x ) i i i Practical compression. Use extra knowledge whenever possible. 46