markus

AUGUST MEETING

Text of Talk Outline

for

HOW MEASURE INFORMATION

Markus Metzler

How Measure Information?

During the last few decades huge technical changes have taken place.
We can buy many new technical items as for example radio, TV, computers or mobile phones, that were not yet available at the beginning of the century. We can use the Internet to pass email within minutes to people at the other end of the world. Or we can look at pictures that were taken from satellites 'close' (still some 10,000 miles) to other planets. The signal that brings the 'pictures' from these satellites is about 1,000,000 times weaker than the noise that is received with the signal.
How can these photos be reproduced?

All these new things have in common, that they are related to information. So people began to talk about the 'information age'. As we live at the beginning of the information age, it might be interesting to understand a little more about the aspects and concepts of information theory.

History
Hartley recognized in the 1930s that information depends upon the probabilities of different outcomes of events.
Claude A. Shannon, an American electrical engineer, founded information theory with a publication in 1948 entitled 'A Mathematical Theory of Communication'.

Information
First two definitions to clarify the discussion:
event is something that happens. Example: tomorrow?s weather.
outcome is the result of an event. Example: tomorrow it is raining or sunny.
An event has several different outcomes. But no outcome belongs to several events (in terms of information theory).

number of different possible outcomes of an event
dependency of the outcomes between each other
probability of a certain outcome

Information has different values for different persons. It might have an emotional/subjective value. If one has some stocks of the company X and none of the company Y, its usually not very interesting to know the value of the stock Y. Or we are not very interested in the weather in Zaire (Africa).

Number of different possible outcomes of an event
It isn't very hard to guess the correct number out of five. But it is very difficult to estimate correctly a number out of 20,000,000.

Dependency of outcomes between each other
In the English language the sequence 'ty' appears much more often than the sequence 'yt'.

Probability of a certain outcome
Some letters are much more likely than others.
(P(t) = 0.105 means the probability of the letter 't' in the English language is 10.5%.)

English language:: P(e) = 0.131; P(t) = 0.105; P(a) = 0.086; P(.) =; P(.) =; P(.) =; P(.) =; P(z) = 0.00077

In Spanish or French the probabilities are different. The average probability for a particular letter is 1/26 = 0.0385.

Uncertainty H(U) describes the uncertainty of the event U. For example, U might be the event of a coin toss. The two possible outcomes of a coin toss are head and tail. With 'u' an outcome is denoted. So u might either be head u1 or tail u2.

The mathematical formula to calculate the uncertainty of an event U with two possible outcomes is:

H(U) = - p1*log2(p1) - p2*log2(p2) where
p1 ,p2 stand for the probabilities of the two different outcomes u1 and u2.
log2(.) is the logarithm to the base 2. The logarithm to the base 2 can hardly be found on calculators. It can be generated by log(x) / log(2) = log2(x).
The symbol H for uncertainty comes from thermodynamical physics where entropy is denoted by H. The same letter is used simply because it is the same formula.
Of course the equation
p2 = 1 - p1
holds because we said that only two different outcomes are possible and the sum of the probabilities is always 1.

The following graph show the value of uncertainty depending of the probability that the event U (out of two possible outcomes) occurs. For example tossing a 'good' coin, which gives head or tail with each a probability of 50%, we find the uncertainty 1 bit. But when we use a 'bad' coin which likes 'tail' with the probability of 80%, we find only an uncertainty of 0.7 bit. We are certain that yesterday it was not raining cats. The probability of raining cats is zero and so is the uncertainty. Each coin toss is independent of the previous ones (the coin has no memory).

When the probability of occurrence is very low or very high the uncertainty H(U) is very low. Whereas the uncertainty is very high when the probability is approximately 0.5. Its very unlikely that it snows in the summer (in CA) so our uncertainty is very small.

The measure of information is symmetrical to the vertical axis through the probability of 0.5. This is the result of changing the sentence 'Its very unlikely that it snows in the summer.' to the sentence 'It is very likely that it does not snow in the summer.' When we have two possible outcomes we get the same amount of information by communicating it is this outcome or not the other outcome.

The general formula for M possible outcomes is:

H(U) = -(sum over all M outcomes for (p_m * log2(p_m))) and
sum over all M probabilities p_m = 1
holds.
p_m is the probability of the m-th outcome and m = 1, 2, 3, ....., M.

Knowing the outcome of an event X can reduce the uncertainty of another event Y. For example, knowing the season (this is the event X) reduces our uncertainty of the temperature outside (event Y): In the summer its hotter than in the winter. This relationship is often called dependency. Y is dependent or independent of X. In information theory one often writes H(Y|X), which means the uncertainty of outcome Y given the outcome X or in other words the uncertainty of the outcome Y knowing the outcome X.

Mutual Information The formula for mutual information is:

I(Y;X) = H(Y) - H(Y|X)

where
I(Y;X) means the mutual information that the outcome of the event X gives about the outcome of the (another) event Y.
Due to the rules of statistics Y gives the same amount of information about X as X does about Y. Therefore we talk about mutual information.

We see that information reduces the uncertainty!

We watch the weather forecast X to reduce our uncertainty of tomorrow?s weather Y. The weather forecast is not always right, but very often. But because tomorrow?s weather and the numbers of the lottery are independent, we do not draw conclusions about the weather from the numbers in the lottery. In other words the lottery does not reduce our uncertainty about the weather.

If all 26 letters of the alphabet do have the same amount of probability and the letters are independent of each other, the information in one letter would be 4.7 bit. But with the given probabilities the average information per letter is only 4.15 bit.

Coding

Coding is just looking for another representation (usually) without loss of information. That means we are able to reconstruct the original message.

Huffman Coding
(Is a prefix free code: That means that no codeword is the beginning of an other one.)
We have a little alphabet that we want to write only by using 0 and 1 without losing information. Further (because we don't like writing very much) we want to keep the number of 0s and 1s (digits) for the whole text small.

Given the following alphabet and probabilities:: P(a) = 0.1; P(b) = 0.005; P(c) = 0.2; P(d) = 0.3; P(e) = 0.005; P(f) = 0.1; P(g) = 0.01; P(h) = 0.03; P(i) = 0.25; ___________; P(all) = 1.0

We could say that we write each letter with four digits:: a = 0000; b = 0001; c = 0010; d = 0011; e = 0100; f = 0101; g = 0110; h = 0111; I = 1000

Then our average would be four digits for one letter and we could reconstruct the original message. A sequence of digits is called a codeword. So all our codewords have length four.

But we can do better using the probabilities. We assign the less likely letters longer codewords and the more likely letters shorter codewords. That is what the Huffman coding scheme does.

Algorithm:
1. Make a list with letters and the probabilities. Consider each letter with a probability as an item.
2. Link the two items with the least probabilities. Add the probabilities of the two items for the link. Consider the whole link (with the two items) as a new item.
3. If the probability of the new item is not 1 go to step 2.
4. The item with the probability 1 is called the root.
5. Step starting at the root through the tree. When the branch goes upwards write a zero. Else write a 1.
6. You just wrote the new representation for the letter.
7. Go to step 5 until all possible outcomes are done.

So we get the following representations:: a = 000; b = 0011000; c = 01; d = 10; e = 0011001; f = 0010; g = 001101; h = 00111; i = 11

The average length of a codeword is now the sum of the probabilities times the length of the codeword:
(0.1 * 3) + (0.005 * 7) + (0.2 * 2) + (0.3 * 2) + (0.005 * 7) + (0.1 * 4) + (0.01 * 6) + (0.03 * 5) + (0.25 * 2) = 2.48 digits per codeword
which is only a little more than half the length of the first introduced coding.

But can we be sure that we can decode our text? Yes, we can. What means '11001110000011000001100100101001' ?

11 is an i

00111 is an h

000 is an a

0011000 is a b

0011001 is an e

0010 is an f

10 is a d

01 is a c

So the message was 'ihabefdc'.

Due to the fact that no codeword is the beginning of another one we can uniquely step through the tree and decode so that we don't loose any information.

Internetsites
http://www.dartmouth.edu/artsci/engs/engs4/lecture4.html http://canyon.ucsd.edu/infoville/schoolhouse/class_html/randy.html http://www.inf.fu-berlin.de/~weisshuh/infwiss/papers/wersig/infthe.html

Finally I hope that I reduced (and not increased!) your uncertainty about information and thank you for your attention.

August 1996
Markus Metzler

Click here to return to August Report

Return to Concept Exchange Society

11	is an	i
00111	is an	h
000	is an	a
0011000	is a	b
0011001	is an	e
0010	is an	f
10	is a	d
01	is a	c