The Story of Information Theory: from Morse to Shannon to ENTROPY

This is the story of how Claude Shannon founded the field of Information Theory, and proved that entropy is the true measure of information. But Shannon’s groundbreaking 1948 paper «A Mathematical Theory of Communication» has its foundations in earlier times, from Morse, Vail, Hartley and Nyquist.

*https://www.youtube.com/watch?v=rmBFaNgg4wk
**https://300.ya.ru/v_UyTU8Y13

таймкоды

00:00:01 Начало войны и научные исследования

  • Европа погрузилась в тотальную войну в 1939 году.
  • США придерживались стратегии «ждать и смотреть».
  • Американские учёные и инженеры работали над идеями, которые могли изменить ход истории.

00:00:25 Клод Шеннон и его идеи

  • Клод Шеннон в Bell Labs искал способы зашифровать речь для сохранения тайны.
  • По ночам он работал над идеей, которая революционизировала подход к информации.
  • Основы теории Шеннона были заложены работами предшественников, включая Найквиста и Хартли.

00:01:11 Военные инновации XVIII века

  • В 1794 году две армии столкнулись близ города Русов.
  • Французская армия использовала воздушные шары для разведки и оптический семафор для быстрой связи.
  • Оптический семафор позволял передавать информацию практически мгновенно на огромные расстояния.

00:03:01 Система Клода Шаппера

  • Клод Шаппер изобрёл семафор с 92 положениями, каждое из которых имело отдельную конфигурацию луча и бокового оружия.
  • Система позволяла кодировать сообщения в 8464 уникальных комбинации.
  • Кодовая книга использовалась для передачи важных команд, таких как атака или отступление.

00:04:46 Открытие Эрстеда

  • В 1820 году Ханс Кристиан Эрстед обнаружил, что электрический ток может отклонять магнитную стрелку.
  • Открытие Эрстеда заложило основу для новой эры коммуникаций.

00:06:20 Сэмюэл Морзе и телеграф

  • Сэмюэл Морзе вернулся в США в 1832 году и начал работу над телеграфной системой.
  • Вместе с Альфредом Вейлом они изобрели машину, передающую импульсы на электрические проводники.
  • Импульсы кодировали буквы и цифры с помощью точек и тире.

00:09:03 Демонстрация Морзе и Вейла

  • 24 мая 1844 года Морзе и Вейл провели демонстрацию телеграфа в зале Верховного суда США.
  • Система позволила передавать сообщения на большие расстояния с почти мгновенной скоростью.

00:10:08 Автоматизация телеграфии

  • К 1876 году система Морзе доминировала в мировой телеграфии, но была ручной и трудоёмкой.
  • Эмиль Бордо изобрёл бордосский код, который автоматизировал телеграфную связь.
  • Код Бордо присвоил каждой букве или цифре фиксированный пятибитный код.

00:12:05 Теория информации Ральфа Хартли

  • В 1927 году Ральф Хартли представил доклад о количественной оценке информации.
  • Хартли определил информацию как группу физических символов, передающих определённое значение.
  • Он утверждал, что количество информации зависит от общего количества возможных сообщений, из которых мы выбираем.
  • Идея Хартли была простой, но глубокой: количество информации можно определить количественно с математической точностью.

00:13:51 Алфавит символов и кодирование сообщений

  • Алфавит символов из трёх и двух последовательных выборок даёт 9 возможных сообщений.
  • Увеличение количества выборок увеличивает количество возможных сообщений.
  • Семафор Чаппы использовал 92 основных символа, что давало 92^2 возможных способа кодирования сообщения.
  • Телепринтер Bordeaux имеет 32 способа кодирования, и количество возможных вариантов увеличивается на 32^n при выборе n вариантов.

00:14:50 Проблема количественной оценки информации

  • Хартли столкнулся с проблемой экспоненциального роста количества возможных последовательностей.
  • Вместо использования исходного числа возможных последовательностей Хартли предложил использовать логарифм этого числа.
  • Это позволило оценить информацию линейно, а не экспоненциально.

00:16:03 Связь между дискретными символами и аналоговыми сигналами

  • Найквист установил связь между дискретными символами и аналоговыми сигналами, изучая импульсы телеграфной системы.
  • Он показал, что при прохождении сигнала через канал более высокие частоты исчезают, что зависит от пропускной способности канала.
  • Пропускная способность канала ограничивает количество сигналов, которые могут быть переданы за секунду.

00:17:12 Вклад Клода Шеннона в теорию информации

  • В 1937 году Клод Шеннон защитил магистерскую диссертацию, которая стала основой цифровой электроники.
  • Шеннон применил математическую алгебру Буля к электрическим цепям, что послужило основой для цифровой логики и компьютеров.
  • В 1948 году Шеннон предложил идеи, которые произвели революцию в восприятии информации.

00:18:16 Структура коммуникационной системы по Шеннону

  • Шеннон выделил пять компонентов коммуникационной системы: источник информации, передатчик, канал, приёмник и адресат.
  • Источник информации генерирует последовательность сообщений, а канал передаёт сигнал.
  • Шеннон подчеркнул, что канал сам по себе не передаёт информацию, а лишь является способом передачи информации из источника.

00:20:35 Пропускная способность канала

  • Шеннон показал, как рассчитать пропускную способность произвольного канала.
  • Пропускная способность зависит от количества последовательностей, которые можно построить из доступных символов.
  • Пример с телепринтером Bordeaux: пропускная способность составляет 5 бит на символ.

00:21:34 Статистический характер источника информации

  • Шеннон заметил, что источник информации является статистическим процессом, а не случайным генератором символов.
  • Каждый новый сгенерированный символ зависит от предыдущих символов, что делает некоторые последовательности более вероятными, чем другие.
  • Подход Хартли учитывал все возможные последовательности, но Шеннон предложил новую перспективу, учитывающую вероятность появления различных последовательностей.

00:22:33 Модель источника информации как марковский процесс

  • Шеннон смоделировал источник информации как марковский процесс, где переход между состояниями происходит случайным образом, но в соответствии с установленными вероятностными правилами.
  • Любой источник информации имеет статистическое поведение, содержащее закономерности, повторения и избыточность.
  • Идея Шеннона стала фундаментальной основой теории информации, сжатия данных, криптографии, машинного обучения и других областей.

00:23:44 Информационная энтропия

  • Шеннон ввёл понятие информационной энтропии как меры устранения неопределённости.
  • Энтропия максимизируется, когда вероятности исходов равны, и падает при искажении вероятностей.
  • Энтропия — это взвешенная сумма значений неожиданности для каждого возможного исхода, взвешенная по вероятности каждого исхода.

00:26:40 Значение энтропии для представления информации

  • Энтропия устанавливает фундаментальную нижнюю границу эффективности представления информации.
  • Это позволяет наилучшим образом использовать канал для передачи информации.

00:27:03 Введение в кодирование

  • Рассматривается источник информации с четырьмя цветами: красный 50%, синий 25%, белый и чёрный по 12,5%.
  • Необходимо присвоить код каждому цвету для передачи по каналу связи.
  • Вопрос: какой код присвоить каждому цвету для наиболее эффективной схемы кодирования?

00:27:38 Методы кодирования

  • Фиксированный код требует 2 бита на символ.
  • Код переменной длины позволяет использовать длину кода, соответствующую неожиданности символа.
  • Оптимальная схема кодирования достигается путём присвоения более коротких кодовых слов более часто встречающимся символам.

00:28:36 Задача кодера

  • Кодер должен преобразовать выходные данные источника в поток битов, статистически близкий к наиболее эффективному использованию канала.
  • Шеннон показал, что можно сжать источник информации до уровня его энтропии, но снижение энтропии приведёт к потере информации.

00:29:35 Сложные источники информации

  • Пример с английским языком показывает, что статистическая структура языка проявляется только при анализе больших фрагментов текста.
  • Буквы не являются независимыми, и некоторые комбинации более вероятны, чем другие.

00:30:13 Доказательство Шеннона

  • Шеннон доказал, что любой источник можно сжать до уровня его энтропии, моделируя длинные последовательности.
  • При большом количестве выборок процентное соотношение цветов приближается к ожидаемому значению.

00:32:16 Энтропия и сжатие

  • В пределе, когда количество выборок приближается к бесконечности, источник информации сводится к его энтропии.
  • На символ приходится менее h бит, и не каждая возможная последовательность может быть закодирована.

00:33:45 Влияние шума на передачу информации

  • Шум в канале связи создаёт неопределённость, которая влияет на передачу информации.
  • Шеннон ввёл понятие условной энтропии для учёта недостающей информации из-за шума.

00:35:11 Пример с монетой

  • Пример с подбрасыванием монеты показывает, что из-за шума передаётся меньше информации, чем ожидалось.
  • Условная энтропия отражает недостающую информацию из-за шума.

00:36:06 Пример с вечеринкой

  • На шумной вечеринке сообщение с 10 битами информации может быть сужено до 7 бит из-за шума.
  • Неопределённость из-за шума ограничивает количество передаваемой информации.

00:36:45 Проблемы связи в условиях шума

  • В зашумленных каналах получатель никогда не может быть уверен в точности переданного сообщения.
  • Система должна быть спроектирована должным образом, чтобы достичь максимальной пропускной способности и избежать ошибок.

00:37:48 Заявление Шеннона о передаче данных

  • Шеннон утверждал, что при правильном кодировании можно передавать данные на пропускной способности канала практически без ошибок.
  • Источник информации с энтропией h генерирует последовательности, количество которых растёт экспоненциально.
  • Распределение этих последовательностей по каналу увеличивает количество возможных выходных последовательностей.

00:38:45 Проблема зашумлённого канала

  • В зашумлённом канале существует более одного правдоподобного входного сигнала, который может привести к заданному результату.
  • Шеннон предложил снизить скорость источника до r, где r меньше пропускной способности канала.
  • Это уменьшает количество последовательностей, генерируемых источником, по сравнению с количеством доступных выходных последовательностей канала.

00:39:36 Эффективность кодирования

  • Даже при небольшом уменьшении скорости источника соотношение между количеством последовательностей становится исчезающе малым.
  • Это обеспечивает достаточное количество последовательностей каналов для кодирования сообщений.
  • При достаточно большой длине блока вероятность интерпретации последовательности как более чем одного сообщения падает почти до нуля.

00:40:21 Теорема Хартли-Шеннона

  • Шеннон распространил понятие энтропии на непрерывные сигналы и предложил способ вычисления пропускной способности непрерывных каналов.
  • Теорема Хартли-Шеннона объединяет идеи Найквиста, Хартли и Шеннона, определяя предельную скорость для надёжной связи.
  • Уравнение выражает пропускную способность канала, ограниченную мощностью и искажённую шумом.

В этом видео

0:02
1939, Europe has descended into total war. Across the Atlantic, the strategy
0:10
in the United States is to wait and see. But behind the scenes, American
0:16
scientists and engineers were working on ideas that would not only help win the war, but they would change the course of
0:24
history. At Bell Labs, a young engineer fresh out of MIT, Claude Shannon, spent
0:31
his days finding ways to encrypt speech to keep it hidden from the enemy. But at
0:36
night, he was working on his own idea, one that would go far beyond the war
0:43
effort. It was an idea that would revolutionize the very way we think
0:48
about information. But Shannon’s theory had its foundations in the work of earlier pioneers
0:55
including his colleagues at Bell Labs, Nyquist and Hartley and even further in
1:01
the intellectual revolutions of the Renaissance and industrial age.
1:11
[Music]
1:49
It is the summer of 1794. Two mighty armies are about to
1:54
clash near the town of Larus. On one side, the Austro Dutch
2:00
coalition stand for monarchy and the preservation of the aristocracy. On the other side, the
2:07
revolutionary French army see themselves as liberators, spreading democracy and
2:13
liberty. The technology of warfare saw two major innovations that day. The first, the
2:20
invading French army used a balloon for aerial reconnaissance, a first in
2:26
battle. And as the battle raged, the French deployed the second. It would
2:32
normally take about 24 hours for a message to be sent from this distance
2:38
back to Paris, but the French had a trick up their sleeves. This was the
2:43
first military application of a new type of longrange communication system called
2:49
the optical semaphore. The invention was a shape-shifting structure which could
2:55
signal information almost instantaneously across vast distances.
3:01
As the musket smoke cleared, Europe had entered a new era of warfare, one where
3:07
rapid communication shaped the outcome of battle. The key innovation was
3:12
invented by Claude Chapper. His semaphore could be positioned into 92
3:18
distinct positions. Each a separate configuration of the beam and sidearms.
3:25
These would be called symbols. And into each symbol could be encoded one of 92
3:31
possible messages, words, sentences, anything the coder desired. But Chapa’s
3:38
system went one step further. Instead of coding into the 92 primary symbols, he
3:46
coded his message into a set of two consecutive symbols. This now gave
3:52
92* 92 or 8,464 unique
3:59
combinations. This larger set would be known as secondary symbols.
4:05
The system used a code book. The first primary symbol told the receiver the
4:10
page and the second the index on the page. To the French military, important
4:17
commands like attack or retreat would be coded to a particular secondary symbol
4:23
as would every other word or sentence in the code book. Chapa’s system was a
4:29
revelation and soon his semaphore chains had spread across France. Chaper had
4:34
started something that would not be fully resolved for 150
4:46
years. In 1820, a discovery was made that would fundamentally transform
4:52
communication. In his laboratory in Copenhagen, Hans
4:57
Christian observed that an electric current could deflect a magnetic needle,
5:02
revealing the connection between electricity and magnetism. Remarkably,
5:08
Urststead found this effect was repeatable. He used a primitive battery,
5:13
a conductor, and a small compass. When he closed the circuit, the needle
5:19
deflected from its normal north south position. When he turned the current off, the magnetic needle returned to its
5:27
usual position. Earth’s findings would send shock waves around Europe. And this
5:34
breakthrough laid the groundwork for a revolution. Unlike Claude Chaper’s Semaphore, a slow, clunky system of
5:40
visual symbols that could be easily intercepted, electricity offered a new possibility, sending information as
5:47
pulses through hidden conductors, traveling far beyond the reach of the human eye. Urst discovery ushered in a
5:56
new age of communication.
6:20
Samuel Morse had been in Europe since 1829. Now in 1832, it was time to go
6:26
home. On the package ship Sully, Morse entered a discussion with Charles Thomas Jackson, a physician and scientist
6:34
well-versed in the emerging field of electromagnetism. Jackson noted to Morse that electricity travel through wires at
6:41
near instantaneous speeds. Morse, an artist by trade, immediately recognized
6:46
the possibilities. Jackson also remarked that a new device capable of detecting
6:51
electric currents had been invented in Europe. The idea was planted in Morse’s mind, and when he returned to the United
6:58
States, he immediately got to
7:06
work. January the 11th, 1838, a bitterly cold chill filled the
7:13
air in Morristown’s Speedwell Iron Works. It would be here that the first
7:18
public demonstration of Morse’s new telegraph system would be held. By now,
7:24
Morse had help in the form of business partner Alfred Vale. Whilst Morse was an artist and a visionary, Vale was an
7:31
engineer, a machinist, a practical problem solver. Between them, they
7:36
invented a machine which could impress pulses onto electrical conductors. These pulses would travel for tens or hundreds
7:44
of miles. Vale assigned a sequence of dots, single pulses, and dashes, double
7:49
pulses, to encode each letter and number. But Morse and Vale were not the
7:54
only show in town. They had competition. In Great Britain, Cook and Wheatstone
7:59
had their own invention. A telegraph system of five magnetic needles and six
8:04
wires. Pulses were sent in parallel. The operator at one end would press the keys
8:10
and within moments, the needles at the other end would swing into position,
8:16
spelling out the message letter by letter. But in its simplest form, Morse’s system only needed a single
8:21
wire. It was cheaper and easier to manufacture, but it could only send one pulse at a time. But out of this
8:28
limitation, Vale hit on something that would later be fundamental to information theory. Vale studied the
8:35
statistics of the English language. In particular, the relative frequency of each letter. He noted that letters like
8:41
E were far more frequent than others. For example, the letter zed. From this, he assigned shorter codes to more
8:48
frequent letters and longer codes to less frequent letters. Overall, this meant fewer pulses on average were
8:55
required to send messages. Morrison Vale had no mathematical basis for doing this. It was pure intuition, but it
9:03
worked. On May 24th, 1844, Morrison Vale conducted another
9:09
public demonstration, except this time it would rock the foundations of
9:14
America. Professor Morse, seated amidst a hushed gathering of distinguished national
9:21
leaders in the chambers of the United States Supreme Court in Washington, tapped out a message on a device of cogs
9:27
and coiled wires. 40 mi away in Baltimore, Alfred Vale received the
9:33
electric signals and sent the message back. For thousands of years, messages have been limited by the speed at which
9:39
messengers could travel and the distance at which eyes could see signals. Neither Alexander the Great nor Benjamin
9:45
Franklin knew anything faster than a galloping horse. Now instant longrange
9:51
communication had become a reality.
10:07
[Music] By 1876, Morse’s system had dominated world telegraphy. But it was a manual,
10:15
laborintensive enterprise. Thousands of skilled operators were necessary to process the sequence of clicks into
10:22
words and numbers. Emile Bordeaux was a French engineer who saw an opportunity
10:27
to automate telegraphy. He invented an encoding scheme called Bordeaux code.
10:32
Rather than using a variable length code, he assigned each letter or number to a fixed 5-bit code capable of sending
10:41
one of 32 possibilities. Looking back through the lens of information theory,
10:46
we realize that Bordeaux’s code is less efficient because it fails to consider the redundancy in the information
10:53
source. No matter how infrequent, every letter or number is given the same
10:58
five-bit code. But what Bordeaux gave us is a way to automate. His telepinter
11:04
seamlessly impressed the necessary five bits onto the line just with the click of a keyboard. The receiver also
11:11
automatically read the signal back. This was the seeds of the modern digital age.
11:26

The globe is now covered in a patchwork of lines and cables. Morse and
11:32
Bodau had provided near instantaneous global connectivity. But there was something missing. There was as yet no
11:40
theory to quantify information, still no way to put bounds on the limits of
11:45
information capacity, and still no grasp of how noise or bandwidth shaped what a
11:52
wire or wave could carry. The analytical framework of information theory was
11:57
ready to be built, and two men in the late 1920s would lay the foundations.
12:05
In September 1927, Ralph Hartley, an employee of Bell Labs, made his way to
12:11
the International Congress of Telegraphy and Telefan in Bonito Mussolini’s Italy.
12:16
On the shores of Lake Ko, Hartley presented a paper which claimed to quantify information. It would go down
12:23
as one of the seinal works of information theory. Hartley claimed to have an objective measure of information
12:31
removed from what he called the elastic manner in which information had previously been defined. He defined the
12:38
communication problem a group of physical symbols which convey a certain
12:43
meaning as assigned by the designer of the system. This might be the visual semaphors of chaper or the dots and
12:50
dashes of Morse code. But now Hartley considered what happens if we send
12:56
symbols consecutively. The total number of symbol sequences Hartley said becomes
13:02
greater the more symbols we send. Every time we select an additional symbol we
13:08
select one message from an ever growing array of possible messages. For example,
13:15
apples are red. These three words represent one precise specific message
13:23
from an exponentially growing set of possible messages. In the same way, if
13:28
we send consecutive symbols, each new selection grows the total number of
13:33
possible messages exponentially. Hartley’s idea was simple but profound.
13:40
The quantity of information is bound to the total possible messages we select
13:45
from. And this said Hartley can be quantified with mathematical precision.
13:52
Let’s say we have a symbol alphabet of three and two consecutive
13:57
selections. There are 3 to the power of two or nine possible messages. But with
14:04
three selections, there are now 27. Four selections would give 81 possible ways
14:12
to encode a message. Consider Chapper’s Semaphore. He had 92 primary symbols to
14:18
select from. But recall that he used two consecutive symbols. This gives 92* 92
14:24
possible ways to encode a message. Each new consecutive selection in Chapa’s
14:30
system gave a factor of 92 increase. In Bordeaux’s teleprinter, there are 32
14:36
ways to encode a message. But if n selections are made, the number of possibilities increases by 32 to the
14:43
power of n. Each new selection broadens the total number of possible sequences
14:48
we can use to assign messages to. And when we select one message from a larger set, the information content, said
14:55
Hartley, is greater. But Hartley immediately hit a problem. Imagine
15:00
Chapper decided to use a code book with 10 consecutive messages instead of two.
15:07
Now there would be 92 to the power of 10 possible ways to encode a message. This
15:13
is greater than the number of stars in the observable universe. If this is our
15:19
measure of information, it scales up very, very, very quickly. In his attempt
15:25
to quantify information, Hartley had observed a monster that grows exponentially out of control. For this
15:32
reason, Hartley made a suggestion. Rather than use the raw number of possible sequences, it is far more
15:39
sensible to quantify information as its logarithm. This means as the number of selections goes up, information
15:46
increases linearly, not
15:52
[Music]
16:02
exponentially. Hartley talked about sending messages comprised of sequences
16:07
of symbols. But to send a symbol, we need a signal. Something which
16:14
represents the symbol we’re trying to send. Nyquist was the first to make the connection between discrete symbols and
16:21
analog signals. He looked at the pulses of a telegraph system, sequences of ones
16:27
and zeros. But he went further. He decomposed these pulses into their
16:33
sinosoidal components. And he realized something fundamental. As a signal
16:39
travels through a channel, higher frequencies get stripped away. The extent to which this happens depends on
16:46
the bandwidth of a channel. The problem is that your signal becomes distorted.
16:52
Nyquist realized that there is a connection between how many signaling events or symbols can be transmitted per
16:59
second and the bandwidth of the channel. And this places a fundamental limit on
17:04
how many signals can be sent per
17:11
second. In 1937, a 21-year-old from a small town in Michigan submitted his
17:18
master’s dissertation at MIT. Many now regard this thesis as the birth of
17:24
digital electronics and the greatest master’s thesis in all of engineering
17:30
history. Claude Shannon had showed how the mathematical algebra of bool, the logic
17:36
of ones and zeros, true and false, could be applied to electrical circuits, and
17:42
it was a blueprint for digital logic, the foundation of computers and all
17:48
modern digital systems. Shannon went on to receive his PhD degree in theoretical genetics, and
17:55
later took up a position at Bell Labs. It was here that he would spend his days
18:00
contributing to wartime cryptography, but at night he was working on a side
18:08
[Music]
18:15
project. In 1948, Claude Shannon proposed a set of ideas that would
18:21
revolutionize the way we think about information. But his starting point was indeed the observation of Hartley. A
18:28
measure of information is the number of messages in the set we select from all choices being equally likely. And
18:35
Shannon also agreed with Hartley that the logarithm was the most natural way to measure information. He proposed that
18:42
the base 2 is the most natural measurement of information. This is what we know as a bit. Then Shannon defined
18:49
the problem. A communication system is made up of an information source which
18:55
produces a sequence of messages. A transmitter which operates on the messages to produce the signal like the
19:02
pulses of a telegraph. The medium or the channel which is used to transmit the signal. A receiver which performs the
19:09
inverse operation of the transmitter to reconstruct the original sequence of messages. And finally the destination
19:16
the intended recipient of the message. Hartley had provided the first real
19:22
quantitative measure of information. He said that the amount of information transfer is the log of the total number
19:30
of available sequences from which that message is selected using the base 2. This means chaper semaphore has a rate
19:37
of 6 1/2 bits per symbol and Bordo’s teleprinter can send five bits per
19:42
symbol. But Shannon realized that the channel by itself does not convey information. A position of chapa
19:48
semaphore or a particular sequence of ones and zeros in Bordeaux’s teleprinter does not in itself convey a message. It
19:56
simply provides a way to transport information from the information source. This might seem
20:02
counterintuitive. Look at how Chapper and Bordeaux allocated their messages to channel signals. Every message was
20:09
allocated to every channel signal. But Shannon showed something remarkable. The information source itself has a rate,
20:16
something innate, an essence of the source rooted in its own behavior. By separating the information
20:24
source and the channel, Shannon had made the first step towards a number of
20:29
groundbreaking [Music]
20:34
discoveries. As a preamble, Shannon showed how the capacity of an arbitrary
20:39
channel can be calculated. Unlike the information source, the channel capacity is simply a function of how many
20:46
sequences can be constructed from the available symbols. And we’ve already seen that the capacity of the Bordeaux
20:52
teleprinter is 5 bits per symbol. So over time, the capacity is 5 n bits per
20:59
second, where n is the transmitted symbols per second. But Shannon also
21:04
showed how the capacity of Morse’s telegraph system can be calculated. This
21:09
one is different because we’re dealing with symbols of different lengths and restrictions on the kind of sequences we
21:16
can have. Nonetheless, Shannon showed that the capacity is the limit as t
21:21
tends to infinity of the log of the allowable sequences divided by time. But
21:27
the impact of Shannon’s next section would be monumental.
21:34
He turned his attention to the information source and he asked a question nobody had asked before. At
21:41
what rate is the source generating information? This is separate to asking
21:46
what capacity a channel has. What Shannon saw that Hartley did not is an information source that is not a random
21:53
generator of symbols but a statistical process. Each new symbol generated is
21:59
not independent of the previous symbols. When we count sequences, some are more
22:04
likely than others. Some sequences are so unlikely that they might as well never occur at all. This is most easily
22:12
observed in human language. There are, of course, a huge number of ways we can arrange words, but only a small subset
22:19
of these possible sequences are commonly used. Hartley’s approach was to count
22:24
all possible sequences. But Shannon introduced a new perspective, one that considered the likelihood of different
22:31
sequences occurring. He then developed a model, a mathematical machine capable of
22:36
producing symbols and generating information. He modeled an information source as a markoff process where the
22:44
source is represented as a complex web of interconnected states that transition
22:49
between each other randomly but according to set probabilistic rules. If
22:54
this was a language, these rules would be dictated by the normal structure of that language. But any information
23:00
source has this kind of statistical behavior. It contains patterns, repetition, and redundancy. Out of
23:07
Shannon’s machine came symbols. But how can we measure how much information is being generated? Shannon produced an
23:14
idea that would become a fundamental pillar of not just information theory but data compression, cryptography,
23:21
machine learning, artificial intelligence, economics and neuroscience. His idea was simple.
23:28
Information is best measured as the uncertainty resolved, not just the
23:33
number of possible choices.
23:40
[Music] Imagine you flip a fair coin. Hartley
23:47
had said that the information is the log of the possible choices, which here is
23:52
one bit of information per flip. You can keep flipping the coin and each new
23:57
event will give one extra bit of information. Now consider if we make the
24:02
coin biased, so there’s an 80% chance of heads. According to Hartley, there are
24:08
still two possible outcomes, and the information is still one bit per flip of
24:14
the coin. But is this correct? Shannon realized that the measure of choice gets
24:20
smaller as the probabilities of the outcomes become more skewed. In reality,
24:25
information is the log of a smaller number of effective choices. Shannon
24:31
called this new measure of information entropy.
24:45
To calculate entropy, Shannon calculated the surprise of each new selection from
24:51
an information source. This quantifies how surprised we are that this symbol has been chosen or in other words, how
24:59
much uncertainty that selection has resolved. For example, if we have four
25:04
options and if we select the symbol with a 50% probability, this is one bit of
25:10
surprise. Even though this is within a symbol space of four, selecting the 50% option provides the same amount of
25:18
information as flipping a coin. But the 12 1/2% symbol provides three bits of
25:24
surprise because it’s more unlikely. So more uncertainty is being resolved. For
25:29
the flip of a fair coin, the surprise is always one bit. But what happens if the
25:35
coin is biased? Of course, the more likely option is chosen most often, but
25:40
the surprise is relatively low. But every so often, the unlikely event happens. The surprise is greater. But
25:47
what Shannon showed is entropy is the average value of this surprise and it
25:52
maximizes when the probabilities are equal. Entropy is the weighted sum of
25:57
the surprise values of each possible outcome weighted by how likely each
26:03
outcome is. Consider again an event with two possible outcomes with probability P
26:09
and Q. If this was an information source, its entropy would be related to how balanced these probabilities are. If
26:16
one outcome is certain, entropy is zero. No information is gained if we learn the outcome. But if the probabilities are
26:22
balanced, this conveys the maximum amount of information and the entropy is
26:28
at its highest. However, if the probabilities are skewed, the entropy
26:33
inevitably drops. But why does this matter? We still need to encode every possible symbol into the channel, right?
26:40
Is entropy just a philosophical and academic measure of what we perceive as
26:45
information? What Shannon did next is prove that entropy places a fundamental lower bound on how efficiently we can
26:52
represent information and therefore make best use of the
27:02
channel. Suppose I have an information source which is represented here as four
27:08
colors. So, there’s a 50% chance I’ll select red. 25% chance I’ll select blue
27:14
and a 12% chance I’ll select white or black. I’m going to put these inside a
27:19
bag and I’ll select randomly. So, we need to assign a code to each one of these symbols to convey this over a
27:27
channel. So, the question is which code should we assign to each of the colors to get the most efficient coding scheme?
27:34
and it’s going to be a random selection. When we think about coding
27:40
these symbols, we might follow the advice of Bordeaux and assign a fixed code. In this case, we require two bits
27:48
to describe all four possible outcomes. Of course, the average number of bits per symbol will always be two bits. But
27:55
can we do better? Let’s try a variable length code. And the trick is to assign
28:01
a code length which matches the surprise of each symbol. So when red is selected, there’s a 50% chance. So there’s one bit
28:09
of surprise. With blue, a 25% chance, two bits of surprise. Matching code
28:14
length to surprise gives us the optimum coding scheme. This is the direction of
28:19
course that Morse and Veil were heading all those years ago. The key is assigning shorter code words to more
28:26
frequent symbols. In this experiment, we achieved an average of 1.78 bits per
28:32
symbol, which is getting close to the entropy of the source, 1.75 bits. But if we carried on the experiment, we would
28:38
actually converge on this value. The job of the coder is to reshape the output of the source into a bitstream whose
28:45
structure is statistically close to the most efficient possible use of the channel, which is when symbols are
28:51
encoded near their surprise level. Shannon said that doing this will allow you to maximize the number of symbols
28:58
which can be conveyed and make full use of the channel capacity. In the previous example, if our channel capacity was
29:05
let’s say 14 bits per second, we could send seven symbols per second with the
29:11
bad encoding scheme. But if we use the good encoding scheme, we could send eight symbols per second. In general,
29:18
Shannon said we can compress an information source down to its entropy. But if we go lower, we are almost
29:24
guaranteed to lose information. But we need to be careful because not every information source is so easy to encode.
29:32
The example we just looked at is a very special case. It works out neatly because the symbols are independent and
29:38
follow a simple probability distribution. But what about something more complex like the English language?
29:44
In this case, we can get close to the true entropy if we encode over long sequences of text. That’s because the
29:51
statistical structure of the language like grammar, word choice, and context only really show up when we look at
29:57
larger chunks. Letters aren’t independent, and some combinations are far more likely than others. So, how did
30:04
Shannon prove his theory was correct in all cases, that you can compress information down to its entropy in the
30:11
general case? How did Shannon prove that any source can be compressed down to its
30:18
entropy? To demonstrate, I’ll start with the simple information source we used previously. Remember, there’s a 50%
30:24
chance we’ll pick a red, 25% chance we’ll pick a blue, and 12% chance we’ll pick a black or a white. I’m going to
30:32
pick at random again and try to generate long sequences. But because I don’t have
30:39
the patience to do it for real, I’ve written a piece of code to simulate this information source. And I’m going to
30:45
simulate lots of these information sources. simultaneously. So each will generate its own long sequence. Shannon
30:54
noticed that something interesting happens if we make the number of selections very large. Can you see that
31:00
the percentage of red, blue, black, and white begins to converge to its expected
31:06
value, 50% red, 25% blue, and so on. This happens for all of the information
31:14
sources we simulated. Even though the sequences produced by our sources are all
31:20
different, they belong to the same typical set. And as the number of selections approaches infinity, this
31:27
probable set is all we are left with. A small set from an astronomically large
31:34
possible number of sequences. And the number of bits required to specify this
31:39
probable set. It is the entropy per symbol of the information
31:45
source. Hartley had taught us that information is the log of the number of
31:51
choices. Shannon had taken the sequence to infinity and showed the true number of choices is specified by the entropy.
32:00
And Shannon said the same result applies for any source. So even if we simulate a
32:07
complex source like a Markov chain, it will also eventually boil down to its
32:13
own set of typical sequences and it too can be specified by its entropy. In the
32:20
limit as we approach infinity, Shannon had proved that the information source
32:25
boils down to its entropy, the essential essence of the source. any fewer than h
32:31
bit bits per symbol and we cannot guarantee that every probable sequence
32:36
can be encoded. Now Shannon’s proof relies on taking a number of sequences which
32:43
approaches infinity but it shows the limit of
32:48
compression. You cannot compress any better than the entropy.
32:54
Shannon had introduced a new method of quantifying information entropy and he
33:01
showed that any information source can be compressed down to its entropy. This
33:08
allows communication with the minimum number of bits per symbol and the most
33:14
efficient use of the channel capacity. But there’s a problem. Every
33:20
practical channel is noisy and it means that there can never be a guarantee that
33:26
the bits you receive are the bits that were sent. What Shannon did next would be the
33:32
most audacious part of his paper and its implications would reverberate decades
33:39
into the
33:44
future. Imagine I flip a fair coin.
33:50
heads or tails. When I tell you the result, you gain one bit of information.
33:55
But now, let’s say I lie 10% of the time. So, if I flip a heads, I’ll tell
34:01
you its tails. And if I flip a tails, I’ll tell you it’s a heads, but only 10%
34:07
of the time. How much information am I now conveying? You might be tempted to
34:13
say, well, if 90% of the time you’re telling the truth, maybe I’m still
34:18
getting.9 bits of information, 9 bits out of 10. But that’s not right because
34:25
you don’t know when I’m lying. That uncertainty is actually baked into every
34:31
message. How did Shannon account for this and adjust the rate accordingly?
34:37
Consider a communication system with an observer capable of seeing both what is
34:42
sent and what is received. The observer notes the errors and then communicates
34:48
them over its own channel, a correction channel. The easiest way to do this is
34:53
to send a stream of bits with one indicating the position of an error and zero otherwise. But these bits could
35:01
occur anywhere. All we know is that the frequency of ones for a long message
35:06
will be 10% or whatever the bit error rate happens to be. Shannon modeled this
35:12
as a communication channel in its own right. And he realized that the capacity
35:17
of the correction channel coded down to it entropy is the missing information at
35:24
the receiver due to the additional uncertainty of the noise. He called this
35:30
conditional entropy. It is the average ambiguity of the received signal. In our
35:36
example of the coin, the entropy of the information source starts as one bit per
35:42
flip, but the uncertainty due to the noise is47 bits per flip. Therefore, the
35:50
rate of information is actually 0.53 bits per flip. Shannon’s
35:57
hypothetical correction channel shows how the conditional entropy represents the missing information due to the
36:05
noise. As an example, imagine being at a loud party and your friend says a word
36:13
and let’s say that word is selected from 2 to the^ of 10 possible words. So 10
36:19
bits of information is contained within that message. But because of the noise,
36:24
you can only narrow it down to one of eight possible words. That means there’s
36:29
still three bits of uncertainty. But the information you gained is the narrowing
36:35
of 2 to the 10 possibilities down to 2 to the 3 possibilities. So seven bits of
36:41
information has been transmitted. When noise is present,
36:46
there is an inherent uncertainty at the receiver and this means there could be
36:51
multiple possible messages for each received sequence. This is not practical
36:57
for communication over noisy channels because the receiver can never be sure
37:03
about what the transmitted message actually was. If we think back to Chapa
37:08
Semaphore, imagine the problems they might face if the weather is foggy. it
37:14
becomes difficult to see what symbol is actually being sent. But you might be able to reasonably narrow it down to a
37:21
limited set of possibilities. In such a case, you might be tempted to limit the
37:26
rate of transfer by restricting the number of symbols that can be sent, effectively separating the symbols
37:33
according to the ambiguity. One might consider the capacity of this channel to be the maximum possible rate given the
37:40
uncertainty of the fog. But the system needs to be designed properly to reach this capacity while still avoiding
37:47
errors. Shannon was now ready to make his most audacious claim yet. He said it
37:54
was possible with the right coding to transmit at the capacity of the channel
38:00
with close to zero errors. Imagine a source X producing information at the
38:07
rate of entropy H of X. Let’s assume this source has reached the channel
38:13
capacity. Over time, this source produces around 2 ^ of h * t probable
38:19
sequences. As Shannon showed, if a source keeps generating sequences, eventually you end up with just a set of
38:27
typical sequences and the number of them grows exponentially at a rate defined by
38:34
the entropy. If we now allocate these sequences to the channel, the number of possible channel output sequences also
38:41
grows roughly at the exponential rate. The trick, as we’ve seen, is to assign
38:48
messages to channel inputs using a code book mapping each message to a channel
38:54
sequence. But here’s the problem. In a noisy channel, we’re guaranteed to have more than one plausible input that could
39:02
produce a given output. that is more than one reasonable cause for what the
39:07
receiver sees. So Shannon tried something different. He reduced the rate
39:12
of the source to R where R is now less than the channel capacity. Now because
39:19
the source is generating information more slowly, it produces fewer sequences
39:24
compared to the number of available channel output sequences. It’s
39:30
interesting to see what happens when we simulate the rate of growth of sequences of R and the channel sequences. Even
39:36
with a small reduction, given enough time, the ratio becomes vanishingly
39:41
small, which means there are now a huge number of channel sequences available to encode messages. And this means the
39:48
channel has more than enough room to spread the messages out to keep them far
39:54
apart. Shannon then assigned messages at random to channel sequences and he
40:01
realized even though the channel is noisy, the chance that a received sequence could be reasonably interpreted
40:09
as more than one message drops to close to zero given a long enough block
40:15
length. [Music]
40:21
Shannon’s final trick was to extend the concept of entropy to continuous signals
40:27
and in doing so he provided a way to calculate the capacity of continuous
40:33
channels. This culminated in the Hartley Shannon theorem a unifying expression
40:39
that brought together the insights of Nyquist Hartley and Shannon himself. The
40:45
result was a single equation that defines the capacity of a channel constrained by power and corrupted by
40:52
noise. The ultimate speed limit for reliable communication
41:01
[Music]

Поделиться: