01L – Градиентный спуск и алгоритм обратного распространения

*https://www.youtube.com/watch?v=BYXQiHJkYBA
**https://300.ya.ru/v_FJ8E192n

таймкоды

00:00:00 Введение в глубокое обучение

  • Глубокое обучение использует парадигму обучения с учителем.
  • Машина обучается на примерах пар «вход-выход».
  • Пример: распознавание изображений автомобилей и самолётов.

00:00:54 Применение обучения с учителем

  • Используется для распознавания речи, изображений, лиц, генерации подписей, перевода.
  • На обучение с учителем приходится 95% современных применений машинного обучения.
  • Упоминаются две другие парадигмы: обучение с подкреплением и самообучение.

00:01:54 Традиционная модель распознавания образов

  • Необработанный сигнал преобразуется в представление с помощью извлекателя признаков.
  • Представление подаётся на вход обучаемому классификатору.
  • Глубокое обучение заменяет ручной процесс создания извлекателя признаков стеком обучаемых модулей.

00:02:50 Принцип глубокого обучения

  • Стек обучаемых модулей преобразует входные данные в более высокий уровень абстракции.
  • Система обучается целиком от начала до конца.
  • Алгоритм обратного распространения ошибки — основа глубокого обучения.

00:03:30 Параметризованные модели

  • Параметризованная модель — это функция g от x и w, где w — набор параметров.
  • Входы и выходы модели являются многомерными массивами тензорами.
  • Параметры w настраиваются во время обучения.

00:05:28 Функция затрат

  • Функция затрат вычисляет меру несоответствия между желаемым выходом и выходом модели.
  • Пример: линейная регрессия, где вычисляется квадрат разности между y и y^.
  • Обучение заключается в минимизации функции затрат, усреднённой по обучающему набору.

00:07:22 Система обозначений

  • Наблюдаемые переменные обозначены сероватыми кружками.
  • Детерминированные функции — фигурами со скруглёнными углами.
  • Функции затрат — красными квадратами.

00:09:08 Оптимизация функции потерь

  • Общая функция потерь вычисляется как среднее значение функции потерь для одного примера по всему обучающему набору.
  • Задача оптимизации — найти минимум функции потерь по параметрам.

00:11:00 Градиентный спуск

  • Градиентный алгоритм предполагает, что функция гладкая и дифференцируемая.
  • Стахастический градиентный спуск: изменение параметров модели на основе одного примера.
  • Процедура: вычисление целевой функции для одного примера, определение изменений параметров, переход к следующему примеру.

00:12:25 Градиентный спуск

  • Градиентный спуск обновляет значение параметра, вычитая градиент целевой функции, умноженный на размер шага.
  • Градиент — это вектор, показывающий, насколько увеличится функция потерь при небольшом изменении параметра.
  • Для каждой компоненты вектора параметров градиент определяет производную по направлению.

00:13:15 Вычисление градиента

  • Компонент градиента вычисляется как отношение изменения функции потерь к изменению параметра.
  • Полный градиент функции потерь по всем весам представляет собой вектор, каждая компонента которого соответствует одному из весов.
  • Обозначение градиента как dl/dW отражает факт небольшого изменения параметра.

00:14:12 Направление градиента

  • Вектор градиента указывает в направлении наибольшего подъёма или самого крутого склона.
  • На двумерной поверхности градиент представляет собой стрелку, указывающую на направление наибольшего уклона.
  • В минимуме градиент равен нулю.

00:15:11 Аналогия с горами

  • Градиентные алгоритмы аналогичны спуску в горах: определение локального уклона и движение в противоположном направлении.
  • При выпуклом ландшафте алгоритм сходится к минимуму.

00:16:08 Параметры шага

  • В простейших случаях размер шага — положительная константа, иногда уменьшаемая в процессе работы.
  • В более сложных вариантах размер шага может быть матрицей, положительно определённой или положительно полуопределённой.
  • Траектория градиентного спуска не всегда прямая, следуя уклону.

00:17:48 Стахастический градиентный спуск

  • Стахастический градиентный спуск вычисляет градиент для каждого примера отдельно, а не для всего набора данных.
  • Оценка градиента на основе одного примера является зашумлённой оценкой полного градиента.
  • Метод более эффективен благодаря использованию избыточности между примерами.

00:19:44 Мини-батчи

  • Мини-батчи позволяют распараллелить вычисления на параллельном оборудовании, таком как GPU.
  • Размер батча варьируется от 30 до нескольких тысяч примеров.
  • Вычисление среднего градиента по батчу улучшает эффективность алгоритма.

00:20:39 Непрерывность и дифференцируемость

  • Целевая функция должна быть непрерывной и дифференцируемой почти везде.
  • Нейросети часто не являются дифференцируемыми, но непрерывными.
  • В точках излома используются субградиенты как обобщение понятия производной.

00:22:30 Невыпуклость функций

  • В большинстве систем глубокого обучения функция оптимизации является невыпуклой.
  • Минимизация невыпуклых функций ранее вызывала сомнения из-за отсутствия строгих доказательств.
  • Несмотря на теоретические ограничения, невыпуклые функции работают на практике.

00:23:13 Оптимизация невыпуклых функций

  • В сообществе машинного обучения дебаты о оптимизации невыпуклых функций длились 20 лет.
  • Классический градиентный спуск часто застревает в локальных минимумах, в то время как стохастический градиентный спуск SGD реже попадает в них из-за шумовой компоненты.
  • В многомерных пространствах локальных минимумов меньше, что позволяет успешно оптимизировать невыпуклые функции.

00:24:18 Пример с двумерной функцией

  • В одномерном пространстве градиентный спуск застревает в локальном минимуме.
  • Добавление второго параметра позволяет обойти локальный минимум и достичь глобального.
  • В пространствах высокой размерности проблема локальных минимумов менее актуальна.

00:25:09 Гиперпараметризация и валидация

  • Современные системы глубокого обучения имеют триллионы параметров.
  • Гиперпараметризация позволяет сетям идеально выучить обучающий набор, но возникает вопрос о их производительности на валидационном и тестовом наборах данных.

00:26:07 Седловые точки

  • В системах глубокого обучения существует множество седловых точек.
  • Стохастический градиент помогает избегать седловых точек.
  • Специальные методы для избегания седловых точек не оказались серьёзной проблемой на практике.

00:27:03 Выбор случайных примеров для SGD

  • Базовый подход: перемешивание примеров в случайном порядке и выбор по одному.
  • Альтернатива: использование генератора случайных чисел для выбора примеров.
  • Полезно составлять батчи из максимально непохожих примеров.

00:28:44 Аугментация данных

  • Аугментация искажает образцы, например, изображения, чтобы получить больше примеров.
  • Преобразования могут быть случайными или предварительно вычисленными.

00:29:42 Размер батча

  • Размер батча определяется аппаратным обеспечением: для GPU обычно от 16 до 64 примеров.
  • Увеличение размера батча ускоряет вычисления, но замедляет сходимость модели.
  • Для задач классификации размер батча не должен значительно превышать число категорий, чтобы не тратить вычислительные ресурсы впустую.

00:30:49 Введение в нейронные сети

  • Нейронная сеть — это модель, состоящая из линейных и нелинейных операционных блоков.
  • Каждый вход умножается на свой вес, и взвешенная сумма вычисляется в юните или нейроне.
  • Термин «нейрон» упрощённо описывает модель, вдохновлённую биологическими нейронами.

00:31:46 Скрытый слой и функция ReLU

  • В первом слое три юнита, который называется скрытым.
  • Взвешенные суммы пропускаются через нелинейную функцию ReLU, которая равна аргументу, если он положителен, и нулю, если отрицателен.
  • Функция ReLU имеет различные названия в разных областях.

00:32:39 Полносвязная сеть и градиенты

  • Полносвязная сеть соединяет каждый юнит в одном слое с каждым юнитом в следующем.
  • Алгоритм обучения настраивает веса сети.
  • Главный трюк глубокого обучения — вычисление градиентов для весов.

00:33:32 Нотация и ограничения сети

  • Взвешенная сумма для юнита с номером i — это сумма произведений выходов вышестоящих юнитов на соответствующие веса.
  • Функция активации применяется к взвешенной сумме.
  • Граф соединений должен быть ациклическим, чтобы можно было выстроить последовательность слоёв.

00:34:30 Алгоритм обратного распространения ошибки

  • Алгоритм обратного распространения ошибки вычисляет градиент функции затрат по отношению к переменным внутри нейронной сети.
  • Переменные могут быть переменными состояния или параметрами модели.
  • Интуитивное объяснение алгоритма основано на изменении значения функции затрат при изменении переменной.

00:37:26 Цепное правило дифференцирования

  • Цепное правило объясняет, как вычислять производную функции, являющейся композицией двух других функций.
  • Для вычисления градиента по переменной s используется формула: dc/ds = dc/dz * dz/ds.
  • Это позволяет распространять градиент назад, умножая на производные функций в обратном порядке.

00:41:13 Влияние переменной z на функцию затрат

  • Изменение переменной z влияет на функцию затрат через взвешенные суммы на следующем слое.
  • Возмущение переменной z приводит к возмущению взвешенных сумм на величину, равную произведению dz на соответствующие веса.
  • Общее влияние на функцию затрат рассчитывается путём суммирования эффектов от возмущения каждой из переменных.

00:43:00 Вычисление градиента по z

  • Градиент функции затрат по z вычисляется как сумма произведений частных производных функции затрат по взвешенным суммам на соответствующие веса.
  • Это позволяет учесть все возможные возмущения из-за изменения переменной z.

00:43:31 Графическое представление операции

  • Операция представлена графически с использованием величин dC/ds0, dC/ds1 и dC/ds2.
  • Вычисление dC/z через умножение dC/ds0 на w0, dC/ds1 на w1 и dC/ds2 на w2.
  • Объяснение принципа обратного распространения: использование тех же весов в обратном направлении.

00:44:22 Простота алгоритма обратного распространения

  • Алгоритм обратного распространения ошибки для классических нейронных сетей прост в реализации.
  • Необходимы две функции: для прямого прохода и для обратного прохода.
  • Функция прямого прохода вычисляет взвешенные суммы и применяет нелинейность.
  • Функция обратного прохода вычисляет взвешенные суммы в обратном направлении, умножая результат на производную нелинейности.

00:45:18 Матричная форма представления

  • Каждое состояние сети представлено вектором.
  • Линейная операция — это умножение вектора на матрицу весов.
  • Размерности вектора и матрицы должны совпадать.

00:46:18 Возмущения и частные производные

  • Возмущение величины c0 умножается на частную производную функции затрат по c0.
  • Общее возмущение функции затрат равно сумме возмущений по всем переменным.
  • Алгебраические операции с возмущениями позволяют вычислять dC/z.

00:48:11 Вычисление dC/z через возмущения

  • Изменение z на dz приводит к изменению c0 на ds0 = dw0 * dz.
  • Подстановка выражения для ds0 в формулу для dC/z даёт dC/z = dw0 * dC/ds0 + dw1 * dC/ds1 + …
  • Разделение на dz позволяет избавиться от dz в уравнении.

00:50:05 Символическое описание нейросети

  • Простая нейросеть описывается линейными блоками, которые умножают предыдущее состояние на матрицу.
  • Результат умножается на другую матрицу и проходит через нелинейность, например, ReLU.
  • Процесс повторяется для получения следующего состояния сети.

00:50:17 Двуслойная нейросеть

  • Состояние z1 умножается на матрицу W1, образуя вектор взвешенных сумм s2.
  • Вектор s2 проходит через нелинейные функции, образуя z2.
  • Архитектура включает два скрытых слоя и три слоя весов.
  • Такой подход упрощает понимание алгоритма обратного распространения ошибки и соответствует современным фреймворкам для глубокого обучения, таким как PyTorch.

00:51:13 Использование PyTorch

  • В PyTorch используются предопределённые классы, например, Linear для умножения вектора на матрицу.
  • Пример программы на Python: создание изображения и нейросети с количеством входов, равным общему числу элементов в изображении.
  • Определение класса для нейронной сети с помощью подкласса NModule.

00:52:12 Конструктор и атрибуты

  • Конструктор принимает размеры внутренних слоёв и инициализирует три линейных модуля.
  • Атрибуты объекта: m0, m1, m2, каждый из которых является экземпляром класса Linear.
  • Классы с заглавной буквы содержат внутренние параметры, весовые матрицы.

00:53:09 Функция forward

  • Функция forward выравнивает входные данные и применяет модули в определённом порядке.
  • Применение ReLU и линейных слоёв для вычисления выходного значения.

00:54:45 Автоматическое дифференцирование

  • PyTorch автоматически вычисляет градиент в обратном направлении благодаря автоматическому дифференцированию.
  • Возможность использования функционального подхода вместо модулей с внутренними параметрами.

00:55:39 Создание экземпляра модели

  • Создание экземпляра класса модели и применение её к изображению.
  • Важность понимания работы обратного распространения даже при автоматическом выполнении в PyTorch.

00:57:03 Создание нового модуля

  • Для создания нового модуля необходимо написать собственную функцию backward.
  • Применение цепного правила для вычисления градиента функции затрат.

00:58:45 Математические аспекты

  • Объяснение размерностей векторов и матриц в контексте цепного правила.
  • Вектор-градиент записывается в виде строки, а производная одного вектора по другому представлена матрицей.

00:59:42 Введение в обратное распространение

  • Функция принимает вход z и выдаёт выход zg.
  • Для вычисления производной модуля необходимо рассмотреть множество членов, так как каждый компонент zg может зависеть от каждого компонента z.
  • В результате получается матрица, где количество строк равно размерности вектора zg, а количество столбцов — размерности z.

01:00:42 Формула обратного распространения

  • Каждый элемент матрицы представляет собой частную производную.
  • Формула обратного распространения: градиент функции затрат по одной переменной умножается на матрицу Якоби для получения градиента по другой переменной.

01:02:37 Пример линейного модуля

  • Если функция — простое умножение на матрицу, матрица Якоби будет транспонированной исходной матрицей.
  • Умножение вектора градиента на транспонированную матрицу весов даёт градиент для следующего слоя.

01:03:31 Обобщение обратного распространения

  • Обратное распространение через линейный модуль — это умножение на транспонированную матрицу.
  • Модуль имеет два входа: zf и g, матрица весов.
  • Для вычисления градиента по весовым параметрам используется другая матрица Якоби.

01:05:11 Рекурсивное правило обратного распространения

  • Градиент по слою k равен градиенту по слою k+1, умноженному на матрицу Якоби слоя k+1 по слою k.
  • Функция, связывающая слои, может зависеть от внутренних параметров.

01:07:00 Реализация в фреймворках

  • В PyTorch и других фреймворках используется параметризованная функция и функция затрат.
  • Производные вычисляются с помощью цепного правила или частных производных.
  • Зависимость выхода от параметров выражается через матрицу Якоби.

01:08:10 Вычислительный граф

  • Вычислительный граф — это способ организации операций для вычисления градиента.
  • Автоматический способ преобразования графа вычисляет градиент.
  • PyTorch хорошо справляется с динамическими вычислениями и сложными функциями.

01:09:43 Проверка размерностей

  • Если w — вектор-столбец размера m, а w — вектор-столбец размера n, то в результате вычислений получаются вектор-строка размера n, вектор-строка размера m и матрица Якоби размера n на n.
  • Все размерности согласуются друг с другом.

01:10:05 Подход к построению нейронных сетей

  • Использование библиотеки базовых модулей для создания сложных графовых структур.
  • Компоновка модулей в архитектуру обучающейся системы.
  • Применение базовых математических операций и параметризованных функций.

01:11:05 Фундаментальные строительные блоки

  • Основные модули: ReLU, сигмоиды, функции затрат.
  • Возможность конструирования вычислительных графов без циклов.
  • Упоминание рекуррентных сетей как способа обработки циклов.

01:12:16 Практические трюки для экспериментов с нейросетями

  • Использование ReLU как основной нелинейности.
  • Важность правильной инициализации весов.
  • Проблемы с нулевыми весами и необходимость нарушения симметрии.

01:13:13 Инициализация весов

  • Инициализация весов случайными значениями для нарушения симметрии.
  • Приём Каймина: уменьшение весов при увеличении числа входов.
  • Примеры инициализации в PyTorch.

01:15:04 Функции потерь и оптимизация

  • Перекрёстная энтропия как функция затрат для классификации.
  • Статистический градиентный спуск на мини-батчах.
  • Необходимость перемешивания обучающих примеров.

01:16:04 Стахастическая градиентная оптимизация

  • Разновидности стахастического градиентного спуска.
  • Проблемы с термином «стахастический градиентный спуск».
  • Методы SGD в PyTorch: Adam, ADAM, RMSprop, Adamax.

01:18:19 Нормализация входных переменных

  • Нормализация входных данных для линейной оптимизации.
  • Влияние дисперсии входных данных на обучение.
  • Вычисление среднего значения и стандартного отклонения для каждой переменной.

01:20:13 Пример нормализации

  • Пример нормализации изображения в оттенках серого.
  • Вычисление среднего значения и стандартного отклонения для каждой переменной изображения.

01:20:36 Нормализация данных

  • Данные предварительно подготавливаются: для каждого пикселя вычисляется среднее значение и делится на стандартное отклонение.
  • Нормализованные данные имеют среднее значение 0 и стандартное отклонение 1.
  • Это помогает нейронной сети и алгоритму оптимизации работать эффективнее.

01:21:24 Безградиентные методы оптимизации

  • Безградиентные методы не предполагают дифференцируемости функции.
  • Примеры функций, где градиент не даёт информации: функции с резкими уступами или дискретные переменные.
  • Пример дискретной переменной: позиция фигур на шахматной доске.

01:23:24 Оценка градиента возмущениями

  • Градиент оценивается путём возмущения входных данных и наблюдения за изменением функции затрат.
  • Сравнение с поиском пути в тумане: оценка направления спуска через пробные шаги.

01:24:49 Методы нулевого порядка

  • Методы нулевого порядка не вычисляют производные, а оценивают функцию напрямую.
  • Примеры методов нулевого порядка: генетические алгоритмы, метод роя частиц.

01:25:46 Методы второго порядка

  • Методы второго порядка вычисляют первую и вторую производные, ускоряя процесс оптимизации.
  • Примеры методов второго порядка: метод сопряжённых градиентов, квази-ньютоновские методы.

01:26:45 Библиотека неверград

  • Библиотека неверград реализует множество алгоритмов оптимизации без доступа к градиенту.
  • Примеры алгоритмов: генетические алгоритмы, эволюционные методы, метод роя частиц.

01:27:44 Обучение с подкреплением

  • В обучении с подкреплением система не знает функцию затрат, поэтому используются методы нулевого порядка.
  • Оценка градиента функции затрат через возмущения параметров или выхода нейросети.

01:29:29 Нормализация внутренних переменных

  • Внутренние переменные сети также нормализуются для достижения нулевого среднего и единичной дисперсии.
  • Примеры методов нормализации: батчивая нормализация, послойная нормализация.

01:30:30 Уменьшение скорости обучения

  • На начальном этапе обучения требуется высокая скорость обучения, но затем её необходимо снижать для достижения локальных минимумов.
  • В PyTorch есть готовые расписания для уменьшения скорости обучения.

01:31:30 Регуляризация и дропаут

  • Регуляризация L1 и L2 используется для устранения бесполезных весов.
  • Дропаут случайным образом выбирает долю нейронов и устанавливает их выходные значения в ноль, что делает сеть более устойчивой.

01:33:26 Мотивация глубокого обучения

  • Обсуждение мотивации использования многослойных нейронных сетей и структур подобного типа.

01:33:35 Линейный классификатор

  • Линейный классификатор разделяет пространство на два полупространства с помощью гиперплоскости.
  • Вычисляется взвешенная сумма входов, добавляется смещение, результат проходит через пороговую функцию.
  • Гиперплоскость задаётся уравнением $\sum_i w_i x_i + b = 0$.

01:34:25 Принцип работы гиперплоскости

  • Скалярное произведение входного вектора на вектор весов определяет положение гиперплоскости.
  • Ортогональные векторы имеют скалярное произведение, равное нулю, что определяет гиперплоскость.
  • В n-мерном пространстве гиперплоскость делит пространство на две половины.

01:35:23 Нелинейноразделимые случаи

  • Если точки данных невозможно разделить гиперплоскостью, это называется нелинейноразделимым случаем.
  • Линейный классификатор не подходит для таких случаев.

01:36:18 Теорема Тома Кавера

  • Теорема Кавера утверждает, что вероятность линейной разделимости точек в n-мерном пространстве близка к единице, если количество точек меньше размерности пространства, и близка к нулю, если больше.
  • Это объясняет, почему линейные классификаторы часто неэффективны для больших наборов данных.

01:37:13 Извлечение признаков

  • Для классификации изображений с большим количеством примеров используется экстрактор признаков.
  • Экстрактор признаков должен быть нелинейным, чтобы превратить нелинейноразделимую задачу в линейноразделимую.
  • Примеры признаков для распознавания лиц: обнаружение глаз, измерение расстояний между ними и до рта.

01:38:59 Методы увеличения размерности

  • Мозаичное разбиение и случайные проекции используются для увеличения размерности представления данных.
  • Случайная проекция: входные векторы умножаются на случайную матрицу и пропускаются через нелинейную операцию.
  • Полиномиальные классификаторы, радиально-базисные функции и ядерные машины также являются методами увеличения размерности.

01:40:54 Полиномиальный классификатор

  • Полиномиальный классификатор увеличивает размерность, добавляя произведения парных переменных к исходным.
  • Разделяющая поверхность в исходном пространстве становится квадратичной кривой или гиперповерхностью.
  • Метод плохо работает в высоких размерностях из-за значительного увеличения количества признаков.

01:41:52 Ограничения полиномиального классификатора

  • Применение полиномиального классификатора к изображениям с высоким разрешением приводит к огромному количеству признаков.
  • Для задач высокой размерности этот подход непрактичен, но может быть полезен в некоторых случаях как полезный трюк.

01:42:16 Метод опорных векторов

  • Метод опорных векторов основан на ядерных машинах, которые представляют собой двухслойные системы.
  • Первый слой содержит столько нейронов, сколько обучающих примеров в наборе данных.
  • Нейроны первого слоя выдают большой выходной сигнал, если входной вектор соответствует обучающему примеру, и малый — если не соответствует.
  • Операция должна быть нелинейной, например, можно использовать скалярное произведение и отрицательную экспоненту.

01:43:12 Обучение линейного классификатора

  • После вычисления мер схожести обучается линейный классификатор.
  • Веса классификатора вычисляются на основе мер схожести.
  • Метод эффективен для задач низкой размерности с небольшим количеством обучающих примеров, но не подходит для компьютерного зрения.

01:44:11 Универсальность двухслойных сетей

  • Двухслойные нейронные сети могут аппроксимировать практически любую функцию с любой точностью при достаточном количестве элементов в скрытом слое.
  • Теорема утверждает, что двухслойные сети являются универсальными аппроксиматорами.

01:45:03 Необходимость многослойных сетей

  • Аппроксимировать все функции двумя слоями неэффективно, так как многие функции требуют слишком большого числа скрытых элементов.
  • Многослойные сети позволяют представлять функции более эффективно.

01:46:01 Обобщающая способность

  • Обобщающая способность зависит от соответствия архитектуры сети решаемой задаче.
  • Многослойность может улучшить обобщающую способность, уменьшая общий размер системы.

01:46:58 Перепараметризованные сети

  • Перепараметризованные сети проще обучают, так как находят минимум целевой функции.
  • Увеличение количества параметров часто улучшает работу сети.

01:47:53 Композиционность мира

  • Мир композиционен: элементарные частицы объединяются в более сложные структуры.
  • Аналогичная иерархия присутствует в изображениях, тексте и речи.
  • Композиционность делает мир постижимым.

01:48:49 Иерархические представления

  • Слоистая архитектура нейросети отражает идею композиционной иерархии.
  • Каждый слой изучает представление с уровнем абстракции чуть выше предыдущего.
  • Сквозное обучение позволяет системе выучивать иерархические представления.

01:49:48 Успех глубокого обучения

  • Глубокое обучение стало успешным благодаря способности изучать иерархические представления.
  • Это объясняет, почему глубокое обучение покорило мир за последнее десятилетие.

Transcript

Search transcript
0:00
So, as you know, we’re going to talk about deep learning and we’re going to jump right in.
0:06
6 seconds
So much of practical applications of deep learning today, machine learning and AI in general use a paradigm called supervised learning, which I’m sure most of you have heard of before.
0:18
18 seconds
So this is the paradigm by which you train a machine by showing it examples of inputs and outputs.
0:25
25 seconds
You want to build a machine to distinguish images of cars from airplanes. You show it an image of a car. It’s if the machine says car, you don’t do anything.
0:33
33 seconds
If it says something else, you adjust the internal parameters of the system so that the output gets closer to the one you want.
0:40
40 seconds
So imagine the target output is some vector of activities on a set of outputs. You want the vector coming out of the machine to get closer to the vector that is the desired output.
0:51
51 seconds
This works really well as long as you have lots of data. It works for speech recognition, image recognition, face recognition, generating captions, translation, all kinds of stuff.
1:02
1 minute, 2 seconds
This is, I would say, 95% of all applications of machine learning today.
1:08
1 minute, 8 seconds
There are two other paradigms, one of which I will not talk about, one of which I will talk about a lot.
1:14
1 minute, 14 seconds
The two other paradigms are reinforcement learning, which I will not talk about. There are other courses.
1:20
1 minute, 20 seconds
There’s a course by Larry Pinto about this that I encourage you to take. And a third paradigm is self supervised learning or unsupervised learning.
1:29
1 minute, 29 seconds
And we’ll talk about this quite a lot in the following weeks. But for now, let’s talk about supervised learning. Self supervised learning.
1:37
1 minute, 37 seconds
You could think of it as kind of a play on supervised learning.
1:41
1 minute, 41 seconds
So the traditional model of pattern recognition, machine learning, and supervised learning, certainly going back to the late 50s, early 60s, is the idea by
1:50
1 minute, 50 seconds
which you take a raw signal, let’s say an image or an audio signal, or a set of features representing an object, and then you turn it into a representation
1:58
1 minute, 58 seconds
using a feature extractor, which in the past was hand engineered, and then you take that representation.
2:07
2 minutes, 7 seconds
which is generally in the form of a vector or a table of numbers of some kind, a tensor, a multidimensional array, but sometimes could be a different type of representation, and you feed that to a trainable classifier.
2:19
2 minutes, 19 seconds
So this is the learning where the learning takes part. This is the classical model, and it’s still popular, it’s still used a lot.
2:26
2 minutes, 26 seconds
But basically what deep learning has done is replace this manual hand engineering of the feature extractor by a stack of trainable modules, if you want.
2:38
2 minutes, 38 seconds
In deep learning.
2:39
2 minutes, 39 seconds
The main idea of deep learning and the Only reason why it’s called deep is that we stack a bunch of modules, each of which transforms the input a little bit
2:49
2 minutes, 49 seconds
into something that’s slightly higher level of abstraction if you want, and then we train the entire system end to end.
2:59
2 minutes, 59 seconds
So I represented those sort of pinkish modules to indicate the ones that are trainable.
3:06
3 minutes, 6 seconds
And the blue modules are the fixed ones, the hand engineered ones. So that’s why deep learning is called deep.
3:13
3 minutes, 13 seconds
We stack multiple layers of trainable things and we train it end to end. The idea for this goes back a long time.
3:20
3 minutes, 20 seconds
The practical methods for this go back to the mid to late 80s with the backpropagation algorithm, which is going to be the main subject of today’s lecture actually.
3:33
3 minutes, 33 seconds
But it took a long time for this idea to actually percolate and become the main tool that people use to build machine learning system.
3:41
3 minutes, 41 seconds
It’s only about 10 years old. Okay, so let’s go through a few definitions. So we’re going to deal with parameterized models.
3:49
3 minutes, 49 seconds
A parameterized model or learning model if you want, is a parameterized function G of X and W, where X is the input and W is a set of parameters.
4:00
4 minutes
I’m representing this here on the right with a particular symbolism where a function like this that produces a single output.
4:08
4 minutes, 8 seconds
Think of the output as either a vector or a matrix or a tensor or perhaps even a scalar. But generally it’s multidimensional.
4:16
4 minutes, 16 seconds
It can actually be something else than a multidimensional array, but something that maybe like a sparse array representation or a graph with values on it.
4:26
4 minutes, 26 seconds
But for now, let’s think of it just as a multidimensional array. So both the inputs and the output are multidimensional arrays, what people call tensors.
4:35
4 minutes, 35 seconds
It’s not really kind of the appropriate definition of tensor, but it’s okay.
4:40
4 minutes, 40 seconds
And that function is parameterized by a set of parameters W. Those are the knobs that we’re going to adjust during training.
4:47
4 minutes, 47 seconds
And they basically determine the input output relationship between the input X and the predicted output Y bar.
4:56
4 minutes, 56 seconds
Okay, so I’m not explicitly representing the wire that comes in with W here.
5:02
5 minutes, 2 seconds
I kind of assume that W is somewhere inside of this module. Think of this as an object in object oriented programming.
5:11
5 minutes, 11 seconds
So it’s an instance of a class that you instantiated and it’s got a slot in it that represents the parameters.
5:20
5 minutes, 20 seconds
And there is a forward function basically that takes as argument the input and returns the output.
5:28
5 minutes, 28 seconds
Okay, so a basic learning machine will have a cost function and the cost function in supervised learning, but Also in some other settings will basically compute
5:36
5 minutes, 36 seconds
the discrepancy, distance, divergence, whatever you want to call it, between the desired output Y, which is given to you from the training set, and the output
5:45
5 minutes, 45 seconds
produced by the system Y bar.
5:54
5 minutes, 54 seconds
Okay, so an example of this, a very simple example of a setting like this is linear regression.
6:00
6 minutes
In linear regression, X is a vector composed of components XI’s. W is also a vector and the output is a scalar.
6:08
6 minutes, 8 seconds
That is simply the dot product of X with W. So Y bar now is a scalar, and what you compute is the square distance, the square difference really between Y and Y bar.
6:19
6 minutes, 19 seconds
If W is a matrix, then now Y is a vector, and you compute the square norm of the difference between Y and Y bar.
6:26
6 minutes, 26 seconds
And that’s basically linear regression. So learning will consist in finding the set of W’s that minimize this particular cost function averaged over a training set.
6:37
6 minutes, 37 seconds
I’ll come to this in a minute, but I want you to think right now about the fact that this G function may not be something particularly simple to compute.
6:47
6 minutes, 47 seconds
So it may not be just multiplying a vector by matrix. It may not be just carrying some sort of fixed computation with sort of a fixed number of steps.
6:56
6 minutes, 56 seconds
It could involve something complicated. It could involve minimizing a function with respect to some other variable that you don’t know.
7:06
7 minutes, 6 seconds
It could involve a lot of iteration of some algorithm that converges towards a fixed point.
7:12
7 minutes, 12 seconds
So let’s not restrict ourselves to G of X, W that are kind of simple things. It could be very complicated things.
7:20
7 minutes, 20 seconds
And we’ll come to this in a few weeks. This is just to kind of explain the notations that I will use during the course of this, this class.
7:32
7 minutes, 32 seconds
So we have observed input and desired output variables. Those are kind of grayish bubbles.
7:39
7 minutes, 39 seconds
Other variables that are produced by the system or internal to the system are those kind of empty circle variables.
7:47
7 minutes, 47 seconds
We have deterministic functions. So functions that are. So they are indicated by this sort of rounded shape here.
7:56
7 minutes, 56 seconds
They can take multiple inputs, have multiple outputs, and each of those can be tensors or scalars or whatever.
8:04
8 minutes, 4 seconds
And they have implicit parameters that are tunable by training. Then we have cost functions.
8:10
8 minutes, 10 seconds
Cost functions are basically functions that take one or multiple inputs and output a scalar. But I’m not representing the output.
8:19
8 minutes, 19 seconds
It’s implicit. If you have a red square, it has an implicit output and it’s a scalar. And we interpret it as a cost or an energy function.
8:28
8 minutes, 28 seconds
So this symbolism is kind of similar to what people use in graphical models. If you’ve heard What a graphical model is, particularly the typographical model called a factor graph.
8:39
8 minutes, 39 seconds
So in a factor graph you have those variable bubbles and you have those factors which are those square cost functions.
8:47
8 minutes, 47 seconds
You don’t have this idea that you have deterministic functions in it, because graphical models don’t care about the fact that you have functions in one direction or another.
8:56
8 minutes, 56 seconds
But here we care about it. So we have this extra symbol. Okay? So machine learning consists in basically minimizing finding the set of parameters w that minimize the cost function averaged over a training set.
9:08
9 minutes, 8 seconds
So a training set is a set of pairs XY indexed by an index P. So we have P training samples, and little P is the index of the training sample.
9:21
9 minutes, 21 seconds
And our overall loss function that we’re going to have to minimize is equal to the cost of the discrepancy between Y and the output of our model Y bar G of x W,
9:29
9 minutes, 29 seconds
as I said earlier.
9:38
9 minutes, 38 seconds
So L is a value, C is a module, and L is a way of writing C of Y, g of x W in a way that depends explicitly on X, y and W. But it’s the same thing, really.
9:50
9 minutes, 50 seconds
The overall loss function, which is this kind of curly L is the average of the per sample loss function over the entire training set.
10:02
10 minutes, 2 seconds
Okay? So compute L for the entire training set, divide by the sum all the terms, divide by P, and that’s the average, that’s the loss.
10:13
10 minutes, 13 seconds
Okay? So now the name of the game is trying to find the minimum of that loss with respect to the parameters. This is an optimization problem.
10:21
10 minutes, 21 seconds
So symbolically, I can represent this entire graph as the thing on the right. This is rarely used in practice, but this is a way to visualize this.
10:30
10 minutes, 30 seconds
Think about each training sample as an identical copy of the replica, if you want, of the model and the cost function applied to a different training sample.
10:42
10 minutes, 42 seconds
And then there is an average operation that computes the loss. Everything you can write as a formula, you can probably write in terms of those graphs.
10:52
10 minutes, 52 seconds
This is going to be very useful as we’re going to see later.
10:59
10 minutes, 59 seconds
Ok, so supervised machine learning and a lot of other machine learning paradigms as well actually can be viewed as function optimization.
11:09
11 minutes, 9 seconds
And a very simple approach to optimizing a function, which means finding the set of parameters to a function that minimize its value is gradient descent or gradient based algorithms.
11:22
11 minutes, 22 seconds
So a gradient based algorithm makes the assumption that the function is somewhat smooth and mostly differentiable, doesn’t have to be everywhere differentiable,
11:30
11 minutes, 30 seconds
but has to be continuous, has to be almost everywhere differentiable, and it has to be somewhat Smooth.
11:37
11 minutes, 37 seconds
Otherwise, the local information of the slope doesn’t tell you much about where the minimum is.
11:42
11 minutes, 42 seconds
Okay, so here’s an example here, depicted on the right, the lines that you see here, the peak lines, are the lines of equal cost, and this cost is quadratic.
11:53
11 minutes, 53 seconds
So it’s basically a kind of paraboloid. And this is the trajectory of a method called stochastic gradient descent, which we’ll talk about in a minute.
12:02
12 minutes, 2 seconds
So, for stochastic gradient descent, the procedure is, you show an example, you run it to the machine, you, you compute the objective for that particular
12:08
12 minutes, 8 seconds
sample, and then you figure out by how much and how to modify each of the knobs in the machine, the W parameters, so that the objective function goes down by a little bit.
12:20
12 minutes, 20 seconds
You make that change, and then you go to the next sample. Let’s be a little more formal. So gradient descent is this very basic algorithm.
12:29
12 minutes, 29 seconds
Here. You replace the value of W by its previous value minus its step size eta here, multiplied by the gradient of the objective function with respect to the parameters.
12:42
12 minutes, 42 seconds
So what is a gradient? A gradient is a vector of the same size as the parameter vector.
12:49
12 minutes, 49 seconds
And for each component of the parameter vector, it tells you by how much the loss function L would increase if you increase the parameter by a tiny amount.
13:00
13 minutes
Okay, it’s a derivative, but it’s a directional derivative, right? So let’s say among all the directions, you only look at W34.
13:09
13 minutes, 9 seconds
And let’s imagine that you tweak W34 by a tiny amount. The loss function curly L is going to increase by a tiny amount.
13:18
13 minutes, 18 seconds
You divide the tiny amount by which L increase by the tiny amount that you modified this, W34, and what you get is the gradient of the loss with respect to W34.
13:28
13 minutes, 28 seconds
If you do this, for every single weight, you get the gradient of the loss function with respect to all the weights.
13:35
13 minutes, 35 seconds
And it’s a vector which, for each component of the weight, gives you the parameter gives you that quantity. Okay?
13:43
13 minutes, 43 seconds
So since Newton and Euler, it’s been written as DL over DW because it indicates the fact that there is this little twiddle.
13:51
13 minutes, 51 seconds
You can twiddle W by little, and there is a resulting twiddling of L. And if you divide those two twiddles and they are infinitely small, you get the derivative.
14:02
14 minutes, 2 seconds
That’s kind of standard notation in mathematics for a few hundred years. Okay, so now the gradient is going to be a vector.
14:12
14 minutes, 12 seconds
And as indicated here on the top right, that vector is an arrow that points upwards along the line of larger slope.
14:22
14 minutes, 22 seconds
So if you are in a 2D surface. You have two W parameters and the surface is represented here, some sort of quadratic ball here in this case.
14:34
14 minutes, 34 seconds
So it’s a second degree polynomial in W1 and W0.
14:38
14 minutes, 38 seconds
Here on the right is the kind of a top down view of this, where the lines represent the lines of equal cost.
14:49
14 minutes, 49 seconds
The little arrows here represent the gradient at various locations. So you have a long arrow if the slope is steep.
14:58
14 minutes, 58 seconds
A short arrow is if the slope is not steep, not large. At the bottom it’s zero and it points towards the direction of higher slope.
15:08
15 minutes, 8 seconds
Imagine you are in a landscape, a mountainous landscape, and you’re in a fog and you want to go down the valley.
15:17
15 minutes, 17 seconds
You look around you and you can tell the local slope of the landscape.
15:24
15 minutes, 24 seconds
You can’t tell where the minimum is because you’re in a fog, but you can tell the local slope.
15:29
15 minutes, 29 seconds
So you can figure out what is the direction of larger slope and then take a step and that will take you upwards.
15:36
15 minutes, 36 seconds
Now you turn around 180 degrees, take a step in that direction, and that was going to take you downwards.
15:43
15 minutes, 43 seconds
And if you keep doing this, and the landscape is convex, which means it has only one local minimum, this will eventually take you down to the valley and
15:50
15 minutes, 50 seconds
presumably to the village.
15:57
15 minutes, 57 seconds
That’s gradient based algorithms. They all differ by how you compute the gradient first and by what this eta step size parameter is.
16:09
16 minutes, 9 seconds
So in simple forms, eta is just a positive constant that sometimes is decreased as the system runs more, but most of the time not.
16:19
16 minutes, 19 seconds
But in more complex versions of gradient based learning, eta is actually an entire matrix itself, generally a positive definite or semi definite matrix.
16:30
16 minutes, 30 seconds
And so the direction adopted by those algorithms is not necessarily the steepest descent.
16:37
16 minutes, 37 seconds
It goes downwards, but it’s not necessarily the steepest descent. And we can see why here.
16:43
16 minutes, 43 seconds
So in this diagram here that I’m showing, this is the trajectory that will be followed by gradient descent in this sort of quadratic cost environment.
16:52
16 minutes, 52 seconds
And as you see, the trajectory is not straight. It’s not straight because the system kind of goes down by following the slope of steepest descent.
17:01
17 minutes, 1 second
And so it goes down the valley before finding the minimum of the valley, if you want. Right.
17:06
17 minutes, 6 seconds
So if your cost function is a little squeezed in one direction, it will go down the ravine and then kind of follow the ravine towards the bottom.
17:15
17 minutes, 15 seconds
In complex situations where you have things that are the trajectory actually has been cut here, but when the function is highly irregular, this might even be
17:24
17 minutes, 24 seconds
more complicated and then you might have to be smart about what you do here.
17:34
17 minutes, 34 seconds
Okay, so Stochastic gradient descent is universally used in deep learning.
17:40
17 minutes, 40 seconds
And this is a slight modification of the gradient steepest descent algorithm, where you don’t compute the gradient of the entire objective function averaged
17:48
17 minutes, 48 seconds
over all the samples.
17:55
17 minutes, 55 seconds
But what you do is you take one sample and you compute the gradient of the objective function for that one sample with respect to the parameters.
18:07
18 minutes, 7 seconds
And you take a step and you keep doing this.
18:11
18 minutes, 11 seconds
You pick another sample, compute the gradient of the objective function for that sample with respect to the weights, make an update.
18:20
18 minutes, 20 seconds
Y is equals stochastic gradient.
18:22
18 minutes, 22 seconds
Stochastic is a fancy term for random, essentially, and it’s called stochastic because the evaluation of the gradient you get on the basis of a single sample is
18:30
18 minutes, 30 seconds
a noisy estimate of the full gradient, the average of the gradient.
18:38
18 minutes, 38 seconds
Because the gradient is a linear operation, the average of the gradients will be the gradient of the average. And so things work out.
18:46
18 minutes, 46 seconds
If you compute the gradient and you kind of keep going overall, the average trajectory will be sort of the trajectory you would have followed by doing full gradient.
18:55
18 minutes, 55 seconds
Okay, but in fact, the reason we’re doing this is because it’s much more efficient in terms of speed of convergence.
19:02
19 minutes, 2 seconds
So although the trajectory followed by stochastic gradient is very noisy, things kind of bounce around a lot.
19:09
19 minutes, 9 seconds
As you can see in the trajectory here at the bottom, the trajectory is very erratic, but in fact it goes to the bottom faster.
19:17
19 minutes, 17 seconds
And it has other advantages that people are still writing papers on. Ok? The reason for that is that stochastic gradient exploits the redundancy between the samples.
19:29
19 minutes, 29 seconds
So in a machine learning setting, the training samples have some similarities between them.
19:35
19 minutes, 35 seconds
If they don’t, then basically the learning problem is impossible. So they necessarily do have some redundancy between them.
19:43
19 minutes, 43 seconds
And the faster you update the parameters, the more often you update them, the more you exploit this redundancy between those parameters.
19:52
19 minutes, 52 seconds
Now, in practice, what people do is they use mini batches.
19:56
19 minutes, 56 seconds
So instead of computing the gradient on the basis of a single sample, you take a batch of samples, typically anywhere between, let’s say, 30 and a few thousand.
20:08
20 minutes, 8 seconds
But smaller batches are better in most cases, actually. And you compute the average of the gradient over those samples.
20:16
20 minutes, 16 seconds
So you compute the average cost over those samples and compute the gradient of the average over those samples and then make an update.
20:26
20 minutes, 26 seconds
The reason for doing this is not intrinsically an algorithmic reason.
20:31
20 minutes, 31 seconds
It’s because it’s a simple way of parallelizing stochastic gradient on parallel hardware such as GPUs. Okay?
20:40
20 minutes, 40 seconds
So there’s no good reason to do batching other than the Fact that our hardware likes it. Okay, question.
20:48
20 minutes, 48 seconds
Yeah, so actually, for real complex deep learning problems, does this objective function have to be continuously differentiable?
20:57
20 minutes, 57 seconds
Well, it needs to be continuous mostly. If it’s non continuous, you’re going to get in trouble. It needs to be differentiable almost everywhere.
21:06
21 minutes, 6 seconds
But in fact, neural nets that most people use are actually not differentiable.
21:10
21 minutes, 10 seconds
And there’s a lot of places where they’re not differentiable, but they are continuous in the sense that they’re functions that have kind of corners in them.
21:19
21 minutes, 19 seconds
If you want, they have kinks, and if you have a kink once in a while, it’s not too much of a problem.
21:27
21 minutes, 27 seconds
So in that case, those quantities should not be called gradients, they should be called sub gradients.
21:35
21 minutes, 35 seconds
Okay, so a subgradient is basically a generalization of the idea of derivative or gradient to functions that have kinks in them.
21:46
21 minutes, 46 seconds
So wherever you have a function that has a kink in it, any slope that is between the slope of one side and the slope of the other side is a valid subgradient.
21:57
21 minutes, 57 seconds
So when you add a kink, you decide, well, the derivative is this or it’s that or it’s kind of somewhere in between and you’re fine.
22:08
22 minutes, 8 seconds
Most of the proof that apply to smooth functions in terms of minimization often apply also to non smooth function that basically are differentiable almost
22:15
22 minutes, 15 seconds
everywhere.
22:24
22 minutes, 24 seconds
So then how do we ensure strict convexity? We do not ensure strict convexity.
22:29
22 minutes, 29 seconds
In fact, in deep learning systems, most deep learning systems, the function that we are optimizing is non convex.
22:36
22 minutes, 36 seconds
In fact, this is one reason why it took so long for deep learning to become prominent, is because a lot of people, particularly theoreticians, people who are
22:45
22 minutes, 45 seconds
sort of theoretically minded, were very scared of the idea that you had to minimize a non convex objective.
22:53
22 minutes, 53 seconds
And they say this can’t possibly work because we can’t prove anything about it. Turns out it does work. You can’t prove anything about it, but it does work.
23:01
23 minutes, 1 second
And so this is a situation and it’s an interesting thing to think about, a situation where the theoretical thinking basically limited what people could do in terms of engineering because they couldn’t prove things about it.
23:13
23 minutes, 13 seconds
Actually very powerful. Okay, yeah, like your colleague, how do you optimize non convex functions? Like your colleague at the Bell Labs who didn’t like the non mathy.
23:27
23 minutes, 27 seconds
It was a whole debate within the machine learning community that lasted 20 years basically.
23:33
23 minutes, 33 seconds
All right, so what about how doesn’t SGD get stuck in local minima once it reaches them? It does.
23:41
23 minutes, 41 seconds
Okay, so Full gradient does get stuck in local minima.
23:46
23 minutes, 46 seconds
SGD is slightly less stuck in local minima because it’s noisy, it allows it sometimes to escape local minima.
23:56
23 minutes, 56 seconds
But the real reason why we’re going to optimize non convex functions and local minima are not going to be such a huge problem is that there aren’t that many
24:04
24 minutes, 4 seconds
local minima that are traps.
24:14
24 minutes, 14 seconds
We’re going to build neural nets and those neural nets or deep learning systems and they’re going to be built in such a way that the parameter space is such a
24:22
24 minutes, 22 seconds
high dimension that is going to be very hard for the system to actually create local minima for us.
24:31
24 minutes, 31 seconds
So think about a picture where we have in one dimension a cost function that has one local minima and then a global minimum.
24:39
24 minutes, 39 seconds
So it’s a function like this. And we start from here. If we optimize using gradient descent, we’re going to get stuck in that local minimum.
24:48
24 minutes, 48 seconds
Now let’s imagine that we parameterize this function now with two parameters. So we’re not a one dimensional, we’re not looking at a one dimensional function anymore, we’re looking at the two dimensional function.
24:57
24 minutes, 57 seconds
We have an extra parameter. This extra parameter will allow us to go around the mountain and go towards the valley, perhaps without having to climb the little hill in the middle.
25:08
25 minutes, 8 seconds
So this is just an intuitive example to tell you that in very high dimensional spaces you may not have as much of a local minimum problem as you have in the sort of intuitive picture of low dimensional spaces.
25:19
25 minutes, 19 seconds
So here those pictures are in two dimensions. They are very misleading.
25:24
25 minutes, 24 seconds
We’re going to be working with millions of dimensions and some of the most recent deep learning systems have trillions of parameters.
25:34
25 minutes, 34 seconds
So local minima is not going to be that much of a problem. We’re going to have other problems, but not that one.
25:42
25 minutes, 42 seconds
So there is like a trend in this hyper over parameterization, right? Like it seems like that more neurons we have and the better these networks work somehow.
25:54
25 minutes, 54 seconds
That’s right. So we’re going to make those networks very large and they’re going to be over parameterized, which means they’re going to have way more adjustable parameters than we would actually need, which means they’re going to be able to learn the training set almost perfectly.
26:06
26 minutes, 6 seconds
And the big question is how well are they going to work on a separate validation set or test set that is separate from the training set.
26:14
26 minutes, 14 seconds
How are they going to work in a real situation where the distribution of samples may be different from what we train it on?
26:20
26 minutes, 20 seconds
So that’s the real Question of machine learning, which I’m sure a lot of you are familiar with. Two more questions.
26:28
26 minutes, 28 seconds
Can we do so how do we escape? Instead, saddle points. Right?
26:34
26 minutes, 34 seconds
So there are tons and tons of saddle points in deep learning systems. A combinatorially large number of saddle points.
26:42
26 minutes, 42 seconds
As a matter of fact, I’ll have a lecture on this. So I don’t want to kind of spend too long answering.
26:49
26 minutes, 49 seconds
But yeah, there are saddle points. The trick with saddle points is you don’t want to get too close to them, essentially.
26:58
26 minutes, 58 seconds
And stochastic gradient helps a little bit with saddle points.
27:04
27 minutes, 4 seconds
Some people have proposed sort of explicit methods to stay away from saddle points, but in practice doesn’t seem to be that much of a problem, actually.
27:13
27 minutes, 13 seconds
Finally, how do you pick samples for stochastic gradient descent randomly? Okay, there is lots of different methods for that.
27:23
27 minutes, 23 seconds
Okay, Yeah. I mean, the basic thing that you should do is you have your training set, you shuffle the samples in a random order and then you just pick them one at a time and then you cycle through them.
27:36
27 minutes, 36 seconds
An alternative is once you get to the end, you reshuffle them and then cycle through them again.
27:42
27 minutes, 42 seconds
An alternative is you pick a random sample using a random number. Every time you pick a new sample, you pick them randomly.
27:51
27 minutes, 51 seconds
If you do batching, a good idea is to put in a batch samples that are maximally different from each other.
27:58
27 minutes, 58 seconds
So things that are, for example, different categories. If you do classification, but most people just do them, just pick them randomly.
28:07
28 minutes, 7 seconds
But it’s good to have samples that are maximally different, that are nearby, either in a batch or during the process of training.
28:16
28 minutes, 16 seconds
And then there are all kinds of tricks that people use to sort of emphasize difficult samples so that the boring, easy samples are not.
28:24
28 minutes, 24 seconds
You don’t waste your time just seeing them over and over again. There’s all kinds of tricks, but the simpler one is which most people use.
28:32
28 minutes, 32 seconds
You shuffle your samples and you run through them. A lot of people now use also data augmentation.
28:38
28 minutes, 38 seconds
So every sample is actually distorted by some process for an image. You can distort the geometry a little bit.
28:46
28 minutes, 46 seconds
You change the colors, you add noise, et cetera. This is an artificial way of adding more samples than you actually have.
28:55
28 minutes, 55 seconds
People do this randomly on the fly or they pre compute those, those transformations. So lots of tricks there as well.
29:03
29 minutes, 3 seconds
Last question. How do you pick the batch size the best? The batch. Batch size, the batch size that’s determined by your hardware.
29:13
29 minutes, 13 seconds
So if you have a gpu, generally for reasonably sized networks, your batch size would be anywhere between 16 and 64 or something like that.
29:22
29 minutes, 22 seconds
For smaller networks, you might have to batch more to exploit your. Your hardware better to kind of have maximum usage of it.
29:30
29 minutes, 30 seconds
If you Parallelize on multiple GPUs within a machine, you may have to have, you know, so let’s say you have eight GPUs, then you’ll be sort of eight times 32, so 256 or something.
29:40
29 minutes, 40 seconds
And then, you know, a lot of the big guys kind of parallelize that over multiple machines, each of which has eight GPUs, some of them have TPUs, whatever.
29:48
29 minutes, 48 seconds
And then you might have to parallelize over thousands of examples. There’s diminishing return in doing this.
29:55
29 minutes, 55 seconds
When you increase the size of the batch, you actually reduce the speed of convergence. You accelerate the calculation, but you reduce the speed of convergence.
30:05
30 minutes, 5 seconds
So at some point it’s not worth increasing your batch size. So if we are doing a classification problem with K classes, what’s going to be like our go to batch size.
30:17
30 minutes, 17 seconds
So there are papers that say if your batch size is significantly larger than the number of categories, or let’s say twice the number of categories, then you’re probably wasting computation going down convergence.
30:31
30 minutes, 31 seconds
So you’re trying to train an image recognizer on ImageNet. If your batch size is larger than about 1,000, you’re probably wasting time.
30:41
30 minutes, 41 seconds
Okay, that’s it, Thanks. I mean, you’re wasting computation, you’re not wasting time. Okay, okay, okay.
30:47
30 minutes, 47 seconds
So let’s talk about traditional neural nets. So a traditional neural net is a model, a particular type of parameterized function which is built by stacking linear and nonlinear operations.
30:59
30 minutes, 59 seconds
Right? So here is kind of a depiction of a traditional neural net here, in this case with two layers.
31:06
31 minutes, 6 seconds
But I’m not imagining there might be more layers here. So you have a bunch of inputs here on the left.
31:12
31 minutes, 12 seconds
Each input is multiplied by a weight, different weights, presumably. And the weighted sum of those inputs by those weights is computed here by what’s called a unit or a neuron.
31:23
31 minutes, 23 seconds
People don’t like using the word neuron in that context because there are incredibly simplified models of neurons in the brain.
31:31
31 minutes, 31 seconds
But that’s the inspiration, really. Okay, so one of those units just computes a weighted sum of its inputs using those weights.
31:40
31 minutes, 40 seconds
This unit computes a different weighted sum of the same inputs with different weights and et cetera. So here we have three units here in the first layer.
31:48
31 minutes, 48 seconds
This is called a hidden layer, by the way, because it’s neither an input nor an output. This is the input and this is the output.
31:54
31 minutes, 54 seconds
And this is somewhere in the Middle So we compute those weighted terms and then we pass those weighted terms individually through a nonlinear function.
32:04
32 minutes, 4 seconds
So here what I’ve shown is the RELU function. So this is called rectified linear unit in the.
32:10
32 minutes, 10 seconds
This is the name that people have given it in the neural net lingual.
32:15
32 minutes, 15 seconds
In other contexts, this is called a half wave rectifier if you’re an engineer, it’s called positive part, if you are a mathematician.
32:24
32 minutes, 24 seconds
Basically, it’s a function that is equal to the identity when its argument is positive and it’s equal to zero if its argument is negative.
32:33
32 minutes, 33 seconds
So, very simple graph. Then we stack a second layer of the same thing, the second stage, again, a layer of linear operations where we compute weighted sums and then we pass a result
32:41
32 minutes, 41 seconds
to nonlinearities. We can stack many of those layers.
32:49
32 minutes, 49 seconds
And that’s basically a traditional plain vanilla garden variety neural net. In this case, fully connected.
32:55
32 minutes, 55 seconds
Fully connected neural net means that every unit in one layer is connected to every unit in the next layer.
33:01
33 minutes, 1 second
And you have this sort of well organized layer architecture.
33:06
33 minutes, 6 seconds
If you want, each of those weights are going to be the things that our learning algorithm is going to tune.
33:14
33 minutes, 14 seconds
And the big trick, the one trick really of deep learning is how we compute those gradients.
33:20
33 minutes, 20 seconds
Okay, so if you want to write this, you can say the weighted sum number I. So you can give a number to each of the units in the network.
33:32
33 minutes, 32 seconds
This unit with number I, and the weighted sum S of I is simply the sum where J goes over the upstream.
33:39
33 minutes, 39 seconds
The set of upstream units to I, which may be all the units in the previous layer, or not could be just a subset.
33:48
33 minutes, 48 seconds
Okay, and then you compute the product of zj, which is the output of the unit number J, times wij, which is the weight that links unit J to unit I.
34:01
34 minutes, 1 second
And then after that you take this si, which is a weighted sum, you pass it through the activation function, this relu, or whatever it is that you use, and that gives you zi, which is the activation for unit I.
34:12
34 minutes, 12 seconds
Okay, simple notation.
34:14
34 minutes, 14 seconds
By changing the set of upstream units of every unit, by building a graph of interconnection, you can basically build any kind of network arrangement that you want.
34:24
34 minutes, 24 seconds
There is one constraint that we can lift that we will lift in subsequent lecture, which is that the graph has to be acyclic in the sense that it can’t have loops.
34:36
34 minutes, 36 seconds
If you have loops, that means you can’t organize the units in layers.
34:40
34 minutes, 40 seconds
You can’t number them in a way that you can compute them so that every time you want to compute a unit, you already have the state of the previous unit.
34:51
34 minutes, 51 seconds
If there are loops, then you can’t do that.
34:55
34 minutes, 55 seconds
For now, we’re going to assume that the WIJ matrix, the W matrix, doesn’t have loops, represents a graph that doesn’t have loops.
35:05
35 minutes, 5 seconds
That’s what I should say. Okay, so here’s an intuitive explanation of the backpropagation algorithm.
35:13
35 minutes, 13 seconds
So the backpropagation algorithm is the main technique that is used everywhere in deep learning to compute the gradient of a cost function, whatever it is,
35:20
35 minutes, 20 seconds
objective function, with respect to a variable inside of the network.
35:27
35 minutes, 27 seconds
This variable can be a state variable like a Z or an S, or it could be a parameter variable like a W, and we’re going to need to do both.
35:36
35 minutes, 36 seconds
So this is going to be an intuitive explanation. And then after that there’s going to be a more mathematical explanation which is less intuitive, but perhaps actually easier to understand.
35:46
35 minutes, 46 seconds
But let me start with the intuition here. So let’s say we have a big network, and inside of this big network we have one of those little activation functions.
35:54
35 minutes, 54 seconds
In this case, it’s a sigmoid function, but it doesn’t matter what it is. For now, this function takes an S and produces a Z.
36:01
36 minutes, 1 second
We call this function H of S. So when we, when we wiggle Z, the cost is going to wiggle based on quantity, right?
36:10
36 minutes, 10 seconds
And if we divide the wiggling of Z by the wiggling of C by the wiggling of Z, that causes it, that gives us the partial derivative of C with respect to Z.
36:22
36 minutes, 22 seconds
So there’s one term. There’s a gradient of C with respect to all the Z’s in the network, and there’s one component of that gradient, which is the partial derivative of the cost with
36:30
36 minutes, 30 seconds
respect to that single variable Z inside the network.
36:36
36 minutes, 36 seconds
And that really indicates how much C would wiggle if we wiggled Z by some amount.
36:41
36 minutes, 41 seconds
We divide the wiggling of C by the wiggling of Z, and that gives us a partial derivative of C with respect to Z.
36:49
36 minutes, 49 seconds
This is not how we’re going to compute the gradient of C with respect to Z.
36:54
36 minutes, 54 seconds
But this is a description of what it is conceptually or intuitively rather. Ok, so let’s assume that we know this quantity.
37:03
37 minutes, 3 seconds
So we know the partial derivative of C with respect to Z. Okay, so C with respect to Z is this quantity here, DC over dz.
37:12
37 minutes, 12 seconds
So think of DZ as the weakling of Z and DC as the weakling of C. Divide one by the other and you get the partial derivative of C with respect to Z.
37:23
37 minutes, 23 seconds
What we have here is what we have to apply is chain rule, the rule that tells us how to compute the derivative of a function composed of two individual
37:31
37 minutes, 31 seconds
functions that we apply one after the other.
37:38
37 minutes, 38 seconds
So remember, chain rule.
37:40
37 minutes, 40 seconds
If you have a function G that you apply to another function H, which is function of parameter S, and you want the derivative of it, the derivative of that is
37:48
37 minutes, 48 seconds
equal to the derivative of G at point H of S multiplied by the derivative of H at point S, right?
37:56
37 minutes, 56 seconds
That’s chain rule. You learned that a few years ago. Hopefully. Now, if I want to write this in terms of partial derivative, it’s the same thing, right?
38:05
38 minutes, 5 seconds
Partial derivative is just a derivative, just with respect to one single variable. So I would write this something like this, DC over ds.
38:15
38 minutes, 15 seconds
So C really is the result of applying this H function to S and then applying some unknown G function to compute C, which is kind of the rest of the network plus the cost.
38:27
38 minutes, 27 seconds
But I’m just going to call the gradient. I’m going to assume that this DC over DZ is known, ok? Someone gave it to me.
38:36
38 minutes, 36 seconds
So this is this variable here on the right. DC over DZ is given to me and I want to compute DC over ds.
38:44
38 minutes, 44 seconds
So what I need to do is write this. DC over DS equals DC over DZ times DZ over ds. And why is this identity true?
38:52
38 minutes, 52 seconds
Is because I can simplify by dz, it’s as simple as this, right?
38:58
38 minutes, 58 seconds
So you have trivial algebra, you have DZ at the denominator here, DZ at the numerator here, simplify, you get DC over ds, okay?
39:06
39 minutes, 6 seconds
It’s a very trivial simple identity which is basically just chain rule applied to partial derivatives.
39:13
39 minutes, 13 seconds
Now, DZ over ds, we know what it is. It’s just H prime of S, ok? Just the derivative of the h function, okay?
39:21
39 minutes, 21 seconds
So we have this formula. DC over DS equals DC over dz, which we assume is known times H prime of S. What does that mean?
39:29
39 minutes, 29 seconds
That means that if we have this component of the gradient of the cost function with respect to Z here, we multiply this by the derivative of the H function at point S, the same point S that we had here.
39:42
39 minutes, 42 seconds
And what we get now is the gradient of the cost function with respect to S. Now here’s the trick.
39:49
39 minutes, 49 seconds
If we had a chain of those H functions, we could keep propagating this gradient backwards by just multiplying by the derivative of all those h functions going backwards.
39:58
39 minutes, 58 seconds
And that’s why it’s called backpropagation. So it’s just a practical application of chain rule. And if you want to kind of convince yourself of this.
40:07
40 minutes, 7 seconds
You can run through this idea of perturbation.
40:09
40 minutes, 9 seconds
Like if I twiddle S by some value, it’s going to twiddle Z by some value = to DS/h@ of S. Basically, the slope of S, right?
40:20
40 minutes, 20 seconds
So DZ equals H prime of S times ds. And then I’m going to have to multiply this by DC over dz.
40:29
40 minutes, 29 seconds
And so I rearrange the terms and I get immediately that this formula DC over DS equals DC over DZ times H prime of S. Okay?
40:40
40 minutes, 40 seconds
So we had another element in our multilayer net, which was the linear sum.
40:48
40 minutes, 48 seconds
And there it’s just a little bit more complicated, but not really. Okay?
40:52
40 minutes, 52 seconds
So one particular variable, Z, here we’d like to compute the derivative, the partial derivative of our cost function with respect to that Z.
41:02
41 minutes, 2 seconds
Okay? And we’re going to assume that we know the partial derivative of S with respect to each of those S’s, the weighted sums at the next layer that Z is going into.
41:13
41 minutes, 13 seconds
So Z only influences C through those S’s.
41:17
41 minutes, 17 seconds
So presumably, by knowing, by basically multiplying how each of those S’s influence C, and then multiplying by how Z influences each of the S’s, and summing up,
41:28
41 minutes, 28 seconds
we’re going to get the influence of Z over C. And that’s the basic idea.
41:38
41 minutes, 38 seconds
Okay? So here’s what we’re going to do. Let’s say we perturb z by dz. This is going to perturb s 0 by dz times w 0.
41:48
41 minutes, 48 seconds
Okay? We multiply z by w 0. So the derivative of this linear operation is the coefficient itself, right?
41:58
41 minutes, 58 seconds
So here the perturbation, which is ds0 is equal to dz times w0. Okay?
42:04
42 minutes, 4 seconds
And now in turn, this is going to modify C and we’re going to multiply this quantity by DC over DS0 to get the DC if you want.
42:16
42 minutes, 16 seconds
Okay? Now, whenever we perturb Z, it’s not going to perturb just S0.
42:23
42 minutes, 23 seconds
It’s also going to perturb S1 and S2. And to see the effect on C, we’re going to have to sum up the effect of the perturbation on each of the S’s and then sum them up to see the overall effect on C.
42:32
42 minutes, 32 seconds
So this is written here on the left.
42:38
42 minutes, 38 seconds
The perturbation of C is equal to the perturbation of S multiplied by the partial derivative of C with respect to S plus the perturbation of S1 multiplied by
42:48
42 minutes, 48 seconds
the partial derivative of DC with respect to S1 plus same thing for S2.
42:58
42 minutes, 58 seconds
Okay, so this is the fact that we need to take into account all the perturbations here that Z may influence.
43:07
43 minutes, 7 seconds
And so I can just write down a very simple thing.
43:11
43 minutes, 11 seconds
Because DC of 0 is equal to w 0 times dz and DS of 2 is w 2 times dz, I can plug this in there and just write DC over dz equal DC over DS 0, which I assume is
43:20
43 minutes, 20 seconds
known times w 0 plus DC over DS 1 times w 1 plus DC over DS 2 times w 2.
43:29
43 minutes, 29 seconds
Okay, if I want to represent this operation graphically, this is shown on the right here I have DC over D, S0, DC over DS1, DC over DS2, which I assume are
43:37
43 minutes, 37 seconds
known, are given to me somehow.
43:45
43 minutes, 45 seconds
I compute DC over DS0 by multiplied by W0. I multiply DC over DS1 by W1, D0, DS2 by W2, I sum them up and that gives me DC over DZ.
43:56
43 minutes, 56 seconds
Okay, it’s just the formula here. Okay, so here’s the cool trick about back propagation through a linear module that computes weighted sums.
44:07
44 minutes, 7 seconds
You take the same weights and you still compute weighted sum with those weights, but you use the weights backwards.
44:16
44 minutes, 16 seconds
Okay? So whenever you had the unit that was sending its output to multiple outputs to multiple units through a weight, you take the gradient of the cost with respect
44:24
44 minutes, 24 seconds
to all those weighted sums and you compute their weighted sum backwards, using the weights backwards to get the gradient with respect to the state of the unit
44:32
44 minutes, 32 seconds
at the bottom.
44:40
44 minutes, 40 seconds
And you can do this for all the units. Okay, so it’s super simple.
44:45
44 minutes, 45 seconds
Now, if you were to write a program to do backprop for classical neural nets in Python, it would take like half a page.
44:53
44 minutes, 53 seconds
It’s very, very simple. Is one function to compute weighted sums going forward in the right order, another function and applying the nonlinearity.
45:04
45 minutes, 4 seconds
And here’s another function to compute weighted sums going backward and multiplying by the derivative of the nonlinearity at every step.
45:13
45 minutes, 13 seconds
It’s incredibly simple. What’s surprising is that it took so long for people to realize this was so useful.
45:21
45 minutes, 21 seconds
Maybe because it was too simple. Okay, so it’s useful to write this in matrix form.
45:27
45 minutes, 27 seconds
So really, the way you should think about a neural net of this type is each state inside the network. Think of it as a vector.
45:34
45 minutes, 34 seconds
It could be a multidimensional array. But let’s think of it just as a vector.
45:39
45 minutes, 39 seconds
A linear operation is just going to multiply this vector by matrix, and each row of the matrix contains all the weights that are used to compute a particular weighted sum for a particular unit.
45:51
45 minutes, 51 seconds
Okay, so multiply this by this matrix. So this dimension has to be equal to that dimension, which is not really well depicted here.
46:00
46 minutes
Actually, one sec from the previous slide, you wrote DS0. What is S differential with respect to?
46:08
46 minutes, 8 seconds
So there is a DS. What is DS? Basically, DS0. You mean? Okay, DS0 is a perturbation of S0. Okay?
46:16
46 minutes, 16 seconds
An infinitely small perturbation of S0 doesn’t matter what it is. Okay?
46:22
46 minutes, 22 seconds
And what we’re saying here is that if you have an infinitely small perturbation of S0, and you multiply this perturbation by the partial derivative of C With
46:34
46 minutes, 34 seconds
respect to S0, you get the perturbation of C, except that corresponds to this perturbation of S0.
46:46
46 minutes, 46 seconds
But we’re not interested in just a perturbation of S0. We’re interested in the perturbation of S1 and S2.
46:53
46 minutes, 53 seconds
So the overall perturbation of C would be the sum of the perturbations of S0, S1 and S2 multiplied by the corresponding partial derivative of C with respect to each of them.
47:03
47 minutes, 3 seconds
It’s a virtual thing, right? It’s not an existing thing you’re going to manipulate. Just imagine that there is some perturbation of S0 here.
47:13
47 minutes, 13 seconds
This is going to perturb C by some value, and that value is going to be the perturbation of S0 multiplied by the partial derivative of C with respect to S0.
47:22
47 minutes, 22 seconds
And then if you perturb S1 simultaneously, you’re also going to cause a perturbation of C. If you perturb S2 simultaneously, you’re also going to cause a perturbation of C.
47:32
47 minutes, 32 seconds
The overall perturbation of C will be the sum of those perturbations, and that is given by this expression here.
47:41
47 minutes, 41 seconds
Now, those D, those infinitely small quantities, ds, dc, et cetera, think of them as numbers.
47:49
47 minutes, 49 seconds
You can do algebra with them. You can divide one by the other. You can do stuff like that.
47:57
47 minutes, 57 seconds
So now you say, what is ds0 equal to? If I tweak z by a quantity dz, it’s going in turn to modify s0 by ds0.
48:07
48 minutes, 7 seconds
And what is the quantity by which s0 is going to be tweaked?
48:12
48 minutes, 12 seconds
If I tweak z by dz, because S is the result of computing the product of z by w 0, then the perturbation is also going to be multiplied by w0, right?
48:24
48 minutes, 24 seconds
So the ds0 corresponding to a particular dz is going to be equal to dz times w0.
48:31
48 minutes, 31 seconds
And this is what’s expressed here, okay? Ds0 equals w0 dz.
48:37
48 minutes, 37 seconds
Now if I take this expression for ds0 and I insert it here in this formula, I get dc equals w 0 times dz times dc over ds 0 plus same thing for 1 plus same thing for 2.
48:50
48 minutes, 50 seconds
And I’m going to take the DZ and pass it to the other side. I’m going to divide both sides by dz.
48:58
48 minutes, 58 seconds
So now I get DC over DZ equal. The DZ doesn’t appear anymore because it’s been put Underneath here is W0 times DC over DS0 plus W1 times DC over DS1, et cetera.
49:11
49 minutes, 11 seconds
Okay. It’s just simple algebra. It’s differential calculus basically. Right? So it’s better to write this in matrix form.
49:21
49 minutes, 21 seconds
So really when you’re computing, if I go back a few slides, when this is really kind of a matrix of all the weights that are kind of upstream of z Of, of the ZJs.
49:34
49 minutes, 34 seconds
So you can align the ZJ as a vector, maybe only the ZJs that have a non zero terms in Wij.
49:42
49 minutes, 42 seconds
And then you can write those W’s as a matrix. And this is just a matrix vector product. Okay?
49:48
49 minutes, 48 seconds
So this is the way this would be written. You have a vector, you multiply by matrix, you get a new vector, pass that through nonlinearities, values, multiply that by a matrix, et cetera.
50:00
50 minutes
So symbolically, you can write a simple neural net this way, where you have linear blocks, linear functional blocks, which basically take the previous state and multiply it by matrix.
50:13
50 minutes, 13 seconds
So you have a state here, z1 multiplied by a matrix, you get w1, z1, and that gives you the vector of weighted sums s2.
50:22
50 minutes, 22 seconds
Then you take that, pass it to the, the nonlinear functions, each component individually, and that gives you Z2.
50:30
50 minutes, 30 seconds
So that’s a two layer neural net, first weight matrix, nonlinearity, second weight matrix, nonlinearity, third weight matrix.
50:39
50 minutes, 39 seconds
And this is the output. There are two hidden layers, three layers of weights. Okay.
50:46
50 minutes, 46 seconds
The reason for writing it this way is that this is like symbolically the easiest way to understand really what, what kind of backprop does.
50:57
50 minutes, 57 seconds
And in fact it corresponds also to the way we define neural nets and we run them on deep learning frameworks like Pytorch.
51:07
51 minutes, 7 seconds
So this is the sort of object oriented version of defining a neural net.
51:13
51 minutes, 13 seconds
In Pytorch we’re going to use predefined class, which are the linear class that basically multiplies a vector by matrix.
51:21
51 minutes, 21 seconds
It also has biases, but let’s not talk about this just now.
51:25
51 minutes, 25 seconds
And another class, which is the RELU function, which takes a vector or a multidimensional array and applies the nonlinear function to every component separately.
51:36
51 minutes, 36 seconds
Okay, so this is a little piece of Python program that uses Torch. We import torch, we make an image which is 10 pixels by 20 pixels and three components for color.
51:47
51 minutes, 47 seconds
We compute the size of it, and we’re going to plug a neural net where the number of inputs is the number of components of our image.
51:55
51 minutes, 55 seconds
So in this case that would be 600 or so, and we’re going to define a class. The class is going to define a neural net.
52:02
52 minutes, 2 seconds
And that’s pretty much all we need to do here. So we define our network architecture.
52:08
52 minutes, 8 seconds
It’s a subclass of neural net module, which is a pretty fine class.
52:13
52 minutes, 13 seconds
It’s got a constructor here that will take the sizes of the internal layers that we want.
52:20
52 minutes, 20 seconds
The size of the input, the size of S1 and z1, the size of S2 and z2, and the size of S3 we call the parent class initializer.
52:29
52 minutes, 29 seconds
And then we just create three modules that are all linear modules, and we need to kind of store them somewhere because they have internal parameters.
52:37
52 minutes, 37 seconds
So we’re going to have three slots in our object, n 0 and 1, n 2 module 1 module 0 module 1, 1 module 2.
52:43
52 minutes, 43 seconds
Each of them is going to be an instance of the class nn linear with two sizes, the input size and the output size.
52:50
52 minutes, 50 seconds
Okay, so the first module has input size D0, output size D1, et cetera. And those classes are, since there is a capital L, means it’s an object, and inside, there are parameters inside that item there.
53:02
53 minutes, 2 seconds
Right? Right. So, for example, the RELU doesn’t have a capital because it doesn’t have internal parameters. It’s not kind of a trainable module.
53:11
53 minutes, 11 seconds
It’s just a function. Whereas those things with capitals, they have sort of internal parameters, the weight matrices inside of them.
53:18
53 minutes, 18 seconds
So now we define a forward function which basically computes the output from the input.
53:24
53 minutes, 24 seconds
And the first thing we do is we take the input thing, which may be a multidimensional array, and we flatten it.
53:33
53 minutes, 33 seconds
We flatten it using this idiomatic expression here in Pytorch. And then we apply the first module to X.
53:41
53 minutes, 41 seconds
We put the result in S1, which is a temporary variable, local variable.
53:47
53 minutes, 47 seconds
Then we apply the relu to S1, put the result in Z, then apply the second layer, put the result in S2, apply the value again, put the result in S3, and then the
53:55
53 minutes, 55 seconds
last linear layer, put the result in S3 and return S3.
54:04
54 minutes, 4 seconds
And there is a typo. So the second line should have been S1. It’s the self M0 of the Z0. Right, but Z0 here.
54:11
54 minutes, 11 seconds
Yes, correct. I know, yeah. This is something that was going to be fixed. Right. Which I didn’t fix.
54:20
54 minutes, 20 seconds
I know this is Z0. Thanks for reminding me of this. Okay, but you’ll see examples.
54:27
54 minutes, 27 seconds
I mean, I’ll show you kind of actual examples of this and you’ll be able to run them yourself.
54:35
54 minutes, 35 seconds
That’s all you need to do. You don’t need to write how you compute the back prop, how you propagate the gradients.
54:45
54 minutes, 45 seconds
You could write it and it would be as simple as forward. You could write a backward function and it would basically multiply by the matrices going backwards.
54:53
54 minutes, 53 seconds
But you don’t need to do this because Pytorch does this automatically for you.
54:57
54 minutes, 57 seconds
When you define the forward function, it knows what modules you’ve called in what order, what are the dependencies between the variables, and it will know how to generate the functions that computes the gradient backwards.
55:07
55 minutes, 7 seconds
So you don’t need to worry about it. Okay. That’s the magic of Pytorch, if you want. That’s a bit the magic of deep learning, really.
55:15
55 minutes, 15 seconds
That’s called automatic differentiation, and this is a particular form of automatic differentiation. There’s another way to write functions in Pytorch that are kind of more functional.
55:25
55 minutes, 25 seconds
So you’re not using modules with internal parameters, you just coding functions one after the other.
55:30
55 minutes, 30 seconds
And Pytorch has a mechanism by which it can automatically compute the gradient of any function you define with respect to whatever parameters you want.
55:39
55 minutes, 39 seconds
Yeah, actually these big guys with a capital L, like the nn linear inside is going to have a lowercase linear which is like the functional part which is
55:49
55 minutes, 49 seconds
performing the matrix multiplication between the weight stored inside the object with a capital L and then the input.
55:58
55 minutes, 58 seconds
So every capital letter object will inside have the functional way.
56:03
56 minutes, 3 seconds
So one can decide to use either the functional form by default, or use this encapsulated version, which are more convenient to just use.
56:13
56 minutes, 13 seconds
Right, right. So in the end you can create an instance of this class. You can create multiple instances, but you can create one here.
56:22
56 minutes, 22 seconds
Just call MyNet and give it the sizes you want. And then to apply this to a particular image, you just do out equal model of image.
56:31
56 minutes, 31 seconds
That’s as simple as that. Okay, so this is your first neural net and it does all the backprop automatically.
56:38
56 minutes, 38 seconds
But you need to understand how backprop works. Right.
56:41
56 minutes, 41 seconds
It’s not because Pytorch does it for you that you can sort of forget about how you actually compute the gradient of a function because it’s inevitable that at
56:48
56 minutes, 48 seconds
some point you’re going to want to actually assemble a neural net with a module that does not pre exist, and you’re going to have to write your own backprop function.
57:00
57 minutes
So to do this you basically have. If you want to create a new module of some complex operation that does not pre exist in Pytorch, then you do something like this.
57:14
57 minutes, 14 seconds
You define a class, but you write your own backward function, basically.
57:21
57 minutes, 21 seconds
Okay, so let’s get one step up in terms of abstraction and write this in slightly more generic form, mathematical form, if you want.
57:34
57 minutes, 34 seconds
So let’s say we have a cost function here, and we want to compute the gradient of this cost function with respect to a particular vector in the system.
57:45
57 minutes, 45 seconds
Zf could be a parameter, it could be a state. Doesn’t matter.
57:50
57 minutes, 50 seconds
Okay, Some states inside, and we have chain rule, and chain rule is nothing more than this that I explained earlier.
57:58
57 minutes, 58 seconds
DC over DZF is equal DC over dzg, DZG over dzf.
58:02
58 minutes, 2 seconds
As long as C is only influenced by zf through zg, there’s no other way for zf to influence C than to go through zg, then this formula is correct, okay?
58:17
58 minutes, 17 seconds
And of course, the identity is trivial because it’s just a simplification by this infinitesimal vector quantity dzg.
58:28
58 minutes, 28 seconds
Okay? So let’s say zg is a Vector of size DG by 1, so this means a column vector, okay?
58:37
58 minutes, 37 seconds
And zf is a column vector of size df.
58:45
58 minutes, 45 seconds
This is, if you want to write the correct dimensions of this, we get something a little complicated.
58:53
58 minutes, 53 seconds
Okay? So first of all, this object here, DZG over dzf. Well, let me start with this one, okay?
59:01
59 minutes, 1 second
This one, DC over dzg, that’s a gradient vector, okay?
59:05
59 minutes, 5 seconds
If zg is a vector, DC over DZG is a gradient vector, and it’s the same size as dzg, but by convention we actually write it as a line, as a row vector.
59:18
59 minutes, 18 seconds
Okay? So this thing here is going to be a row vector whose size is the same size as zg, but it’s going to be horizontal instead of vertical.
59:28
59 minutes, 28 seconds
Okay, this object here is something more complicated. It’s actually a matrix. Why is it?
59:34
59 minutes, 34 seconds
Matrix is because it’s the derivative of a vector with respect to another vector. Okay?
59:40
59 minutes, 40 seconds
So let’s look at this diagram here on the right, we have a function G. It takes zf as an input and it produces zg as an output.
59:49
59 minutes, 49 seconds
And if we want to capture the information about the derivative of that module, which is this quantity here, dzg over dzf.
59:57
59 minutes, 57 seconds
There’s a lot of terms to capture because there’s a lot of ways in which every single output, every component of zg, can be influenced by every component of zf,
1:00:02
1 hour, 2 seconds
right?
1:00:08
1 hour, 8 seconds
So for every pair of components ZG and zf, there is a derivative term which indicates by how much zg would be perturbed if I perturbed zf by a small
1:00:16
1 hour, 16 seconds
infinitesimal quantity, right?
1:00:23
1 hour, 23 seconds
We have that for every pair of components of ZG and zf.
1:00:28
1 hour, 28 seconds
As a result, this is a matrix whose dimension is the number of rows is the size of zg, and the number of columns is the size of zf.
1:00:42
1 hour, 42 seconds
And each term in this matrix is one partial derivative term.
1:00:48
1 hour, 48 seconds
So this whole matrix here, if I take the component ij, it’s the partial derivative of the ith output of that module, the ith component of zg, with respect to
1:00:56
1 hour, 56 seconds
the jth component of zf.
1:01:04
1 hour, 1 minute, 4 seconds
Okay? So what we get here is a row vector is equal to a row vector multiplied by a matrix, and the sizes kind of work out so that, you know, they’re compatible with
1:01:12
1 hour, 1 minute, 12 seconds
each other. Okay?
1:01:20
1 hour, 1 minute, 20 seconds
So what is back propagation? Now, back propagation is this formula, okay?
1:01:24
1 hour, 1 minute, 24 seconds
It says if you have the gradient of some cost function with respect to some variable, and you know, the dependency of this variable with respect to another
1:01:31
1 hour, 1 minute, 31 seconds
variable, you multiply this gradient vector by that Jacobian matrix and you get the gradient vector with respect to that second variable.
1:01:38
1 hour, 1 minute, 38 seconds
So that graphically here on the right, if I have the gradient of the cost with respect to zg, which is DC over dzg, and I want to compute the gradient of C with
1:01:46
1 hour, 1 minute, 46 seconds
respect to zf, which is DC of dzf.
1:01:54
1 hour, 1 minute, 54 seconds
I only need to take that vector, which is a row vector, multiplied by the Jacobian matrix DG over dzf or DZG over dzf, and I get DC over dzf.
1:02:05
1 hour, 2 minutes, 5 seconds
Okay, Is this formula? Someone is objecting here? Isn’t a summation missing here? Which summation?
1:02:14
1 hour, 2 minutes, 14 seconds
A summation of all the components of these partial multiplications here. Yeah. Well, this is a vector. This is a vector.
1:02:22
1 hour, 2 minutes, 22 seconds
This is a matrix. There’s a lot of sums going on here because when you compute the product of this vector of this matrix, you’re going to have a lot of sums, right? Yep. So it’s hidden, right?
1:02:31
1 hour, 2 minutes, 31 seconds
Yeah, the sums are hidden, okay, inside of this vector matrix product.
1:02:37
1 hour, 2 minutes, 37 seconds
Like you can take a specific example, let’s imagine that this G function is just a matrix multiplication, okay?
1:02:45
1 hour, 2 minutes, 45 seconds
We Just multiply by zf by matrix W. So we have a linear operation.
1:02:51
1 hour, 2 minutes, 51 seconds
The derivative of the Jacobian matrix of the multiplication by matrix is the transpose of that matrix.
1:02:59
1 hour, 2 minutes, 59 seconds
So what we’re going to do here is take this vector multiplied by the transpose of the W matrix and what we get is that vector.
1:03:08
1 hour, 3 minutes, 8 seconds
Okay? And it all makes sense, right? The sizes make sense. This matrix here is the transpose of the weight matrix, which of course had the reverse size.
1:03:19
1 hour, 3 minutes, 19 seconds
We pre multiply it by the row vector of the gradient from the layer above and we get the gradient from respected layer below.
1:03:30
1 hour, 3 minutes, 30 seconds
Okay, so backpropagating through a linear module just means multiplying the transpose of the matrix used by that module.
1:03:38
1 hour, 3 minutes, 38 seconds
And it’s just a generalized form of what I explained earlier of propagating through the weights of a linear system.
1:03:45
1 hour, 3 minutes, 45 seconds
But it’s less intuitive, right? Okay. So we’re going to be able to do backpropagation by computing gradients all the way through by propagating backwards.
1:03:55
1 hour, 3 minutes, 55 seconds
But this module really has two inputs.
1:03:59
1 hour, 3 minutes, 59 seconds
It has an input which is zf and the other one is wg, the weight matrix, the parameter vector that is used inside of this module.
1:04:10
1 hour, 4 minutes, 10 seconds
So there is a second Jacobian matrix which is the Jacobian matrix of ZG with respect to the terms of this weight parameter.
1:04:21
1 hour, 4 minutes, 21 seconds
Okay. And to compute the gradient of the cost function with respect to those weight parameter, I need to multiply this gradient vector by the Jacobian matrix of that block with respect to its weight.
1:04:35
1 hour, 4 minutes, 35 seconds
And it’s not the same as the Jacobian matrix with respect to the input, it’s a different Jacobian matrix.
1:04:42
1 hour, 4 minutes, 42 seconds
I’ll come back to this in a second.
1:04:45
1 hour, 4 minutes, 45 seconds
So to do backprop again, if we have a vector of gradients of some cost with respect to a state, and we have a function that is a function of one or several
1:04:53
1 hour, 4 minutes, 53 seconds
variables, we multiply this gradient by the Jacobian matrix of this block with respect to each of these inputs and that gives us the gradient with respect to
1:05:02
1 hour, 5 minutes, 2 seconds
each of the inputs that’s going to be expressed here.
1:05:11
1 hour, 5 minutes, 11 seconds
So this is the back propagation of states in a layer wise classical Type Neural Net dc over dzk, which is the state of layer k is dc over zk +1, which is the
1:05:24
1 hour, 5 minutes, 24 seconds
gradient of the cost with respect to the layer above times the Jacobian matrix of the State of layer K plus 1 with respect to the state of layer k. Right.
1:05:38
1 hour, 5 minutes, 38 seconds
Now we assume DC over DZ K plus 1 is known and we just need to multiply by the Jacobian matrix of the function that links zk to zk plus 1, the function that is used to compute zk plus 1 from zk.
1:05:49
1 hour, 5 minutes, 49 seconds
And this may be a function also of some parameters inside. But here that’s the matrix of partial derivatives of F whose output is ZK +1 with respect to each of the components of ZK.
1:06:00
1 hour, 6 minutes
OK, so that’s the first rule of backpropagation, and it’s a recursive rule. So you can start from the top.
1:06:07
1 hour, 6 minutes, 7 seconds
You start initially with DC over dc, which is one, which is why I have this one here on top.
1:06:13
1 hour, 6 minutes, 13 seconds
And then you just keep multiplying by the Jacobian matrix all the way down, and that back propagates gradients.
1:06:20
1 hour, 6 minutes, 20 seconds
And now you get gradients with respect to all the states. You also want the gradients with respect to the weights, because that’s what you need to do learning.
1:06:31
1 hour, 6 minutes, 31 seconds
So what you can write is the same chain rule, dc over dwk is equal to dc over dzk plus 1, which we assume is known times dzk plus 1 over dwk.
1:06:40
1 hour, 6 minutes, 40 seconds
And you can write this as dc over dk plus 1. And the dependency between zk plus 1 and wk is the function zk applied to wk.
1:06:48
1 hour, 6 minutes, 48 seconds
So you can differentiate the function, the output of the function ZK with respect to wk, and that gives you another Jacobian matrix.
1:06:56
1 hour, 6 minutes, 56 seconds
And so with those two formulas you can do backprop with just about anything, really.
1:07:01
1 hour, 7 minutes, 1 second
What goes on inside Pytorch and inside most of those frameworks, TensorFlow and JAX and whatever, is something like this where you have.
1:07:10
1 hour, 7 minutes, 10 seconds
So let’s take a very simple diagram here where you have an input parameterized function that computes an output that goes to a cost function.
1:07:20
1 hour, 7 minutes, 20 seconds
And that cost function measures the discrepancy between the output of the system and the desired output.
1:07:27
1 hour, 7 minutes, 27 seconds
So you can write this function as C of G. I didn’t put the X here, but just for charity.
1:07:33
1 hour, 7 minutes, 33 seconds
And the derivative of this is again you apply chain rule.
1:07:37
1 hour, 7 minutes, 37 seconds
Or you can write it with partial derivatives this way and same for expand the dependency of the output with respect to the parameters as the Jacobian matrix of
1:07:48
1 hour, 7 minutes, 48 seconds
G with respect to W. If W is a scalar, then this is just a derivative, partial derivative.
1:08:00
1 hour, 8 minutes
Okay, now you can express this as a compute graph. So you can say like how am I going to compute DC over dw?
1:08:10
1 hour, 8 minutes, 10 seconds
What I’m going to have to do is take the value 1, which is the derivative of C with respect to itself.
1:08:17
1 hour, 8 minutes, 17 seconds
Basically the loss with respect to itself. I’m going to multiply this by the derivative of the Cost with respect to Y bar.
1:08:25
1 hour, 8 minutes, 25 seconds
Okay, and that’s going to give me D.C. over DY bar, obviously. Okay, this is the same as this, because I just multiplied by one.
1:08:35
1 hour, 8 minutes, 35 seconds
Then multiply this by the Jacobian matrix of G with respect to W, which is a derivative.
1:08:42
1 hour, 8 minutes, 42 seconds
If W is a scalar, that of course depends on X and I get DC over dw. So this is so called compute graph, right?
1:08:50
1 hour, 8 minutes, 50 seconds
This is a way of organizing operations to compute the gradient.
1:08:55
1 hour, 8 minutes, 55 seconds
And there is essentially an automatic way of transforming a graph of a compute graph of this type into a compute graph of this type that computes the gradient automatically.
1:09:06
1 hour, 9 minutes, 6 seconds
And this is the magic that happens in the automatic differentiation inside PyTorch and TensorFlow and other systems.
1:09:14
1 hour, 9 minutes, 14 seconds
Some systems are pretty smart about this in a sense that those functions can be fairly complicated.
1:09:20
1 hour, 9 minutes, 20 seconds
They can involve themselves computing derivatives and stuff, and they can involve dynamic computation where the graph of computation depends on the data.
1:09:29
1 hour, 9 minutes, 29 seconds
And actually Pytorch handles this properly. I’m not going to go through all the details of this, but this is kind of a way of reminding you what the dimensions of all those things are.
1:09:41
1 hour, 9 minutes, 41 seconds
So if Y is a column vector of size M, W is a column vector of size N, then this is a row vector of size N, this is a row vector of size M, and this is a
1:09:51
1 hour, 9 minutes, 51 seconds
Jacobian matrix of size N by N. And all of this works out okay.
1:10:02
1 hour, 10 minutes, 2 seconds
So the way we’re going to build neural nets, and I’ll come back to this in a subsequent lecture, is that we are going to have at our disposal a large collection
1:10:14
1 hour, 10 minutes, 14 seconds
of of basic modules which we’re going to be able to arrange in more or less complex graphs as a way to build the architecture of a learning system.
1:10:26
1 hour, 10 minutes, 26 seconds
Okay, so either we’re going to write a class or we’re going to write a program that runs the forward pass.
1:10:35
1 hour, 10 minutes, 35 seconds
And this program is going to be composed of basic mathematical operations, addition, subtraction of tensors or multidimensional arrays, other types of scalar
1:10:46
1 hour, 10 minutes, 46 seconds
operations, or the application of one of the predefined complex parameterized functions like a linear module, a ReLU, or things like that.
1:11:00
1 hour, 11 minutes
We have at our disposal a large library of such modules, which are things that people have come up with over the years that are kind of basic modules that are used in a lot of applications.
1:11:13
1 hour, 11 minutes, 13 seconds
Right? So the basic things that we’ve seen so far are things like relu’s. There’s other nonlinear functions like sigmoids and variations of this.
1:11:22
1 hour, 11 minutes, 22 seconds
There’s a large collection of them.
1:11:25
1 hour, 11 minutes, 25 seconds
And then we have cost Functions like squared error, cross entropy, hinge loss, ranking loss, and blah, blah, blah, which I’m not going to go through now, but we’ll talk about this later.
1:11:38
1 hour, 11 minutes, 38 seconds
The nice thing about this formalism is that as I said before, you can sort of compute graphs.
1:11:47
1 hour, 11 minutes, 47 seconds
You can construct a deep learning system by assembling those modules in any kind of arrangement you want, as long as there is no loops in the connection graph.
1:11:57
1 hour, 11 minutes, 57 seconds
So as long as you can come up with a partial order in those modules, that will ensure that they are computed in the proper way.
1:12:06
1 hour, 12 minutes, 6 seconds
Okay, but there is a way to handle loops, and that’s called recurrent nets. We’ll talk about this later.
1:12:16
1 hour, 12 minutes, 16 seconds
Okay, so here’s a few practical tricks if you want to play with neural nets, and you’re going to do that soon enough, perhaps even tomorrow.
1:12:24
1 hour, 12 minutes, 24 seconds
So, and these are kind of a bit of a black art of deep learning, which is sort of, a lot of it is implemented already in things like Pytorch if you use standard tools.
1:12:34
1 hour, 12 minutes, 34 seconds
But some of it is kind of more of the sort of oral culture, if you want, of the deep learning community.
1:12:41
1 hour, 12 minutes, 41 seconds
You can find this in papers, but it’s a little difficult to find sometimes. So most neural nets use ReLU’s as the main nonlinearity.
1:12:49
1 hour, 12 minutes, 49 seconds
So this sort of half wave rectifier, hyperbolic tangent, which is a sigmad function, and logistic function, which is also a Siemian function, are used, but not as much, not nearly as much.
1:13:01
1 hour, 13 minutes, 1 second
You need to initialize the waste properly. So if you have a neural net and you initialize the waste to zero, it never takes off, it will never learn.
1:13:10
1 hour, 13 minutes, 10 seconds
The gradients will always be zero, all the time. And the reason is because when you back propagate the gradient, you multiply by the transpose of the weight matrix.
1:13:19
1 hour, 13 minutes, 19 seconds
If that weight matrix is zero, your gradient is zero. So if you start with all the weights equal to zero, you never take off.
1:13:26
1 hour, 13 minutes, 26 seconds
And someone asked a question about saddle points before. Zero is a saddle point. And so if you start at this saddle point, you never get out of it.
1:13:36
1 hour, 13 minutes, 36 seconds
So you have to break the symmetry in the system. You have to initialize the weights to small random values.
1:13:44
1 hour, 13 minutes, 44 seconds
They don’t need to be random, but it works fine if they are random. And the way you initialize is actually quite important.
1:13:53
1 hour, 13 minutes, 53 seconds
So there’s all kinds of tricks to initialize things properly. One of the tricks was invented by my friend leon betou about 30 years ago.
1:14:01
1 hour, 14 minutes, 1 second
Even more than that, actually 34 years ago, almost, unfortunately, now it’s called differently, it’s called The Kaiming trick, but it’s the same and it consists in initializing the weights to random values in such a way that you.
1:14:15
1 hour, 14 minutes, 15 seconds
If a unit has many inputs, the weights are smaller than if it has few inputs. And the reason for this is that you want the weighted sum to be roughly kind of have some reasonable value.
1:14:27
1 hour, 14 minutes, 27 seconds
If the input variables have some reasonable value, let’s say variance one or something like this, and you’re computing a weighted sum of them, the weighted sum,
1:14:34
1 hour, 14 minutes, 34 seconds
the size of the weighted sum, is going to grow like the square root of the number of inputs.
1:14:41
1 hour, 14 minutes, 41 seconds
And so you want to set the weights to something like the inverse square root if you want the weighted sum to be kind of about the same size as each of the inputs.
1:14:52
1 hour, 14 minutes, 52 seconds
So that’s built into Pytorch. You can call this initialization procedure. What’s the exact name of it? Alfredo?
1:14:59
1 hour, 14 minutes, 59 seconds
I can’t remember the one that. Then there is the Xavier and then there is also yours we have in Pytorch.
1:15:08
1 hour, 15 minutes, 8 seconds
Yeah, they’re slightly different, but they kind of do the same, more or less. Yeah, the Xavier Glot version. Yeah, yeah, yeah.
1:15:16
1 hour, 15 minutes, 16 seconds
This one divides by the fanin and fan out. There’s various loss functions.
1:15:20
1 hour, 15 minutes, 20 seconds
So I haven’t talked yet about what the cross entropy loss is, but cross entropy loss is a particular cost that’s used for classification.
1:15:29
1 hour, 15 minutes, 29 seconds
I’ll probably talk about this next week, unless I have some time at the end of this lecture. This is for classification.
1:15:36
1 hour, 15 minutes, 36 seconds
As I said, we use stochastic gradient descent on mini batches and mini batches only because the hardware that we have needs mini batches to perform properly.
1:15:45
1 hour, 15 minutes, 45 seconds
If we had different hardware, we would use mini batch size one. As I said before, we need to shuffle the training samples.
1:15:54
1 hour, 15 minutes, 54 seconds
So if someone gives you a training set and puts all the examples of category one that only dumps category two or they don’t put category three, etc.
1:16:03
1 hour, 16 minutes, 3 seconds
If you use stochastic gradient by keeping this order, it is not going to work.
1:16:08
1 hour, 16 minutes, 8 seconds
You have to shuffle the samples so that you basically get samples from all the categories within kind of a small subset if you want.
1:16:19
1 hour, 16 minutes, 19 seconds
There is an objection here for the stochastic gradient. Isn’t Adam better? All right. Okay.
1:16:27
1 hour, 16 minutes, 27 seconds
There is a lot of variance of stochastic gradient. They are all stochastic gradient methods.
1:16:33
1 hour, 16 minutes, 33 seconds
In fact, people in optimization say this should not be called stochastic gradient descent because it’s not a descent algorithm.
1:16:42
1 hour, 16 minutes, 42 seconds
Because stochastic gradient sometimes goes uphill because of the noise. Right.
1:16:48
1 hour, 16 minutes, 48 seconds
So people who want to really kind of be correct about this is stochastic gradient optimization, but not stochastic gradient descent.
1:16:58
1 hour, 16 minutes, 58 seconds
That’s the first thing. Stochastic gradient optimization or stochastic gradient descent.
1:17:04
1 hour, 17 minutes, 4 seconds
SGD is a special case of gradient based optimization, okay?
1:17:08
1 hour, 17 minutes, 8 seconds
And the specification of it says you have to have a step size eta, but nobody tells you how you set the step size eta.
1:17:16
1 hour, 17 minutes, 16 seconds
And nobody tells you that this step size is a scalar or a diagonal matrix or a full matrix, okay?
1:17:24
1 hour, 17 minutes, 24 seconds
So there are variations of SGD in which ADA is changed all the time for every sample or every batch.
1:17:31
1 hour, 17 minutes, 31 seconds
In sgd, most of the time this ADA is decreased according to a schedule.
1:17:37
1 hour, 17 minutes, 37 seconds
And there are a bunch of standard schedule in Pytorch that are implemented in techniques like adam.
1:17:46
1 hour, 17 minutes, 46 seconds
The eta is actually a diagonal matrix, and that diagonal matrix, the term in the diagonal matrix, are changed all the time.
1:17:56
1 hour, 17 minutes, 56 seconds
They’re computed based on some estimate of the curvature of the cost function. There’s a lot of methods to do this.
1:18:04
1 hour, 18 minutes, 4 seconds
They’re all SGD type methods. ADAM is an SGD method with a special type of ada.
1:18:10
1 hour, 18 minutes, 10 seconds
So yeah, in the OPT in package in Torch, there’s a whole bunch of those methods.
1:18:19
1 hour, 18 minutes, 19 seconds
There’s going to be a whole lecture on this, so don’t worry about it. About optimization.
1:18:26
1 hour, 18 minutes, 26 seconds
Normalize input variables to zero mean and unit variance.
1:18:30
1 hour, 18 minutes, 30 seconds
So this is a very important point that this type of optimization method, gradient based optimization methods, when you have weighted sums, kind of linear
1:18:38
1 hour, 18 minutes, 38 seconds
operations, tends to be very sensitive to how the data is prepared.
1:18:47
1 hour, 18 minutes, 47 seconds
So if you have two variables that have very widely different variances, one of them varies between, let’s say minus 1 and plus 1.
1:18:55
1 hour, 18 minutes, 55 seconds
The other one varies between minus 100 and plus 100. The system will basically not pay attention to the one that varies between plus 1 and minus 1, will only pay attention to the big one.
1:19:06
1 hour, 19 minutes, 6 seconds
And this may be good or this may be bad. Furthermore, the learning rate, you’re going to have to use the ADA parameter.
1:19:15
1 hour, 19 minutes, 15 seconds
The step size is going to have to be set to a relatively small value to prevent the weights that look at this highly variable input from diverging.
1:19:23
1 hour, 19 minutes, 23 seconds
The gradients are going to be very large because the gradients basically are proportional to the size of the input or even to the variance of the input.
1:19:32
1 hour, 19 minutes, 32 seconds
So if you don’t want your system to diverge, you’re going to have to tune down the running rate if the input variance is large.
1:19:41
1 hour, 19 minutes, 41 seconds
If the input variables are all shifted, they’re all between, let’s say 99 and 101 instead of minus 1 and 1.
1:19:48
1 hour, 19 minutes, 48 seconds
Then again, it’s very difficult for a gradient based algorithm that use weighted sums to kind of figure out those things again.
1:19:57
1 hour, 19 minutes, 57 seconds
I’ll talk about this more formally later. Right now just remember the trick that you need to normalize your input.
1:20:05
1 hour, 20 minutes, 5 seconds
So basically take every variable of your input, subtract the mean. You compute the mean over the training set of each variable.
1:20:12
1 hour, 20 minutes, 12 seconds
So let’s say your training set is a set of images. The images are, let’s say 100 by 100 pixels. Let’s say they’re grayscale.
1:20:19
1 hour, 20 minutes, 19 seconds
So you get 10,000 variables and let’s say you get a million samples, right?
1:20:23
1 hour, 20 minutes, 23 seconds
You’re going to get each, you’re going to take each of those 1,000, each of those 10,000 variables, compute the mean of it over the training set, compute the standard deviation of it over the entire training set.
1:20:36
1 hour, 20 minutes, 36 seconds
And the samples you’re going to show to your system are going to be a sample where you have subtracted the mean from each of the 10,000 pixels and divided the
1:20:45
1 hour, 20 minutes, 45 seconds
resulting values by the standard deviation that you computed.
1:20:53
1 hour, 20 minutes, 53 seconds
Okay? So now what you have is a bunch of variables that are all zero mean and all standard deviation equal to one.
1:21:02
1 hour, 21 minutes, 2 seconds
And that makes your neural net happy. That makes your optimization algorithm happy. Actually, we have actually a question.
1:21:12
1 hour, 21 minutes, 12 seconds
So you keep repeating SGD type methods, gradient based methods, because there are other types of methods.
1:21:20
1 hour, 21 minutes, 20 seconds
Yes. Okay, so there is gradient free methods.
1:21:23
1 hour, 21 minutes, 23 seconds
So a gradient free method is a method where you do not assume that the function you’re trying to optimize is differentiable or even continuous with respect to the parameters for several reasons.
1:21:36
1 hour, 21 minutes, 36 seconds
Perhaps it’s a function that looks like a golf course, right? It’s flat. And then maybe it’s got steps and you know, it’s difficult to like.
1:21:46
1 hour, 21 minutes, 46 seconds
The local gradient information does not give you any information as to where you should go to find the minimum.
1:21:53
1 hour, 21 minutes, 53 seconds
Okay, it could be that the function is essentially discrete, right? It’s not a function of continuous variables.
1:22:00
1 hour, 22 minutes
Function of discrete variables. So, for example, am I going to win this chess game? The variable you can manipulate is the position on the board.
1:22:10
1 hour, 22 minutes, 10 seconds
That’s a discrete variable. So you can’t compute a gradient of a score with respect to a position on a chess game.
1:22:20
1 hour, 22 minutes, 20 seconds
It’s a discrete variable. Another example is the cost function is not something you can compute.
1:22:27
1 hour, 22 minutes, 27 seconds
You don’t actually know the cost function.
1:22:30
1 hour, 22 minutes, 30 seconds
So, for example, the only thing you can do is give an input to the cost function and it tells you the cost, but you don’t know the function.
1:22:41
1 hour, 22 minutes, 41 seconds
It’s not a program on a computer. You can’t back propagate gradient through it. A good example of this is the real world.
1:22:49
1 hour, 22 minutes, 49 seconds
The real world, you can think of it as a cost function, right? You learn to ride a bike and you ride your bike and at some point you fall.
1:22:59
1 hour, 22 minutes, 59 seconds
The real world does not give you a gradient of that cost function, which is how much you hurt with respect to your actions.
1:23:09
1 hour, 23 minutes, 9 seconds
The only thing you can do is try something else and see if you get the same result or not. So what do you do in that case?
1:23:17
1 hour, 23 minutes, 17 seconds
So basically now your cost function is a black box. So now you cannot propagate gradient to this black box.
1:23:24
1 hour, 23 minutes, 24 seconds
What you have to do is estimate the gradient by perturbing what you feed to that black box.
1:23:31
1 hour, 23 minutes, 31 seconds
You try something, and that something would be a perturbation of your input to this black box.
1:23:37
1 hour, 23 minutes, 37 seconds
And you see what resulting perturbation occurs on the output of the black box, the cost.
1:23:44
1 hour, 23 minutes, 44 seconds
Now you can estimate whether this modification improved or made the result worse.
1:23:51
1 hour, 23 minutes, 51 seconds
So essentially this is like this optimization problem I was telling you about earlier.
1:23:58
1 hour, 23 minutes, 58 seconds
The gradient based algorithm is like you are in the mountain, lost in a mountain, in a fog.
1:24:05
1 hour, 24 minutes, 5 seconds
You can’t see anything, but you can estimate the direction of steepest descent, right? You can just look around and you can tell which is the direction of steepest descent.
1:24:15
1 hour, 24 minutes, 15 seconds
You just take a step in that direction. What if you can’t see? So basically to estimate in which direction the function goes down, you have to actually take a step.
1:24:27
1 hour, 24 minutes, 27 seconds
Okay, so you take a step in one direction, then you come back, then you can take a step in the other direction, come back, and then maybe you get an estimate for where the steepest descent is.
1:24:37
1 hour, 24 minutes, 37 seconds
Now you can take a step for steepest descent.
1:24:39
1 hour, 24 minutes, 39 seconds
So this is estimating the gradient by perturbation instead of by analytic means of backpropagating gradients, computing Jacobians or whatever, partial derivatives.
1:24:49
1 hour, 24 minutes, 49 seconds
And then there is the second step of complexity. Let’s imagine that the, the landscape you’re in is basically flat everywhere.
1:24:57
1 hour, 24 minutes, 57 seconds
Except once in a while there is a step. Taking a small step in one direction will not give you any information about which direction you have to go to.
1:25:06
1 hour, 25 minutes, 6 seconds
So there you have to use other techniques, taking bigger steps, working for a while and seeing if you fall down the step or not, or go up a step.
1:25:17
1 hour, 25 minutes, 17 seconds
Maybe you can multiply yourself in 10,000 copies of yourself and then kind of explore the surroundings.
1:25:23
1 hour, 25 minutes, 23 seconds
And then whenever someone says, oh, I find a hole, calls everyone to kind of come there. Okay, so all those methods are called gradient free optimization algorithms.
1:25:32
1 hour, 25 minutes, 32 seconds
Sometimes they’re called zeroth order method. Why zeroth Order because first order is when you can compute the derivative.
1:25:40
1 hour, 25 minutes, 40 seconds
Zeroth order is when you cannot compute the derivative. You can only compute the function or get a value for the function.
1:25:45
1 hour, 25 minutes, 45 seconds
And then you have second order methods that compute not just the first derivative, but also the second derivative. And they’re also gradient based.
1:25:52
1 hour, 25 minutes, 52 seconds
Okay, because they need the first derivative as well, but they can accelerate the process by also computing the second derivative.
1:25:58
1 hour, 25 minutes, 58 seconds
And ADAM is a very simplified form of kind of second order method. It’s not a second order method, but it has a hint of second order.
1:26:06
1 hour, 26 minutes, 6 seconds
Another hint of second order method is what’s called conjugate gradient.
1:26:11
1 hour, 26 minutes, 11 seconds
There’s another class of method called quasi Newton methods, which also kind of using curvature information, if you want to accelerate.
1:26:21
1 hour, 26 minutes, 21 seconds
Many of those are not actually practical for neural net training, but there are some forms that are.
1:26:27
1 hour, 26 minutes, 27 seconds
If you are interested in zero starter optimization, there is a library that is actually produced by.
1:26:33
1 hour, 26 minutes, 33 seconds
It’s an open source library which originated at Facebook Research in Paris by an author called Olivier Tetot. But it’s really a community effort.
1:26:42
1 hour, 26 minutes, 42 seconds
There’s a lot of contributors to it. It’s called Nevergrad and it implements a very large number of different optimization algorithms that do not assume that you have access to the gradient.
1:26:55
1 hour, 26 minutes, 55 seconds
There are genetic algorithms or evolutionary methods. There are particle swarm optimization, there are perturbation methods.
1:27:02
1 hour, 27 minutes, 2 seconds
There’s all kinds of tricks. There’s a whole catalog of those things. And those sometimes it’s unavoidable, you have to use them because you don’t know the cost function.
1:27:13
1 hour, 27 minutes, 13 seconds
So a very common situation where you had to use those things is reinforcement learning. So reinforcement learning is basically a situation where you tell the system.
1:27:22
1 hour, 27 minutes, 22 seconds
You don’t tell the system the correct answer. You only tell the system whether the answer was good or bad because you give the value of the cost, but you don’t tell the machine where the cost is.
1:27:34
1 hour, 27 minutes, 34 seconds
So the machine doesn’t know what the cost function is. The machine cannot actually compute the gradient of the cost.
1:27:40
1 hour, 27 minutes, 40 seconds
And so it has to use something like a zero starter method. So what you can do is you can compute a gradient with respect to the parameters of the overall cost function by perturbing the parameters.
1:27:52
1 hour, 27 minutes, 52 seconds
Or what you can do is compute the gradient of the cost function with respect to the output of your neural net using perturbation.
1:28:00
1 hour, 28 minutes
And once you have this estimate, then you backpropagate the gradient through your network using regular backprop.
1:28:08
1 hour, 28 minutes, 8 seconds
So that’s a combination of estimating the gradient through perturbation for the cost function, because you don’t Know it and then backpropagating from there.
1:28:18
1 hour, 28 minutes, 18 seconds
This is basically the technique that was used by the DeepMind people in the first deep Q learning type methods.
1:28:26
1 hour, 28 minutes, 26 seconds
Back to the normalization. Do we normalize the entire data set or each batch? It’s equivalent.
1:28:35
1 hour, 28 minutes, 35 seconds
So you normalize each sample, but the variable you’re computing is on the entire training set.
1:28:41
1 hour, 28 minutes, 41 seconds
So you’re computing the standard deviation and the mean over the entire training set.
1:28:47
1 hour, 28 minutes, 47 seconds
In fact, most of the time you don’t even need to do it over the entire training set because mean and standard deviation converges pretty fast.
1:28:56
1 hour, 28 minutes, 56 seconds
But you do it over the entire training set. Right.
1:28:59
1 hour, 28 minutes, 59 seconds
And what you get is a constant number, two constant numbers, a number that you subtract and the number that you should divide for each component of your input.
1:29:11
1 hour, 29 minutes, 11 seconds
Okay. It’s a fixed pre processing for a given training set. You’ll have a fixed mean and standard deviation vector.
1:29:21
1 hour, 29 minutes, 21 seconds
But maybe we can connect to the other tool. Right, the other module. The batch. The batch normalization. Right. Okay. Okay, we haven’t talked about that yet.
1:29:29
1 hour, 29 minutes, 29 seconds
Yeah, I’m saying that we can perhaps extend this normalization bit to both sides like the whole data set and the batch itself.
1:29:38
1 hour, 29 minutes, 38 seconds
Okay. Yes, yes. So I mean again, there’s going to be a whole lecture on this, but for the same reason it’s good to have variables, the input that are zero mean and you need variance.
1:29:49
1 hour, 29 minutes, 49 seconds
It’s also good for the state variables inside the network to basically have zero mean and you need variance.
1:29:57
1 hour, 29 minutes, 57 seconds
And so people have come up with various ways of doing normalization of the variables inside a network so that they approach zero mean and unit variance.
1:30:08
1 hour, 30 minutes, 8 seconds
And there are many ways to do this. They have cute names like batch normalization, like layer normalization.
1:30:17
1 hour, 30 minutes, 17 seconds
And the idea goes back a very long time. Batch norm is kind of a more recent incarnation of it.
1:30:24
1 hour, 30 minutes, 24 seconds
Let’s see, where was I scheduled to decrease the learning rate?
1:30:30
1 hour, 30 minutes, 30 seconds
Yeah, as it turns out, for reasons that are still not completely fully understood, you need to learn fast.
1:30:37
1 hour, 30 minutes, 37 seconds
Initially you need a learning rate of a particular size.
1:30:41
1 hour, 30 minutes, 41 seconds
But to get good results in the end, you kind of need to decrease the learning rate to kind of let the system settle inside of minima.
1:30:50
1 hour, 30 minutes, 50 seconds
And that requires decreasing the learning rate. There’s various semi valid theoretical explanation for this, but experimentally it’s clear you need to do that.
1:31:01
1 hour, 31 minutes, 1 second
And again, there are schedules that are pre programmed in Pytorch for this use a bit of L1 or L2 regularization on the weights or combination.
1:31:09
1 hour, 31 minutes, 9 seconds
Yeah. After you’ve trained your system for a few epochs, you you might want to kind of prune it, eliminate the weights that are useless, make sure that the weights have their kind of minimum size.
1:31:21
1 hour, 31 minutes, 21 seconds
And what you do is you add a term in the cost function that basically shrinks the weights at every iteration.
1:31:30
1 hour, 31 minutes, 30 seconds
You might know what L2 and L1 regular registration means if you’ve taken a class in machine learning for logistic regression or stuff like that.
1:31:40
1 hour, 31 minutes, 40 seconds
It’s very common, but L2 sometimes is called weight decay. These again are preprogramed in Pytorch.
1:31:47
1 hour, 31 minutes, 47 seconds
A trick that a lot of people use for large neural nets is a trick called dropout. Dropout is implemented as a layer in Pytorch.
1:31:56
1 hour, 31 minutes, 56 seconds
What this layer does is that it takes the state of a layer and it randomly picks a certain proportion of the units and basically sets them to zero, right?
1:32:07
1 hour, 32 minutes, 7 seconds
So you can think of it as a mask. A layer that applies a mask to an input.
1:32:12
1 hour, 32 minutes, 12 seconds
And the mask is randomly picked at every sample and some proportion of the mask are set, the value in the mask are set to zero, some are set to one, and you multiply the input by the mask.
1:32:26
1 hour, 32 minutes, 26 seconds
So only a subset of the units are allowed to speak to the next layer. Essentially that’s called dropout.
1:32:33
1 hour, 32 minutes, 33 seconds
And the reason for doing this is that it forces the unit to distribute the information about the input over multiple units instead of kind of squeezing everything into a small number.
1:32:45
1 hour, 32 minutes, 45 seconds
And it makes the system more robust. There’s some theoretical arguments for why that, why it does that experimentally.
1:32:54
1 hour, 32 minutes, 54 seconds
If you add this to a large network, you get better generalization error, you get better performance on the test set.
1:33:01
1 hour, 33 minutes, 1 second
It’s not always necessary, but it helps. Okay, there’s lots of tricks and I’ll devote a lecture on this.
1:33:08
1 hour, 33 minutes, 8 seconds
So I’m not going to go through all of them right now. That requires explaining a bit more about optimization.
1:33:15
1 hour, 33 minutes, 15 seconds
So really what deep learning is about, like I told you everything about deep learning, like the basics of deep learning. What I haven’t told you is why we use deep learning.
1:33:25
1 hour, 33 minutes, 25 seconds
Okay? And that’s basically what I’m going to tell you about now. The motivation for why is it that we need basically multilayer neural nets or things of this type?
1:33:35
1 hour, 33 minutes, 35 seconds
Okay, so the traditional prototypical model of supervised learning for a very long time is basically a linear classifier.
1:33:44
1 hour, 33 minutes, 44 seconds
Linear classifier for a two class problem is basically a single unit of the similar type that we talked about earlier.
1:33:52
1 hour, 33 minutes, 52 seconds
You compute a weighted sum of inputs at a bias. You could think of the bias as just another trainable weight whose corresponding input is equal to 1.
1:34:03
1 hour, 34 minutes, 3 seconds
If you want and then you pass that through a threshold function, the sine function that outputs minus one if the weighted sum is below zero and plus one if it’s above zero.
1:34:14
1 hour, 34 minutes, 14 seconds
Okay, so this basic linear classifier basically partitions the space, the input space of x’s into two half spaces separated by hyperplane.
1:34:24
1 hour, 34 minutes, 24 seconds
So the equation sum of I wi xi plus v equals 0 is the surface that separates the category 1 that is going to produce y bar equals plus 1 from category 2, where y bar equals minus 1.
1:34:36
1 hour, 34 minutes, 36 seconds
Why does it divide the space into two halves? It’s because you’re computing the dot product of an input vector with a weight vector.
1:34:46
1 hour, 34 minutes, 46 seconds
If those two vectors are orthogonal, then the dot product is zero. Okay? B is just an offset.
1:34:53
1 hour, 34 minutes, 53 seconds
So the set of points in X space where this dot product is zero is the set of points that are orthogonal to the vector W. Okay, so in a N dimensional space, your vector W is a vector.
1:35:07
1 hour, 35 minutes, 7 seconds
And the set of X whose dot product with W is 0 is a hyperplane. Right? So it’s a. It’s a linear subspace of dimension N minus 1.
1:35:16
1 hour, 35 minutes, 16 seconds
Okay? And that hyperplane divides the space of dimension n in two halves. So here is the situation in two dimensions.
1:35:25
1 hour, 35 minutes, 25 seconds
You have two dimensions, x1, x2. You have data points here, the red category and the blue category.
1:35:33
1 hour, 35 minutes, 33 seconds
And there is a weight vector plus a bias where the intercept here of this green separating line with X1 is minus B times divided by W1.
1:35:44
1 hour, 35 minutes, 44 seconds
So that gives you an idea for what W should be. And the W vector is orthogonal to that separating surface.
1:35:52
1 hour, 35 minutes, 52 seconds
Okay, so changing B will change the position and then changing W will change the orientation, basically.
1:35:59
1 hour, 35 minutes, 59 seconds
Now, what about situations like this where the points are. The red and blue points are not separable by a hyperplane?
1:36:07
1 hour, 36 minutes, 7 seconds
That’s called a nonlinearly separable case. So there you can’t use a linear classifier to separate those.
1:36:15
1 hour, 36 minutes, 15 seconds
What are we going to do? In fact, there is a theorem that goes back to 1966 by Tom Cover, who died recently.
1:36:25
1 hour, 36 minutes, 25 seconds
Actually it was at Stanford that says the probability that a particular separation of p points is, is linearly separable in N dimension is close to 1 when P is
1:36:32
1 hour, 36 minutes, 32 seconds
smaller than N, but is close to zero when P is larger than N.
1:36:40
1 hour, 36 minutes, 40 seconds
In other words, if you take an N dimensional space, you throw P random points in that N dimensional space, data points, okay?
1:36:48
1 hour, 36 minutes, 48 seconds
And you randomly label them blue and red. You ask the question, what is the probability that that particular dichotomy is linearly separable?
1:36:57
1 hour, 36 minutes, 57 seconds
I can separate the blue points from the red points with a hyperplane and the answer is, if P is less than N, you have a good chance that they will be separable.
1:37:08
1 hour, 37 minutes, 8 seconds
If P is larger than N, you basically have no chance that they will.
1:37:13
1 hour, 37 minutes, 13 seconds
Okay, so if you have an image classification problem, let’s say, and you have tons of examples way bigger. So let’s say you do nnist.
1:37:22
1 hour, 37 minutes, 22 seconds
So, nnist is a data set of handwritten digits. The Images are in 28 by 28 pixels.
1:37:27
1 hour, 37 minutes, 27 seconds
In fact, the intrinsic dimension is smaller because some pixels are always zero and you have 60,000 samples.
1:37:35
1 hour, 37 minutes, 35 seconds
The probability that those 60,000 samples of, let’s say, zeros from everything else or one from everything else is nearly separable is basically nil.
1:37:45
1 hour, 37 minutes, 45 seconds
So, which is why people invented the classical model of pattern recognition would consist in taking an input, engineering a feature extractor to produce a
1:37:54
1 hour, 37 minutes, 54 seconds
representation in such a way that in that space, now your problem becomes, let’s say, linearly separable.
1:38:03
1 hour, 38 minutes, 3 seconds
If you use a linear classifier or some other separability if you use another type of classifier. Now necessarily this feature extraction must be nonlinear itself.
1:38:13
1 hour, 38 minutes, 13 seconds
If the only thing it does is, is some affine transformation of the input is not going to make a nonlinearly separable problem into a linearly separable one.
1:38:23
1 hour, 38 minutes, 23 seconds
So necessarily this feature extractor has to be nonlinear. This is very important to remember.
1:38:30
1 hour, 38 minutes, 30 seconds
A linear pre processing doesn’t do anything for you, essentially. So people spend decades in computer vision, for example, or speech recognition, devising good feature extractors for particular problems.
1:38:41
1 hour, 38 minutes, 41 seconds
What features are good to do? Face recognition, for example.
1:38:45
1 hour, 38 minutes, 45 seconds
Can I do things like detect the eyes and then measure the ratio between the separation of the eyes with the separation from the mouth, and then compute a few features like this, and then feed that to a classifier and figure out who the person is.
1:38:57
1 hour, 38 minutes, 57 seconds
So most papers between, let’s say the 1960s or 70s and the late 2000s or early 2010s in computer vision were essentially about that, like how you represent
1:39:07
1 hour, 39 minutes, 7 seconds
images properly, not all of them okay, but a lot of them for recognition.
1:39:16
1 hour, 39 minutes, 16 seconds
And a lot of people kind of devise very sort of generic ways of devising feature extractors.
1:39:25
1 hour, 39 minutes, 25 seconds
The basic idea is you just expand the dimension of the representation in a nonlinear way so that now your number of dimension is larger than the number of samples.
1:39:33
1 hour, 39 minutes, 33 seconds
And now your problem has a chance of becoming linearly separable. So the ideas that I’m not going to go through, like space styling, random projection.
1:39:41
1 hour, 39 minutes, 41 seconds
So random projection basically is a very simple idea.
1:39:44
1 hour, 39 minutes, 44 seconds
You take your input vectors, you multiply them by a random matrix, and then you pass the result to Some nonlinear operation, okay, that’s called random projection.
1:39:56
1 hour, 39 minutes, 56 seconds
And it might make. If the dimension of the output is larger than the dimension of the input, it might make a nonlinearly separable problem, Linearly separable.
1:40:08
1 hour, 40 minutes, 8 seconds
It’s very inefficient because, you know, you might need a very large number of this dimension to be able to kind of do a good job. But it works in certain cases.
1:40:17
1 hour, 40 minutes, 17 seconds
And you don’t have to train the first layer. You basically pick it randomly. And so the only thing you need to train is a linear classifier on top.
1:40:25
1 hour, 40 minutes, 25 seconds
These polynomial classifiers, which I’ll talk about in a bit in a minute. Radio basis functions and kernel machines.
1:40:31
1 hour, 40 minutes, 31 seconds
So those are basically techniques to turn an input into a representation that then will be essentially classifiable by a simple classifier like a linear classifier.
1:40:44
1 hour, 40 minutes, 44 seconds
So what’s a polynomial classifier? Polynomial classifier. Basically imagine that your input vector has two dimensions.
1:40:54
1 hour, 40 minutes, 54 seconds
The way you increase the dimensionality of the representation is that you take each of the input variables, but you also take every product of pairs of input variables, right?
1:41:04
1 hour, 41 minutes, 4 seconds
So now you have a new feature vector which is composed of x1, x2. You add one for the bias, and then also x1 times x2, x1 squared and x2 squared.
1:41:13
1 hour, 41 minutes, 13 seconds
So when you do a linear classification in that space, what you’re doing really is a quadratic classification in the original space, right?
1:41:24
1 hour, 41 minutes, 24 seconds
The surface, the separating surface in the original space now is a quadratic curve in two dimension.
1:41:32
1 hour, 41 minutes, 32 seconds
In n dimension, it’s a quadratic hypersurface, basically. So it could be a parabola or ellipse or hyperbola, depending on the coefficients.
1:41:44
1 hour, 41 minutes, 44 seconds
Now, the problem with this is that it doesn’t work very well in high dimension because the number of features grows with a square of the number of inputs.
1:41:52
1 hour, 41 minutes, 52 seconds
So if you want to apply this to, let’s say an ImageNet type image, the resolution is 256 by 256 by 3. Because you have color channels, that’s already a high dimension.
1:42:02
1 hour, 42 minutes, 2 seconds
If you take the cross product of all of those variables, that’s way too large.
1:42:10
1 hour, 42 minutes, 10 seconds
So it’s not really practical for high dimensional problems, but it’s a trick. Now here is.
1:42:16
1 hour, 42 minutes, 16 seconds
So super vector machines are basically two layer networks, or kernel machines more generally are two layer systems in which the first layer has as many dimensions as you have training samples.
1:42:29
1 hour, 42 minutes, 29 seconds
So for each training sample, you create a neuron, a unit, if you want.
1:42:35
1 hour, 42 minutes, 35 seconds
And the role of this unit is to produce a large output if the input vector matches one of the training samples, and a small output if it doesn’t, or the other
1:42:42
1 hour, 42 minutes, 42 seconds
way around, a small output if it matches a large output if it doesn’t, okay, doesn’t really matter. But it has to be nonlinear.
1:42:51
1 hour, 42 minutes, 51 seconds
So something like compute the dot product of the input by one of the training samples and pass this through a negative exponential or a square or something like that.
1:43:03
1 hour, 43 minutes, 3 seconds
So this gives you how much the input vector resembles one of the training samples. You do this for every single training samples.
1:43:12
1 hour, 43 minutes, 12 seconds
And then you train a linear classifier, basically to use those inputs as input to a linear classifier.
1:43:19
1 hour, 43 minutes, 19 seconds
You compute the weights of that linear classifier. It’s basically as simple as that. There’s some regularization involved.
1:43:27
1 hour, 43 minutes, 27 seconds
So essentially it’s kind of a lookup table. You have your entire training set as points in your neurons, if you want, or units in your first layer.
1:43:36
1 hour, 43 minutes, 36 seconds
And they each indicate how close the current input vector is to them. So you get some picture of where the input vector is.
1:43:44
1 hour, 43 minutes, 44 seconds
By basically having the relative position to all of the training samples and then using a simple linear operation, you can figure out what’s the correct answer.
1:43:54
1 hour, 43 minutes, 54 seconds
This works really well for low dimensional problems, the small number of training samples.
1:44:00
1 hour, 44 minutes
But you’re not going to do computer vision with it, at least not if x’s are pixels, because it’s basically template matching.
1:44:08
1 hour, 44 minutes, 8 seconds
Now here is a very interesting fact.
1:44:11
1 hour, 44 minutes, 11 seconds
It’s the fact that if you build a two layer neural net on this model, so let’s say a two layer neural net, you have an input layer, a hidden layer, I’m not specifying the size, and a single output unit.
1:44:23
1 hour, 44 minutes, 23 seconds
And you ask what functions can I approximate with an architecture of this type of?
1:44:30
1 hour, 44 minutes, 30 seconds
The answer is you can approximate pretty much any well behaved function as close as you want, as long as you have enough of those units in the middle.
1:44:38
1 hour, 44 minutes, 38 seconds
So this is a theorem that says that two layer neural nets are universal approximators. It doesn’t really matter what nonlinear function you put in the middle, any nonlinear function will do.
1:44:49
1 hour, 44 minutes, 49 seconds
A two layer neural net is a universal approximator. And immediately you say, well, why, why do we need multiple layers then if we can approximate anything with two layers?
1:45:00
1 hour, 45 minutes
And the answer is, it’s very, very inefficient to try to approximate everything with only two layers.
1:45:06
1 hour, 45 minutes, 6 seconds
Because many, many, many interesting functions we’re interested in learning cannot be efficiently represented by a two layer system.
1:45:14
1 hour, 45 minutes, 14 seconds
They can possibly be represented by a two layer system, but the number of hidden units it would require would be so ridiculously large that you completely impractical.
1:45:25
1 hour, 45 minutes, 25 seconds
Okay, so that’s why we need layers this very simple point is something that took about, you know, it took until the, basically the 2010s for the machine learning and computer vision communities to understand.
1:45:40
1 hour, 45 minutes, 40 seconds
Okay, if you understood what I just said, you just took a few seconds, so you beat them.
1:45:48
1 hour, 45 minutes, 48 seconds
There is a last question here before we finish class. So does the depth of the network then have anything to do with generalization?
1:45:57
1 hour, 45 minutes, 57 seconds
Okay, so generalization is a different story. Generalization is very difficult to predict. It depends on a lot of things.
1:46:04
1 hour, 46 minutes, 4 seconds
It depends on the appropriateness of the architecture to the problem at hand. So, for example, people use convolutional nets for computer vision.
1:46:12
1 hour, 46 minutes, 12 seconds
They use transformers for text, you know, blah, blah, blah. So there are certain architectures that work well for certain types of data.
1:46:22
1 hour, 46 minutes, 22 seconds
So that’s, that’s the main thing that will improve generalization. But generally, yes, multiple layers can improve generalization.
1:46:30
1 hour, 46 minutes, 30 seconds
Because for a particular function you’re interested in learning, computing it with multiple layers will allow you to reduce the overall size of the system that will do a good job.
1:46:40
1 hour, 46 minutes, 40 seconds
And so by reducing the size, you’re basically making it easier for the system to find kind of good representation.
1:46:47
1 hour, 46 minutes, 47 seconds
But there is something else which has to do with compositionality. I’ll come to this in a minute if I have time.
1:46:54
1 hour, 46 minutes, 54 seconds
Also, the minimum, how do you call the. Well, is larger if we have overparameterized networks.
1:47:02
1 hour, 47 minutes, 2 seconds
If you have over parameterized network, it’s much easier to find a minimum to your objective function, which is why neural nets are generally overparameterized.
1:47:10
1 hour, 47 minutes, 10 seconds
They generally have a much larger number of parameters than what you would think is necessary. And when you get them bigger, when you make them bigger, they work better.
1:47:18
1 hour, 47 minutes, 18 seconds
Usually it’s not always the case, but it’s very curious phenomenon about this. We’ll talk about this later. Okay, this is the one point I want to make.
1:47:27
1 hour, 47 minutes, 27 seconds
And it’s the fact that the reason why layers are good is that the world is compositional, the perceptual world in particular, but the world in general, the universe, if you want to, is compositional.
1:47:40
1 hour, 47 minutes, 40 seconds
What does that mean? It means that, okay, at the level of the universe, right, we have elementary particles.
1:47:46
1 hour, 47 minutes, 46 seconds
They assemble to form less elementary particles.
1:47:49
1 hour, 47 minutes, 49 seconds
Those assemble to form atoms, those assemble to form molecules, those assemble to form materials, those assemble to form structures, objects, et cetera, and environment, scenes, et cetera.
1:48:02
1 hour, 48 minutes, 2 seconds
You have the same kind of hierarchy. For images, you have pixels. They assemble to form edges and textons and motifs, parts and objects.
1:48:10
1 hour, 48 minutes, 10 seconds
In text, you have characters assemble to form words, word groups, clauses, sentences, stories.
1:48:16
1 hour, 48 minutes, 16 seconds
In speech, you have speech samples assembled to form elementary sounds, phones, phonemes, syllables, words, etc.
1:48:24
1 hour, 48 minutes, 24 seconds
So you have this kind of compositional hierarchy in a lot of natural signals. And this is what makes the world understandable.
1:48:32
1 hour, 48 minutes, 32 seconds
There’s this famous quote by Albert Einstein. The most incomprehensible thing about the world is that the world is comprehensible.
1:48:40
1 hour, 48 minutes, 40 seconds
And the reason why the world is comprehensible is because it’s compositional, because small part assemble to form bigger part.
1:48:48
1 hour, 48 minutes, 48 seconds
And that allows you to have a description, an abstract description of the world in terms of parts from the level immediately below, in terms of level of abstraction.
1:48:58
1 hour, 48 minutes, 58 seconds
So to some extent, the layered architecture in a neural net reflects this idea that you have kind of a compositional hierarchy where simple things assemble to form slightly more complex things.
1:49:09
1 hour, 49 minutes, 9 seconds
So images, you have pixels formed to form edges that are kind of depicted here.
1:49:14
1 hour, 49 minutes, 14 seconds
These are actually feature detectors, visualization of feature detectors by a particular convolutional net, which is a particular type of neural net, multilayer neural net.
1:49:23
1 hour, 49 minutes, 23 seconds
So at the low level, you have units that detect oriented edges. A couple layers up, you have things that detect simple motif circles, gratings, corners, et cetera.
1:49:31
1 hour, 49 minutes, 31 seconds
And then a few layers up, there are things like parts of objects and things like that.
1:49:37
1 hour, 49 minutes, 37 seconds
So I think personally that the magic of deep learning, the fact that multiple layers help, is the fact that the perceptual world is basically a compositional hierarchy.
1:49:48
1 hour, 49 minutes, 48 seconds
And then this end to end learning in deep learning allows the system to learn hierarchical representations where each layer learns a representation that has a
1:49:55
1 hour, 49 minutes, 55 seconds
level of abstraction slightly higher than the previous one.
1:50:03
1 hour, 50 minutes, 3 seconds
So low level, you have individual pixels, then you have the presence or absence of an edge, then you have the presence or absence of a part of an object, and then you have the presence or absence of an object independently of the position of that object, the illumination, the color, the occlusions, the background,
1:50:15
1 hour, 50 minutes, 15 seconds
things like that.
1:50:23
1 hour, 50 minutes, 23 seconds
So that’s the motivation, the idea why deep learning is so successful and why it’s basically taken over the world over the last ten years or so.
1:50:31
1 hour, 50 minutes, 31 seconds
All right, thank you for your attention. Great.
1:50:34
1 hour, 50 minutes, 34 seconds
So for tomorrow, guys, don’t forget to try to go over the 01 tutorial tensor, sorry, the 01 notebook that we have on the website such that we, we can get all on
1:50:43
1 hour, 50 minutes, 43 seconds
the same level for the one that are not really familiar with numpy stuff.
1:50:51
1 hour, 50 minutes, 51 seconds
Okay? So otherwise, let’s see you tomorrow morning and have a nice day. Take care everyone.
1:51:00
1 hour, 51 minutes
Bye bye.

Поделиться: