In this video, Dr. Ardavan (Ahmad) Borzou will discuss the importance of revisiting our current approach to building and training machine learning models. To demonstrate the point, he will start from a general probability function and use the maximum entropy principle and perturbation theory to analytically derive linear regression. After a short review of the current ML training method, he will show an alternative that is computationally less costly.
Расшифровка видео
Поиск по видео
Introduction
0:00
this is an IBM 700 series computer these
0:03
were a state-of-the-art computers of
0:05
their time their computing power was one
0:09
over a million times lower than a
0:11
typical laptop today but their size was
0:14
as large as a building they not only
0:17
were slow but also consumed a lot of
0:20
electricity making the computations very
0:23
costly to perform the calculations they
0:26
were sending physical signals to vacuum
0:29
2 a process that would generate a lot of
0:32
heat adding to the maintenance costs and
0:36
this is a GPU cluster that we used today
0:38
to build large language models like GPT
0:41
of open AI or Gemini of Google and
0:45
others similar to the IBM 700 series
0:49
they are the size of a building larger
0:51
ones in this case consume a lot of
0:55
electricity and generate a lot of heat
0:58
in the case of the IBM 700 series or the
1:01
second generation of computers in
1:02
General Physics stepped in and
1:05
introduced semiconductors which then led
1:08
to replacing the vacuum tubes with
1:11
transistors revolutionizing computers in
1:14
all aspects the size became small enough
1:17
to fill in our backpacks but the compute
1:20
power increased by millions the
1:23
electricity they need can be stored in a
1:25
small battery and the heat they generate
1:28
is negligible
1:30
today we need a similar Revolution to
1:33
shrink the size of GPU clusters to
1:36
something that fits in our pocket but is
1:39
more powerful in Computing with
1:42
negligible energy consumption and heat
1:45
generation you might have heard of
1:47
quantum computers that use quantum
1:49
physics to perform their
1:52
computations they can potentially help
1:54
us achieve this revolution but quantum
1:58
computers are not the only place physics
2:00
can revolutionize machine learning and
2:02
AI physics has a vast literature on
2:05
incorporating symmetries into its models
2:08
which makes them a lot more efficient
2:11
let me explain through a simple example
2:14
imagine you are going to compute the
2:16
area of the surface of a sphere if we
2:19
don’t take the spherical symmetry into
2:21
account we end up solving three
2:24
integrals with complicated looking
2:26
limits of
2:28
integration but if we use the spherical
2:30
symmetry we end up with one single
2:33
integral with very simple limits of
2:36
integration now imagine we want to
2:39
compute these two computation approaches
2:42
using the same
2:43
computer which one will be faster and
2:46
consume less energy and generate less
2:50
heat of course symmetries backed by
2:52
group Theory have an ocean of literature
2:56
and have much more to offer than what I
2:58
just mentioned physics also has a vast
3:01
literature on perturbation Theory and
3:04
how to analytically solve the outcome of
3:06
probabilistic events using this
3:09
technology can reduce the computation
3:11
that is needed for performing a certain
3:14
machine learning task perhaps you have
3:17
seen these nice looking
3:19
diagrams these invented by Richard fan
3:22
provide an analytic framework to
3:25
describe probabilistic events without
3:27
much computation in this channel
3:30
starting from this video I will
3:33
gradually talk about how to use these
3:35
diagrams and their technology to
3:38
estimate machine learning parameters in
3:40
high energy physics and also statistical
3:43
physics the models are described by a
3:46
probability function of this form most
3:49
of the existing machine learning models
3:52
also can be reformulated into this same
3:55
probabilistic form my conjecture is that
3:58
all machine learning models can be
4:00
reformulated into this equation then
4:04
comes the brilliant idea of this
4:06
physicist L lond and what we know as the
4:09
phenomenon of
4:11
universality Simply speaking the
4:14
underlying principles don’t matter much
4:17
in behavior of a statistical system
4:20
Loosely speaking if this F is the same
4:23
in a physics model and the Machine
4:25
learning model the two behave the same
4:29
if you already know how the physics
4:30
model is behaving you already know
4:33
everything about the machine learning
4:34
model no need to spend money energy and
4:38
electricity to estimate the parameters
4:40
of the machine learning model to
4:42
demonstrate how we can reduce
4:44
computations enhance the cost by
4:47
changing our approach to building
4:49
machine learning models in the rest of
4:51
this video I will start from this
4:54
equation derive linear regression out of
4:56
it and estimate its parameters with with
4:59
less than usual computation to show you
5:02
the difference let me first briefly
Review of Current Approaches
5:04
review the conventional process of
5:07
parameter estimation in machine learning
5:09
let’s say I have this spreadsheet of
5:11
data and would like to predict the last
5:14
column y using the values in the rest of
5:16
the columns I will assume a function for
5:20
the expected value of y in terms of X
5:23
variables in linear regression this is
5:25
the equation to estimate the beta
5:28
parameters I will Define a type of an
5:30
error function a loss function and try
5:33
to minimize it here is an example of a
5:36
loss function called the residual sum of
5:39
squares here the sum is over the rows of
5:42
the
5:43
spreadsheet I just plug the X and Y
5:45
values of each row into this equation to
5:48
end up with an equation for the beta
5:51
parameters and now I need to find the
5:54
best set of beta parameters that
5:57
minimize this loss function a
6:00
conventional method for finding the
6:02
optimal set of beta parameters that
6:05
minimize the error is called gradian
6:07
descent invented by a physicist named
6:11
Kushi let me quickly explain how that
6:13
method works for Simplicity let’s assume
6:16
I have only two unknown beta parameters
6:19
let’s say if I plot the loss or error
6:22
function in terms of the two beta
6:24
parameters I end up with this
6:26
shape the parameters at the bottom of
6:30
this well are the solutions to the
6:32
minimization and are the ones I should
6:34
insert into my linear
6:37
regression however generation of this
6:40
plot to find the minimum is not
6:43
efficient because every single point on
6:46
the surface of this plot requires one
6:49
computation to avoid all these
6:51
computations I will start with an
6:54
initial guess for the parameters and
6:57
that is usually just a random location
6:59
on this surface I will then compute the
7:03
slope of all the points around my
7:05
initial guess and find out that this
7:08
direction for example has the steepest
7:11
decrease in the value of the error
7:13
function I will then update my initial
7:16
guess of the parameters by moving their
7:19
values in this direction I will then
7:23
repeat this process until I achieve no
7:26
further decrease in the error function
7:29
this same algorithm or its derivatives
7:33
is used to train even deep learning
7:35
models now why this method of building
7:38
machine learning models is not efficient
7:41
first my initial guess can be far away
7:44
from the real minimum and it takes a lot
7:47
of steps to reach the minimum and every
7:50
single step in this algorithm costs
7:53
money another problem is when we have a
7:56
surface like this and our visual guess
8:00
happens to fall here then the algorithm
8:03
will give me a wrong answer it Returns
8:06
the location of a local minimum instead
8:10
of the true minimum and this
8:13
visualization is just for two unknown
8:16
parameters imagine how complicated and
8:19
lengthy the process can be when a model
8:22
has as many unknown parameters as a
8:25
trillion such as in GPT or gini large
8:28
language model
8:30
that explains why open AI spent $100
8:34
million to train gp4 let’s not switch
8:37
the gear and build a linear regression
Alternative Approach
8:40
model with a couple of advantages one
8:43
without explicitly going into the
8:45
minimization task two it starts from
8:49
this formulation that can be linked to
8:52
physics models and take advantage of the
8:54
concept of universality in physics three
8:58
such that allow us to use thean diagrams
9:02
in understanding them and also in
9:05
Computing their parameters in a couple
9:07
of previous videos I have covered part
9:10
of what I need for this video so I will
9:13
just summarize the important parts to
9:16
keep this video
9:17
self-contained but for more details I
9:20
suggest you watch those videos later in
Gaussian Form from First Principles
9:23
this video we discussed that a general
9:25
form for the probability function
9:28
governs the occurrence of events in a
9:30
system in terms of a spreadsheet of data
9:33
it determines the probability of
9:35
observing specific values for The
9:37
Columns of the spreadsheet in other
9:40
words the variables of the system next
9:43
we show that because of the maximum
9:45
entropy principle which has the same
9:48
mathematical form in physics and
9:50
information Theory we can ignore higher
9:52
order terms in the tailor series of this
9:55
F which we call the effective free
9:57
energy of the system
9:59
we showed that after rearranging the
10:02
terms in the tailor series we can write
10:04
it in the form of a gaussian
10:06
distribution for the Target variable the
10:09
variable that we would like to predict
10:12
then I used some herois arguments to
10:16
prove that this is the mean of Y in
10:19
other words to prove that I can write
10:21
this equation which is just the linear
10:24
regression let me now use the technology
10:26
behind the fem man diagrams to prove
Derivation of Linear Regression
10:29
that that this equation implies linear
10:31
regression I will then use the same
10:33
derivation without repeating it to
10:36
estimate the parameters of the model
10:39
using data first I would like to
10:41
emphasize that in the case of linear
10:44
regression the X values in this equation
10:46
are equal to the values in the rows of
10:49
the spreadsheet that means in the case
10:52
of linear regression this probability is
10:54
a function of only one variable Y in
10:58
this equation Z in the denominator is
11:01
called the partition function it
11:04
contains all the information we need to
Partition Function
11:06
extract from the system meanwhile it is
11:09
a normalization factor meaning that if I
11:12
sum over the probabilities of all the Y
11:15
values it must be equal to one that is
11:18
the definition of
11:20
probability to keep the notations simple
11:23
let me replace these terms with mu I
11:25
will replace the MU with the actual
11:28
terms at the end of the
11:30
calculations I will now add an auxiliary
11:33
variable J which I need to set to zero
11:36
at the end let me now add and subtract
11:40
this term so that I have a complete
11:42
Square term for Yus mu in terms of finan
11:47
diagrams this 2 * Sigma square is the
11:51
propagator of the model and this
Feynman Diagram of Linear Regression
11:53
represented by a line in the
11:56
computations this term is not a function
11:59
of Y and I can take it out of the
12:01
integral and from the integral table I
12:05
know the answer to this integral and
12:07
here is the analytic solution for Z if
12:11
we set J equal to Zer the normalization
12:14
factor comes out to be this which is the
12:17
familiar form from the gaussian
12:20
distribution let’s now take a derivative
12:22
of the two sides of the equation with
12:24
respect to J and then set J equal to
12:27
zero the left hand side is just the
12:30
definition of the mean of Y minus mu and
12:34
the right hand side is zero if you now
12:36
replace mu with the actual terms that we
12:39
promised earlier this equation becomes
12:42
linear regression to estimate Sigma
Estimate Parameters of Linear Regression
12:45
Square using data let’s now take a
12:47
second derivative of the two sides with
12:50
respect to J and then set J to zero
12:53
again the right hand side is just Sigma
12:56
Square Times the normalization factor
12:59
the left hand side is just the
13:01
definition of the variance of Y times
13:04
the normalization factor so we just
13:07
prove that Sigma squ is the variance of
13:10
Y we can now use the law of large
13:13
numbers to estimate Sigma Square using
13:16
our data spreadsheet for that I first
13:19
need to find the MU the mean of Y using
13:23
the same law mu is the sum of all the Y
13:26
values in my spreadsheet divide by the
13:29
number of rows next I subtract the
13:32
estimated mu from all the Y values in
13:35
the spreadsheet multiply the results by
13:39
themselves take them to power two and
13:42
then sum all of the rows finally divide
13:45
by the number of rows actually I can
13:47
quickly perform these computations in
13:50
Python using the mean and the ver
13:53
methods of the nonp library for example
13:56
let’s now estimate the beta parameters
13:58
first expand the parenthesis in the
14:01
probability function this time rearrange
14:03
them into a matrix form now assuming
14:06
that the X values are also not set
14:09
similar to the Y variable this is a
14:11
multivariate gausian
Multivariate Gaussian Distribution
14:14
distribution The Matrix in the middle is
14:17
the inverse of the covariance Matrix
14:19
related to the correlation Matrix I will
14:23
es skip the proof as it is just the
14:25
higher dimensional version of the proof
14:28
that I presented in the single
14:29
dimensional casee earlier in this video
14:33
again using the law of large numbers I
14:35
can estimate the components of the
14:37
covariance Matrix using the spreadsheet
14:41
for example to find the value of beta 1
14:44
I just need to estimate this component
14:47
of the inverse of the covariance Matrix
14:49
note that I already estimated Sigma
14:52
Square earlier in the video it was the
14:55
empirical variance of the Y column of
14:58
the spreadsheet and I can perform these
15:00
computations using the C method of the
15:03
nonp library in
15:05
Python next I can compute the inverse of
15:08
the Matrix using link alge doin method
15:12
of
15:13
NP beta 1 is then equal to minus of 2 *
15:17
Sigma squ * the One 2 component of the
15:21
estimated
15:22
Matrix and I can follow these same
15:25
procedures to estimate the rest of the
15:27
beta parameters without going through
15:30
the minimization task as the final
15:32
chapter of this video let me talk a bit
15:35
about the miracle of symmetries and how
Symmetries Can Reduce Computations
15:38
their incorporation can revolutionize AI
15:41
models let’s assume that I have
15:43
collected some data and after looking at
15:46
my spreadsheet it turns out that my
15:48
system has a Z2 symmetry simply speaking
15:53
that means if I flip the plus or minus
15:56
signs of all the cells of the
15:58
spreadsheet the empirical expectation
16:01
values don’t change I can immediately
16:04
and with no further calculation set all
16:07
the out power terms in the effective Fel
16:10
energy to zero that means in the blink
16:13
of an ie I estimate their parameters to
16:16
be zero without a need to go into the
16:18
minimization
16:20
process another advantage of using
16:22
symmetries is that we can categorize
16:25
machine learning models into symmetry
16:27
classes and estimate the parameters of
Gas, Liquid, Solid Equivalent of ML Models
16:30
that symmetry class once and for all
16:34
times and record them in a notebook just
16:37
like the table of
16:39
integrals after that we only need to
16:41
find the Symmetry class of a machine
16:43
learning model and then read its
16:46
parameters from that pre-written
16:49
notebook