A Physicist’s Guide to Building Better Machine Learning Models

In this video, Dr. Ardavan (Ahmad) Borzou will discuss the importance of revisiting our current approach to building and training machine learning models. To demonstrate the point, he will start from a general probability function and use the maximum entropy principle and perturbation theory to analytically derive linear regression. After a short review of the current ML training method, he will show an alternative that is computationally less costly.

Расшифровка видео

Поиск по видео
Introduction
0:00
this is an IBM 700 series computer these
0:03
were a state-of-the-art computers of
0:05
their time their computing power was one
0:09
over a million times lower than a
0:11
typical laptop today but their size was
0:14
as large as a building they not only
0:17
were slow but also consumed a lot of
0:20
electricity making the computations very
0:23
costly to perform the calculations they
0:26
were sending physical signals to vacuum
0:29
2 a process that would generate a lot of
0:32
heat adding to the maintenance costs and
0:36
this is a GPU cluster that we used today
0:38
to build large language models like GPT
0:41
of open AI or Gemini of Google and
0:45
others similar to the IBM 700 series
0:49
they are the size of a building larger
0:51
ones in this case consume a lot of
0:55
electricity and generate a lot of heat
0:58
in the case of the IBM 700 series or the
1:01
second generation of computers in
1:02
General Physics stepped in and
1:05
introduced semiconductors which then led
1:08
to replacing the vacuum tubes with
1:11
transistors revolutionizing computers in
1:14
all aspects the size became small enough
1:17
to fill in our backpacks but the compute
1:20
power increased by millions the
1:23
electricity they need can be stored in a
1:25
small battery and the heat they generate
1:28
is negligible
1:30
today we need a similar Revolution to
1:33
shrink the size of GPU clusters to
1:36
something that fits in our pocket but is
1:39
more powerful in Computing with
1:42
negligible energy consumption and heat
1:45
generation you might have heard of
1:47
quantum computers that use quantum
1:49
physics to perform their
1:52
computations they can potentially help
1:54
us achieve this revolution but quantum
1:58
computers are not the only place physics
2:00
can revolutionize machine learning and
2:02
AI physics has a vast literature on
2:05
incorporating symmetries into its models
2:08
which makes them a lot more efficient
2:11
let me explain through a simple example
2:14
imagine you are going to compute the
2:16
area of the surface of a sphere if we
2:19
don’t take the spherical symmetry into
2:21
account we end up solving three
2:24
integrals with complicated looking
2:26
limits of
2:28
integration but if we use the spherical
2:30
symmetry we end up with one single
2:33
integral with very simple limits of
2:36
integration now imagine we want to
2:39
compute these two computation approaches
2:42
using the same
2:43
computer which one will be faster and
2:46
consume less energy and generate less
2:50
heat of course symmetries backed by
2:52
group Theory have an ocean of literature
2:56
and have much more to offer than what I
2:58
just mentioned physics also has a vast
3:01
literature on perturbation Theory and
3:04
how to analytically solve the outcome of
3:06
probabilistic events using this
3:09
technology can reduce the computation
3:11
that is needed for performing a certain
3:14
machine learning task perhaps you have
3:17
seen these nice looking
3:19
diagrams these invented by Richard fan
3:22
provide an analytic framework to
3:25
describe probabilistic events without
3:27
much computation in this channel
3:30
starting from this video I will
3:33
gradually talk about how to use these
3:35
diagrams and their technology to
3:38
estimate machine learning parameters in
3:40
high energy physics and also statistical
3:43
physics the models are described by a
3:46
probability function of this form most
3:49
of the existing machine learning models
3:52
also can be reformulated into this same
3:55
probabilistic form my conjecture is that
3:58
all machine learning models can be
4:00
reformulated into this equation then
4:04
comes the brilliant idea of this
4:06
physicist L lond and what we know as the
4:09
phenomenon of
4:11
universality Simply speaking the
4:14
underlying principles don’t matter much
4:17
in behavior of a statistical system
4:20
Loosely speaking if this F is the same
4:23
in a physics model and the Machine
4:25
learning model the two behave the same
4:29
if you already know how the physics
4:30
model is behaving you already know
4:33
everything about the machine learning
4:34
model no need to spend money energy and
4:38
electricity to estimate the parameters
4:40
of the machine learning model to
4:42
demonstrate how we can reduce
4:44
computations enhance the cost by
4:47
changing our approach to building
4:49
machine learning models in the rest of
4:51
this video I will start from this
4:54
equation derive linear regression out of
4:56
it and estimate its parameters with with
4:59
less than usual computation to show you
5:02
the difference let me first briefly
Review of Current Approaches
5:04
review the conventional process of
5:07
parameter estimation in machine learning
5:09
let’s say I have this spreadsheet of
5:11
data and would like to predict the last
5:14
column y using the values in the rest of
5:16
the columns I will assume a function for
5:20
the expected value of y in terms of X
5:23
variables in linear regression this is
5:25
the equation to estimate the beta
5:28
parameters I will Define a type of an
5:30
error function a loss function and try
5:33
to minimize it here is an example of a
5:36
loss function called the residual sum of
5:39
squares here the sum is over the rows of
5:42
the
5:43
spreadsheet I just plug the X and Y
5:45
values of each row into this equation to
5:48
end up with an equation for the beta
5:51
parameters and now I need to find the
5:54
best set of beta parameters that
5:57
minimize this loss function a
6:00
conventional method for finding the
6:02
optimal set of beta parameters that
6:05
minimize the error is called gradian
6:07
descent invented by a physicist named
6:11
Kushi let me quickly explain how that
6:13
method works for Simplicity let’s assume
6:16
I have only two unknown beta parameters
6:19
let’s say if I plot the loss or error
6:22
function in terms of the two beta
6:24
parameters I end up with this
6:26
shape the parameters at the bottom of
6:30
this well are the solutions to the
6:32
minimization and are the ones I should
6:34
insert into my linear
6:37
regression however generation of this
6:40
plot to find the minimum is not
6:43
efficient because every single point on
6:46
the surface of this plot requires one
6:49
computation to avoid all these
6:51
computations I will start with an
6:54
initial guess for the parameters and
6:57
that is usually just a random location
6:59
on this surface I will then compute the
7:03
slope of all the points around my
7:05
initial guess and find out that this
7:08
direction for example has the steepest
7:11
decrease in the value of the error
7:13
function I will then update my initial
7:16
guess of the parameters by moving their
7:19
values in this direction I will then
7:23
repeat this process until I achieve no
7:26
further decrease in the error function
7:29
this same algorithm or its derivatives
7:33
is used to train even deep learning
7:35
models now why this method of building
7:38
machine learning models is not efficient
7:41
first my initial guess can be far away
7:44
from the real minimum and it takes a lot
7:47
of steps to reach the minimum and every
7:50
single step in this algorithm costs
7:53
money another problem is when we have a
7:56
surface like this and our visual guess
8:00
happens to fall here then the algorithm
8:03
will give me a wrong answer it Returns
8:06
the location of a local minimum instead
8:10
of the true minimum and this
8:13
visualization is just for two unknown
8:16
parameters imagine how complicated and
8:19
lengthy the process can be when a model
8:22
has as many unknown parameters as a
8:25
trillion such as in GPT or gini large
8:28
language model
8:30
that explains why open AI spent $100
8:34
million to train gp4 let’s not switch
8:37
the gear and build a linear regression
Alternative Approach
8:40
model with a couple of advantages one
8:43
without explicitly going into the
8:45
minimization task two it starts from
8:49
this formulation that can be linked to
8:52
physics models and take advantage of the
8:54
concept of universality in physics three
8:58
such that allow us to use thean diagrams
9:02
in understanding them and also in
9:05
Computing their parameters in a couple
9:07
of previous videos I have covered part
9:10
of what I need for this video so I will
9:13
just summarize the important parts to
9:16
keep this video
9:17
self-contained but for more details I
9:20
suggest you watch those videos later in
Gaussian Form from First Principles
9:23
this video we discussed that a general
9:25
form for the probability function
9:28
governs the occurrence of events in a
9:30
system in terms of a spreadsheet of data
9:33
it determines the probability of
9:35
observing specific values for The
9:37
Columns of the spreadsheet in other
9:40
words the variables of the system next
9:43
we show that because of the maximum
9:45
entropy principle which has the same
9:48
mathematical form in physics and
9:50
information Theory we can ignore higher
9:52
order terms in the tailor series of this
9:55
F which we call the effective free
9:57
energy of the system
9:59
we showed that after rearranging the
10:02
terms in the tailor series we can write
10:04
it in the form of a gaussian
10:06
distribution for the Target variable the
10:09
variable that we would like to predict
10:12
then I used some herois arguments to
10:16
prove that this is the mean of Y in
10:19
other words to prove that I can write
10:21
this equation which is just the linear
10:24
regression let me now use the technology
10:26
behind the fem man diagrams to prove
Derivation of Linear Regression
10:29
that that this equation implies linear
10:31
regression I will then use the same
10:33
derivation without repeating it to
10:36
estimate the parameters of the model
10:39
using data first I would like to
10:41
emphasize that in the case of linear
10:44
regression the X values in this equation
10:46
are equal to the values in the rows of
10:49
the spreadsheet that means in the case
10:52
of linear regression this probability is
10:54
a function of only one variable Y in
10:58
this equation Z in the denominator is
11:01
called the partition function it
11:04
contains all the information we need to
Partition Function
11:06
extract from the system meanwhile it is
11:09
a normalization factor meaning that if I
11:12
sum over the probabilities of all the Y
11:15
values it must be equal to one that is
11:18
the definition of
11:20
probability to keep the notations simple
11:23
let me replace these terms with mu I
11:25
will replace the MU with the actual
11:28
terms at the end of the
11:30
calculations I will now add an auxiliary
11:33
variable J which I need to set to zero
11:36
at the end let me now add and subtract
11:40
this term so that I have a complete
11:42
Square term for Yus mu in terms of finan
11:47
diagrams this 2 * Sigma square is the
11:51
propagator of the model and this
Feynman Diagram of Linear Regression
11:53
represented by a line in the
11:56
computations this term is not a function
11:59
of Y and I can take it out of the
12:01
integral and from the integral table I
12:05
know the answer to this integral and
12:07
here is the analytic solution for Z if
12:11
we set J equal to Zer the normalization
12:14
factor comes out to be this which is the
12:17
familiar form from the gaussian
12:20
distribution let’s now take a derivative
12:22
of the two sides of the equation with
12:24
respect to J and then set J equal to
12:27
zero the left hand side is just the
12:30
definition of the mean of Y minus mu and
12:34
the right hand side is zero if you now
12:36
replace mu with the actual terms that we
12:39
promised earlier this equation becomes
12:42
linear regression to estimate Sigma
Estimate Parameters of Linear Regression
12:45
Square using data let’s now take a
12:47
second derivative of the two sides with
12:50
respect to J and then set J to zero
12:53
again the right hand side is just Sigma
12:56
Square Times the normalization factor
12:59
the left hand side is just the
13:01
definition of the variance of Y times
13:04
the normalization factor so we just
13:07
prove that Sigma squ is the variance of
13:10
Y we can now use the law of large
13:13
numbers to estimate Sigma Square using
13:16
our data spreadsheet for that I first
13:19
need to find the MU the mean of Y using
13:23
the same law mu is the sum of all the Y
13:26
values in my spreadsheet divide by the
13:29
number of rows next I subtract the
13:32
estimated mu from all the Y values in
13:35
the spreadsheet multiply the results by
13:39
themselves take them to power two and
13:42
then sum all of the rows finally divide
13:45
by the number of rows actually I can
13:47
quickly perform these computations in
13:50
Python using the mean and the ver
13:53
methods of the nonp library for example
13:56
let’s now estimate the beta parameters
13:58
first expand the parenthesis in the
14:01
probability function this time rearrange
14:03
them into a matrix form now assuming
14:06
that the X values are also not set
14:09
similar to the Y variable this is a
14:11
multivariate gausian
Multivariate Gaussian Distribution
14:14
distribution The Matrix in the middle is
14:17
the inverse of the covariance Matrix
14:19
related to the correlation Matrix I will
14:23
es skip the proof as it is just the
14:25
higher dimensional version of the proof
14:28
that I presented in the single
14:29
dimensional casee earlier in this video
14:33
again using the law of large numbers I
14:35
can estimate the components of the
14:37
covariance Matrix using the spreadsheet
14:41
for example to find the value of beta 1
14:44
I just need to estimate this component
14:47
of the inverse of the covariance Matrix
14:49
note that I already estimated Sigma
14:52
Square earlier in the video it was the
14:55
empirical variance of the Y column of
14:58
the spreadsheet and I can perform these
15:00
computations using the C method of the
15:03
nonp library in
15:05
Python next I can compute the inverse of
15:08
the Matrix using link alge doin method
15:12
of
15:13
NP beta 1 is then equal to minus of 2 *
15:17
Sigma squ * the One 2 component of the
15:21
estimated
15:22
Matrix and I can follow these same
15:25
procedures to estimate the rest of the
15:27
beta parameters without going through
15:30
the minimization task as the final
15:32
chapter of this video let me talk a bit
15:35
about the miracle of symmetries and how
Symmetries Can Reduce Computations
15:38
their incorporation can revolutionize AI
15:41
models let’s assume that I have
15:43
collected some data and after looking at
15:46
my spreadsheet it turns out that my
15:48
system has a Z2 symmetry simply speaking
15:53
that means if I flip the plus or minus
15:56
signs of all the cells of the
15:58
spreadsheet the empirical expectation
16:01
values don’t change I can immediately
16:04
and with no further calculation set all
16:07
the out power terms in the effective Fel
16:10
energy to zero that means in the blink
16:13
of an ie I estimate their parameters to
16:16
be zero without a need to go into the
16:18
minimization
16:20
process another advantage of using
16:22
symmetries is that we can categorize
16:25
machine learning models into symmetry
16:27
classes and estimate the parameters of
Gas, Liquid, Solid Equivalent of ML Models
16:30
that symmetry class once and for all
16:34
times and record them in a notebook just
16:37
like the table of
16:39
integrals after that we only need to
16:41
find the Symmetry class of a machine
16:43
learning model and then read its
16:46
parameters from that pre-written
16:49
notebook

Поделиться: