Consider some set of points ${(x_i, y_i)}$ . To make statistically significant predictions of $y$ given $x$, we would like to learn the distribution $P(y|x, \mu_ {y(x)},\sigma)$. In this article, we will derive an algorithm to learn the $y$ for the case when $\mu_y \propto x$. Specifically, we make the following assumptions
(1) For all $x$, $\sigma_x = \sigma$ where $\sigma$ is some constant (2) The datapoints are drawn iid from the gaussian distribution $\mathcal{N}(y|x,\mu_{y(x)}, \sigma)$
Define:
$\hat{y}$ is a single prediction
$\mathbf{\hat{y}}$ is a vector of predictions
$\mathbf{y}$ is a vector of iid draws from $\mathcal{P}(y\mid x, \mu_{y(x)}, \sigma_x)$
$\mathbf{x}$ is a vector of $x$ values corresponding to the iid draws in $\mathbf{y}$
$\mu_i = \mathbf{y}_i$
We assume that the observed values of $\mathbf{y}$ are the true mean for each $P(y\mid x,\mu_{y(x)}\sigma)$
Recalling the definition of independence
define the likelyhood function for $\mathbf{\hat{y}}$,
which is a measure of the likelyhood that, given mean $\mu_i$ and $\sigma$ our predictions will actually be observed and not merely predicted.
Our problem is reduced to finding those $\mu_i$ and $\sigma$ which maximize $\mathcal{L}$. Because $\ln x$ is monotonically increasing, maximizing $\mathcal{L}$ is equivalent to maximizing $\ln \mathcal{L}$.
Then
Recall the definition of the gaussian:
which we substitute into the log likelyhood function
I will make a comment about something you may have noticed. This model assumes that variance is constant with respect to $x$ and that only $\mu_y$ changes, but never variance.
Now we will make the linearity assumption, i.e. that
So therefore our problem comes to
This is exactly the case when
Since maximizing $k \ln w$ is equivalent to maximizing $k \ln w $ for some constant $k$ , we can replace the $-\frac{1}{2\sigma^2}$ term with a $\frac{1}{2}$ term instead which simplifies our theoretical derivations.
The problem as stated mathematically could be solved analytically. However, that is usually very difficult so we use numerical methods instead. A common numerical method is gradient descent.
We then use this method to update the weights accordingly using some learning rate.
Once we have the weights we have a way of predicting $u_{y(x)}$ for any $x$ and our $MSE$ is exactly the variance of $P(y|x)$ . Thus we have successfully fit the distribution