Consider some set of points ${(x_i, y_i)}$ . To make statistically significant predictions of $y$ given $x$, we would like to learn the distribution $P(y|x, \mu_ {y(x)},\sigma)$. In this article, we will derive an algorithm to learn the $y$ for the case when $\mu_y \propto x$. Specifically, we make the following assumptions

(1) For all $x$, $\sigma_x = \sigma$ where $\sigma$ is some constant (2) The datapoints are drawn iid from the gaussian distribution $\mathcal{N}(y|x,\mu_{y(x)}, \sigma)$

Define:

$\hat{y}$ is a single prediction

$\mathbf{\hat{y}}$ is a vector of predictions

$\mathbf{y}$ is a vector of iid draws from $\mathcal{P}(y\mid x, \mu_{y(x)}, \sigma_x)$

$\mathbf{x}$ is a vector of $x$ values corresponding to the iid draws in $\mathbf{y}$

$\mu_i = \mathbf{y}_i$

We assume that the observed values of $\mathbf{y}$ are the true mean for each $P(y\mid x,\mu_{y(x)}\sigma)$

Recalling the definition of independence

define the likelyhood function for $\mathbf{\hat{y}}$,

which is a measure of the likelyhood that, given mean $\mu_i$ and $\sigma$ our predictions will actually be observed and not merely predicted.

Our problem is reduced to finding those $\mu_i$ and $\sigma$ which maximize $\mathcal{L}$. Because $\ln x$ is monotonically increasing, maximizing $\mathcal{L}$ is equivalent to maximizing $\ln \mathcal{L}$.

Then

Recall the definition of the gaussian:

which we substitute into the log likelyhood function

I will make a comment about something you may have noticed. This model assumes that variance is constant with respect to $x$ and that only $\mu_y$ changes, but never variance.

Now we will make the linearity assumption, i.e. that

So therefore our problem comes to

This is exactly the case when

Since maximizing $k \ln w$ is equivalent to maximizing $k \ln w $ for some constant $k$ , we can replace the $-\frac{1}{2\sigma^2}$ term with a $\frac{1}{2}$ term instead which simplifies our theoretical derivations.

The problem as stated mathematically could be solved analytically. However, that is usually very difficult so we use numerical methods instead. A common numerical method is gradient descent.

We then use this method to update the weights accordingly using some learning rate.

Once we have the weights we have a way of predicting $u_{y(x)}$ for any $x$ and our $MSE$ is exactly the variance of $P(y|x)$ . Thus we have successfully fit the distribution