Ordinary linear regression
 Least squares is a widely used method for finding the “line of the best fit”.
 Fitting a line through the cloud of points will result in the line not passing through many of the points.
 What is the best way to find parameters of a linear model \(y = ax + b\).
 A typical approach is choosing \(a\) and \(b\) such that the difference between the actual dependent variables \(y_i\) and predicted variables \(\hat{y}_i\) is minimum.
 There are different ways to measure the difference between \(y_i\) and \(\hat{y}_i\). Two common methods is \(L_1 norm\) and \(L_2 norm\) (Euclidean): \(L_1 = \sum_{i=1}^{n}  \hat{y}_i  y_i \) , \(L_2 = \sum_{i=1}^{n} (\hat{y}_i  y_i)^2\).
 These methods are based on errors or residuals, where each error \(e_i = \hat{y}_i  y_i\).
 To find the best \(a\) and \(b\), we want to minimize the loss measures \(L_l\).
 \(l = 2\) is the most commonly used and simplest method. Because \(L_1\) has two main drawbacks. First, it does not have an analytical solution. Second, it sometimes provides multiple solutions, whereas \(L_2\) always returns a single optimal solution.
 To find the best parameters of the line fitting a set of data, we can use a simple Monte Carlo algorithm. Although this approach is not optimal.
 In the Monte Carlo algorithm, we choose two random values for \(a\) and \(b\). Then measure the loss. Next, we choose two other random values and measure the loss again. If the new loss is less than the previous one, we update \(a\) and \(b\). We repeat updating \(a\) and \(b\) until the loss stops changing much.
Linear regression best fit formula
 There is a simpler and more common way for minimizing the cost function for a regression problem.
 Deriving the formula to directly find the best fitting line is informative.
 Let’s say \(x_i\) denotes each independent variable (input) and \(y_i\) denotes each dependent variable (output).
 The line of best fit is \(\hat{y}_i = ax_i + b\).
 The cost function is \(C = \sum_{i=1}^{n} (y_i  \hat{y}_i)^2\)
 We now replace \(\hat{y}_i\) with its definition: \(C = \sum_{i=1}^{n} (y_i  ax_i  b)^2\).
 Minimizing \(C\) invovles finding its first derivative and finding at which values of \(a\) and \(b\) it is 0.
 Partial derivative of \(C\) with respect to \(b\):
 \(\frac{\partial C}{\partial b}[\sum_{i=1}^{n} (y_i  ax_i  b)^2]\).
 Use the chain rule to take the derivative: \(0 = \sum_{i=1}^{n} 2(y_i  ax_i  b)\).
 Divide both sides by \(2\): \(0 = \sum_{i=1}^{n} (y_i  ax_i  b)\)
 Break the summation into three parts and take \(a\) out: \(0 = \sum_{i=1}^{n} y_i  a \sum_{i=1}^{n} x_i  \sum_{i=1}^{n} b\).
 Sum \(b\) to \(n\) is \(nb\) : \(0 = \sum_{i=1}^{n} y_i  a \sum_{i=1}^{n} x_i  nb\).
 Add \(nb\) to both sides and divide by \(n\): \(b = \frac{\sum_{i=1}^{n} y_i  a \sum_{i=1}^{n} x_i}{n}\).
 Summations divided by \(n\) is mean. So it all simplifies to: \(b = \bar{y}  a\bar{x}\).
 Partial derivative of \(C\) with respect to \(a\):
 \(\frac{\partial C}{\partial a}[\sum_{i=1}^{n} (y_i  ax_i  b)^2]\).
 Use the chain rule: \(0 = \sum_{i=1}^{n} 2x_i(y_i  ax_i  b)\).
 Divide both sides by \(2\): \(0 = \sum_{i=1}^{n} x_i(y_i  ax_i  b)\).
 Distribute the \(x\): \(0 = \sum_{i=1}^{n} (x_iy_i  ax_i^2  bx_i)\).
 Substitute \(b = \bar{y}  a\bar{x}\) from above: \(0 = \sum_{i=1}^{n} (x_iy_i  ax_i^2  (\bar{y}  a\bar{x})x_i)\).
 Distribute the minus sign and \(x\): \(0 = \sum_{i=1}^{n} (x_iy_i  ax_i^2  \bar{y}x_i + a\bar{x}x_i)\).
 Split up the sum into two sums: \(0 = \sum_{i=1}^{n} (x_iy_i  \bar{y}x_i) + \sum_{i=1}^{n}(a\bar{x}x_i  ax_i^2)\).
 Take \(a\) out of the summation: \(0 = \sum_{i=1}^{n} (x_iy_i  \bar{y}x_i)  a \sum_{i=1}^{n}(x_i^2  \bar{x}x_i)\).
 Isolate \(a\): \(a = \frac{\sum_{i=1}^{n} (x_iy_i  \bar{y}x_i)}{\sum_{i=1}^{n}(x_i^2  \bar{x}x_i)}\).
 You can now calculate \(a\), then substitute it in the formula for \(b\).
 How different are \(a\) and \(b\) values from your Monte Carlo algorithm from the analytical results?
Multivariate regression
 When the number of independent variables increase, analytical finding of the line of best fit becomes increasingly difficult.
 A common algorithm to solve such problems is Gradient Descent (GD).
 GD is an algorithm for minimizing any function, here the cost function.
 GD outline:
 Start with some random (or all zero) parameter values.
 Take the derivative of the cost function with respect to each parameter
 Subtract the derivative from the original parameter values.
 Before subtraction, you can multiply the gradient by a fixed factor, the learning rate α. This controls the “step size” of the algorithm.
 New parameter values are the ones from which the gradient (times the learning rate) has been subtracted.
 More formally, a GD algorithm is repeating the following until convergence:
 \(a_i = a_i  \alpha \frac{\partial C}{\partial a_i}\) for \(i =1...n\), where \(a_i\) is parameter \(i\), \(C\) is the cost function, and α is the learning rate.
 Since the derivative gives the slope of a line, substracting it from the original parameter value moves us in the direction of lowering the cost function.
 If α is too small, the algorithm will be slow.
 If α is too big, the algorithm will fail to converge or even diverge.
 Note that all parameters should be updated simultaneously.
 The cost function can be normalized by dividing it by number of samples \(C = \frac{1}{2n}\sum_{i=1}^{n} (y_i  \hat{y}_i)^2\).
 The \(\frac{1}{2}\) helps cancel the \(2\) in the derivation of the function.
Derivative of the cost function
 Derivative of the bias term is \(\frac{\partial C}{\partial b} = \frac{1}{n}\sum_{i=1}^{n} (y_i  \hat{y}_i)\).
 Derivative of the coefficients is \(\frac{\partial C}{\partial a_i} = \frac{1}{n}\sum_{i=1}^{n} (y_i  \hat{y}_i)x_i\).
How to choose α
 Choosing a good learning rate is important for both converging to a minimum and for having an optimal learning time.
 To find which learning rate is best, you can use two kind of plots:
 How cost changes per iteration.
 How learning time changes per iteration.
 Ideally, cost should be decreasing and eventually becoming flat.
 You want a learning rate the minimizes the training time.
Matrix operations
 To make the code easier to write and faster to run, we can calculate the derivatives of all variables with matrix multiplication.
 Let \(A\) be a vector of all coefficients \(a_1 ... a_n\).
 To make computations easier, we can add a new feautre \(x_0\) and equal it to 1.
 Thus, vector \(A\) also includes intercept \(b\).
 The equation for the linear line is then \(Y = XA\), where X is the vector of all variables.
 The cost function becomes \(C = \frac{1}{2n} \sum (XA  Y)^2\).
 Derivative of the cost function of each variable is then \(\frac{\partial C_A}{\partial a_i} = \frac{\partial}{\partial a_i} (\frac{1}{2n} (XA  Y)^2) = \frac{1}{n} (XA  Y) \times X\).
Batch vs stochastic GD
 Batch GD is when we update model parameters using all training data.
 In case the data is very large, using batch GD can be too slow.
 In stochastic GD, we update the coefficients with a random fraction of the training set at each iteration.
 stochastic GD, as the name implies, does not always move in the direction of minimizing the cost function, but when the cost function is convex, it can find the minimum^{1}.
Exercises

Monte Carlo regression. Write an algorithm to find the best fitting line to the following two lists. This will be a simple algorithm that checks the loss for a random combination of a and b variables. If a new combination is better than the old one, then they are selected. Run the algorithm for \(N = 10^6\) steps. Compare it with the builtin functions for finding the best fitting line (see below). Test your algorithm with \(L_1\) and \(L_2\) loss functions. (2 points)
More specifically, you first write two functions called
predict
andcost
.predict
gives you your predicted \(y\) from x, y, a and b.cost
gives you the cost the predicted y. After that you write a function that accepts parameters x and y. Inside the function you write a for loop that pick random values for a and b from a given range. At each iteration, it checks whether the cost of the new parameters are less than the cost of previous parameters. If so, update the parameter values.using Random using VegaLite x = collect(1:10) .+ randn(10) y = collect(1:10) .+ randn(10) @vlplot( mark = {:point, color=:black, filled=true}, x = {x, axis={title="Independent variable"}}, y = {y, axis={title="Dependent variable"}} )
Ordinary least square from
Scikitlearn
:using ScikitLearn import ScikitLearn: fit!, predict @sk_import linear_model: LinearRegression model = LinearRegression() x = reshape(x, 10, 1) # needed for the fit! function fit!(model, x, y) ypred = predict(model, x) intercept, coef = model.intercept_, model.coef_[1] # plotting using DataFrames d = DataFrame(:x => x, :y => y, :ypred => ypred); p2 = @vlplot(data=d)+ @vlplot( mark = {:point, color=:black, filled=true}, x = {:x, axis={title="Independent variable"}}, y = {:y, axis={title="Dependent variable"}} ) + @vlplot( mark = {:line, color=:red}, x = :x, y = :ypred )
 Univariate linear regression. Write a function to calculate the line of best fit to the following data.
using RDatasets trees = dataset("datasets", "trees"); x = trees[!, :Girth]; y = trees[!, :Height];
Do so in two different ways: a) using gradient descent (4 points), b) using the analytical solution (1 point).
To run a linear regression with gradient descent, create the following functions: a) a function called
univariate_predict
, that acceptsx
,a
, andb
(all of them numbers), and return a prediction (number). IfX
is an array, the same function that you wrote can be used with broadcasting (predict.(X, a, b)
).  b) a function called
univariate_gradient
that acceptsX
(array),Y
(array),a
(number), andb
(number) and then returns new values fora
andb
. To that end, it should value of the derivative of the cost function atX
.  c) a function called
univariate_cost
that acceptsX
,Y
,a
, andb
and calculates the cost.  d) Finally build a function called
linear_regression
for gradient descent that acceptsX
,Y
, learning rateα
, anditerations
for number of iterations, and returns the finala
andb
.\  Now you can put all these functions in a file and create a module called
ML
. During the course, we create mode machine learning algorithms and you can add them to youML
package. By the end of the course, you will have your own machine learning package. You may download from here an emptyML
file to fill out the function definitions.  As a best practice for writing reproducible code, add a
Project.toml
file to your directory that manages the version of packages you are using.  Try different learning rates. Plot the change of cost vs time for different learning rates and choose a learning rate accordingly.
 a) a function called

Multivariate linear regression. Write a function to find the line of best fit to the following data using gradient descent. The logic of the algorithm is the same as univariate linear regression, except that you will work with matrices this time. (3 points)
X = Matrix(trees[!, [:Girth,:Height]]); y = trees[!, :Volume];