Notes on Regression Analysis

1. Regression Analysis

Regression analysis is a statistical process for estimating the relationships between a dependent variable(y) and one or more independent variables(\mathbf{X}=(x_0,x_1,\dots). That is, estimating the parameters \boldsymbol{\beta}=(\beta_0,\beta_1,\dots)(also called weights) in

(1)   \begin{equation*} E(y|\mathbf{X})=f(\mathbf{X},\boldsymbol{\beta}) \end{equation*}

based on some training examples \{(\mathbf{X}^{\{i\}},y^{\{i\}}); i=1,\dots,m\}.

Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning.

Regression refer specifically to the estimation of continuous response variables, as opposed to the discrete response variables used in classification.

In linear regression, the model specification is that the dependent variable, is a linear combination of the parameters (but need not be linear in the independent variables).


2. Lest Mean Squares Algorithm

There are several methods for the regression problem,
1. Lest Mean Squares
2. Bayesian methods, e.g. Bayesian linear regression. Percentage regression, for situations where reducing percentage errors is deemed more appropriate.
3. Least absolute deviations, which is more robust in the presence of outliers, leading to quantile regression
4. Nonparametric regression, requires a large number of observations and is computationally intensive
5. Distance metric learning, which is learned by the search of a meaningful distance metric in a given input space.

If the error between the real value of y and the estimated E(y|\mathbf{X}) has Gaussian distribution, the lest mean squares algorithm will give the most likelihood function.

We define the cost function

(2)   \begin{equation*} J(\boldsymbol{\beta}) =\frac{1}{2}\sum_{i=1}^{m} (f(\mathbf{X}^{\{i\}},\boldsymbol{\beta})-y^{\{i\}})^2 \end{equation*}

Our aim is to minimize J(\boldsymbol{\boldsymbol{\beta}}) by a proper chose of \boldsymbol{\beta}.


3. Gradient Descent Algorithm

There are also several ways to find out the \boldsymbol{\beta} which can minimize J(\boldsymbol{\beta})
Batch gradient descent.
1. Stochastic gradient descent (or incremental gradient descent).
2. Newton’s method (or Newton-Raphson method).
3. Solve \frac{\partial J}{\partial \beta_i}=0, i=0,\dots,n directly.

Batch gradient descent starts with some initial \boldsymbol{\beta}, and repeatedly performs the update,

(3)   \begin{equation*} \beta_j \to \alpha \frac{\partial}{\partial \beta_j}J(\boldsymbol{\beta}) \end{equation*}

(This update is simultaneously performed for all values of j=0,1,\dots,n, n is the number of \beta_j, or the number of feather)
Here, \alpha is called the learning rate.


4. Fit y=\beta_0+\beta_1x+\beta_2x^2+\beta_3e^x+\beta_4e^{2x}


Python models for regression,
1. StatsModels. A Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.
2. Scikit-learn. A machine learning model can be used for classification, regression, clustering, dimensionality reduction, model selection, preprocessing.

In this case, we try to find out the \boldsymbol{\beta} in y=\beta_0+\beta_1x+\beta_2x^2+\beta_3e^x+\beta_4e^{2x} for a Given a set of training example. Clearly, this regression problem is not linear.

Our program has the following structure
1. Construct a set of (x,y_0) where y_0=1+x+x^2+e^x+e^{2x}.
2. Add Gaussian distribution noise to y_0, get the training samples (x,y).
3. Using batch gradient descent to find out the \boldsymbol{\beta} in y=\beta_0+\beta_1x+\beta_2x^2+\beta_3e^x+\beta_4e^{2x}. The updating rules is

(4)   \begin{align*} \beta_0 \to \beta_0+\sum_{i=1}^{m} (y^{\{i\}}-f(x^{\{i\}},\boldsymbol{\beta}))\\ \beta_1 \to \beta_1+\sum_{i=1}^{m} (y^{\{i\}}-f(x^{\{i\}},\boldsymbol{\beta}))x^{\{i\}}\\ \beta_2 \to \beta_2+\sum_{i=1}^{m} (y^{\{i\}}-f(x^{\{i\}},\boldsymbol{\beta}))(x^{\{i\}})^2\\ \beta_3 \to \beta_3+\sum_{i=1}^{m} (y^{\{i\}}-f(x^{\{i\}},\boldsymbol{\beta}))e^{x^{\{i\}}}\\ \beta_4 \to \beta_4+\sum_{i=1}^{m} (y^{\{i\}}-f(x^{\{i\}},\boldsymbol{\beta}))e^{2x^{\{i\}}} \end{align*}

4. Plot y_0=1+x+x^2+e^x+e^{2x}, y(x), f(x,\boldsymbol{\beta})=\beta_0+\beta_1x+\beta_2x^2+\beta_3e^x+\beta_4e^{2x}.

After 180 iterations, the change of \beta_j is less than 0.1\%. For most of the cases, f is different from y_0, while they have almost the same line shape, no matter how many training samples we are using.

Leave a Reply