Surrogate Modelling

Published

May 9, 2025

Scalar linear regression

Let \(y = f(x)\) be the unknown function. Given a set of \(n\) input-output pairs \((x_i, y_i) \forall\ i \in [1,n]\), we aim to find a linear model that approximates \(f\). Let the linear model be:

\[ y = \beta_0 + \beta_1 x \]

The coefficients \(\beta_0\) and \(\beta_1\) are found by minimizing the sum of squared errors:

\[ \min_{\beta_0, \beta_1} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 \]

The first order optamality conditions for this problem are:

\[ \begin{align*} \frac{\partial \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2}{\partial \beta_0} &= -2 \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i) &= 0 \\ \frac{\partial \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2}{\partial \beta_1} &= -2 \sum_{i=1}^n x_i (y_i - \beta_0 - \beta_1 x_i) &= 0 \end{align*} \]

Here, we note that

\[ \begin{align*} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i) &= \sum_{i=1}^n y_i - n \beta_0 - \beta_1 \sum_{i=1}^n x_i &= 0 \\ \sum_{i=1}^n x_i (y_i - \beta_0 - \beta_1 x_i) &= \sum_{i=1}^n x_i y_i - \beta_0 \sum_{i=1}^n x_i - \beta_1 \sum_{i=1}^n x_i^2 &= 0 \end{align*} \]

Since by definition, \(\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i\) and \(\bar{y} = \frac{1}{n} \sum_{i=1}^n y_i\), we have \[ \begin{align*} \sum_{i=1}^n y_i - n \beta_0 - \beta_1 \sum_{i=1}^n x_i &= n \bar{y} - n \beta_0 - \beta_1 n \bar{x} &= 0 \\ \sum_{i=1}^n x_i y_i - \beta_0 \sum_{i=1}^n x_i - \beta_1 \sum_{i=1}^n x_i^2 &= n \bar{x} \bar{y} - \beta_0 n \bar{x} - \beta_1 \sum_{i=1}^n x_i^2 &= 0 \end{align*} \]

Solving for \(\beta_0\) and \(\beta_1\) we get:

\[ \begin{align*} \beta_1 &= \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} \\ \beta_0 &= \bar{y} - \beta_1 \bar{x} \end{align*} \]

Here, \(\bar{x}\) and \(\bar{y}\) are the sample means of the inputs and outputs respectively. Inputs are known as the independent variable and outputs are known as the dependent variable.

In the next section, we will see how to extend this to the multivariate case.

Multivariate linear regression

Let \(y = f(\mathbf{x}),\ \mathbf{x} \in \Re^p\) be the unknown function. Given a set of \(n\) input-output pairs \((x_i, y_i)\), \(i=1,\ldots,n\), we want to find a linear model that approximates \(f\):

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p \]

The coefficients \(\beta_0, \beta_1, \ldots, \beta_p\) are found by minimizing the sum of squared errors:

\[ \min_{\beta_0, \beta_1, \ldots, \beta_p} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_{i1} - \beta_2 x_{i2} - \ldots - \beta_p x_{ip})^2 \]

We can get an explicit closed form solution to this problem.

Let \(X\) be the \(n \times (p+1)\) matrix of inputs:

\[ X = \begin{bmatrix} 1 & x_{11} & x_{12} & \ldots & x_{1p} \\ 1 & x_{21} & x_{22} & \ldots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \ldots & x_{np} \end{bmatrix} \]

where \(x_{ij}\) is the \(i^{th}\) sample of the \(j^{th}\) input.

Let \(\mathbf{y}\) be the \(n \times 1\) vector of outputs:

\[ \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} \]

The sum of squared errors can be written as:

\[ \begin{align*} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_{i1} - \beta_2 x_{i2} - \ldots - \beta_p x_{ip})^2 &= (y - X \beta)^T (y - X \beta) \end{align*} \]

Then the optimisation problem can be stated as:

\[ \begin{align*} \min_{\beta} (\mathbf{y} - X \mathbf{\beta})^T (\mathbf{y} - X \mathbf{\beta}) \end{align*} \]

The first order optimality conditions for this problem are:

\[ \begin{align*} \frac{\partial (\mathbf{y} - X \mathbf{\beta})^T (\mathbf{y} - X \mathbf{\beta})}{\partial \mathbf{\beta}} &= -2 X^T (\mathbf{y} - X\mathbf{\beta}) &= 0 \end{align*} \]

Solving for \(\beta\) we get:

\[ \begin{align*} X^T X \mathbf{\beta} &= X^T \mathbf{y} \\ \mathbf{\beta} &= (X^T X)^{-1} X^T \mathbf{y} \end{align*} \]

where \(X\) is the \(n \times (p+1)\) matrix of inputs and \(y\) is the \(n \times 1\) vector of outputs.

The matrix \(X^T X\) is known as the Gram matrix and is a \(p+1 \times p+1\) matrix. The matrix \(X^T \mathbf{y}\) is a \(p+1 \times 1\) vector. The matrix \((X^T X)^{-1} X^T\) is known as the Moore-Penrose pseudo-inverse of \(X\) and is a \(p+1 \times n\) matrix.

In the next section we will extend this to a general set of basis functions.

Multivariate linear regression with basis functions

Let \(y = f(\mathbf{x}),\ \mathbf{x} \in \Re^p\) be the unknown function. Given a set of \(n\) input-output pairs \((x_i, y_i)\), \(i=1,\ldots,n\), we want to find a linear model that approximates \(f\):

\[ y = \beta_0 + \beta_1 \phi_1(\mathbf{x}) + \beta_2 \phi_2(\mathbf{x}) + \ldots + \beta_m \phi_m(\mathbf{x}) \]

The coefficients \(\beta_0, \beta_1, \ldots, \beta_m\) are found by minimizing the sum of squared errors:

\[ \min_{\beta_0, \beta_1, \ldots, \beta_m} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 \phi_1(\mathbf{x}_i) - \beta_2 \phi_2(\mathbf{x}_i) - \ldots - \beta_m \phi_m(\mathbf{x}_i))^2 \]

We can get an explicit closed form solution to this problem.

Let \(\Phi\) be the \(n \times m\) matrix of basis functions:

\[ \Phi = \begin{bmatrix} \phi_1(\mathbf{x}_1) & \phi_2(\mathbf{x}_1) & \ldots & \phi_m(\mathbf{x}_1) \\ \phi_1(\mathbf{x}_2) & \phi_2(\mathbf{x}_2) & \ldots & \phi_m(\mathbf{x}_2) \\ \vdots & \vdots & \ddots & \vdots \\ \phi_1(\mathbf{x}_n) & \phi_2(\mathbf{x}_n) & \ldots & \phi_m(\mathbf{x}_n) \end{bmatrix} \]

The sum of squared errors can be written as:

\[ \begin{align*} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 \phi_1(\mathbf{x}_i) - \beta_2 \phi_2(\mathbf{x}_i) - \ldots - \beta_m \phi_m(\mathbf{x}_i))^2 &= (\mathbf{y} - \Phi \mathbf{\beta})^T (\mathbf{y} - \Phi \mathbf{\beta}) \end{align*} \]

Then the optimisation problem can be stated as:

\[ \begin{align*} \min_{\beta} (\mathbf{y} - \Phi \mathbf{\beta})^T (\mathbf{y} - \Phi \mathbf{\beta}) \end{align*} \]

The first order optimality conditions for this problem are:

$$ \[\begin{align*} \frac{\partial (\mathbf{y} - \Phi \mathbf{\beta})^T (\mathbf{y} - \Phi \mathbf{\beta})}{\partial \mathbf{\beta}} &= -2 \Phi^T (\mathbf{y} - \Phi\mathbf{\beta}) &= 0 \end{align*}\]

Solving for \(\beta\) we get:

\[ \begin{align*} \Phi^T \Phi \mathbf{\beta} &= \Phi^T \mathbf{y} \\ \mathbf{\beta} &= (\Phi^T \Phi)^{-1} \Phi^T \mathbf{y} \end{align*} \]

where \(\Phi\) is the \(n \times m\) matrix of basis functions and \(y\) is the \(n \times 1\) vector of outputs.

Important

Note here that this is still a linear regression model in the parameters \(\beta\). The basis functions \(\phi_1, \phi_2, \ldots, \phi_m\) are non-linear in the inputs \(\mathbf{x}\), but the model is linear in the parameters \(\beta\). For example, if \(\phi_1(\mathbf{x}) = x_1^2\), then the model is linear in \(\beta\) but non-linear in \(x_1\).

Statistical interpretation of linear regression

The linear regression model can be interpreted in a statistical framework. Let us assume that we are conducting an experiment to measure the output \(y\) as a function of the inputs \(\mathbf{x}\). Each measurement is assumed to be noisy and can be written as:

\[ y_i = f(\mathbf{x}_i) + \epsilon_i \]

where \(f(\mathbf{x}_i)\) is the true function value and \(\epsilon_i\) is the error term. These experiments have following properties:

  1. Each measurement has same level of fidelity. This means that the error in each measurement has the same probability distribution.

  2. Each measurement is independent of the others. This means that the outcome of one measurement does not affect the outcome of another measurement. They are not only uncorrelated, but also independent. This gives rise to the notion of i.i.d. (independent and identically distributed) samples.

i.i.d

Independent and identically distributed samples are a set of random variables that are independent of each other and have the same probability distribution.

  1. The noise is each measurement is due to contributions from a large number of sources of errors. Then the central limit theorem states that the error term is normally distributed. Hence we can assume that the error term is normally distributed with mean 0 and variance \(\sigma^2\). This is written as:

\[ \epsilon_i \sim N(0, \sigma^2) \]

With these assumptions, we can write the linear regression model as:

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p + \epsilon \]

where \(\epsilon\) is the error term.

To capture the true function, we need to minimise the errors in the model. The errors are given by:

\[ \epsilon_i = y_i - \beta_0 - \beta_1 x_{i1} - \beta_2 x_{i2} - \ldots - \beta_p x_{ip} \]

The sum of squared errors is given by:

\[ \sum_{i=1}^n \epsilon_i^2 = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_{i1} - \beta_2 x_{i2} - \ldots - \beta_p x_{ip})^2 \]

which can be written in the vector notation as:

\[ \begin{align*} \sum_{i=1}^n \epsilon_i^2 &= \epsilon^T \epsilon\\ &= (\mathbf{y} - X \mathbf{\beta})^T (\mathbf{y} - X \mathbf{\beta})\\ &= \mathbf{y}^T \mathbf{y} - \mathbf{y}^T X \mathbf{\beta} - \mathbf{\beta}^T X^T \mathbf{y} + \mathbf{\beta}^T X^T X \mathbf{\beta}\\ &= \mathbf{y}^T \mathbf{y} - 2 \mathbf{\beta}^T X^T \mathbf{y} + \mathbf{\beta}^T X^T X \mathbf{\beta} \end{align*} \]

Note here that the term \(\mathbf{y}^T X \mathbf{\beta}\) is a scalar and is equal to its transpose.

\[ \mathbf{y}^T X \mathbf{\beta} = ((\mathbf{y}^T X) \mathbf{\beta})^T = \mathbf{\beta}^T (\mathbf{y}^T X)^T = \mathbf{\beta}^T X^T \mathbf{y}\]

We aim to find \(\beta\) that minimises the error. The optimisation problem can be stated as:

\[ \min_{\beta} \mathbf{y}^T \mathbf{y} - 2 \mathbf{\beta}^T X^T \mathbf{y} + \mathbf{\beta}^T X^T X \mathbf{\beta} \]

Since the first term is independent of \(\beta\), we can ignore it and the updated optimisation problem is:

\[ \min_{\beta} - 2 \mathbf{\beta}^T X^T \mathbf{y} + \mathbf{\beta}^T X^T X \mathbf{\beta} \]

The first order optimality conditions for this problem are:

\[ \begin{align*} \frac{\partial (- 2 \mathbf{\beta}^T X^T \mathbf{y} + \mathbf{\beta}^T X^T X \mathbf{\beta})}{\partial \mathbf{\beta}} &= -2 X^T \mathbf{y} + 2 X^T X \mathbf{\beta} &= 0 \end{align*} \]

Probability Distribution Function (PDF)

A probability distribution function is a function that describes the likelihood of obtaining the possible values that a random variable can take. The probability distribution function is a non-negative function, and the area under the curve is equal to 1.

Let us assume that we are given a set of \(n\) input-output pairs \((x_i, y_i)\), \(i=1,\ldots,n\). We know that the output is contaminated with noise and can be written as:

\[ y_i = f(x_i) + \epsilon_i \]

where \(f(x_i)\) is the true function value and \(\epsilon_i\) is the error term. These measurements have following properties:

  1. Each measurement has same level of fidelity. This means that the error in each measurement has the same probability distribution.

  2. Each measurement is independent of the others. This means that the outcome of one measurement does not affect the outcome of another measurement. They are not only uncorrelated, but also independent. This gives rise to the notion of i.i.d. (independent and identically distributed) samples.

  3. The noise is each measurement is due to contributions from a large number of sources of errors. Then the central limit theorem states that the error term is normally distributed. Hence we can assume that the error term is normally distributed with mean 0 and variance \(\sigma^2\). This is written as:

\[ \epsilon_i \sim N(0, \sigma^2) \]

We further assume that the true function \(f(x)\) can be approximated by

\[ \hat{f}(x) = \sum_{j=1}^m \beta_j \phi_j(x) \]

where \(\phi_j(x)\) are known as the basis functions. The basis functions are non-linear in the inputs \(x\), but the model is linear in the parameters \(\beta\).

Then the ith measurement error can be written as:

\[ \epsilon_i = y_i - \hat{f}(x_i) = y_i - \sum_{j=1}^m \beta_j \phi_j(x_i) \]

which in vector notation can be written as:

\[ \epsilon = \mathbf{y} - \Phi \mathbf{\beta} \]

where \(\Phi\) is the \(n \times m\) matrix of basis functions and \(y\) is the \(n \times 1\) vector of outputs.

Now, we wish to minimise the expectation of this error as it is assumed that the error is normally distributed with zero mean (that means no bias).

For \[ \begin{align*} \epsilon &= 0 \\ \mathbf{y} - \Phi \mathbf{\beta} &= 0 \\ \mathbf{y} &= \Phi \mathbf{\beta} \\ \Phi^T \mathbf{y} &= \Phi^T \Phi \mathbf{\beta} \\ \mathbf{\beta} &= (\Phi^T \Phi)^{-1} \Phi^T \mathbf{y} \end{align*} \]

Fully sequential space filling designs - Sobol sequence

Morris Mitchell space filling design

https://bookdown.org/rbg/surrogates/chap4.html#ref-morris1995exploratory

https://cran.r-project.org/web/packages/FSSF/FSSF.pdf

Shang, B. and Apley, D.W. (2019), “Large-Scale Fully-Sequential Space-filling Algorithms for Computer Experiments”, Journal of Quality Technology (in press). doi:10.1080/00224065.2019.1705207