What is the best least squares polynomial fit?

statistics
machine-learning
polynomial-fitting
Author

Devendra Ghate

Published

December 24, 2024

Assume that you are given some data and you have no idea about the underlying model. You want to fit a polynomial to the data. The question is: what is the best polynomial to fit the data?

Consider the following data points:

Code
# This function generates data for x^2 + 2x + 1 + noise(0,0.5) and fits a polynomial to the data. It fits a 2nd order and 5rd order polynomial to the data. Then it plots the data and the fitted polynomials. Also it calculates the mean squared error between the fitted polynomial and the data.  It also calculates the mean squared error between the fitted polynomial and the actual underlying function (x^2 + 2x + 1).

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)
x = np.linspace(-1, 1, 10)

# Generate data
y = x**2 + 2*x + 1 + np.random.normal(0, 0.5, 10)

# Fit a 2nd order polynomial
p2 = np.poly1d(np.polyfit(x, y, 2))

# Fit a 5th order polynomial
p5 = np.poly1d(np.polyfit(x, y, 5))

# Plot the data and the fitted polynomials
plt.scatter(x, y)
plt.plot(x, p2(x), label='2nd order polynomial')
plt.plot(x, p5(x), label='5th order polynomial')
plt.legend()
plt.show()

# Calculate the mean squared error between the fitted polynomial and the data
mse_p2 = np.mean((p2(x) - y)**2)
mse_p5 = np.mean((p5(x) - y)**2)

# Calculate the mean squared error between the fitted polynomial and the actual underlying function (x^2 + 2x + 1)
actual_y = x**2 + 2*x + 1
mse_p2_actual = np.mean((p2(x) - actual_y)**2)
mse_p5_actual = np.mean((p5(x) - actual_y)**2)

print('Mean squared error between the 2nd order polynomial and the data:', mse_p2)
print('Mean squared error between the 5th order polynomial and the data:', mse_p5)
print('Mean squared error between the 2nd order polynomial and the actual underlying function:', mse_p2_actual)
print('Mean squared error between the 5th order polynomial and the actual underlying function:', mse_p5_actual)

Mean squared error between the 2nd order polynomial and the data: 0.17408952512398074
Mean squared error between the 5th order polynomial and the data: 0.10924697595195239
Mean squared error between the 2nd order polynomial and the actual underlying function: 0.1958905751342512
Mean squared error between the 5th order polynomial and the actual underlying function: 0.2607331243062771

Here we note that even though the error between the 5th order polynomial and the data is lower than the error between the 2nd order polynomial and the data, the error between the 5th order polynomial and the actual underlying function is higher than the error between the 2nd order polynomial and the actual underlying function. This is because the 5th order polynomial is overfitting the data, capturing the noise in the data rather than the underlying trend. The 2nd order polynomial, on the other hand, captures the underlying trend better, resulting in a lower error between the fitted polynomial and the actual underlying function.

Similar thing happens in neural networks as well. If we have a very complex model, it may overfit the data and perform poorly on unseen data. This is why it is important to choose the right complexity of the model based on the data and the underlying trend.

In order to overcome, we can use techniques like cross-validation, regularization, early stopping, dropout, etc. to prevent overfitting and improve the generalization of the model.

More on this in the later posts.