Assume that you are given some data and you have no idea about the underlying model. You want to fit a polynomial to the data. The question is: what is the best polynomial to fit the data?
Consider the following data points:
Code
# This function generates data for x^2 + 2x + 1 + noise(0,0.5) and fits a polynomial to the data. It fits a 2nd order and 5rd order polynomial to the data. Then it plots the data and the fitted polynomials. Also it calculates the mean squared error between the fitted polynomial and the data. It also calculates the mean squared error between the fitted polynomial and the actual underlying function (x^2 + 2x + 1).import numpy as npimport matplotlib.pyplot as pltnp.random.seed(0)x = np.linspace(-1, 1, 10)# Generate datay = x**2+2*x +1+ np.random.normal(0, 0.5, 10)# Fit a 2nd order polynomialp2 = np.poly1d(np.polyfit(x, y, 2))# Fit a 5th order polynomialp5 = np.poly1d(np.polyfit(x, y, 5))# Plot the data and the fitted polynomialsplt.scatter(x, y)plt.plot(x, p2(x), label='2nd order polynomial')plt.plot(x, p5(x), label='5th order polynomial')plt.legend()plt.show()# Calculate the mean squared error between the fitted polynomial and the datamse_p2 = np.mean((p2(x) - y)**2)mse_p5 = np.mean((p5(x) - y)**2)# Calculate the mean squared error between the fitted polynomial and the actual underlying function (x^2 + 2x + 1)actual_y = x**2+2*x +1mse_p2_actual = np.mean((p2(x) - actual_y)**2)mse_p5_actual = np.mean((p5(x) - actual_y)**2)print('Mean squared error between the 2nd order polynomial and the data:', mse_p2)print('Mean squared error between the 5th order polynomial and the data:', mse_p5)print('Mean squared error between the 2nd order polynomial and the actual underlying function:', mse_p2_actual)print('Mean squared error between the 5th order polynomial and the actual underlying function:', mse_p5_actual)
Mean squared error between the 2nd order polynomial and the data: 0.17408952512398074
Mean squared error between the 5th order polynomial and the data: 0.10924697595195239
Mean squared error between the 2nd order polynomial and the actual underlying function: 0.1958905751342512
Mean squared error between the 5th order polynomial and the actual underlying function: 0.2607331243062771
Here we note that even though the error between the 5th order polynomial and the data is lower than the error between the 2nd order polynomial and the data, the error between the 5th order polynomial and the actual underlying function is higher than the error between the 2nd order polynomial and the actual underlying function. This is because the 5th order polynomial is overfitting the data, capturing the noise in the data rather than the underlying trend. The 2nd order polynomial, on the other hand, captures the underlying trend better, resulting in a lower error between the fitted polynomial and the actual underlying function.
Similar thing happens in neural networks as well. If we have a very complex model, it may overfit the data and perform poorly on unseen data. This is why it is important to choose the right complexity of the model based on the data and the underlying trend.
In order to overcome, we can use techniques like cross-validation, regularization, early stopping, dropout, etc. to prevent overfitting and improve the generalization of the model.