An objective function is ideally a loss function which seeks to be minimzed.
Tom Cruise
In specific domains, it is alternatively called as reward function, a profit function, a utility function, or a fitness function, and is the negative of the loss function therefore, is to be maximized. In supervised learning, we have a set of training examples \(\mathbf{x} = (x_1,\dots, x_n)^T\), i.e., (customer name, bill, age, \(\dots\) ) with labels \(y_1, \dots, y_n\), i.e., (churn/no churn) and wishes to predict \(y_{n+1}\) given \(x_{n+1}\) by defining a model \(f:\mathbf{x} \rightarrow y\), i.e., predict customer churn in future. To do so one forms a hypothesis, \(f\) such that \(f(\mathbf{x}_{n+1})\) is a "good" approximation of \(y_{n+1}\). The approximation is usually defined with the help of a loss function. Loss function is the measure of the accuracy of the predictions and we are interested in defining the model that maximizes the accuracy. This is equivalent of minimizing the expected loss, which is known as the risk. The risk of a model is therefore defined as the expected loss over the joint distribution of \(\mathbf{x}\) and \(y\), i.e., \(p(\mathbf{x},y)\), \begin{equation} \label{eq:risk_function} R(f)=E \left[ \mathcal{L}(y, f(\mathbf{x})) \right], \end{equation} where \(\mathcal{L}\) is the loss function and the optimum model is the one that minimizes the risk [1], \begin{equation} \label{eq:risk_minization} f^* = \text{argmin}_{f \in \mathcal{F}} \, R(f) \end{equation} where, \(\mathcal{F} \in \mathcal{A}^\mathbf{X} \) represents the entire function space over the given dataset. We desire \(R(f)\) to be as as small as possible, resulting in \(f^*\) that generalizes to unseen data the best. But since the true risk of the model is unkown as the joint distribution \(p(\mathbf{x},y)\) is unknown due to lack of observations, we rely on the empircal risk estimation minimization which governs the indeductive principle of statistical learning as opposed to the deductive one where the true risk function is firstly derived. Note that this helps us to understand the difference between the modern day data scientists and the theoratical physicists where the first belongs to the former class of learning methods whereas the second to the later. The empirical risk \(\hat{R}(f)\) is the emprical estimate of the true risk where the expected loss is measured over the emprical distrbution rather than the population distribtuion from which the data points were sampled: \begin{equation} \label{eq:empirical_risk} \hat{R}(f) = \frac{1}{n} \sum_{i=1}^{n} \mathcal{L}(y_i, f(x_i)), \end{equation} where \(x_i\) is the ith entry of the vector \(\mathbf{x}\). By the law of large numbers, we have \( \lim_{n\to\infty} \hat{R}(f) = R(f)\). Therefore, above takes the form: \begin{equation} \label{eq:empirical_risk_minization} f^* = \text{argmin}_{f \in \mathcal{F}} \, \hat{R}(f). \end{equation} When the cardinality of \(\mathbf{x} \) is in-finite, \(\mathcal{A}^\mathbf{x}\) takes infinite search space, we therefore restrict \(\mathcal{F} \subset \mathcal{A}^\mathbf{x} \). Hence the choice of \(\mathcal{F}\) is of major importance. It esentially translates to the class of models we choose to perform predicton with (more on is covered in the article on decision theory). The \(\mathcal{F}\) comprises of set of various model classes, the most common and elementary among them is the class of linear models: \begin{equation} \label{eq:linear_models} \mathcal{F} = \{f: f(\mathbf{X}) = w_0 + \mathbf{w}^T\mathbf{x} \}, \end{equation} where, \(w_0\) commonly referred to as the offset and \(\mathbf{w} = \{w_1, w_2, \dots, w_n\}^T \) as weigths, are the number of parameters to be estimated. These are called linear models as the parameter vector is linear in their degree. More on linear models in later posts. For a particular class of models, a learning algorithm trains the paremeter set \(\Theta \) of the model to best fit the data, which is to reduce the prediction loss. Therefore, a model is best described by its estimated parameters \(\hat{\Theta} \): \begin{equation} \hat{f} (\mathbf{X}) = f(\mathbf{X}; \hat{\Theta}) \end{equation} The complexity of an algorithm grows w.r.t the choice of the model and loss function. Simpliest choices such as linear regression and squar loss yield analytical solutions whereas complex methods do often require numerical methods for parameter estimation.