Logistic Regression: A Probability Theoretic Approach

3/9/2025

Introduction

One of my favorite ways to study machine learning is through the lens of probability theory - the subset of math concerned with modeling randomness and chance. In this spirit, this post will describe how logistic regression, a basic binary classification algorithm, can be modeled using concepts from elementary probability theory.

We will start with a basic description of probability-theoretic supervised ML, describe logistic regression within this framework, and conclude with deriving the optimization problem used to obtain optimal parameters.

Probability Theoretic Supervised Machine Learning

Before specifically addressing logistic regression, we'll first zoom out and study general supervised ML strategies through a probability-theoretic lens. Let:

Under this framework, inference involves predicting the distribution of labels for a specific feature vector. Mathematically, this means that given that we observe X=x\vec{X} = \vec{x}, predict the conditional label distribution:

Notice that the use of conditional probability here encodes our current knowledge. We know what the feature vector is, so we condition on it. The goal of inference is to see how our knowledge of the feature vector x\vec{x} changes the distribution of labels.

If we want to predict only a single label, we can select the label that is most likely to occur in the conditional label distribution.

The conditional label distribution P(YX=x)\mathbb{P}(Y|\vec{X}=\vec{x}) can be modeled by some arbitrary distribution MM, where:

We call MM our model. To train the model, we must find the values of θ\theta that produce the most reasonable conditional label distributions.

Enough with the general definitions for now though - let's dive into the details of how this applies to logistic regression.

Defining the Logistic Regression Model

In order to discuss logistic regression, let's start by placing some assumptions:

Mathematical Definition

Logistic regression defines the conditional label distribution as:

where:

Notice what's actually happening here - we're saying that if we observe feature vector xRd\vec{x} \in \mathbb{R}^d randomly, then:

This precisely defines the conditional label distribution in terms of x\vec{x} and the model parameters θ,b\vec{\theta}, b as desired!

Logits

As our label distribution is binary, we really only need concern ourselves with how logistic regression predicts the probability of the positive label - the probability of the negative label follows via complement.

Logistic regression predicts the probability a given feature vector has a positive label in two steps:

  1. it computes the logit for the feature vector x\vec{x} using a linear combination + an intercept: logit(x)=θx+b\text{logit}(\vec{x}) = \vec{\theta} \cdot \vec{x} + b
  2. it thresholds this logit into the [0, 1] range using the sigmoid (or logistic) function
Graph of sigmoid function over interval [-10, 10]

Graph of sigmoid function over interval [-10, 10]. Image by author.

Intuitively, it makes sense to think of logits as raw probabilities. They encode all the relevant information needed to obtain the probability, but just not in the right "format" until thresholded.

Assuming each label is equally likely to be drawn, then

One final note - it's critical to understand that as logits are computed via a linear combination of features + an intercept, a logistic regressor is a linear model. There are both benefits and downsides to this:

Training Logistic Regression Models

To train a logistic regression model, we have to learn the values of θ\vec{\theta} and bb that allow us to predict the most accurate conditional label distributions. As usual with supervised machine learning, we collect a training dataset to determine how close our model is to predicting correct labels.

Our dataset consists of nn independent observations, where the iith observation consists of two values:

We can more concisely represent this dataset as Sn={(x(i),y(i))}i=1nS_n = \{(\vec{x}^{(i)}, y^{(i)})\}_{i=1}^n.

Maximum Likelihood Estimation

To train a model, we need to be able to determine how "good" certain parameter choices are. One way to do so is by calculating how likely it is that the dataset SnS_n was sampled from the conditional label distribution given by θ,b\vec{\theta}, b.

The key in this logic is that SnS_n is our source of truth, and training involves finding parameters that make it as likely as possible to observe that ground truth when randomly sampled.

This intuition is formalized as the likelihood (technically the conditional likelihood) of model parameters θ,b\vec{\theta}, b, defined as the function L(θ,b;Sn)\mathcal{L}(\vec{\theta}, b; S_n).

Usually, the likelihood L\mathcal{L} captures the probability of observing the joint distribution P(X,Y)\mathbb{P}(\vec{X}, Y) rather than the conditional distribution we seek to model. This is why L\mathcal{L} as defined here is the conditional likelihood.

L\mathcal{L} describes how likely it was that the labels in SnS_n were sampled using the parameters θ,b\vec{\theta}, b. As a higher likelihood corresponds to better parameters, we can choose the optimal values for θ,b\vec{\theta}, b by maximizing L(θ,b;Sn)\mathcal{L}(\vec{\theta}, b; S_n) - a process called maximum likelihood estimation (MLE).

Deriving a Neater Optimization Problem

We've now defined the core optimization problem for logistic regression: find θ,b\vec{\theta}, b that maximize L(θ,b;Sn)\mathcal{L}(\vec{\theta}, b; S_n). In other words, solve

We can simplify this maximization problem further until it starts to resemble something that we can easily train using regular supervised machine learning approaches (such as gradient descent, for example).

This section is going to be a ton of math, so strap in. We'll first start by first returning to an assumption you likely skipped over when reading the previous part.

Our dataset consists of nn independent observations

We know that for two independent events A,BA, B that P(AB)=P(A)P(B)\mathbb{P}(A \cap B) = \mathbb{P}(A)\mathbb{P}(B). Then the probability of observing the labels in dataset SnS_n can be factored into the product of individual probabilities of ground truths:

Now there are two cases here:

as we've stated previously. We can use a neat trick with how we define our label space to combine these two cases into a single expression:

We can see here when:

Therefore returning to the likelihood function:

We could stop here, but the iterated product Π\Pi is super nasty to take gradients over. Luckily we have an ace up our sleeve.

Recall that arg maxxx=arg maxxln(x)\argmax_x x = \argmax_x \ln(x) (over the same domain). We can apply the natural log to the likelihood L\mathcal{L} to get the log-likelihood \ell, where \ell and L\mathcal{L} are maximized by the same values of θ\vec{\theta} and bb.

Looking at the log-likelihood then:

Applying the following log laws:

we can simplify the log-likelihood to:

This formulation of the optimization objective is something that's much nicer for gradient computation as that nasty iterated product is removed.

So finally, we have the final format of the MLE problem that will give us the optimal parameters for the logistic regression model:

By convention, it is common to solve minimization problems rather than maximization problems - we can equivalently rewrite the MLE optimization problem as:

Conclusion

To recap, a logistic regression model defines the conditional distribution of labels to be a Bernoulli distribution parameterized by a linear transformation of the feature vector with the parameters θ,b\vec{\theta}, b.

We can learn the optimal parameters θ,b\vec{\theta}, b by performing maximum likelihood estimation using a dataset SnS_n of nn independent observations. The MLE optimization problem simplifies to the following optimization problem that can be solved using traditional convex optimization methods.