Logistic Regression: A Probability Theoretic Approach
3/9/2025
Introduction
One of my favorite ways to study machine learning is through the lens of probability theory - the subset of math concerned with modeling randomness and chance. In this spirit, this post will describe how logistic regression, a basic binary classification algorithm, can be modeled using concepts from elementary probability theory.
We will start with a basic description of probability-theoretic supervised ML, describe logistic regression within this framework, and conclude with deriving the optimization problem used to obtain optimal parameters.
Probability Theoretic Supervised Machine Learning
Before specifically addressing logistic regression, we'll first zoom out and study general supervised ML strategies through a probability-theoretic lens. Let:
- be a random feature vector with components. In other words, is a vector where component , where is the probability distribution of feature .
- be a random vector specifying a label. is a discrete random variable for classification problems and a continuous random variable for regression problems.
Under this framework, inference involves predicting the distribution of labels for a specific feature vector. Mathematically, this means that given that we observe , predict the conditional label distribution:
Notice that the use of conditional probability here encodes our current knowledge. We know what the feature vector is, so we condition on it. The goal of inference is to see how our knowledge of the feature vector changes the distribution of labels.
If we want to predict only a single label, we can select the label that is most likely to occur in the conditional label distribution.
The conditional label distribution can be modeled by some arbitrary distribution , where:
- is dependent on observed feature vector
- is parameterized by parameter set
We call our model. To train the model, we must find the values of that produce the most reasonable conditional label distributions.
Enough with the general definitions for now though - let's dive into the details of how this applies to logistic regression.
Defining the Logistic Regression Model
In order to discuss logistic regression, let's start by placing some assumptions:
- is drawn from a real-valued space. This means each feature vector observation is a real-valued vector,
- is a discrete random variable which can take on values in the set , corresponding to negative and positive labels respectively.
Mathematical Definition
Logistic regression defines the conditional label distribution as:
where:
- is the Bernoulli probability distribution. As a recap, a Bernoulli distribution is a binary distribution defined so that
- (following from the complement rule)
- are the parameters of our model
- is the sigmoid function:
Notice what's actually happening here - we're saying that if we observe feature vector randomly, then:
This precisely defines the conditional label distribution in terms of and the model parameters as desired!
Logits
As our label distribution is binary, we really only need concern ourselves with how logistic regression predicts the probability of the positive label - the probability of the negative label follows via complement.
Logistic regression predicts the probability a given feature vector has a positive label in two steps:
- it computes the logit for the feature vector using a linear combination + an intercept:
- it thresholds this logit into the [0, 1] range using the sigmoid (or logistic) function
![Graph of sigmoid function over interval [-10, 10]](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fsigmoid.99597064.png&w=1920&q=75)
Graph of sigmoid function over interval [-10, 10]. Image by author.
Intuitively, it makes sense to think of logits as raw probabilities. They encode all the relevant information needed to obtain the probability, but just not in the right "format" until thresholded.
Assuming each label is equally likely to be drawn, then
- the sign of the logit corresponds to the sign of the predicted label
- when logit is positive, model predicts positive label is more likely.
- when logit is negative, model predicts negative label is more likely.
- the magnitude of the logit describes the model's confidence in the prediction:
- if is close to 0 or 1 the model predicts a very high or very low probability of a positive label (far from random, high confidence).
- if is close to 0.5 the model predicts close a 50% probability of a positive label (close to random, low confidence).
One final note - it's critical to understand that as logits are computed via a linear combination of features + an intercept, a logistic regressor is a linear model. There are both benefits and downsides to this:
- Good: Logistic regression is interpretable - we can see exactly which features contributed to the sign and magnitude of the logit, exposing what patterns the model is recognizing in the data.
- Bad: Logistic regression can underfit to data that doesn't have a clear linear trend. In these situations, you would have to perform manual feature engineering to encode nonlinear relationships.
Training Logistic Regression Models
To train a logistic regression model, we have to learn the values of and that allow us to predict the most accurate conditional label distributions. As usual with supervised machine learning, we collect a training dataset to determine how close our model is to predicting correct labels.
Our dataset consists of independent observations, where the th observation consists of two values:
- as a feature vector sampled from our feature space
- as the label corresponding to the feature vector
We can more concisely represent this dataset as .
Maximum Likelihood Estimation
To train a model, we need to be able to determine how "good" certain parameter choices are. One way to do so is by calculating how likely it is that the dataset was sampled from the conditional label distribution given by .
- if our choice of is good, then we should see that is large, indicating that the labels in are likely to be sampled from the conditional label distribution.
- if our choice of is poor, then we may see that is small, indicating that the labels in are unlikely to be sampled from the conditional label distribution.
The key in this logic is that is our source of truth, and training involves finding parameters that make it as likely as possible to observe that ground truth when randomly sampled.
This intuition is formalized as the likelihood (technically the conditional likelihood) of model parameters , defined as the function .
Usually, the likelihood captures the probability of observing the joint distribution rather than the conditional distribution we seek to model. This is why as defined here is the conditional likelihood.
describes how likely it was that the labels in were sampled using the parameters . As a higher likelihood corresponds to better parameters, we can choose the optimal values for by maximizing - a process called maximum likelihood estimation (MLE).
Deriving a Neater Optimization Problem
We've now defined the core optimization problem for logistic regression: find that maximize . In other words, solve
We can simplify this maximization problem further until it starts to resemble something that we can easily train using regular supervised machine learning approaches (such as gradient descent, for example).
This section is going to be a ton of math, so strap in. We'll first start by first returning to an assumption you likely skipped over when reading the previous part.
Our dataset consists of independent observations
We know that for two independent events that . Then the probability of observing the labels in dataset can be factored into the product of individual probabilities of ground truths:
Now there are two cases here:
- - in this case,
- - in this case,
as we've stated previously. We can use a neat trick with how we define our label space to combine these two cases into a single expression:
We can see here when:
- , then , leaving only the first term in the product
- , then , leaving only the second term in the product
Therefore returning to the likelihood function:
We could stop here, but the iterated product is super nasty to take gradients over. Luckily we have an ace up our sleeve.
Recall that (over the same domain). We can apply the natural log to the likelihood to get the log-likelihood , where and are maximized by the same values of and .
Looking at the log-likelihood then:
Applying the following log laws:
we can simplify the log-likelihood to:
This formulation of the optimization objective is something that's much nicer for gradient computation as that nasty iterated product is removed.
So finally, we have the final format of the MLE problem that will give us the optimal parameters for the logistic regression model:
By convention, it is common to solve minimization problems rather than maximization problems - we can equivalently rewrite the MLE optimization problem as:
Conclusion
To recap, a logistic regression model defines the conditional distribution of labels to be a Bernoulli distribution parameterized by a linear transformation of the feature vector with the parameters .
We can learn the optimal parameters by performing maximum likelihood estimation using a dataset of independent observations. The MLE optimization problem simplifies to the following optimization problem that can be solved using traditional convex optimization methods.