Have you ever thought about how to pick the right number of bins for a histogram? Have you questioned whether there’s a deeper justification behind your choices beyond just making it look visually appealing? Histograms are one of the most essential tools for visualizing data, but choosing the right resolution — that is, the number and width of bins — matters a great deal, especially if you plan to use the histogram for further analysis. Often, histograms are calculated to represent how densely packed your data is. In this article, we explore the mathematical foundations of density estimation, focusing specifically on how bin size should decrease as your sample grows larger. Drawing inspiration from areas like perturbation theory in physics and Taylor expansions in mathematics, we’ll develop a rigorous framework for building density estimates.
All images are by the author
Background
Approximations
The basic idea is intuitive: the more data you collect, the finer the level of detail you should be able to resolve. If you only have ten data points, two or three broad bins are probably all you can manage before your histogram turns into a patchwork of sparse, mostly empty bars. But if you’re working with ten million observations, those same broad bins start to look like a blurry, pixelated image. Naturally, you’d want to “zoom in” by increasing the number of bins. The real question is: how should we scale this resolution as our dataset grows?
In physics, when a system is too complicated to solve exactly, researchers often rely on Perturbation Theory. In Quantum Electrodynamics (QED), for instance, complex particle interactions are approximated by expanding them in terms of a small coupling constant — such as the electron charge e. This “interaction strength” gives a natural ordering to successive approximations. But for a histogram, what plays the role of that “charge”? Is there a fundamental parameter that captures the relationship between our discrete samples and the underlying distribution we’re trying to estimate?
Mathematics provides another avenue: the Taylor Expansion. If we assume the underlying density function is smooth enough (analytic), we can approximate it locally using its higher-order derivatives. This seems promising, since it can be shown that higher-order terms gradually diminish. However, even if we restrict ourselves to analytic distributions, it’s not immediately clear how this reasoning leads us to a specific bin width.
Another possibility is to frame the problem as an Expansion in Basis Functions. Just as a piecewise continuous function can be represented through a Fourier transform or Legendre polynomials, we could treat histogram bins as a collection of basis functions. Using this framework, we could approximate the density in an L² sense. But this introduces its own challenges. How do we efficiently compute the coefficients for these basis functions? And more critically, how do we enforce the physical requirements of a probability density function? Unlike a general Fourier series, a density must be non-negative everywhere and integrate to one. As we’ll see later, the approach derived from information theory shares notable similarities with basis function expansions.
Information Theory
Priors & Posteriors
For an introduction to Bayesian statistics or information theory, the reader may consult (Murphy, 2022). In the Bayesian framework, a model , where X represents the observable data we wish to model and denotes our parameters, also incorporates a prior distribution 𝑃(𝜃|ℳ) that encodes our beliefs about the parameters before any data is observed. Once the data has been collected, we can estimate the posterior distribution .
𝑃(𝜃|𝑋) = 𝑃(𝑋|𝜃)𝑃(𝜃|ℳ)/𝑃(𝑋)
This formulation is mathematically elegant because it inherently guards against overfitting. However, it demands careful discipline: you are not permitted to choose your model or prior after you’ve already looked at the data. If you let the data guide your choice of model structure, you undermine the entire logical foundation of Bayesian inference.
The most-likely model given the data versus model weighting
The quality of a model can be evaluated by examining its surprisal (see, for example, (Vries, 2026)):
log 𝑃(𝑋|ℳ) = −surprisal = accuracy – complexity
Models with an excessive number of parameters — perhaps because one is tempted to include all sorts of speculative interactions — may achieve extraordinary accuracy, but they are penalized by their own complexity. The best model isn’t necessarily the most detailed one; it’s the one that captures the most meaningful information with the fewest unnecessary assumptions.
When evaluating a collection of candidate models, we can compute the relative likelihood of each one in comparison to the others:
𝑃(ℳ𝑖 ∣ 𝑋) ~ 𝑃(𝑋 | ℳ𝑖) 𝑃(ℳ𝑖)
It’s tempting to simply select the model with the highest probability and proceed. But this “winner-takes-all” strategy carries certain risks:
- Statistical fluctuations: The data 𝑋 might contain a random artifact that makes a suboptimal model appear temporarily superior.
- The weight of the crowd: Sometimes, the combined probability of many “less likely” models actually exceeds that of the single “best” model.
For these reasons, a more reliable strategy is to retain all models, weighting each by its probability. It’s important to emphasize that this is not a “mixture” of different truths — we still assume that only one model is actually correct — but we use the full distribution of possibilities to properly account for our own uncertainty.
Densities
A density using the Bayesian approach
To treat a density estimate as a formal statistical model, we view each of its 𝐾 bins as a parameter. Concretely, we assign a weight to each bin, representing the probability that a data point falls within that interval. Since total probability must sum to one (), a density with 𝐾 bins is fully described by 𝐾 − 1 independent parameters. Such models are also known as mixture models. Within our Bayesian framework, we need to specify a prior over these weights. Since we are dealing with categorical proportions that must sum to
First, the Dirichlet distribution is the mathematically natural choice.
Selecting the Hyperparameters
The Dirichlet distribution is controlled by hyperparameters, typically represented as 𝛼. These values act as “pseudo-counts”—essentially our assumptions about the shape of the density before we’ve even observed any actual data. When we adopt a flat prior (where the evidence 𝑃(𝑋) remains constant), two main approaches emerge for selecting 𝛼:
- 𝛼 = 1/𝐾 (The Sparse Option): This is commonly used when we anticipate the data to be heavily concentrated in certain areas. It presumes beforehand that most bins will be empty, making it a prior that encourages sparsity.
- 𝛼 = 1 (The Uniform Option): Also referred to as the flat or Laplace prior, this assumes that every possible configuration of weights is equally probable. It effectively adds one “hypothetical” observation to each bin before any real data is collected.
When building a standard density estimate, the second option—𝛼 = 1—is often the most intuitive. It represents a neutral starting standpoint where we assume the data is evenly spread across the interval until the evidence suggests otherwise.
By structuring our bins in this manner, we’ve converted the “pixelation” of a density estimate into a well-defined model. We now have a fixed set of parameters (𝐾 − 1 weights) and a clearly stated prior (𝛼 = 1). The next stage is to leverage the data to determine the ideal number of bins 𝐾 by striking a balance between the goodness of fit and the complexity introduced by additional parameters.
Example
Consider the data illustrated in the figure below:
When fitting the model with 8 bins, we obtain:

What stands out in this density estimate is that the rightmost bin holds a value above zero, even though no actual data points fall within it. This outcome is a direct consequence of the Bayesian methodology, which estimates the true density by blending our prior beliefs with the observed data.
In summary, we derived a density using a Bayesian framework. We established a prior 𝑃(𝜃) that expressed our expectation for a uniform density. We then incorporated the data to compute the posterior 𝑃(𝜃|𝑋), which underlies the resulting density estimate.
Weighted Density Estimates
Applying the method described in the previous section, we can construct density estimates using 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024 bins. Increasing the number of bins improves how closely the estimate matches the data, but also adds greater complexity. As covered earlier, we can use accuracy and complexity to compute the evidence for each configuration. By treating each density as a candidate model, we can evaluate its relative likelihood within the set of all models under consideration. This produces the figure shown below:

As discussed previously, one could select the single “best” model—which in this case would be using 8 bins. However, a more robust approach is to compute a weighted combination across all models. This gives:

It’s worth emphasizing that, from a Bayesian standpoint, this weighted approach represents the optimal possible result. Also note that the contribution from a configuration with 1024 bins is still visible in this graph. Additionally, one can demonstrate that the influence of higher-order configurations (larger 𝑁) gradually diminishes.
Densities with Unequal Bins
The density estimate obtained above has a somewhat blocky appearance, which stems from the use of equally sized bins. Other strategies are available, such as employing random splits (while making the appropriate adjustments to the prior). This produces the graph shown below:

Densities with Error Bars
Finally, to complete our construction of density estimates, it can be valuable to visualize the uncertainty inherent in these estimates. Although computationally demanding to calculate, the formula for determining the standard deviation of the density is remarkably elegant (F. Pijlman, 2023):
This yields the following density visualizations with error bars:


Conclusions
We started with a straightforward question: Is there a mathematical basis for choosing the bins in a histogram? Since the concept of bins inherently bridges raw data points and continuous density functions, we explored how to select bins for density estimation.
Using a Bayesian approach rooted in information theory, we can fit density estimates without the risk of overfitting (where too many bins reveal excessive detail). While it’s possible to compute the single “best” bin width, we observed that:
- Model weighting enables us to merge multiple resolutions, yielding a smoother and more truthful representation of the underlying data.
- Dirichlet Priors provide a principled way to encode our initial assumptions about how the data is distributed.
Much like perturbation theory provides a hierarchy for describing physical interactions, this Bayesian framework offers a hierarchy for data resolution. The resolution naturally scales as more data becomes available. It’s worth noting that these ideas can also be applied when training models that involve expansions in interaction terms.
The technique of combining density estimates across various resolutions was also examined in the context of randomly chosen bins. This approach produced smoother histograms that may appear more natural for most datasets.
We also introduced the use of standard deviations in histograms. Although the derivation of these standard deviations was developed within a Bayesian context, the calculation procedure hints at broader applicability. As such, it serves as a useful tool for visualizing the remaining uncertainties in density estimates.
Acknowledgements
The EdgeAI “Edge AI Technologies for Optimised Performance Embedded Processing” project has received funding from the Key Digital Technologies Joint Undertaking (KDT JU) under grant agreement No. 101097300. The KDT JU receives support from the European Union’s Horizon Europe research and innovation programme, as well as from Austria, Belgium, France, Greece, Italy, Latvia, Luxembourg, the Netherlands, and Norway.
References
- F. Pijlman, J. L. (2023). Variance of Likelihood of Data. 34/37.
- Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.
- Vries, B. d. (2026). Active Inference for Physical AI Agents. arXiv.
Bio
Fetze Pijlman is a Principal Scientist at Signify Research in Eindhoven, the Netherlands. His research spans probabilistic machine learning, Bayesian inference, and signal processing, with a particular focus on applying these mathematical frameworks to IoT, sensing, and smart systems.



