Statistical Pi(e)

Statistics is probably the branch of math most closely related to science. Stats is what sets the scientific ball rolling: if you have a hypothesis, you get data, then use stats to analyze the data and come to conclusions about your hypothesis. It’s hard to imagine a way to make stats useless. Of course, this can be interpreted as a challenge.

a man saying "Did someone say 'useless'?"

Buckle up, folks, for we are going on a statistical journey. We are going to do a statistical analysis of the digits of π.

I: Is Pi Approximately Uniform?

Ever look at the digits of π? Those infinite digits, stretching out into the cosmos and beyond? You probably have. You’ve probably also wondered if there’s some sort of pattern behind them. The goal of this post is to come out of this at least 95% confident that the answer to that question, in at least one respect, is “no”.

First, we ask the question that is the title of this section: is π approximately uniform? That is, is there any one digit that occurs more frequently in pi than the others? We can ask this question graphically by asking about the distribution of the digits of pi. If all these digits are uniform, each digit should occur 1/10 of the time, since there are ten digits. So the distribution of the digits of pi should be this:

If we play the same game with the first 314 digits of pi, this is the distribution that results: (the graph below is interactive)

The fun thing, however, is we’re working with computers, so we can jack up the number of digits however high we want! This is a million:¹

*When I was making this, my computer didn’t throw a fit at all! Way to go, computer!*

This looks pretty convincing; the bars appear to be of the same size, which is what we’re aiming for. However, we can do this even better by formalizing “appear”. And you know what that means!

It’s time for

The best statistical test here is a chi-squared test for goodness of fit².

So, how does it work?

II: A Crash Course in Statistical Tests

The basic format of any statistical test involves a null hypothesis ( $H_0$ ) and an alternative hypothesis ( $H_\mathrm{a}$ ). The null states that what we are trying to prove is false, and the alternative is the opposite, i.e., it states that what we are trying to prove is true. In the context of our digits example, if we were for some reason trying to prove the digits were biased in some fashion (which is usually the case in the real world—you almost never encounter a situation where you need to prove that a set of frequencies are the same), they would be written as follows:

\begin{cases}H_0\!\!: \text{The digits are of equal frequency.}\\ H_\mathrm{a}\!\!: \text{The digits are not of equal frequency.}\end{cases}

Actually, now that I’ve written it out, we’ll roll with it. Let’s see what happens if we try to prove that the digits aren’t of equal frequency.

The idea behind statistical tests is this: we assume the null is true, and then follow this idea to some logical conclusion, which is very unlikely to be true. At this point we can say the null itself is very unlikely to be true, and claim that there is evidence to support the alternative, which is the desired result.

What is this logical conclusion? This is where the p-value comes in. This is calculated by taking our data and calculating the probability that we would have gotten results this extreme or more entirely by chance given that the null is true. In context, this is the probability, assuming that the digits are of equal frequency, that the digits just happened to line up the way they are due to sheer coincidence.

If the p-value is low enough, we get to reject the null and claim that there is evidence to support the alternative. If it is not low enough we cannot reject the null, and we typically keep assuming the null. This is like most legal systems: in that case the null would be that the defendant is innocent of whatever crime they are convicted of, and it is necessary to prove that they are not innocent, i.e., to reject the null, to convict them. Otherwise, they are presumed innocent. This is what “innocent until proven guilty” means.

In our case, because we are trying to prove that the digits don’t line up, we should obtain a somewhat large p-value. The typical threshold is 0.05, or one in twenty. If we don’t obtain a value lower than 0.05, we keep the null (i.e., they line up) by default.

What is the p-value of our test, you ask? Well, we’ll just have to run it, won’t we?

III: Yes, Pi in Fact Is Approximately Uniform

The exact manner in which the p-value is computed depends on the type of test we’re running. In this case, we’re using a chi-squared ( $\chi^2$ ) distribution. The exact shape of the distribution depends on a variable called the degrees of freedom, which is equal to the number of categories (digits, in our test) minus one. As there are ten digits in π (0 through 9), there are nine degrees of freedom, represented by a subscript $_9$ . The distribution, in full denoted by $\chi^2_9$ , ends up looking like this:

So! What do we actually do with this distribution?

Nothing, yet. First we need to find a test statistic—basically a measure of how close our data is to the target distribution. This can be calculated using the formula

\chi^2=\sum\frac{(\text{observed}-\text{expected})^2}{\text{expected}},

where observed and expected are measured by counts, summing this value over all categories (all ten digits). Summing up this value for all ten digit frequencies, we find that $\chi^2 \approx 5.5114$ , or roughly five and a half.

Calculating probabilities with probability density functions (PDFs) such as our chi-squared distribution is simple: simply find the area under the curve for the x-values you want. Calculating this area is done using an integral. This means that, for example, the full area under any PDF is guaranteed to be 1, since this calculation can be simplified to the probability that you get any value, which is of course 1.

Now, we need to put all of this together to find the p-value. To repeat, the p-value is the probability, assuming that the digits are not of equal frequency, that they would line up this close (or closer) to equality by coincidence. Our $\chi^2_9$ PDF measures this, but with areas. So, the p-value is some area on our chi-squared distribution. What area? Well, our $\chi^2$ test statistic is a measure of the closeness of our data to the distribution, so the probability of getting a $\chi^2$ value of 1 trillion is actually the output of our distribution when interpreted as a function. And the probability of getting a value of 1 trillion or greater is the area on our chi-squared distribution to the right of the line $x=5.5114$ . For perspective, the x-axis on this distribution, repeated below, ranges from 0 to 25.

Some technical details: the $\chi^2$ distribution has no value over negative numbers, but has values for all positive real numbers, so the part of the graph with area under it ranges from 0 to $\infty$ ; it’s just that said area becomes really really small for larger values.

Finding the area is done in practice using a calculator or computer. Unfortunately, when using distributions such as these on laptops such as mine, Desmos gets iffy. On to WolframAlpha!

What’s that, WolframAlpha?

*sigh*³

The moral of this story is: don’t use numbers above 1 million in computations involving probability distributions, because they’re cursed and your calculator and/or computer will hate you forever.

Anyway, Python didn’t let me down, and returned a p-value of about 0.84, meaning that we cannot reject the null, which if you remember is that the digits are of equal frequency. It only gave this verdict after about ten minutes, however.

*Protip: If your computer swears revenge on you, try restarting it.*

Now, you can’t discuss stats without a good discussion of misleading people with stats. On to the next section!

IV: How Can I Use This to Deceive People?

Let’s go back to our bar chart.

Here are the exact frequencies:

Digit	Count	Frequency
0	99959	9.9959%
1	99757	9.9757%
2	100026	10.0026%
3	100230	10.023%
4	100230	10.023%
5	100359	10.0359%
6	99548	9.9548%
7	99800	9.98%
8	99985	9.9985%
9	100106	10.0106%

As you can see, there is some slight variation! This is good news, because we can blow up the y-axis and make it look like large variation:

However, the differences on this are so miniscule that the amount of significant figures we have to use to make this convincingly misleading draws attention to the y-axis, which is precisely what we do not want to be doing. We can do better than this.⁴

Recall the chi-squared test we did earlier in this post to determine whether there was a significant difference in the distribution of the digits. Recall that the answer was “no”. Now, how can we change the result of this test while still keeping the reasoning mostly plausible?

Right on the mark! This is called p-hacking. P-hacking is something that is really complex and is causing a problem in science, and it’s something that’s more suited to a separate post, since this one is really running a bit long. To cut it short, p-hacking is when you do various bad things to data, such as small sample sizes, cherry-picking your favorite data, removing “outliers”⁵ subjectively, and being generally biased, in the hope to make your results significant (i.e., get the p-value below the magical 0.05 figure).

Anyway, let’s see p-hacking for ourselves. I’m taking the same 1 million digits, but I’m taking 50 random samples of 50 digits each. The full results are here, but the key here is that 3 of these came out significant. Here is the most visible result, with a $\chi^2$ of 30.8:

*+10 alertness points if you noticed that the graph goes up to 0.5.*

The best part is drawing fallacious conclusions, though:

*Look, all I’M saying is that if 8 isn’t ∞ in disguise then why are they both greater than π? OPEN YOUR EYES*

It’s good to keep a watch on situations like these: just because a conclusion fits the data doesn’t mean it’s supported by them.

Sources/Footnotes

1
The digits come from http://piday.org/million.
2
When I first heard this term in Stats class, the word goodness felt wrong to me. If you’re thinking the same thing, know that goodness is absolutely a real word, and it’s just the linguists messing with our minds.
3
What WolframAlpha is saying here is “upgrade or I’m not computing this any farther”. Basically, it says that I’m not paying it enough to do integrals like the ones I gave it. Which is fair, because I’m not paying it at all, so…
4
Or worse, I guess, depending on whether you believe “misleading the public about the distribution of digits of pi” to be good or bad.
5
Generally speaking, outliers are anomalous data points. P-hackers will remove data that doesn’t fit their hypothesis by calling them outliers, even if they’re not.

2 responses to “Statistical Pi(e)”

eddy says:

November 3, 2023 at 2:42 pm

A fun read because our comments are funny. However, you take no time justifying the use of Chi Squared. You seem to presume we should just “blindly” use Chi Squared; because you said so? Perhaps this is beyond the scope of this blog but to me it is critically important, i.e., if you did a simulation using a Monte Carlo method, would results be the same as the Chi Squared.

- The Adder says:
  
  November 4, 2023 at 9:49 pm
  
  Hi! Glad you liked the post. In reponse to your comment, I picked χ^2 somewhat at random, because using all possible methods would result in an infeasibly long post; there’s nothing specific about it that makes it special. You mention Monte Carlo; the idea is that you would get a p-value that would be similar (but not the same–randomness and all that) to the p-value you’d get from χ^2.