This video offers a brilliant geometric perspective that turns abstract statistical concepts into an intuitive visual relationship between estimation and noise. It provides a rare level of clarity by using the Pythagorean theorem to explain why prediction intervals are inherently wider than confidence intervals.
深度探索
先修知识
- 暂无数据。
后续步骤
- 暂无数据。
深度探索
The Differences Between Estimation vs Prediction, and Confidence vs Prediction Intervals本站收录:
#maths #statistics #probability #machinelearning 00:00 - 00:40 Introduction 00:40 - 04:05 Problem setup & 04:05 - 05:23 Recap of L2 geometry 05:23 - 10:14 The geometry of the sample mean 10:14 - 14:31 The geometry of prediction 14:31 - 16:10 General picture: estimation vs prediction 16:10 - 25:02 Confidence and prediction intervals 25:02 - 26:15 Remarks on deviation from normality Prerequisite video: https://youtu.be/h4QF-2YiM88 Mathematical Statistics playlist: https://youtube.com/playlist?list=PL1dqPc_qxc0cl0vc7FWux68cvIlYjdo-a&si=ETbAS3atTZnyRl3c
Hello, in this video we will give a geometric account of confidence interval versus prediction intervals and more broadly the difference between estimation and prediction in statistics.
The first general treatment of confidence intervals is given by the Polish American mathematician Jersey Neman. We will show that there is a Pythagorean relation between the errors in the estimation, the noise in the data and the errors in the prediction. Let me know in the comment section whether this geometric relation is new to you and what other concepts in probability [clears throat] and statistics you would like to see visualized. Let's go. First we will set up an example problem which is a focus of the video but we will come to the general problem of estimation versus prediction later. Let's consider a population with finite mean mu and variance sigma square. The population is assumed to be infinite. It can literally be infinite or it might be finite but we allow repeated sampling. We assume that the variance sigma squar is known. The mean mu is a constant but unknown. And these are our goals from a random sample drawn from p. We would like to estimate mu and predict a new observation say x new and evaluate the accuracies of the solutions to both problems. And here's a natural idea. We take the sample mean as an estimate of the population mean and the estimated population mean is in turn taken as the prediction for the new observation. The solutions to these problems are given as formulas in every introductory statistics textbook. First we have the so-called 95% confidence interval XAR plus or minus 1.96 * sigma over the<unk> of N. Next, we have the 95% prediction interval. Xar plus or minus 1.96 * sigma times the square root of 1 + 1 /n. This immediately raised a few questions. What are the interpretation of these formulas? And why the plus one in the prediction interval? And lastly, what are the underlying assumptions of these results? The answers are all quite subtle. And here's the TLDDW to the second question. The additional plus one in the prediction interval is to reflect the fact that the new observation carries its own noise. Whereas the confidence interval is for estimating a fixed quantity, namely the population mean. And therefore prediction is intrinsically less accurate than estimation. And that's a general fact that holds true beyond this simple problem. And we shall see how geometry helps us obtaining those answers. Before we do so, let's set up our model and some notations. Let's denote D to be X1, X2 all the way to XN. This is the data set. We may regard our data points as realizations of random variables big X1, big X2 all the way to big XN, which are assumed to be independent and identically distributed following the distribution law P. Recall that P is the distribution of the population with mean mu and variance sigma squared. And now let's take mu hat equals to x new hat as the sample mean xbar which is of course defined as 1 / n * the sum of x i from 1 to n. So the sample mean is both our estimate for the true mean and the prediction of the new observation. uh and to evaluate this method let's use the mean square error criteria. The MSSE of the estimation is defined as the expected value of mu hat minus mu^ 2.
The MSE for prediction is defined as the expected value of x new hat minus x new^ 2. So this is our model. We will first work with a general probability distribution p and then specialize p to be the normal distribution. The unifying geometric viewpoint of our analysis is that random variables can be seen as vectors. And here's the prerequisite video. You can find it in the description or the mathematical statistics playlist or the link in the right corner. And also this video synergizes with an earlier video on the bias variance decomposition. And this video does not depend on that one, but can be seen as a special case of that video. And here's a brief recap of the prerequisite video. The mean variance right triangle. The set L2 of square integraable random variables form a vector space. And here's a general vector X in L2. The mean and variance of random variables are obtained through orthogonal projections. Take the subspace span one of L2. This is the set of constant or degenerate random variables. The mean or the expectation of X is the orthogonal projection of X onto the span of one. And the component of X that is orthogonal to the span of one is called the residual or the noise.
Its length is the standard deviation of X or equivalently its squared length is the variance of X and that's the geometry of the population mean and population variance. But what's the geometry of the sample mean Xbar? Well, the sample mean can be interpreted as the centrid of a polygon. Let's elaborate. First, let's look at a simple case where n equals 2. So, we have two vectors x1 and x2. The average xbar is simply the midpoint of the line segment that connects the tip of x1 to x2. When n= 3, the tips of x1, x2, x3 form a triangle. And xbar being the average of the three vectors. Its tip is exactly the centrid of the triangle which we know from high school geometry is the intersection of the three medians of the triangle. And here's a general diagram.
Suppose we have n data points x1 all the way to xn. Connecting their tips gives us a polygon of n vertices and xar is the centrid. And we remark that this interpretation works for any random variables x1 to xn. They aren't necessarily independent or follow the same distribution. Their average Xbar is the centrid of a polygon. And now let's investigate the properties of Xbar as well as their geometric interpretations.
First we will show that Xbar is unbiased. Let's try to calculate the mean of Xbar. Remember taking expectation is linear because orthogonal projection is a linear transformation.
So the expected value of Xbar is just the arithmetic mean of the expectation of each term which turns out to be mu.
And this is what we mean by saying that Xbar is an unbiased estimator for mu. On average the estimator equals the true value. That is to say no systematic error. But what is the geometric meaning of this fact? This is our vector space and all the data points visualized as vectors. The tips of the vectors form a polygon and xbar is the centrid.
Projecting any of the sample points as well as xbar onto the span of one gives us mu like so. So that means under our assumptions all the data points as well as xbar lives in an aphine space or a hyper plane orthogonal to the span of one. This hyper plane is not a vector space unless it passes through the origin i.e. mu equals 0. So we denote this aine space as mu + 1 orthogonal.
Next let's calculate the standard error.
The standard error is simply another name for the standard deviation of xar.
And by the geometry of random variables, the standard error is the length of the vector xar minus mu. So naturally we investigate the vector xar minus mu. And this is nothing but the center version of XAR which equals 1 / N of the sum X I minus mu where I ranges from 1 to N.
Denote epsilon I equals X I minus mu as the error or noise. This way Xar minus mu is just the average of all the noise terms. Thus we need the properties of epsilon eyes. First the epsilon eyes all lives in the orthogonal complement of one. And we already know this from the mean variance right triangle. What we shall show next is that the set epsilon i over sigma is also normal. Let's try to calculate the inner product between epsilon i and epsilon j. And by definition, this is the expected value of epsilon i * epsilon j. If i does not equal j, then the inner product vanishes. Since x i and xj are independent and the expected value of x i minus mu is zero. If I and J are equal, the inner product of epsilon i with itself is of course the variance of x i that is sigma squar. So we conclude that the set epsilon 1 epilon 2 all the way to epsilon n are pair-wise orthogonal and each vector has the same length and therefore epsilon i over sigma is [clears throat] orthonormal.
Therefore by the Pythagorean identity the square norm of the sum of all the epsone i's is the sum of the square norms of each term which is just n sigma squar and hence the length of xar minus mu which is the average of all the noise terms is 1 / n * the roo<unk> of n * sigma which is sigma over the square root of n. And this is a visualization for n= 3. Here we have three mutually orthogonal noise vectors and your average the green vector has a length of the blue vector over the square root of three. So in summary the expected value of xbar is mu i.e. xbar is an unbiased estimator for mu and the standard error is sigma over the square root of n and both formulas have nice geometric interpretations. So that was the geometry of Xbar and now let's look at a separate issue which is the geometry of prediction. After that we will combine the geometry of Xbar and the geometry of prediction to form a full picture. By our assumptions X1, X2 all the way to XN as well as the new observation X new are ID random variables following the distribution law P. All the random variables except the last one form a random sample D. The problem of prediction is to define a function of D to approximate X new. So let's define a few more vector spaces. Denote H to be the vector space of functions G of arguments D and X new such that G is in L2. G can potentially depend on both the training data and the new observation.
And we denote HD to be the set of functions G of D such that G is in L2.
Of course, HD is a vector subspace of H.
It consists of functions of the training data only. So now we have three subspaces of L2 that are nested. H contains HD and HD contains the subspace one. Remember the subspace one is the set of constant or degenerate random variables. Such random variables can be treated as constant functions of D.
Therefore, one is a subspace of HD. And here comes a key relation that is the orthogonal decomposition of the new observation X new. We write X new as mu plus epsilon new. Mu of course is a constant. So it lives in a subspace one.
But we claim that the new noise epsilon new is not only orthogonal to the subspace one. It is also orthogonal to HD. This is a proof for any G in HD. We take the inner product between epsilon new and G because epsilon new the new noise is assumed to be independent to the training data set. We have epsilon new independent of G of D. So the expectation factorizes and of course the epsilon new has zero mean. So the inner product is zero. And this is true for every G in HD. And therefore, if new is orthogonal to the whole of HD. And this diagram right here is the geometry of prediction. Any function that predicts X new lives in the subspace HD. And the noise in X new is orthogonal to HD as long as the new noise and the training data set are uncorrelated. So now let's add Xbar to the picture. After all, we're using Xbar not only as an estimate for mu, but also as a predictor for X new. Xbar lives in HD as argued. And we just calculated the mean and the standard error of Xbar. And lastly, the prediction error is the L2 distance between Xbar and X new. So here we're adding Xbar to the picture. Xar lives in HD. Its mean i.e. projection onto the subspace one is mu and its residual has a length sigma over the square root of n. The prediction error is the length between the tips of the vectors x new and xar the purple line. Since epsilon new is orthogonal to hd, epsilon new is orthogonal to the residual of xar. And therefore the right triangle with the sides P epsilon U and sigma over the square root of N is a right triangle.
From that we can immediately obtain the length of P by the Pythagorean identity.
Once again P is the square root of sigma squ plus sigma square N. Here the sigma squ of course refers to the square length of epsilon new. And that's why there's an extra plus one in the formula for the prediction interval. It is to account for the length of epsilon new.
And that's the full picture. The difference between estimating mu and predicting x new. The business of estimating mu only takes place in the subspace HD. But the business of predicting x new lives in a higher dimensional space h. And as a side note, let's look at the general picture of prediction versus estimation in which the estimator or the prediction function may not necessarily be xbar. First, we still have the ambient space h and this is a subspace hd. Here is a new observation x new projecting onto the subspace one produces its mean mu and the residual with length sigma known as the irreducible error. Next, we take an arbitrary estimator for mu which is also the predictor for x new. Projecting it onto the subspace one gives us the expected value of mu hat. Its residual lies in HD and its length is the standard error which is the square root of the variance of the estimator or predictor. In general, muhat might be biased and the difference between e- mu hat and mu is known as the bias. The distance between x new and x new hat is the prediction error. The distance between mu hat and mu is the estimation error. The pythagorean identity for the prediction error has three terms that is MSE prediction equals to bias squar plus the variance plus sigma squ. The estimation error lives in HD only has two components. The MSSE of the estimation error is simply bias squar plus variance. So there are two Pythagorean relations going on and overall the prediction error squared is the estimation error squar plus sigma squar and that's the general picture that illustrate the difference between prediction and estimation. And now we wish to construct confidence intervals for our estimates and prediction intervals for our prediction. And for that we need additional distributional assumptions.
Let's look at a special case that P is the normal distribution with mean, mu and sigma squar. So now we suppose that all the random variables modeling observations both in a training data set and the new observations are iid each normal mu sigma squar. From this assumption, we can easily derive the sampling distribution of xbar as well as the difference between xar and x new.
This is because linear combinations of iid gausian are still gausian.
So we have xbar follows a normal distribution and also x new minus xbar also follows a normal distribution and normal distributions are fully characterized by their means and variances but we just calculated those.
So we have everything we need. This is a mean and variance of xar. EX new minus xar is zero. So the variance of x new - xar is the square length of x new - xr.
That's sigma^ 2 * 1 + 1 / n quantity. So xar is normal mu sigma^ 2 / n. And x new - xar is normal 0 sigma^ 2 * 1 + 1 / n quantity. We can then standardize the normal distributions. that is subtract the mean and divide by the standard deviation so that the normal distribution becomes a standard normal distribution and we can turn this into probability statements. Let alpha be a number between 0 and 1 and let Z half alpha to be the critical value of the standard normal. For example, if alpha is 5%.
1 minus alpha is 95% and half alpha is 2.5%.
Z half alpha is the famous value 1.96.
So we have the probability of Xar minus mu over the standard error is between minus z half alpha and z half alpha. And similarly x new minus xar over the prediction error is greater than minus z half alpha and less than or equal to z half alpha. So these two probability statements both follow from the assumption of normal distribution. And then we're just going to manipulate the inequality inside the bracket. Xar minus Z half alpha* the standard error less than or equal to mu less than or equal to xar plus z half alpha times the standard error. This seems like a rather pointless step, but there's a purpose in that. The unknown parameter mu is bounded by two expressions that depends on the random sample only. And the probability statement as a whole is true for all mu. And that's the key here which shall see in a second why this is important. Um but for now we just manipulate the inequality in the second probability statement and we get x new is between xar plus or minus z half alpha times the prediction error.
This is also true for any mu. We shall see the confidence interval follows from equation one and the prediction interval follows from equation two. First the confidence interval. Let's give the general definition.
Suppose d the random sample follows a distribution p theta with a fixed but unknown parameter theta in some set big theta of possible values for theta. An interval L to U is called a 1 minus alpha times 100% confidence interval for SATA if for everyta in big theta the probability of the interval containing SATA is at least 1 minus alpha. Here the interval end points can only depend on D. And the inequality says it doesn't matter what the true value of SATA is.
The probability of capturing the true parameter is no less than 1 minus alpha.
And here's the result. We just derived an exact fit of the definition. So we conclude that Xar plus or minus Z half alpha * sigma over N is a 1 - alpha* 100% confidence interval for mu. And here's a visual interpretation of our formula. So imagine we have these repeated experiments number 1 to 20. The gray dots represent the sample points in each sample. The red and blue line segments represents the resulting confidence intervals calculated from our formula. As we can see, sometimes the calculated confidence interval contains mu and sometimes don't. We call the cases that do a hit and cases that don't a miss. The probability of a hit is the confidence level. Or if we're being super careful, the confidence level is the minimum or the infyum of the hitting probability where the minimum or infyum is taken over all possible values of the unknown parameter. But in my personal experience, confidence interval still causes a lot of confusions in practice.
For example, there's a lot of misconceptions about statements such as this one. It says, "Our study shows that the average hours of sleep per day among students in college X is 6.3 hours with a 95% confidence interval of 5.9 and 6.7.
A common wrong interpretation says the interval 5.9 to 6.7 has a 95% chance of capturing the true mean and this is not correct. The reason for such common confusion I suspect is because there's very little relation between the confidence level the 95% and the actual reported confidence level.
But somehow people try to connect the two. So here we give a correct interpretation of statement like this in two layers. First regarding the 95% confidence level. This study uses an interval estimator of which the long-term frequency of capturing the true mean is at least 95%.
if repeated sampling under the same experimental conditions were to be performed and the underlying probabilistic assumptions hold. So the 95% confidence level is all about the method the interval estimator not the numbers 5.9 to 6.7.
And regarding the reported confidence interval the correct interpretation is given the data observed for this study the interval estimator returns the values of 5.9 to 6.6. 7 and that's the complete and accurate interpretation. In the same spirit of how we handled confidence intervals, we can now define, calculate, and interpret prediction intervals. Here's the formal definition.
Suppose the random sample D and the new observation follows a joint distribution P theta for some unknown theta.
An interval lar u star is called a one minus alpha times 100% prediction interval for x new if for every theta in biga the probability of the interval lar to u star containing the new observation x new is at least 1 minus alpha. Here the end points lar and u star are once again functions of the random sample only. So by this definition and the equation two we had earlier a 1 minus alpha * 100% prediction interval for x new is xar plus or minus z half alpha times the prediction error. And here's an interpretation of an example statement based on our study. We predict the hours of sleep per day of a student in college X to be 6.3 hours with a 95% prediction interval of 5.3 to 7.3.
Once again, there are two layers of meaning regarding the 95% probability.
This study uses an interval estimator of which the long-term frequency of capturing an unobserved data is at least 95%.
If repeated sampling under the same experimental conditions were to be performed and the underlying probabilistic assumptions hold given the data observed for this study, the interval estimator returns the value of 5.3 to 7.3 hours. And one final word regarding the probabilistic assumptions regarding our derivation of the confidence and prediction intervals. If the underlying distribution P is no longer normal, what happens next? Well, if the sample size N is large, Xbar is still approximately normal by the central limit theorem. However, because X new a single observation is not normal. X new minus Xar is now a combination of a non-normal and approximate normal. So, its distribution is now unclear. uh this implies that the confidence interval which depends only on the first pivotal quantity is still approximately valid whereas the prediction interval which is based on the second pivotal quantity we have here is no longer valid. So this is another reason why prediction is harder than estimation. The formulas for predictions for example the prediction intervals here are more sensitive to violations to the underlying distributional assumptions. That's it for today's video. If you find the content helpful, please consider like and subscribe. And I'll see you next time.
相关推荐
United States | Can you simplify?? | Calculator Not Allowed 📵
Math_MasterTv
313 views•2026-05-17
SAT Math Hack7 Substitution vs Elimination
SatPrep1600AI
940 views•2026-05-16
The Perfect Golden Ratio Fractal Doodle
ZenArtShreem
263 views•2026-05-17
This Math Problem Is Worth $1,000,000
MindLoop247
125 views•2026-05-15
Can you find the length X in the triangle?
MathBooster
934 views•2026-05-16
Introduction to Vassiliev Knot Invariants
3cycle
121 views•2026-05-17
Spiralling Bugs..Barely Converging! #mathvisualization #physics #geometry
BarelyConverging
203 views•2026-05-17
Tricky Maths Question for Competitive Exams | How to solve this?
MathBeast.channel-l9i
273 views•2026-05-20











