Tip:
Highlight text to annotate it
X
In this video I want to talk about what is easily one of the
most fundamental and profound concepts in statistics and
maybe in all of mathematics.
And that's the central limit theorem.
And what it tells us is we could start off with any
distribution that has a well-defined mean and variance.
And if it has a well-defined variance, it has a well-defined
standard deviation.
And it can be a continuous distribution or a discrete one.
I'll draw a discrete one just because it's easier to
imagine at least for the purposes of this video.
So let's say I have a discreet probability
distribution function.
And I want to be very careful not to make it look anything
close to a normal distribution because I want to show you the
power of the central limit theorem.
So let's say I have a distribution.
Let's say it could take on values 1 through
6: 1, 2, 3, 4, 5, 6.
It's some kind of crazy dice.
It's very likely to get a 1, let's say it's impossible-- let
me make that a straight line-- you've a very high likelihood
of getting a 1, let's say it's impossible to get a 2, let's
say it's an OK likelihood of getting a 3 or a 4.
Let's say it's impossible get a 5.
And let's say it's very likely to get a 6 like that.
So that's my probability distribution function.
If I were to draw a mean, this is symmetric, so maybe the mean
would be something like that.
The mean would be halfway.
So that would be my mean right there.
The standard deviation maybe it would look-- it'd be
that far and that far above and below the mean.
But that's my discreet probability
distribution function.
Now what I'm going to do here, instead of just taking samples
of this random variable that's described by this probability
distribution function, I'm going to take samples of it.
But I'm going to average the samples and then look at those
samples and see the frequency of the averages that I get.
And when I say average I mean the mean.
So let's say-- and let me define something-- let's say my
sample size, and I could put any number here, but let's say
first off we try a sample size of n is equal to 4.
And what that means is I'm going to take 4
samples from this.
So let's say the first time I take 4 samples.
So my sample sizes is 4.
Let's say I get a 1, let's say I get another 1, let's say
I get a 3, and I get a 6.
So that right there is my first sample of sample size 4.
I know the terminology can get confusing because this is a
sample that's made up of 4 samples.
But when we talk about the sample mean and the sampling
distribution of the sample mean which we're going to talk more
and more about over the next few videos, normally the sample
refers to the set of samples from your distribution.
And the sample size tells you how many you actually took
from your distribution.
But the terminology can be very confusing because you can
easily view one of these as a sample.
But we're taking 4 samples from here.
We have a sample size of 4.
And what I'm going to do is I'm going to average them.
So let's say the mean-- I'm going to be very careful when I
say average-- the mean of this first sample of size 4 is what?
1 plus 1 is 2.
2 plus 3 is 5.
5 plus 6 is 11.
11 divided by 4 is 2.75.
That is my first sample mean for my first sample of size 4.
Let me do another one.
My second sample of size 4.
Let's say that I get a 3, a 4, let's say I get another 3,
and let's say I get a 1.
I just didn't happen to get a 6 that time.
And notice I can't get a 2 or a 5.
That's impossible for this distribution.
The chance of getting a 2 or a 5 is zero.
So I can't have any 2's or 5's over here.
So for this second sample of sample size 4, my sample mean--
so my second sample mean is going to be 3 plus 4 is 7.
7 plus 3 is 10 plus 1 is 11.
11 divided by 4 once again is 2.75.
Let me do one more because I really want to make it clear
what we're doing here.
So I do one more-- actually we're going to do a gazillion
more, but let me just do one more in detail.
So let's say my third sample of sample size 4 I get-- some I'm
going to literally takes 4 samples.
So my sample is made up of 4 samples from this original
crazy distribution.
Let's say I get a 1, a 1, a 6 and a 6.
And so my third sample mean is going to be 1 plus 1 is 2.
2 plus 6 is 8.
8 plus 6 is 14.
14 divided by 4 is 3.5.
And as I find each of these sample means-- so for each of
my samples of sample size 4 I figure out a mean-- and as I do
each of them I'm going to plot it on a frequency distribution.
And this is all going to amaze you in a few seconds.
So I plot this all on a frequency distribution.
So I say, OK, on my first sample my first
sample mean was 2.75.
So I'm plotting the actual frequency of the sample means
I get for each sample.
So 2.75, I got it one time.
So I'll put a little plot there.
So that's from that one right there.
And the next time I also got a 2.75.
That's a 2.75 there.
So I got it twice.
So I'll plot the frequency right there.
Then I got a 3.5.
So all the possible values, I could have a 3, I could have
a 3.25, I could have a 3.5.
So then I had the 3.5 so I'll plot it right there.
And what I'm going to do is I'm going to keep
taking these samples.
Maybe I'll take 10,000 of them.
So I'm going to keep taking these samples.
So I go all the way to s 10,000.
I just do a bunch of these.
And what it's going to look like over time is each of these
I'm going to make a dot because I'm going to have to zoom out.
So if I look at it like this, over time, it still has all
the values that it might be able to take on.
You know, 2.75 might be here.
So this first dot is going to be this one right here is going
to be right there and that second one is going to be right
there and then that one at 3.5 is going to look right there.
But I'm going to do it 10,000 times so I'm
going to have 10,000.
And let's say as I do it, I'm going to just
keep plotting them.
I'm just going to keep plotting the frequencies.
I'm just going to keep plotting them over and
over and over again.
And what you're going to see is as I take many, many
samples of size 4.
I'm going to have something that's going to start
kind of approximating a normal distribution.
So each of these dots represent an incidence of a sample mean.
So as I keep adding on this column right here that means
I kept getting the sample mean 2.75.
So over time I'm going to have something that's starting to
approximate a normal distribution.
And that is a neat thing about the central limit theorem.
So the central limit-- and this was the case for-- so in
orange, that's the case for n is equal to 4.
This was for sample size of 4.
Now if I did the same thing with a sample size of maybe 20.
So in this case instead of just taking 4 samples from my
original crazy distribution every sample I take 20
instances of my random variable and I average those 20 and then
I plot the sample mean on here.
So in that case, I'm going to have a distribution
that looks like this.
And we'll discuss this in more videos.
But it turns out if I were to plot 10,000 of the sample means
here, I'm going to have something that-- two things:
it's going to even more closely approximate a normal
distribution.
And we're going to see in future videos it's actually
going to have a smaller-- well, let me be clear-- it's going
to have the same mean.
So that's the mean.
This is going to have the same mean.
It's going to have a smaller standard deviation.
So I should plot these from the bottom because
you kind of stack it.
One you get 1 and then another instance then another instance.
But this is going to more and more approach a
normal distribution.
So the reality is-- and this is what's super cool about the
central limit theorem-- as your sample size becomes larger,
or you can even say as it approaches infinity, but you
really don't have to get that close to infinity to really get
close to a normal distribution.
Even if you have a sample size of 10 or 20, you're already
getting very close to a normal distribution.
In fact, about as good an approximation as we see
in our everyday life.
But what's cool is we can start with some crazy
distribution, right?
This has nothing to do with a normal distribution.
But if we have a sample size-- this was n equals 4-- but if we
have a sample size of n equals 10 or n equals 100, and we were
to take 100 of these instead of 4 here and average them and
then plot that average, the frequency of it.
Then we take 100 again, average them, take the
mean, plot that again.
And if we were to do that a bunch of times, in fact, if we
were to do that an infinite time, we would find--
especially if we had an infinite sample size-- we
would find a perfect normal distribution.
That's the crazy thing.
And it doesn't apply just to taking the sample mean.
Here we took the sample mean every time but you could have
also taken the sample sum.
The central limit theorem would have still applied.
But that's what's so super useful about it.
Because in life there's all sorts of processes out there,
proteins bumping into each other, people doing crazy
things, humans interacting in weird ways.
And you don't know the probability distribution
functions for any of those things.
But what the central limit theorem them tells us is if we
add a bunch of those actions together, assuming that they
all have the same distribution, or if we were to take the mean
of all of those actions together and if we were to plot
the frequency of those means, we do get a normal
distribution.
And that's frankly why the normal distribution shows up so
much in statistics and why frankly it's a very good
approximation for the sum or the means of a lot
of processes.
Normal distribution.
What I'm going to show you in the next video is I'm actually
going to show you that this is a reality.
That as you increase your sample size, as you increase
your n, and as you take a lot of sample means, you're going
to have a frequency plot that looks very, very close to
a normal distribution.