Probability density and distribution

Probability density and distribution#

Probability density function and important parameters
Different probability distributions

Probability density function and important parameters#

Probability density function#

let \(x\) be a continous (not discrete!) variable that takes values in \(\Omega\), for example temperature, probability density function \(f_X(x)\) of an event X (i.e. T=10°C) is defined as a continous function on \(\mathbb{R}\) with the following three attributes:

(24)#\[\begin{split}\begin{align*} &1.~~~f_X(x) \geq 0~~\textrm{for all}~~ x \epsilon \Omega,\\ &2.~~~\int_{\Omega}f_X(x)~dx = 1,\\ &3.~~~P(X\epsilon (a,b)) = \int_a^bf_X(x)~dx ~~\textrm{for all}~~ (a,b) \subset \Omega \end{align*}\end{split}\]

Question: What is the unit of the pdf?
Answer: [\(f_X(x)\)] = [x]\(^{-1}\).
Question: What is the integral of the pdf? Answer: The cumulative distribution function.

Cumulative distribution function#

Cumulative distribution function for an event X is a montonously increasing, non-dimensional function F_X(x) on \(\mathbb{R}\) defined as:

(25)#\[F_X(x) = \int_{-\infty}^x f_(r)~dr,\]

which is equivalent to:

(26)#\[\begin{split}\begin{align*} & \lim_{x \to -\infty} F_X(x) = 0\\ & \lim_{x \to \infty} F_X(x) = 1\\ & \frac{d}{dx}F_X(x) = f_X(x) \end{align*}\end{split}\]

consequently the probability of the event \(X\) to be inside the range of \((a,b)\) is:

\[P(X \epsilon (a,b)) = F_X(b) - F_X(a)\]

Expectation \(\varepsilon\)#

the expectation of a given pdf weighs is with \(x\) in the integral:

\[\begin{split}\begin{align*} \varepsilon(X) &= \int_{\Omega}xf_X(x)~dx\\ \varepsilon(g(X)) &= \int_{\Omega}g(x)f_X(x)~dx \end{align*}\end{split}\]

two attributes of the expectation are:

\[\begin{split}\begin{align*} \varepsilon(g_1(X) + g_2(X)) &= \varepsilon(g_1(X)) + \varepsilon(g_2(X))\\ \varepsilon(ag(X)+b) &= a\varepsilon(g(X)) + b \end{align*}\end{split}\]

Central moments \(\mu\)#

k-th moment of a continous random variable X:

\[\mu^{(k)} = \varepsilon(x^k) = \int_{\Omega}x^kf(x)~dx\]

k-th central moment of a continous random variable X:

\[\mu^{'(k)} = \int_{\Omega}(x-\mu)^kf(x)~dx\]

example: anomalies with \(\mu\) mean seasonal cycle
mean \(\mu\): location parameter \(\mu = \mu^{(1)}\)
variance:

\[Var(X) = \mu^{'(2)} = \int_{\Omega}(x-\mu)^2f(x)~dx\]

standard deviation:

\[\sigma_X = \sqrt{Var(X)}\]

Chebyshev’s inequality:

\[P(|X-\mu| \geq \lambda \sigma) \leq \frac{1}{\lambda^2}\]

Skewness \(\gamma_1\)#

is a measure of the asymmetry of a distribution: symmetric for \(\gamma_1=0\), scaled version of the third central moment, non-dimensional shape parameter

\[\gamma_1 = \int_{\Omega}\left( \frac{x-\mu}{\sigma} \right)^3f_X(x)~dx\]

Kurtosis \(\gamma_2\)#

is a measure of the peakedness of a distribution: a normal distribution (will be explained later this lecture) has \(\gamma_2=0\), scaled and shifted version of the fourth central moment, non-dimensional shape parameter

\[\gamma_2 = \int_{-\infty}^{\infty} \left( \frac{x-\mu}{\sigma} \right)^4f_X(x)~dx -3\]

Examples#

summer sea level at Kieler Förde, \(\mu = 0.06\), \(\sigma=0.19\), \(\gamma_1=-0.6\), \(\gamma_2=4.07\)

probability densities of some measured variables

P-quantiles#

mean and variance are affected by the tail ends of the pdf (likelihood of extreme values), but p-quantiles \(x_p\) are insensitive to extreme values.
p quantile of 0.3 means that 30% of the x values are below this threshold

\[\begin{split}\begin{align*} F_X(x_p) = p ~~~~~\textrm{with}~~~~~ &P(X \varepsilon (-\infty,x_p)) = p,\\ &P(X \varepsilon (x_p, \infty)) = 1-p \end{align*}\end{split}\]

median m\(_{50}\) is the 50%-quantile: half of the distribution lays above and the other half below m\(_{50}\).

\[F_X(m_{50}) = 0.5 ~~~\rightarrow~~~ P(x\leq m_{50}) = P(x\geq m_{50}) = 0.5\]

let’s look at the p-quantiles of the log-normal distribution in Figure 5 to get an idea. note the difference of mean and median!

Different probability distributions#

Uniform distribution#

symmetric and less peaked than the normal distribution:

\[\begin{split}f_X(x) = \cal U(a,b) = \left\{ \begin{array}{ll} 1/(b-a) ~~~\textrm{for all}~ x\epsilon [a,b]\\ 0~~~~~~~~~~~~~~~~~ \textrm{elsewhere}\\ \end{array} \right.\end{split}\]

with the cumulative distribution function:

\[\begin{split}F_X(x) = \left\{ \begin{array}{ll} 0~~~~~~~~~~~~~~~~~~~~~~~~~~ \textrm{ for}~x \leq a\\ (x-a)/(b-a) ~~~\textrm{for all}~ x\epsilon [a,b]\\ 1~~~~~~~~~~~~~~~~~~~~~~~~~~ \textrm{ for}~x \geq a\\ \end{array} \right.\end{split}\]

exercise: calculate \(\mu,Var,\sigma,\gamma_1,\gamma_2\) of the uniform distribution \(\cal U(a,b)\)
solutions: \(\mu(\cal U(a,b))= \frac{1}{2}(a+b)\), \(Var(\cal U(a,b))= \frac{1}{12}(b-a)^2\), \(\sigma(\cal U(a,b))= \sqrt{\frac{1}{12}}(b-a)\), \(\gamma_1(\cal U(a,b))= 0\), \(\gamma_2(\cal U(a,b))= -1.2\)

Normal (Gaussian) distribution#

most physical quantities are nearly normal distributed

\[f_{\cal N}(x) = \frac{1}{\sqrt{2\pi}\sigma}e^{\frac{-(x-\mu)^2}{2\sigma^2}} ~~~\textrm{with}~~~ X \sim \cal N(\mu,\sigma^2)\]

no skewness or kurtosis: \(\gamma_1=\gamma_2=0\)
no analytical form of cdf, approximation:

(27)#\[F_{\cal N}(x) \approx \frac{1}{2} \left( 1+sign\left( \frac{x-\mu}{\sigma} \right) \sqrt{1-e^{\frac{-2}{\pi} \left(\frac{x-\mu}{\sigma} \right)^2 }} \right)\]

central limit theorem states: If \(X_k,k=1,2,...\) is an infinite series of independent and identically distributed random variables with \(\varepsilon(X_k)=\mu\) and \(Var(X_k)=\sigma^2\) then the average \(\frac{1}{n} \sum^n_{k=1}X_k\) is asymptotically normal distributed. That is:

\[\lim_{n \to \infty} \frac{\frac{1}{n} \sum^n_{k=1}(X_k-\mu)}{\frac{\sigma}{\sqrt{n}}} \sim \cal N(0,1)\]

a larger sample size reduces the standard deviation as of:

\[\lim_{n \to \infty} \frac{1}{n} \sum^n_{k=1}(X_k-\mu) \sim \cal N(0,\frac{\sigma^2}{n}) ~~\Rightarrow ~~ \sigma_{\Sigma}= \frac{\sigma}{\sqrt{n}}\]

Log-normal distribution#

distribution of positive definite quantities such as rainfall, wind speed

\[f_X(x) = \frac{1}{\sigma\sqrt{2\pi}} \frac{1}{x} e^{\frac{-(ln(x)-ln(\theta))^2}{2\sigma^2}} ~~\textrm{for}~~ x>0\]

with the median value \(\theta\) and

\[ln(X) \sim \cal N(ln(\theta),\sigma)\]

exercise: derive a general for the k-th central moment of the distribution
solution: \(\varepsilon(X^k) = \theta^ke^{{k\sigma}^2/2}\)

\(\chi^2\)-distribution#

sum of k independent squared \(\cal N(0,1)\) random variables, k number of degrees of freedom, application for the pdfs of variance estimates:

\[f_{\chi}(x) = \frac{x^{(k-2)/2}e^{-x/2}}{\Gamma(k/2)2^{k/2}} ~~\textrm{if}~~ x>0 \]

with

(28)#\[\Gamma(x) = \int_0^{\infty}e^{-t}t{x-1}dt ~\textrm{for}~x>0\]

it has handy attributes:

\[\begin{split}\begin{align*} \varepsilon(X) &=k\\ Var(X) &=2k \end{align*}\end{split}\]

Student’s t-distribution#

application for testing the significance of the differences in the means. be \(t(k)\) a test variable with \(k>0\), if A and B are independent random variables such that

\[B \sim \chi^2(k) ~\textrm{and} ~A \sim \cal {N(0,1)}\]

the t-distribution can be written as:

\[t(k) \sim \frac{A}{\sqrt{B/k}}\]

using the \(\Gamma\)-function (28) the distribution can also be written as:

\[F_{\cal T}(t) = \frac{\Gamma((k+1)/2)}{\sqrt{k\pi}\Gamma(k/2)} \left( 1+\frac{t^2}{k} \right)^{\frac{-(k+1)}{2}}\]

t-test?

Fisher-F-distribution#

application for testing the significance of the differences in the variance. for \(\chi^2\)-distributed \(K\) and \(L\):

\[K \sim \chi^2(k) ~\textrm{and}~ L \sim\chi^2(l)\]

the F-distribution is given by:

\[{\cal F}(k,l) = \frac{K/k}{L/l}\]

alternatively the probsbility density of the F-distribution is also given by:

\[f_{\cal F}(x) = \frac{(k/l)^{k/2}\Gamma((k+l)/2)}{\Gamma(k/2)\Gamma(l/2)} x^{(k-2)/2} \left(1+\frac{k}{l}x\right)^{-(k+l)}\]

Summary of theoretical distributions#

Continous random vectors, multi-variate data#

example: vectors X temperature and Y sea level pressure:

\[\mathbf{X}~\textrm{and}~\mathbf{Y} \sim f_{\mathbf{X},\mathbf{Y}}(\vec{x},\vec{y})\]