In this post I attempt to explore some basic intuition about the inverse function as a method to generate random variables.
PDF and CDF
Suppose there is a probability density function . One can easily calculate the cumulative density function as:
Where .
PDF – example
Let’s generate a probability density function based on a polynomial equation . Suppose we want this pdf to describe the probability of some variable , where is the region . We define arbitrarily as:
It is necessary to adjust this polynomial function to a factor of defined by:
Therefore, the probability density function is a piecewise function defined as:
By doing so, we can be confident that:
CDF – example
Naturally, the cdf function is just an integral. But due to the fact that the pdf we are using is a pisewise function, we must divide the integrals across the region. So the cdf function would have the following form:
We can figure the value of intuitively.
As a result we have:
Inverse function
The inverse function of is expressed as such that:
In more general terms, if , the inverse function must satisfy .
Example I
To begin with a simple example assume we have a pdf that describes the density function of an variable:
The cdf is then:
where:
The inverse function can only be deduced on increasing functions. For this reason, only the component of the piecewise cdf is going to be calculated. We can easily find the solution with . The general approach is to define and solve for .
Since we expect that and we can rule out the second solution.
Example II
If we try to follow the same steps shown in Example I to calculate the inverse function of , we would probably fail. High order polynomial function are difficult (if not impossible) to solve with analytical methods. Therefore, a numerical method is suggested for this problem.
At a first glance, switching from an analytical approach to a numeric method does not seem appealing. Nonetheless, in many cases we are better off working on a problem with a non-parametrical estimated density function (obtained from the empirical data) that with an analytical function determined by assumptions.
The first step is to have available the numeric data of the x-axis, pdf and cdf values.
# Example of inverse function | |
# Generating x-axis, pdf and cdf numeric values. | |
# last modifyed: 29/04/2016 | |
library(ggplot2) | |
library(tibble) | |
library(tidyr) | |
# Mathematical function --------------------------------------------------- | |
pdf_example <- function(x){ | |
flag <- F | |
if(x >= -4 & x <= -1) flag <- T | |
if(x >= 1 & x <= 4) flag <- T | |
if(flag) | |
-5 * (x^2 - 1) * (x^2 - 16) / 1044 | |
else | |
0 | |
} | |
# Create x-axis, pdf and cdf values -------------------------------------- | |
n <- 5000 | |
dataset <- data_frame( x = seq(-5, 5, 10/n )) | |
dataset$pdf <- sapply(X = unlist(dataset$x), FUN = pdf_example) | |
aux <- 0 | |
freq_acum <- numeric() | |
for(i in 1:(n+1)){ | |
freq_acum[i] <- dataset$pdf[i] + aux | |
aux <- dataset$pdf[i] + aux | |
} | |
dataset$cdf <- freq_acum / sum(dataset$pdf) | |
dataset <- gather(dataset, dens, freq, -x) | |
# Plot the results -------------------------------------------------------- | |
g <- ggplot(dataset, aes(x, freq, group = dens)) + theme_minimal() + | |
geom_ribbon(aes(ymin = 0, ymax = freq, fill = dens), alpha =0.6, color = "black") + | |
xlab("value") + ylab("frequency") + scale_fill_brewer() + | |
ggtitle("Dataset generated - {x-axis, pdf, cdf}") | |
# g + geom_segment(aes(x = 0, xend = 0, y = 0, yend = 1/2), color = "dark blue", size = 0.75) + | |
# geom_segment(aes(x = -5, xend = 0, y = 1/2, yend = 1/2), color = "dark blue", size = 0.75) | |
The cdf function maps a variable in the x-axis to only one value in the y-axis. The inverse method suggest the contrary; mapping the a variable form the y-axis into the x-axis.
So let’s suppose that we want to know the value of . The first step is to find the position of the nearest value yet smaller than in the vector containing the data of the cdf. If the value is really close, the next step would be to use the position obtained to figure out what is the correspondent x-value. Pretty straight forward, right? There would not be problem with this if the “nearest value” is indeed significantly near to . Due to the fact that this situation is not commonly true, an interpolation technique is recommended.
To maintain simplicity, linear interpolation is frequently used. Assume that the closest value to in the vector containing the data of the cdf is in the -th position. Lets call the x-axis vector and the cdf vector . Linear interpolation uses a simple linear model to estimate the corresponding x-value for . In essence:
Since we know that and is the closest value smallest value to we also know that . By definition, cdf are increasing functions so must also hold true. As a result .
Inverse method
If there is a random variable drawn from a uniformly distribution , then is a random variable with distribution only if .
Then by using random uniformly distributed variables we can simulate any distribution. The following code shows the implementation of the numeric method to calculate the inverse function.
As a result, the inverse method allowed us to generate 5000 random variables that apparently distribute according to .
As it can be seen, the inverse method approach to generate random variables is a handy tool. The methodology allows us to apply this technique with analytical or numeric methods.
In further posts, we will explore how a density distribution can be estimated via non-parametric statistics. We will enhance monte-carlo simulations by doing so and hopefully build accurate predictive models. The inverse methodology is at the core of many high level simulations.