Inverse function as a method to generate rvs

In this post I attempt to explore some basic intuition about the inverse function as a method to generate random variables.

PDF and CDF

Suppose there is a probability density function f. One can easily calculate the cumulative density function F as:

F(x) = \int_{-\infty}^x f(t) dt = g_x

Where g_x \in [0, 1].

PDF – example

Let’s generate a probability density function f based on a polynomial equation k.  Suppose we want this pdf to describe the probability of some variable x \in r, where r is the region {[-4, -1] \cup [+1, +4]}. We define k arbitrarily as:

k = -(x^2 - 1)(x^2 - 16)

k

It is necessary to adjust this polynomial function to a factor of h defined by:

h = \int_{-4}^{-1} k dx + \int_{1}^{4} k dx = \frac{1044}{5}

Therefore, the probability density function f is a piecewise function defined as:

f = \begin{cases} 0 &\quad\text{if } x \in r^c \\ \frac{k}{h} &\quad\text{if } x \in r \ \end{cases}

pdf

By doing so, we can be confident that:

\int_{-\infty}^{\infty} f(t) dt = 1

CDF – example

Naturally, the cdf function is just an integral. But due to the fact that the pdf we are using is a pisewise function, we must divide the integrals across the region. So the cdf function F would have the following form:

F = \begin{cases} I_1 &\quad\text{if } x \leq -4 \\ I_2 &\quad\text{if } -4 < x \leq -1 \\ I_3 &\quad\text{if } -1 < x \leq 1 \\ I_4 &\quad\text{if } 1 < x \leq 4 \\ I_5 &\quad\text{if } 5 < x \ \end{cases}

We can figure the value of I_i intuitively.

I_1 = 0

I_2 = \int_{-4}^x \frac{k}{h} dx = \frac{1}{3132} (1408 - 240 x +85 x^3 - 3 x^5) 

I_3 = \int_{-4}^{-1} \frac{k}{h} dx = \frac{1}{2}

I_4 = \int_{-4}^{-1} \frac{k}{h} dx + \int_{1}^x \frac{k}{h} dx =  \frac{1}{3132} (1724- 240 x +85 x^3 - 3 x^5) 

I_5 = \int_{-4}^{-1} \frac{k}{h} dx + \int_{1}^4 \frac{k}{h} dx =  1

As a result we have:

cdf

Inverse function

The inverse function of F is expressed as F^{-1} such that:

F^{-1} (g_x) = x

In more general terms, if y = f(x), the inverse function  y^{-1} = f(y)  must satisfy y^{-1}(f(x))=x.

Example I

To begin with a simple example assume we have a pdf f_2 that describes the density function of an x variable:

f_2 = \begin{cases} \frac{6-3x}{6} &\quad\text{if } 0 < x \leq 2 \\ 0 &\quad\text{if } x \leq 0 \cup 2 < x \ \end{cases}

The cdf is then:

F_2  = \begin{cases} a &\quad\text{if } x \leq 0 \\ b &\quad\text{if } 0 < x \leq 2 \\ c &\quad\text{if } 2 < x \ \end{cases}

where:

a = 0

b = a +\frac{1}{6} \int_0^{x} (6-3x) dx = \frac{1}{6}(6x - \frac{3}{2} x^2)

c = a +\frac{1}{6} \int_0^{2} (6-3x) dx = 1

ex1

The inverse function f^{-1} can only be deduced on increasing f functions. For this reason, only the b component of the piecewise cdf is going to be calculated. We can easily find the solution with x = \frac{-B \pm \sqrt{B^2 - 4 A C}}{2 A}. The general approach is to define y = f(x) and solve for x.

b^{-1} = \begin{cases} 2(1-\sqrt{1-x})  \\ 2(1+\sqrt{1-x}) \ \end{cases}

Since we expect that b^{-1} (1) = 2 and b^{-1} (0) = 0 we can rule out the second solution.

 

F_2^{-1} = 2(1-\sqrt{1-x}) 

ex1_inv

Example II

If we try to follow the same steps shown in Example I to calculate the inverse function of f, we would probably fail. High order polynomial function are difficult (if not impossible) to solve with analytical methods. Therefore, a numerical method is suggested for this problem.

At a first glance, switching from an analytical approach to a numeric method does not seem appealing. Nonetheless, in many cases we are better off working on a problem with a non-parametrical estimated density function (obtained from the empirical data) that with an analytical function determined by assumptions.

The first step is to have available the numeric data of the x-axis, pdf and cdf values.

data_gn

The cdf function F maps a variable in the x-axis to only one value in the y-axis. The inverse method F^{-1} suggest the contrary; mapping the a variable form the y-axis into the x-axis.

 So let’s suppose that we want to know the value of F^{-1} (u). The first step is to find the position of the nearest value yet smaller than u in the vector containing the data of the cdf. If the value is really close, the next step would be to use the position obtained to figure out what is the correspondent x-value. Pretty straight forward, right? There would not be problem with this if the “nearest value” is indeed significantly near to u. Due to the fact that this situation is not commonly true, an interpolation technique is recommended.

To maintain simplicity, linear interpolation is frequently used. Assume that the closest value to u in the vector containing the data of the cdf is in the i-th position. Lets call the x-axis vector x and the cdf vector z. Linear interpolation uses a simple linear model to estimate the corresponding x-value for u. In essence:

\hat{x} = \frac{x[i+1]-x[i]}{z[i+1]- z[i]} (u - z[i]) + x[i]

Since we know that z[i] < u and z[i]  is the closest value smallest value to u we also know that u < z[i+1] . By definition, cdf are increasing functions so F^{-1}(z[i]) < F^{-1}(u) < F^{-1}(z[i+1]) must also hold true. As a result F^{-1}(u) \approx \hat{x} .

inverse_func

Inverse method

If there is a random variable u drawn from a uniformly distribution U[0,1], then y is a random variable with f distribution only if y = F^{-1} (u) .

Then by using random uniformly distributed variables we can simulate any distribution. The following code shows the implementation of the numeric method to calculate the inverse function.

As a result, the inverse method allowed us to generate 5000 random variables that apparently distribute according to f .

rel_ud
A simple plot showing in the x-axis the set of uniformly distributed rv and on the y-axis the rv generated when applying the inverse method. 

first50
This plot illustrates how the transformation works, taking each uniform-rv as an input and generating the desired-rv as an otuput. 
simul_rand
The frequency histogram of the desired-rv shows that they distribute as f.


As it can be seen, the inverse method approach to generate random variables is a handy tool. The methodology allows us to apply this technique with analytical or numeric methods.

In further posts, we will explore how a density distribution can be estimated via non-parametric statistics. We will enhance monte-carlo simulations by doing so and hopefully build accurate predictive models. The inverse methodology is at the core of many high level simulations.

 


 

 

 

 

 

Advertisements

Download and plot financial data in R

This post covers some basic procedures for downloading and plotting financial data in the R programming language.

 

Before starting

  • It is highly recommended to work with the IDE RStudio for R.
  • For aesthetic reasons, the package ‘ggplot2’ will be used for the data visualization process. The package ‘tidyr’ will also be required. To download, run the following code in the RStudio’s console:
    install.packages('ggplot2')
    install.packages('tidyr')
  • It is necessary to locate the financial data of your interest within an online database (suggested online database here) with an API that allows to download the data set. In this post the USD/EUR, SGD/EUR and CHF/EUR exchange rates will be used as an example, obtained from the European Central Bank (ECB) database in Quandl. The use of other databases may need some adjustments to the code presented in this post.

Code…

Defining the time interval

First thing to do is to identify the time interval for the financial data. This will be stored as a vector named inter.t. The function Sys.Date() gives the actual date registered in your system. As it can be seen, in this example the time interval is one year long (365 days).

# time interval
 inter.t <- c((Sys.Date()) - 365, (Sys.Date()))
# libraries 
library(ggplot2)
require(tidyr)

Downloading, reading and manipulating the data

We proceed to indicate the “links” for downloading the data. Note that Quandl’s R package can be used in this step.

  • url: vector containing the urls for the download.
  • nam: vector for storing the names of the “to be downloaded” data as .csv
  • variables: vector containing the name of the financial data.
# URLs for downloading data:
url <- numeric(); nam <- numeric(); 
variables <- c("EURUSD", "EURSGD", "EURCHF")

# exchange rate USD per EUR
url[1] <- paste("https://www.quandl.com/api/v3/datasets/ECB/EURUSD.csv?start_date=", 
 inter.t[1], "&end_date=", inter.t[2]); nam[1] <- "eurousd.csv"

# exchange rate SGD per EUR
 url[2] <- paste("https://www.quandl.com/api/v3/datasets/ECB/EURSGD.csv?start_date=", 
 inter.t[1], "&end_date=", inter.t[2]); nam[2] <- "eursgd.csv"

# exchange rate CHF per EUR
 url[3] <- paste("https://www.quandl.com/api/v3/datasets/ECB/EURCHF.csv?start_date=", 
 inter.t[1], "&end_date=", inter.t[2]); nam[2] <- "eurchf.csv"

Now the data is actually going to be downloaded into your working directory as a .csv (with the function download.file()). We proceed to read back the data to the environment with read.csv() function in a for loop to avoid writing the instruction for each variable. The order() and as.Date() functions will be used within the same for loop to avoid any data discrepancy.

  • le: number of financial variables.
  • fin.data: list that contains dataframes (financial data).
  • number.data: number of days each financial data has.
le <- length(url); fin.data <- list()

for (i in 1:le) {
 download.file(url[i], nam[i])
 fin.data[[i]] <- read.csv(file = nam[i], header = TRUE, sep = ",", na.strings = TRUE)
 fin.data[[i]] <- fin.data[[i]][order(fin.data[[i]][, 1]), ]
 fin.data[[i]][, 1] <- as.Date(fin.data[[i]][, 1])
 colnames(fin.data[[i]]) <- c("Date", "Value", rep("Value", ncol(fin.data[[i]])-2))
}
names(fin.data) <- variables

# general information about the database
number.data<- lapply(fin.data, nrow)
number.data

The list number.data has the number of observations (daily rates) we have for the exchange rates. A general view of the data for each rate can be done with the head() function.

lapply(fin.data, head)

So now we have the financial data in a list named fin.data. Inside this list there are three dataframes, each one containing the historical information (date and value) of an exchange rate. Furthermore, in the dataframe we can find two columns. The first one being Date (object class Date) and the other being Value (object class numeric).

This particular arrangement of the data is quite intuitive and useful. It can be handy while trying to manipulate data in order to do operations or analysis. Nevertheless, in the following step we are going to create a general dataframe, another useful way to store this kind of data.

  • gen.findat : dataframe containing the financial data
  • dff: an auxiliary variable.
gen.findat <- data.frame(Date = character(0), Exchange.rate = character(0), Value = numeric(0))
for (i in 1:le){
 dff <- gather(fin.data[[i]], Exchange.rate, Value, -Date)
 dff$Exchange.rate <- rep(variables[i], nrow(dff))
 gen.findat <- rbind(gen.findat, dff)
}

Plotting

Once having done all the previous code, plotting the exchange rates will be quite easy.

ggplot(gen.findat, aes(x = Date, y = Value, group = Exchange.rate)) + 
 geom_line(aes(colour = Exchange.rate), size = 0.8) + 
 ggtitle("Foreign currencies per EUR") + 
 ylab("Monetary units") + 
 theme_light()

Result

At the end we are left with a messy environment and some useful variables. As it has been said before, the variables of interest are the ones that contain the financial data. In this case fin.data and gen.findat. Feel free to clean the environment and leave the variables you want.

EURI

The package gglpot2 comes with many editing options. I encourage you to explore them and make a more elegant presentation for the plots.

If you consider that there is a better way of doing a process shown in this post, feel free to mention it in the comments. I will be glad to read your recommendations, it’s always good to learn from other perspectives.

 

You can download this code or others in github.

For writing readable code, check the Google’s R Style Guide here.