Thursday, December 24, 2009

Markov Chains

A coin flip is an example of a Bernoulli trial, an event which has a known probability for success (p) and failure (1-p). A coin flip is a specific case where one of the two possible outcomes, "heads" or "tails", is chosen to mean success and both are usually assumed to be equally probable. Repeated coin flips, like any sequence of Bernoulli trials, constitute a Bernoulli process. If we assign the value of 1 to a heads and 0 to a tails, the expected value of the random variable in this particular process is (1/2) indicating that in the long run we expect as many heads as tails. We also assume that each coin flip is independent of any other flip so that no matter how many heads or tails we've seen so far, the probability of heads or tails on the next flip is exactly the same as it was on the first.

Suppose, however, that we have a "magical" coin for which the probability of heads depends on whether the coin was last heads up or tails up. For example, if the coin is facing heads up the probability of heads on the next flip is 0.6, but if the coin is tails up the probability of heads is 0.7. These probabilities plus the current state of the coin are needed to determine the likelihood of heads or tails on the next flip. [See Diagram 1]. Like the Bernoulli process, repeated flips of our magic coin produce a sequence of random variables, but here they are not independent. However, if the present state of the coin is known the future states are said to be conditionally independent of all states prior to the present. This type of sequence is called a Markov process.

Diagram 1: The Magic Coin

Markov processes are named for Andrey Andreyevich Markov, a Russian mathematician (and student of Chebyshev) who did most of his mathematical work in the early part of the 20th century. Like the Bernoulli process, a Markov process is a type of stochastic process, which means that it is a sequence of random variables produced at successive time intervals. The distinguishing feature of the Markov process is that the futures states are dependent only on the present state (this is often called the Markov property). If the possible number of states for the random variable is finite or countably infinite, the process is called a Markov chain.

Markov chains are ubiquitous in modern science and engineering. Stochastic processes in general are a random analogy to deterministic processes (e.g., the trajectory of a rocket), and the Markov chain is the simplest of these that can be used to make predictions about future states. It also helps that the Markov chain model-- taking all of the known state into account to calculate the next time step-- fits well into many types of computer simulation. Markov chains are used for calculating the parameters of statistical distributions, ranking web pages in search engines, finding low energy configurations of complicated molecules, and generating artificial documents with correct grammar.


In his 1906 paper entitled "Extension of the Law of Large Numbers to Dependent Quantities Markov defined the chain in this way:

" an infinite sequence of variables connected in such a way that for any k is independent of in case is known."

It wasn't until about 20 years later that this was called a Markov chain. Markov sought applications for mathematics, so he would no doubt be pleased that his creation figures so prominently in 21st century science. But it was not his intention to create a practical tool. In fact, Markov was inspired to study this type of process by a contemporary mathematician named Pavel Alekseevich Nekrasov whose assertion that the Law of Large Numbers was valid only for pairwise independent random variables aggravated Markov's sense of mathematical rigor.

The Law of Large Numbers tells us that as the number of observations in a sample increases, the sample mean will approach the expected value. Not surprisingly, the first proof of the Law of Large Numbers was provided by Jacob Bernoulli (though he had a grander name for it-- the golden theorem). Subsequent proofs and refinements assumed that the random events giving rise to the law are independent, but nobody had proved this is a necessary condition. When Nekrasov claimed without sufficient evidence that pairwise independence is necessary, Markov set out to prove him wrong. Markov's initial approach was a more mathematically formal version of the magic coin.

Had Markov had access to a computer he might have run a few quick simulations to test his idea. Figure 1 below shows the current mean value of a series of normal coin flips (in blue) and a series of flips of our magic coin (green). It's fairly apparent that the means of both series settle towards a stable value after a couple of thousand flips, although clearly the magic coin has a different stable value than the fair coin.
Figure 1: Mean Value of Series of Coin Flips and Simple Markov Chain

To generate the sequence of states in the chain above, you can simulate the process of coin flipping by generating a random number and using it choose the next state following the logic in Diagram 1 (in fact, that's how these graphs were generated). However, this process can also be thought of as starting out with a particular distribution and then applying the state transition probabilities to find a new distribution. For example, suppose that there are 100 of the magic coins, all facing heads up. If all are flipped once, the expected number of heads and tails would be 100 x 0.6 = 60 heads and 100 x 0.4 = 40 tails. If we flip them again, the expected number of heads would be 60 x 0.6 = 36 for the coins showing heads plus 40 x 0.7 = 28 heads for the coins showing tails, for a total of 64. Similarly, we'd arrive at 36 tails. Mathematically, this process can be represented with a vector for the distribution being multiplied by a matrix that contains the probabilities of moving from heads to tails:


where


So the first two steps as describe above would look like this:



The matrix P in these equations is called a transition matrix or a stochastic matrix. Each element of the matrix represents the probability of the transition from one state to another and the rows of a stochastic matrix will always sum to 1. A Markov chain like that for our magic coin is called homogeneous, meaning that the transition matrix does not change over time. For a homogeneous chain, it's possible to show that if the elements of P are positive the chain will always reach an equilibrium distribution. For the magic coin as defined in Diagram 1 the equilibrium distribution is about 64% heads/36% tails (it's just good fortune that the first two multiplications by P give us whole numbers of coins-- the equilibrium number of heads is 63.63636...).

To understand what is a Markov chain, it might be useful to compare it to something that is not a Markov chain. Consider first a random walk on a 2-dimensional lattice. Starting at a given point in the lattice a random direction is chosen and the walker moves in that direction. This is repeated for any number of steps. It's fairly easy to see that this is a Markov chain: the next point in the walk depends only on where the walker currently stands. It makes no difference how the walker arrived at the present spot in the lattice. However, let's say we make one change to this recipe. In our new version of the random walk, we add the restriction that the walker cannot revisit any point in the lattice. This is called a self-avoiding walk. Now the walker cannot retrace steps or cross an earlier path, so each new step requires knowledge of the previous states of the system. This is a feature of non-Markov processes-- they have a "memory" of their previous states.



Figure 2: Random Walk (Markov) and Self-Avoiding Walk (Non-Markov)

Figure 2 shows an example of each type of walk. It's interesting to note that the case of coin flipping can be thought of as a one-dimensional random walk if we stipulate that heads means to take a step forward and tails means take a step back. Now the position of the walker has the Markov property even though the values of the individual coin flips are independent. For the magic coin, the only difference is that the probability of taking a step in either direction depends on which direction the walker took on the last step. Since heads has a higher probability regardless of the current state of the coin, the walker will tend to drift in the positive direction. The great mathematician Karl Pearson did some of the earliest work on random walks and famously evoked the image of the random walker as a stumbling drunk with his assertion that "the most probable place to find a drunken man who is at all capable of keeping on his feet is somewhere near his starting point!" For the case of the magic coin, the drunk leans slightly forward.

In the real world Markov processes are the exception rather than the rule, but their application is still flourishing more than a century after Markov started his research (Google returns over a million results for a search on "Markov chain"). Computer science, statistics, operations research and finance are all areas seeing active research on Markov chains. There's a Markov model for predicting the outcome of the NCAA basketball tournament, and a web application that generates random (often surreal) Garfield cartoons by chaining together panels from the originals (Garkov).

Wednesday, September 2, 2009

Monte Carlo Integration

One of the first computer recreations I remember trying was using the Monte Carlo method to calculate the area of a circle. The basic technique is to generate a random point in a square and keep track of the proportion that fall within the inscribed circle (that is ).

Back in the day, this was probably a dozen lines of Fortran (or maybe Basic), but it's an even shorter program in Python:

from __future__ import division
from numpy.random import uniform as u

def mcCircleArea(r=1,n=10000):
return sum( 1 if u(-r,r)**2 + u(-r,r)**2 <=r else 0 for i in range(n) )/n*((2*r)**2)


This example is often used to explain the concept of Monte Carlo integration (in fact, it's used on the Wikipedia page for that topic). But it's kind of an odd case in the sense that you're working with a function that's like:

f(x,y) = \begin{cases} 1, & \mbox{if } x^2+y^2<r \\ 0, & \mbox{otherwise} \end{cases}

The basic idea behind Monte Carlo integration is that to find the integral of a function f(x) you take many random samples x from a domain D and then you multiply the average of f(xi) by the volume of the domain. So imagine that you wanted to find the area of a circle as you might have in basic calculus, starting with the equation:

f(x)=\sqrt{r-x^{2}}

In this case you could take uniformly random samples from the interval (0,r], calculate f(x) for those samples, and then multiply that average by r to get an estimate of the area of one quadrant of a circle. I think a similar process works for most functions of a single variable because it's easy to take samples from a uniform distribution (which is not too say that it's always a good way to do integration).

Monte Carlo integration problems come up in many applications where it's not easy to sample points from the domain of interest, because they are high dimension and/or oddly shaped. In that case, you can do a nifty trick where you choose a simpler domain that you can sample, and provided that it is a superset of the target distribution it can still be applied to the problem. Basically you do what happens in the circle area code above: generate random points in the larger domain and keep track of the proportion that fall within the domain of interest. This is called rejection sampling.

Again, I find the circle example unsatisfying because it's basically turning a function of one variable into a constant function of two variables. On the other hand, its hard to contrive an example where the domain is a single dimension. You could do something like integrate the function above from, say, (0.2,0.8) while sampling points from the superset (0,1.0).


from __future__ import division
from numpy.random import uniform as u
import numpy as np

def mcSubInt(r=1,n=10000,a=0.2,b=0.8):
''' Definite integral from 0.2 to 0.8 is 0.50498905889431 '''
return sum([np.sqrt(r-x**2) for x in [u(0,r) for i in range(n)] if x > a and x < b])/n*r



This calculates


\frac{1}{n}\sum_{1}^{n}f(x_{i})

where

f(x) = \begin{cases} \sqrt{r-x^{2}}, & \mbox{if } a<&x<b \\ 0, & \mbox{otherwise} \end{cases}


Kind of a boring example, but i think it qualifies. A slightly more interesting example might be integrating the equation f(x,y) = y+1 over a unit circle domain centered at the origin of the x,y plane. That works out to the volume of half of a 2 unit-high unit cylinder, or pi (ok, so maybe that's still not that interesting).


from __future__ import division
from numpy.random import uniform as u
from pylab import scatter,show
import numpy as np

def mcCircleDomain(n=10000):
sum_fx = 0
accept = []
reject = []
for i in range(n):
(x,y) = u(-1,1,2)
if (x**2 + y**2) < 1:
sum_fx += (y+2)
accept.append((x,y))
else:
reject.append((x,y))

scatter([c[0] for c in accept],[c[1] for c in accept])
scatter([c[0] for c in reject],[c[1] for c in reject],c='r')
show()

return sum_fx/n*4