Suppose, however, that we have a "magical" coin for which the probability of heads depends on whether the coin was last heads up or tails up. For example, if the coin is facing heads up the probability of heads on the next flip is 0.6, but if the coin is tails up the probability of heads is 0.7. These probabilities plus the current state of the coin are needed to determine the likelihood of heads or tails on the next flip. [See Diagram 1]. Like the Bernoulli process, repeated flips of our magic coin produce a sequence of random variables, but here they are not independent. However, if the present state of the coin is known the future states are said to be conditionally independent of all states prior to the present. This type of sequence is called a Markov process.
Diagram 1: The Magic Coin
Markov processes are named for Andrey Andreyevich Markov, a Russian mathematician (and student of Chebyshev) who did most of his mathematical work in the early part of the 20th century. Like the Bernoulli process, a Markov process is a type of stochastic process, which means that it is a sequence of random variables produced at successive time intervals. The distinguishing feature of the Markov process is that the futures states are dependent only on the present state (this is often called the Markov property). If the possible number of states for the random variable is finite or countably infinite, the process is called a Markov chain.
Markov chains are ubiquitous in modern science and engineering. Stochastic processes in general are a random analogy to deterministic processes (e.g., the trajectory of a rocket), and the Markov chain is the simplest of these that can be used to make predictions about future states. It also helps that the Markov chain model-- taking all of the known state into account to calculate the next time step-- fits well into many types of computer simulation. Markov chains are used for calculating the parameters of statistical distributions, ranking web pages in search engines, finding low energy configurations of complicated molecules, and generating artificial documents with correct grammar.
In his 1906 paper entitled "Extension of the Law of Large Numbers to Dependent Quantities Markov defined the chain in this way:
" an infinite sequence
of variables connected in such a way that
for any k is independent of
in case
is known."
It wasn't until about 20 years later that this was called a Markov chain. Markov sought applications for mathematics, so he would no doubt be pleased that his creation figures so prominently in 21st century science. But it was not his intention to create a practical tool. In fact, Markov was inspired to study this type of process by a contemporary mathematician named Pavel Alekseevich Nekrasov whose assertion that the Law of Large Numbers was valid only for pairwise independent random variables aggravated Markov's sense of mathematical rigor.
The Law of Large Numbers tells us that as the number of observations in a sample increases, the sample mean will approach the expected value. Not surprisingly, the first proof of the Law of Large Numbers was provided by Jacob Bernoulli (though he had a grander name for it-- the golden theorem). Subsequent proofs and refinements assumed that the random events giving rise to the law are independent, but nobody had proved this is a necessary condition. When Nekrasov claimed without sufficient evidence that pairwise independence is necessary, Markov set out to prove him wrong. Markov's initial approach was a more mathematically formal version of the magic coin.
Had Markov had access to a computer he might have run a few quick simulations to test his idea. Figure 1 below shows the current mean value of a series of normal coin flips (in blue) and a series of flips of our magic coin (green). It's fairly apparent that the means of both series settle towards a stable value after a couple of thousand flips, although clearly the magic coin has a different stable value than the fair coin.
To generate the sequence of states in the chain above, you can simulate the process of coin flipping by generating a random number and using it choose the next state following the logic in Diagram 1 (in fact, that's how these graphs were generated). However, this process can also be thought of as starting out with a particular distribution and then applying the state transition probabilities to find a new distribution. For example, suppose that there are 100 of the magic coins, all facing heads up. If all are flipped once, the expected number of heads and tails would be 100 x 0.6 = 60 heads and 100 x 0.4 = 40 tails. If we flip them again, the expected number of heads would be 60 x 0.6 = 36 for the coins showing heads plus 40 x 0.7 = 28 heads for the coins showing tails, for a total of 64. Similarly, we'd arrive at 36 tails. Mathematically, this process can be represented with a vector for the distribution being multiplied by a matrix that contains the probabilities of moving from heads to tails:
where
So the first two steps as describe above would look like this:
The matrix P in these equations is called a transition matrix or a stochastic matrix. Each element of the matrix represents the probability of the transition from one state to another and the rows of a stochastic matrix will always sum to 1. A Markov chain like that for our magic coin is called homogeneous, meaning that the transition matrix does not change over time. For a homogeneous chain, it's possible to show that if the elements of P are positive the chain will always reach an equilibrium distribution. For the magic coin as defined in Diagram 1 the equilibrium distribution is about 64% heads/36% tails (it's just good fortune that the first two multiplications by P give us whole numbers of coins-- the equilibrium number of heads is 63.63636...).
To understand what is a Markov chain, it might be useful to compare it to something that is not a Markov chain. Consider first a random walk on a 2-dimensional lattice. Starting at a given point in the lattice a random direction is chosen and the walker moves in that direction. This is repeated for any number of steps. It's fairly easy to see that this is a Markov chain: the next point in the walk depends only on where the walker currently stands. It makes no difference how the walker arrived at the present spot in the lattice. However, let's say we make one change to this recipe. In our new version of the random walk, we add the restriction that the walker cannot revisit any point in the lattice. This is called a self-avoiding walk. Now the walker cannot retrace steps or cross an earlier path, so each new step requires knowledge of the previous states of the system. This is a feature of non-Markov processes-- they have a "memory" of their previous states.
Figure 2 shows an example of each type of walk. It's interesting to note that the case of coin flipping can be thought of as a one-dimensional random walk if we stipulate that heads means to take a step forward and tails means take a step back. Now the position of the walker has the Markov property even though the values of the individual coin flips are independent. For the magic coin, the only difference is that the probability of taking a step in either direction depends on which direction the walker took on the last step. Since heads has a higher probability regardless of the current state of the coin, the walker will tend to drift in the positive direction. The great mathematician Karl Pearson did some of the earliest work on random walks and famously evoked the image of the random walker as a stumbling drunk with his assertion that "the most probable place to find a drunken man who is at all capable of keeping on his feet is somewhere near his starting point!" For the case of the magic coin, the drunk leans slightly forward.
In the real world Markov processes are the exception rather than the rule, but their application is still flourishing more than a century after Markov started his research (Google returns over a million results for a search on "Markov chain"). Computer science, statistics, operations research and finance are all areas seeing active research on Markov chains. There's a Markov model for predicting the outcome of the NCAA basketball tournament, and a web application that generates random (often surreal) Garfield cartoons by chaining together panels from the originals (Garkov).