January 05, 2017

Markov Chains: The Imitation Game

In this post we're going to build a Markov Chain to generate some realistic sounding sentences impersonating a source text. This is one of my favourite computer science examples because the concept is so absurdly simple and and the payoff is large. This will be done using python, and your final code will look like this.

Before I explain how it works though, let's look at an example generated from the past four Cyber Omelette posts:

"The first step to attach our mirror is done by simply screw the backing with a Raspberry Pi's into the filename. That looks like installing and blinking cursor. Type the wood screws.
Gorilla glue.
Electrical tape.
Extension Cord with multiple passes of the all Together  Once it hasn't really stuck in place until all directions. Clean your monitor." 

So it's not exactly coherent, but it comes pretty close! I also built a twitter bot called mayhem_bot (git repo) using this concept. This bot imitates users and trending hashtag discussions, so send it some tweets if you want more examples.




Markov Chain Theory


Markov Chains are simply a technique for storing acceptable outcomes that can occur from a given state. This is also called a state transition. Imagine you're in a car at an intersection. Your possible moves are to drive forward, turn left /right, or reverse. So for your current state, there are 4 possible subsequent states. 

When applied to language, the simplest case is to read a group of words and records all pairs. In the phrase "The big big cat ate the bread" for example, the word "the" has two possible pairs, as it is followed by both "big", and "bread". The word "big" also has two possibilities; "big", and "cat". All others have a single possible outcome.

Here is how the Markov Chain data would look in this case:

the: [big, bread]
big: [big, cat]
cat: [ate]
ate: [the]

To randomly generate a phrase from this data, we simply pick a starting word, and then select the next word from the set of allowed outcomes. Then this choice becomes our new starting word, and the process repeats.

If your smartphone has a word suggestion feature when typing a message, try typing a word, and then only choosing recommendations. When this takes place, you're just walking through a Markov Chain! For best results, your source training material should contain over 1000 words.

If you'd like to read more about this, I recommend Chapter 3 from The Practice of Programming which is where I was first introduced to this concept. 

Python Markov Chain 


In this example, you will simply need Python3 with no non-standard libraries. If you are unfamiliar with python, I recommend checking out my introduction to python which includes install instructions.

In this program, there are four steps:
  1. Read our source text.
  2. Build the Markov Chain.
  3. Use Markov Chain to generate random phrase.
  4. Output phrase.
The first step is to read source text which is used to create the Markov Chain. This will come from a text file, which makes it easy to copy and paste from multiple sources. This text file will be read from file, and returned as a string.

That looks something like this:


We now have a string with all our words. Double line breaks ('\n\n') have been removed, but single breaks and punctuation have been left in for emphasis.

Now we can generate our Markov Chain. This will take the form of a dictionary, with the key as a single word, and the value as a list of all subsequent words. This is done by splitting our string into a list of words, and then iterating through the list pushing values for each word key. Since we are considering pairs of words, we start at the second word in the list and use the entry at index - 1 as the key.

That is done as follows:



This function returns our Markov Chain dictionary, which we can now use to randomly generate a phrase.

Generating a phrase simply requires setting a desired length, and assigning starting word. Then we can just walk the chain until we reach our word count. To ensure the starting word exists in the set, it is randomly chosen. This could easily be modified specify a start word if you are confident it is in the Markov Chain.



This function returns our random message.

Next we have to output our result. This can simply be printed on the screen, but I prefer to output to a text file, as console messages are clumsy and not befitting the elegant prose we will be generating.

This can be done with a simple output function.


Now lets call these functions in the right order in our main namespace. In this example, the source text file is passed as a command line argument, and the result is stored in a file called "output.txt".

The full code can be found here.

To run this example, navigate your command line interface to the directory containing Markov.py. Create a text file in this directory containing some source material, perhaps called "input.txt". Now you can run this program as follows:

$>python Markov.py input.txt

This should create a file called "output.txt" in this directory containing your phrase. Try running it a few times to watch how the phrase changes.

This is the simplest form of this project, and there are a lot of ways to improve on it. For example, you can try deleting words as they are used until there are none left, or even use two words as the key instead of one word. Have some fun and see how simple rules can improve the overall realism of the generated phrases.

If you generate some funny chains, please throw them in the comments below with an explanation of what they were generated from! 

2 comments:

  1. Mayhem seems to have been sabotaged by a tweet, he is not active on Twitter. Possibly as a result of hitting a top ten list in Hashtag games and a mass complaint attack again him

    ReplyDelete
  2. Yup, that definitely happened... Someone gamed the algorithm and made it a spam machine! Fixed that bug and replied to support. Hopefully they are feeling lenient.

    ReplyDelete