Wednesday, May 10, 2017

Markov Madness: Increasing the Complexity of Content Generation

Moving up the totem pole, we arrive at the middle ground of more “classic” content generation systems that are more advanced than “enumerate all the possible content” but are less advanced than the state of the art. The key insight with these systems is the recognition that, even though there are a massive number of potential pieces of content that can be generated, the vast majority of them are utterly meaningless to humans, and if we use human content as a starting point, we can look for characteristics in that content to generalize or extrapolate in a way that hopefully yields more human-like content. Because dealing with generating text has historically been a more tractable problem than generating things like images or videos (if only because it’s less data to process and object recognition in images took a while to get advanced enough), we’ll start there. The variant that I, personally, am most familiar with are known as Markov chain models (with a visual explanation here). Essentially, models such as these work by picking an n-gram size (where an n-gram is a sequence of n characters), and then, given an example text to train from, they look at every possible n-gram in that text. For every n-gram they see, they record what the next word/letter/etc was. For example, if we’re using a 2-gram (bigram) and our text is “the dog ran”, all 2-grams of that text are [“th”, “he”, “e “, “ d”, “do”, “og”, “g “, “ r”, “ra”, “an”], and for each of those 2-grams, we record the letter that came after. In this example, the end result is something like this:
“th”        -> “e”,
“he”       -> “ “,
“e “         -> “d”,
“ d”         -> “o”,
“do”        -> “g”,
“og”        -> “ “,
“g “,        -> “r”,
“ r”,         -> “a”,
“ra”,        -> “n”,
“an”       -> “”
Now, for a longer text, it’s likely that you’ll have multiple instances of the same bigram (i.e. “th” could be followed by “e” for the word “the” or “o” for the word “those”), and so you just chain them together as something like this:
                “th”        -> [“e”, “o”]
Then, once you’ve done this for the entire text (say, Hamlet), you can generate new text by picking a random n-gram to start from and then since you’ve tabulated everything that ever came after that n-gram in your list, you pick randomly from the list of potential next characters, you append a new character, and then you repeat the process and keep going. It’s a simple concept, but there are actually an incredibly large number of internet bots that function in exactly this way. With this approach, you can do things like generate Donald Trump tweets, simulate a subreddit, or you can replace your fellow writer coworkers with absurd bots. In a similar vein to Markov chains, there are a number of works based on linguistics or theories of grammar on automated text generation, but, as I am not a linguistics major, I don’t know much about them other than many of them work on things called “context free grammars”. Some examples include the postmodernism generator, which creates random articles that imitate postmodernist writing, SCIgen, which generates nonsense computer science research papers (one of which was accepted to SCI 2005), and snarXiv, a parody to arXiv that generates random paper titles and abstracts with one of the suggested uses as “If you’re a graduate student, gloomily read through the abstracts, thinking to yourself that you don’t understand papers on the real arXiv any better”. These middle-tier content generation techniques are certainly better than random guessing at good content, but for the most part, they just serve to be humorous. It’s the next section, the “state of the art” or ”next gen”, where things start to get quite interesting.

No comments:

Post a Comment