Moving up the totem pole, we arrive at the middle ground
of more “classic” content generation systems that are more advanced than
“enumerate all the possible content” but are less advanced than the state of
the art. The key insight with these systems is the recognition that, even
though there are a massive number of potential pieces of content that can be
generated, the vast majority of them are utterly meaningless to humans, and if
we use human content as a starting point, we can look for characteristics in
that content to generalize or extrapolate in a way that hopefully yields more
human-like content. Because dealing with generating text has historically been
a more tractable problem than generating things like images or videos (if only
because it’s less data to process and object recognition in images took a while
to get advanced enough), we’ll start there. The variant that I, personally, am
most familiar with are known as Markov chain models (with
a visual explanation here).
Essentially, models such as these work by picking an n-gram size (where an
n-gram is a sequence of n characters), and then, given an example text to train
from, they look at every possible n-gram in that text. For every n-gram they
see, they record what the next word/letter/etc was. For example, if we’re using
a 2-gram (bigram) and our text is “the dog ran”, all 2-grams of that text are
[“th”, “he”, “e “, “ d”, “do”, “og”, “g “, “ r”, “ra”, “an”], and for each of
those 2-grams, we record the letter that came after. In this example, the end
result is something like this:
“th”
-> “e”,
“he” ->
“ “,
“e “ -> “d”,
“ d” -> “o”,
“do” -> “g”,
“og” -> “ “,
“g “, -> “r”,
“ r”, -> “a”,
“ra”, -> “n”,
“an” ->
“”
Now, for a longer text, it’s likely that you’ll have multiple instances
of the same bigram (i.e. “th” could be followed by “e” for the word “the” or
“o” for the word “those”), and so you just chain them together as something
like this:
“th” -> [“e”, “o”]
Then, once you’ve done this for the entire text (say, Hamlet), you can
generate new text by picking a random n-gram to start from and then since
you’ve tabulated everything that ever came after that n-gram in your list, you
pick randomly from the list of potential next characters, you append a new
character, and then you repeat the process and keep going. It’s a simple
concept, but there are actually an incredibly large number of internet bots
that function in exactly this way. With this approach, you can do things like generate
Donald Trump tweets, simulate a subreddit,
or you can replace your fellow writer coworkers with absurd
bots. In a similar vein to Markov chains, there are a number of works based
on linguistics or theories of grammar on automated text generation, but, as I
am not a linguistics major, I don’t know much about them other than many of
them work on things called “context free grammars”. Some examples include the postmodernism generator, which
creates random articles that imitate postmodernist writing, SCIgen, which generates
nonsense computer science research papers (one of which was accepted to SCI
2005), and snarXiv, a
parody to arXiv
that generates random paper titles and abstracts with one of the suggested uses
as “If you’re a graduate student, gloomily read through the abstracts, thinking
to yourself that you don’t understand papers on the real arXiv any better”.
These middle-tier content generation techniques are certainly better than
random guessing at good content, but for the most part, they just serve to be
humorous. It’s the next section, the “state of the art” or ”next gen”, where
things start to get quite interesting.
No comments:
Post a Comment