Wednesday, May 10, 2017

Monkeys at Typewriters: a Shoot-in-the-Dark Approach to Content Generation

If I needed to sum up the approach used by “trivial” content generation systems, it’s the phrase “monkeys at typewriters”, referring to the infinite monkey theorem. This theorem essentially states that “a monkey hitting keys at random on a typewriter for an infinite amount of time will almost surely type any given text”. With regards to trivial content generation systems, replace “monkey” with “machine/computer”, and there you go, you have a content generation system. The key piece of the infinite monkey theorem that we care about is that the theorem itself doesn’t necessarily place any limit on the size of the text it can generate, and thus the likelihood of any one text being generated is 0 since it can generate an infinite number of them. However, with digital computers, this assumption is out the window: ultimately, digital computers are finite machines with finite memory and finite computational capabilities. Thus, the number of things they can generate (and store; it’s an important requirement) are finite – it’s quite a big number, but it is finite. For example, the computer I’m typing this on has 16GB of RAM, which, assuming 1 byte (8 bits) per character and nothing else in the memory, means that I can hold a total of  16*2^30 characters in my memory. With 2^8 = 256 possible characters, that means that there are a total of 256^(16*2^30) possible texts I could type (256 choices per character and 16*2^30 characters to choose), and while this is a big number, it ultimately is a finite number. At its core, this is another idea that is at the heart of machine generated content: there is only a finite amount of content that can be generated and stored by any given machine. In comparison to real media like paintings, film-based-photographs, etc – which, in theory, have an infinite number of possible instances so long as you avoid quantum mechanics or discussions about the limits of human perception, neither of which am I qualified to talk about – digital representations are only finite approximations of this work that was drawn from the infinite ether of possible artworks. There are an infinite number paintings that can be made on a 7x7 palette, but there are a finite number ((3*256)^(1000*1000) assuming just RGB, 8-bit pixels) of 1000x1000 pixel images that can be made representing the images on that palette. In the context of remix culture, assuming a 44.1kHz audio sample rate and a 3 minute song with 256 different amplitude values that can be recorded at each instant, there are 256^(3*60*44.1*1000) songs that can be created. With our discussions about “loss” with regards to capturing physical media digitally, this is a part of the loss that is occurring. These digital representations are specially crafted to contain a large amount of the perceptible information that humans would have received from the physical copy, but they are ultimately attempting to represent a potentially infinite amount of information in a large, but finite, amount of space.

With this framework in mind, we can finally discuss content generation systems built around this approach. At the bottom of the totem pole with regards to these methods are the systems that take this idea to heart and embrace it wholly. For example, ShitpostBot 5000 is a Facebook page with a simple concept: users submit “templates” and “source images”, and every 30 minutes, the bot that gives the page its name picks a “template” along with all the source images needed to fill it out, and it posts it on Facebook (and Twitter). A good majority of them are just meaningless junk, but occasionally, thanks to either dumb luck or the magic of confirmation bias, some of them actually work. However, the approach used in systems like these aren’t exactly exciting. Luckily, there are many other techniques used in content generation that yield much better results. Before moving on, it’s worth taking note that a lot of the content the bot puts out is quite distasteful, to the point that the first version of the page was reported enough times that it was removed from Facebook. Regardless, on a similar level to this, we have technologies like random username generators, random password generators, and really, any computer data could, in theory, be enumerated so long as you put an upper limit on it. It’s just a boring way to do things, and in many cases, it takes an infeasible amount of human time.

No comments:

Post a Comment