Wednesday, May 10, 2017

CNN -- Not just a News Channel: Modern Work in Content Generation

Bringing things back to where we started, much of the initial work in systems that underlie state of the art content generation actually came from simply trying to perceive it. You see, before you ask a computer to generate a picture of a bird, it helps to train it on what a bird is; it needs to be able to distinguish a chunk of pixels in an image that represent a bird from the surrounding environment, invariant of the actual environment (indoors, outdoors, bright blue sky, dark green forest, etc.) the picture was taken in. Machine learning as a whole is an incredibly broad field, and as a result, many different approaches to this problem of “classification of X” have been developed. While there are certainly situations where they’re non-ideal or are the equivalent of using a truck to hammer in a nail, some of the biggest buzzwords at the moment are Artificial Neural Networks (ANNs), Deep Learning, and, especially with regards to image-related problems, Convolutional Neural Networks (CNNs). In 2015, Google decided to try to investigate what exactly these networks were learning to perceive, and thus they released Inceptionism, an exploration into what exactly these networks were internalizing with regards to the concepts that they’re asked to learn. It was found, for example, that images like these are what the model thought of as a “dumbbell”:

Which, as they note in the article, is interesting that the network isn’t learning the concept of a “dumbbell” itself without an arm attached to it. At the same time as this work, these exact same techniques allowed Google to feed in an image that was pure noise (i.e. meaningless static) and instruct the network to loop back on itself and emphasize whatever feature it thinks it saw. Basically, it’s like looking at the clouds, deciding it looks like a dog, and then using that in a feedback loop that reshapes the cloud to look more like a dog and repeating the process. Amazingly, with such a simple setup, the result ended up being trippy images like these:

All of which were purely computer generated with no insight from humans at all. Just like that, we have what I would consider the dawn of modern works in machine generated content. Over the course of the next 2-3 years, a massive number of works have been released that really push the boundaries of what would have classically been thought possible. Starting back in August, 2015, there was just an explosion of work, including:
Essentially, the rate of progress in this field is currently mind blowing. There specifics of each approach vary, but they’re essentially still based around this same idea of “show examples of human content, use this information to generate new content in a more intelligent way than random generation”. A lot of these may seem like parlor tricks, but the things that these methods enable can quickly head into some uncomfortable territory, especially in the era of #FakeNews. For example, the startup Lyrebird is developing a product that allows you to electronically copy the voice of anyone with as little as one minute of audio recording and then use that electronic voice to say anything you want. In their demonstration, they have vocal recreations of Obama, Trump, and Clinton discussing their product, but, as they advertise, you’ll be able to make them say anything you want. Combine that with research out of the University of Erlangen-Nuremberg (and others) that demonstrates the ability to remap facial expressions onto streams of video, and those two systems together could be used, with some quality enhancements, to legitimately create things like fictitious press announcements. Combined with something like the ability to generate raw videos as presented above, you may not even need a source video that was actually recorded from the real world. Again, as discussed in the first section on “machine-assisted generation”, it’s not that this isn’t possible to do right now – someone with enough footage on hand of Barack Obama can conceivably make him appear to say anything they want with a nonlinear video editor – it’s that the barrier for entry, for even the most complex of media manipulations, is quickly approaching zero. It’s a set of circumstances that makes it hard to not feel at least a bit guilty that I can hardly contain my excitement at the development of these technologies. 

