Bringing things back to where we started, much of the initial work in
systems that underlie state of the art content generation actually came from
simply trying to perceive it. You see, before you ask a computer to generate a
picture of a bird, it helps to train it on what a bird is; it needs to be able
to distinguish a chunk of pixels in an image that represent a bird from the
surrounding environment, invariant of the actual environment (indoors,
outdoors, bright blue sky, dark green forest, etc.) the picture was taken in. Machine
learning as a whole is an incredibly broad field, and as a result, many
different approaches to this problem of “classification of X” have been
developed. While there are certainly situations where they’re non-ideal or are
the equivalent of using a truck to hammer in a nail, some of the biggest
buzzwords at the moment are Artificial
Neural Networks (ANNs), Deep Learning, and,
especially with regards to image-related problems, Convolutional
Neural Networks (CNNs). In 2015, Google decided to try to investigate what
exactly these networks were learning to perceive, and thus they released Inceptionism,
an exploration into what exactly these networks were internalizing with regards
to the concepts that they’re asked to learn. It was found, for example, that
images like these are what the model thought of as a “dumbbell”:
Which, as they note in the article, is interesting that the network isn’t
learning the concept of a “dumbbell” itself without an arm attached to it. At
the same time as this work, these exact same techniques allowed Google to feed
in an image that was pure noise (i.e. meaningless static) and instruct the
network to loop back on itself and emphasize whatever feature it thinks it saw.
Basically, it’s like looking at the clouds, deciding it looks like a dog, and
then using that in a feedback loop that reshapes the cloud to look more like a
dog and repeating the process. Amazingly, with such a simple setup, the result
ended up being trippy images like these:
All of which were purely computer generated with no insight from humans
at all. Just like that, we have what I would consider the dawn of modern works
in machine generated content. Over the course of the next 2-3 years, a massive
number of works have been released that really push the boundaries of what
would have classically been thought possible. Starting back in August, 2015,
there was just an explosion of work, including:
- · Utilizing these deep networks to map photographs into paintings
- · Generating Shakespeare and musical compositions.
- · Mapping crappy MS Paint style drawings to look like paintings,
- · Extending the work of “photographs to paintings” to work with video
- · Learning to write descriptions that summarize content in images
- · Hallucinating images from textual descriptions (precursor to my show and tell)
- · Generation of arbitrary audio signals with applications to text to speech and music creation
- · Improved hallucination of images from text (my show and tell topic)
- · Hallucination of videos!
- · Construction of 3D models of rooms based on a single photograph
- · And many others that I’ve lost the links to at this point
Essentially, the rate of progress in this field is currently mind blowing.
There specifics of each approach vary, but they’re essentially still based
around this same idea of “show examples of human content, use this information
to generate new content in a more intelligent way than random generation”. A
lot of these may seem like parlor tricks, but the things that these methods
enable can quickly head into some uncomfortable territory, especially in the
era of #FakeNews. For example, the startup Lyrebird is developing a product
that allows you to electronically copy the
voice of anyone with as little as one minute of audio recording and then
use that electronic voice to say anything you want. In their demonstration,
they have vocal recreations of Obama, Trump, and Clinton discussing their
product, but, as they advertise, you’ll be able to make them say anything you
want. Combine that with research out of the University of Erlangen-Nuremberg
(and others) that demonstrates the ability to remap facial expressions
onto streams of video, and those two systems together could be used, with some
quality enhancements, to legitimately create things like fictitious press
announcements. Combined with something like the ability to generate raw videos as
presented above, you may not even need a source video that was actually
recorded from the real world. Again, as discussed in the first section on “machine-assisted
generation”, it’s not that this isn’t possible to do right now – someone with
enough footage on hand of Barack Obama can conceivably make him appear to say
anything they want with a nonlinear video editor – it’s that the barrier for
entry, for even the most complex of media manipulations, is quickly approaching
zero. It’s a set of circumstances that makes it hard to not feel at least a bit
guilty that I can hardly contain my excitement at the development of these
technologies.
No comments:
Post a Comment