Tuesday, April 25, 2017

A Sentence is Worth a Thousand Pictures -- Image Synthesis from Text Show & Tell


              On the surface, perception doesn’t actually seem all that complicated. For humans, it comes so naturally that it’s one of the defining characteristics of life. After all, what is a human without the ability to see, hear, touch, feel, or taste? At a base level, perception is simply the ability to tell the difference between at least two things – what is sight without at least the contrast of light and dark? Intelligence, again at a basic level, then, comes from the ability to make decisions based on perception. Electronic computers have always had the ability to change their execution based on the internal state of the machine (i.e. make decisions) to the point that it is one of the key requirements for even mathematical formalizations of a computer to be able to compute everything that is computable (fun fact: referred to as Computable Functions). With this in mind, one of the biggest challenges in computing hasn’t been how to create a machine that makes decisions (my alarm clock already does that based on the time) but how to digitize perception so that computers can then make meaningful decisions based on their perception of the environment. It’s progress in this exact area that has essentially been the foundation of machine learning or, as it’s commonly called these days, “AI”.

In the pre-electronic computer era, the core mathematics involved in machine learning were mostly developed and scattered across the fields of statistics, linear algebra, and calculus/optimization, but when computers came onto the scene, entirely new venues of possibilities opened to researchers – venues that were so computationally intensive that, up until that point, they would have just been too tedious, time consuming, or boring for ordinary humans to do. With easy computation, though, it became possible to combine these subjects such that between the early 1950s and today, we’ve gone from using computers for basic regression “fit a trend line to my data” tasks all the way to being able to transfer artistic style between images or, as the subject of my show and tell discussed, synthesize new images just from a description of what is in it. Many of the key advancements in machine learning haven’t been to show that this is possible mathematically – the field of mathematics is generally broad enough that you can ask “is there a function that takes a set of sentences and maps each one to an image?” and it’ll go “sure! If you can represent them mathematically somehow, there’s a function (or at least a relation) that maps between them!” – they’ve been to create methods such that it suddenly becomes feasible to find or create these functions based on examples of how they behave.

In the context of new media, the capabilities offered by these techniques are somewhat unprecedented. As more media becomes digital, it is possible not only to duplicate effortlessly but even transform that media into a new form effortlessly. With this in mind, the line between what is original and what is copied grows ever thinner, and really, it’s going to be either interesting or scary, depending on who you ask, to watch media develop in the coming years. To a certain extent, we’re already getting to that point, and in my project, I’m going to explore what I see as the frontiers of machine-generated or machine-assisted-generation of media with the hope that looking at the state of things now will give us a viewport into how things will be in the future.

No comments:

Post a Comment