[AI Seminar] Will we be able to draw and paint by speaking?

Jun 14, 2021

The second AI seminar centered around a lecture on “video intelligence” by Lee Jung-ho, the leader of Artificial Intelligence Research Lab in LG Electronics Future Technology Center. Video intelligence is a technology that enables machines to recognize and understand objects using camera sensors in a very similar way that humans do through their eyes. Although the term may not yet feel familiar, this technology is already widely present in our daily lives. It is being used for vehicle license plate recognition systems, automated driving technology, face recognition, automated image classification, and so on.


LG Electronics is also actively developing its video intelligence technology and applying it to products. For example, LG OLED TVs have built-in AI processors that can recognize human faces and text on the screen to optimize the image quality. Also, research is underway to develop the LG InstaView ThinQ refrigerator’s feature that uses video intelligence technology and built-in cameras to identify the contents of the fridge, make cooking recommendations, and even order cooking ingredients.

These are just some of the many video intelligence research projects that LG Electronics is currently conducting. This seminar focused on the video intelligence technology for text-based painting.


In November 2014, Google introduced the neural image caption generator, a technology that uses deep learning to see an image and explain it in words. This technology uses the Convolution Neural Network (CNN) to recognize images and the Recurrent Neural Network (RNN) to learn natural language expressions and generate text.



Source: Show and Tell: A Neural Image Caption Generator (https://arxiv.org/pdf/1411.4555.pdf)


Then, how about the other way around? Can we generate images based on the text?

First of all, there are the Deep Dream Generator and Style Transfer. These technologies are capable of adjusting or reorganizing images based on deep learning. They both can stylize desired images differently. Below you see an example of an image recreated using style transfer. The photo of the houses on the left was combined with Van Gogh’s painting to generate a newly stylized image.



Source: Neural Style Transfer (http://ml-ko.kr/dl-with-python/8.3-neural-style-transfer.html)


Is it really possible to generate an image based on the text? Yes, it is.

Open AI announced DALL-E, a video-generating algorithm that uses a Generative Pre-trained Transformer (GPT-3). DALL-E is named after the genius painter Salvador Dali and the robot WALL-E from the namesake movie WALL-E. DALL-E combines computer visioning and natural language processing to recognize text and generate images based on it. It can also humanly express visual images of animals and objects and combine non-related items in aesthetic ways to create a new image.

How did this technology come about? The GPT-3 technology used for DALL-E is a deep learning language model that can generate various texts based on entered text. It can write a story like a human. DALL-E learns in the same way as GPT-3, but generates image pixels instead of text to build an image.

For example, if you enter “An illustration of a baby daikon radish in a tutu walking a dog” as text, it will generate the images below.



Source: Artificial intelligence DALL-E that understands the text and generates images. (https://blog.naver.com/chandong83/222198993535)


Enter “An armchairs in the shape of an avocado,” and you will get these images.



Source: Artificial intelligence DALL-E that understands the text and generates images.(https://blog.naver.com/chandong83/222198993535)


Although all it can do now is to create basic images, this technology can become the basis of future video intelligence which could produce more sophisticated paintings. Imagine what we will be able to do with it in the future.

While designing your home, you might say something like “I want the front door to be on the left in the south, and I want a south-facing window in the living room,” “I want to have a garden at the center of my living room to grow small plants,” or “I want one bathroom and one guest room on the first floor,” then the AI would draw a floor plan for you.

If you are looking for a specific photo among the hundreds of images on your phone, you could simply say "Do you have the photo of me clay-shooting at the camp that I went to last year?,” and it will find it for you right away. How convenient would that be?


In ways like this, AI technology is advancing at full speed in various fields. What we ultimately want from AI development is to enable it to understand things at human levels to make our lives even more convenient.

The third seminar will be held under the title “Will AI be able to write, compose, and sing songs?” Please look forward to the upcoming seminars.