D01-4 Multimodal Deep Learning Model for Analysis and Synthesis of Shitsukan Images by Disentangling Styles and Shapes
In our research, we conduct a study on “a multimodal deep learning model for Shitsukan recognition and synthesis in images by disentangling texture style and shape features.” Specifically, (1) from a large amount of paired image and language data, we automatically learn the correnpondence between the texture part of the image and the texture representation of the language, and construct a shared texture embedding space of image texture features and language texture features to realize cross-modal retrieval (recognition) between image and language. (2) Furthermore, by fusing texture embedding vectors with image shape features, we will synthesize images with novel textures. The objective of our research is to propose a deep learning model which can achieve these tasks in a unified manner. The proposed model is expected to enable (a) “deep” texture analysis of images and linguistic expressions using a large amount of data, and (b) subtle manipulation of image textures by language.