摘要
|
Outfit coordination is a direct way for people to express themselves. However, judging the compatibility between tops and bottoms requires considering multiple factors such as color and style. This process is time-consuming and prone to errors. In recent years, the development of large language models and large multi-modal models has transformed many application fields. This study aims to explore how to leverage these models to achieve breakthroughs in fashion outfit recommendations. This research combines the keyword response text from the large language model Gemini in the Vision Question Answering (VQA) task with the deep feature fusion technology of the large multi-modal model Beit3. By providing only image data of the clothing, users can evaluate the compatibility of tops and bottoms, making the process more convenient. Our proposed model, the Large Multi-modality Language Model for Outfit Recommendation (LMLMO), outperforms previously proposed models on the FashionVC and Evaluation3 datasets. Moreover, experimental results show that different types of keyword responses have varying impacts on the model, offering new directions and insights for future research |