About Me
I am a Ph.D. student in Research Center for Social Computing and Information Retrieval(SCIR), at Harbin Institute of Technology (HIT, China). My advisor is Prof. Yanyan Zhao and Prof. Bing Qin. My research interests include multimodal fine-grained sentiment analysis, multimodal large language models and multimodal data generation.
PUBLICATIONS
- Large language models (LLMs) have demonstrated formidable capabilities in both task-oriented and chit-chat dialogues. However, when extended to large vision-language models (LVLMs), we found LVLMs excel in objectively describing the content in the image, while subjective multimodal emotional chit-chat (MEC) dialogues ability is insufficient, a shortfall we attribute to the scarcity of high-quality MEC data. The collection and annotation of high-quality MEC dialogue data are helpful, but the cost of such processes makes the acquisition of large-scale data challenging. Addressing this gap, we introduce an adversarial LLM-based data augmentation framework U2MEC for generating MEC dialogue data from unimodal data.
- This survey aims to present a comprehensive review of recent research in text-centric multimodal sentiment analysis tasks, examine the potential of LLMs for text-centric multimodal sentiment analysis, outlining their approaches, advantages, and limitations, summarize the application scenarios of LLM-based multimodal sentiment analysis technology, and explore the challenges and potential research directions for multimodal sentiment analysis in the future.
- In this paper, we evaluate different abilities of GPT-4V including visual understanding, language understanding, visual puzzle solving, and understanding of other modalities such as depth, thermal, video, and audio.
- We propose the UNIMO-3 model, which has the capacity to simultaneously learn the multi-modal in-layer interaction and cross-layer interaction. UNIMO-3 model can establish effective connections between different layers in a cross-modal encoder, and adaptively capture the interaction between two modalities at different levels.
- Proposed a cross-modal translation approach that explicitly utilizes facial expressions as visual emotional cues in open-domain images. Introduced a fine-grained cross-modal alignment method based on CLIP, achieving alignment and matching between textual sentiment targets and facial expressions in images. The method achieved state-of-the-art results on the Twitter-15 and Twitter-17 datasets.
- Constructed the first Chinese sentiment analysis dataset, MACSA, with fine-grained cross-modal alignment annotations of both text and images. Proposed a cross-modal fine-grained alignment annotation based on aspect categories, mitigating the issues of weakly supervised text-image alignment and missing sentiment targets in text. On the MACSA dataset, introduced a cross-modal alignment fusion network based on multi-modal heterogeneous graphs, achieving state-of-the-art results.
- Reviewed and summarized the relevant research in multimodal sentiment analysis, and proposed a research framework for fine-grained multimodal sentiment analysis.
- We propose the sentiment word aware multimodal refinement model (SWRM), which can dynamically refine the erroneous sentiment words by leveraging multimodal sentiment clues. We conduct extensive experiments on the real-world datasets including MOSI-Speechbrain, MOSI-IBM, and MOSI-iFlytek and the results demonstrate the effectiveness of our model, which surpasses the current state-of-the-art models on three datasets. Furthermore, our approach can be adapted for other multimodal feature fusion models easily.