Skip to content (access key 's')
Logo of Technion
Logo of CS Department
Logo of CS4People

The Taub Faculty of Computer Science Events and Talks

Multimodal Image Mappings: From Geometric Alignment to Captions
event speaker icon
Noam Rotstein (M.Sc. Thesis Seminar)
event date icon
Sunday, 22.10.2023, 13:30
event location icon
Zoom Lecture: 93548839223
event speaker icon
Advisor: Prof. Ron Kimmel
At the heart of our research lies a simple yet profound question: how can we bridge the gap between different visual, geometric, and language modalities? Firstly, we address the task of aligning colored point clouds embedded in 3D, obtained by a colored depth scanner, with color images provided by conventional cameras. These two data forms are inherently different, in both structural and chromatic properties. We use a tailored optimization procedure to align the point cloud and camera image by reducing photometric discrepancies between them. This ensures precise modality correspondence, resulting in RGBD data that enhances geometric reconstruction. In our second effort, we focus on the relation between images and text. While vision-language pre-training (VLP) has significantly advanced image captioning, models tend to provide generic descriptions, omitting salient details. This issue stems from training datasets that, while capturing broad image content, frequently overlook specifics. To address this, we enrich captions with insights from "frozen" vision experts like object detectors and attribute extractors. Our method, FuseCap, fuses this data with original captions via a large language model (LLM), producing a vast dataset of detailed captions. Models trained on this data offer enhanced performance and richer descriptions. Beyond intermodality, a core principle across both studies is a data-centric AI strategy. By focusing on data quality rather than exhaustive model refinement, we significantly enhance the outcomes of data-hungry models.