The Taub Faculty of Computer Science Events and Talks
Noam Rotstein (M.Sc. Thesis Seminar)
Sunday, 22.10.2023, 13:30
Advisor: Prof. Ron Kimmel
At the heart of our research lies a simple yet profound question: how can we bridge the gap between different visual, geometric, and language modalities? Firstly, we address the task of aligning colored point clouds embedded in 3D, obtained by a colored depth scanner, with color images provided by conventional cameras. These two data forms are inherently different, in both structural and chromatic properties. We use a tailored optimization procedure to align the point cloud and camera image by reducing photometric discrepancies between them. This ensures precise modality correspondence, resulting in RGBD data that enhances geometric reconstruction. In our second effort, we focus on the relation between images and text. While vision-language pre-training (VLP) has significantly advanced image captioning, models tend to provide generic descriptions, omitting salient details. This issue stems from training datasets that, while capturing broad image content, frequently overlook specifics. To address this, we enrich captions with insights from "frozen" vision experts like object detectors and attribute extractors. Our method, FuseCap, fuses this data with original captions via a large language model (LLM), producing a vast dataset of detailed captions. Models trained on this data offer enhanced performance and richer descriptions. Beyond intermodality, a core principle across both studies is a data-centric AI strategy. By focusing on data quality rather than exhaustive model refinement, we significantly enhance the outcomes of data-hungry models.