Macaw-LLM is an exploratory endeavor that pioneers multi-modal language modeling by seamlessly combining image🖼️, video📹, audio🎵, and text📝 data, built upon the foundations of CLIP, Whisper, and ...
e.g. $ . scripts/macaw_dir.sh # MACAW training on Cheetah-Direction (Figure 1) $ . scripts/macaw_vel.sh # MACAW training on Cheetah-Velocity (Figure 1) $ . scripts/macaw_quality_ablation.sh # Data ...