I read one ML-related paper every weekday and post a 5-minute video summary to my YouTube channel. This page collects all of those together with short text descriptions.

#6 Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V [paper]

As we've seen before, LLM-based visual agents are pretty good at planning what to do when completing high-level tasks, but pretty bad at "grounding", i.e. turning the plan into an executable action.

Set-of-Mark prompting is a proposed technique to make grounding easier - it turns out that by annotating image inputs with masks and labels we can help LLMs ground the tasks better.

#5 Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models [paper]

If you aspire to become an LLM Sommelier, you should definitely read this paper and use the dataset to help you. As new multimodal models are released, evaluating them and understand their relative strengths and weaknesses is hard. LMSys helps, but only gives us overall ranking, not granular understanding.

This paper contributes a dataset that makes this evaluation process easier + a method for the evaluation (I love the dataset part, I’m a little skeptical about the evaluation part)

#4 GPT4V(ision) is a Generalist Web Agent, if Grounded [paper]

#3 Mind2Web: Towards Generalist Agent for the Web [paper]

#2 Don't Generate, Discriminate [paper]

#1 More Agents Is All You Need [paper]