Common Sense: On Multimodal Foundation Models: From Specialists to General-Purpose Assistants

Sunday, November 12, 2023

On Multimodal Foundation Models: From Specialists to General-Purpose Assistants

This survey paper was authored by Microsoft researchers. One of them is Jianfeng Gao Distinguished Scientist & Vice President at Microsoft.

It is barely recommendable!

The table of contents fails to list the gigantic bibliography section.

This paper remains at a very simplistic level. Few, if any concepts/methods etc. are explained (usually only references to other papers are provided).

Quite a few acronyms are not spelled out (e.g. HED).

Many of the figures created by the authors are rather useless and could have been prepared by a child.

Given the fact that this paper covers one of the most recent and novel research fields in machine learning & AI (i.e. multimodel foundation models) it is very surprising that the authors came up with almost 570 references (see for more details below).

Some other notes:

The link of footnote no. 1 on p. 8 appears to be broken
What is the difference between a denoising U-Net and the original U-Net? No explanation (see p. 30-31). The authors say that both are similar, but similar is not the same.
On p. 31, the famous Q, K, V symbols from self attention are introduced, but without any further explanation.
Figure 4.17 is not mentioned in the text
Quite a few spelling errors (e.g. cricial on p. 70 or speicifc on p. 74), which is surprising in this day and age. May indicate sloppiness.

Remarks on the extensive bibliography/references:

Almost 570 references or 26 pages of 119 pages total (or about 22% of all pages). Way too many references. As if the authors cobbled up any paper under the sky. Of over one thousand papers in my own machine learning & AI library (including many other survey papers) with references count, this is the paper with the second highest count.
Wrong citation of one of the most famous machine learning papers, i.e. Generative adversarial nets (2014) by first author Ian Goodfellow published at NIPS. Unfortunately, the authors cited the 2020 republication of this paper titled "Generative adversarial networks" in the Communications of the ACM. What an amateur mistake and what a misrepresentation of GANs.
Lot's of low quality references in my opinion
Tons of papers have exclusively authors with Chinese last names (a bit unusual). To be more precise, about 215 references or 38% of total. Don't get me wrong, I like China and its people very much.
It cites at least 20 or more survey papers. Some of these surveys are of dubious value (e.g. extremely narrowly focused). If you don't know what to write about or for lack of a topic, write a survey paper.
They also cited every benchmark, dataset etc. no matter how relevant or how much they are actually accepted and utilized. They cite at least 11 benchmark and 15 dataset papers.
It also appears, they cited any somewhat related paper that was published in 2023. The count is 227 (or 40% of all references).

Sunday, November 12, 2023

On Multimodal Foundation Models: From Specialists to General-Purpose Assistants

No comments: