I have not had time to check it out this latest model by AllenAI, but it seems to be another significant step towards comprehensive multimodal models.
How much information gets lost due to compression and sparsification of input data?
"Unified-IO is the first neural model to perform a large and diverse set of AI tasks spanning classical computer vision, image synthesis, vision-and-language, and natural language processing (NLP). Unified-IO achieves this broad unification by homogenizing every task's input and output into a sequence of tokens drawn from a discrete and finite vocabulary. Dense inputs such as images, masks, and depth maps are converted to sequences using a universal compressor, and sparse structured inputs such as bounding boxes and human joint locations are transcribed into language, which is naturally sequential.
This approach of unifying input and output data enables us to train a single sequence-to-sequence Unified IO model to perform tasks across more than 80 diverse computer vision and NLP benchmarks. ..."
No comments:
Post a Comment