This could be an interesting paper by Andrew Zisserman and his team!
How many people are really deaf? According to Google about 11 million Americans or 3.6% of the population.
Personally, every time I see a human sign language interpreter next to e.g. a politician in a video it distracts me and drives me nuts to watch them wildly gesticulating (maybe I suffer from ADHD or autism).
Hopefully, machine learning & AI can mediate here very soon so that these gesticulations are only displayed to those viewers that are deaf or those viewers who enjoy watching those sign language interpreters.
From the abstract:
"Our aim is to develop a unified model for sign language understanding, that performs sign language translation (SLT) and sign-subtitle alignment (SSA). Together, these two tasks enable the conversion of continuous signing videos into spoken language text and also the temporal alignment of signing with subtitles -- both essential for practical communication, large-scale corpus construction, and educational applications.
To achieve this, our approach is built upon three components:
(i) a lightweight visual backbone that captures manual and non-manual cues from human keypoints and lip-region images while preserving signer privacy;
(ii) a Sliding Perceiver mapping network that aggregates consecutive visual features into word-level embeddings to bridge the vision-text gap; and
(iii) a multi-task scalable training strategy that jointly optimises SLT and SSA, reinforcing both linguistic and temporal alignment.
To promote cross-linguistic generalisation, we pretrain our model on large-scale sign-text corpora covering British Sign Language (BSL) and American Sign Language (ASL) from the BOBSL and YouTube-SL-25 datasets.
With this multilingual pretraining and strong model design, we achieve state-of-the-art results on the challenging BOBSL (BSL) dataset for both SLT and SSA. Our model also demonstrates robust zero-shot generalisation and finetuned SLT performance on How2Sign (ASL), highlighting the potential of scalable translation across different sign languages."
To achieve this, our approach is built upon three components:
(i) a lightweight visual backbone that captures manual and non-manual cues from human keypoints and lip-region images while preserving signer privacy;
(ii) a Sliding Perceiver mapping network that aggregates consecutive visual features into word-level embeddings to bridge the vision-text gap; and
(iii) a multi-task scalable training strategy that jointly optimises SLT and SSA, reinforcing both linguistic and temporal alignment.
To promote cross-linguistic generalisation, we pretrain our model on large-scale sign-text corpora covering British Sign Language (BSL) and American Sign Language (ASL) from the BOBSL and YouTube-SL-25 datasets.
With this multilingual pretraining and strong model design, we achieve state-of-the-art results on the challenging BOBSL (BSL) dataset for both SLT and SSA. Our model also demonstrates robust zero-shot generalisation and finetuned SLT performance on How2Sign (ASL), highlighting the potential of scalable translation across different sign languages."
Figure 1. A unified sign language understanding model. Givensigning data, our model performs both SLT and SSA, guided by textual prompts. For both tasks, a 500-frame (20s at 25 fps) video is used as input.
In SLT mode, the model receives the sign video with frame-level timestamps specifying the region of interest (not shown for clarity), and generates a spoken language translation for that segment.
In SLT mode, the model receives the sign video with frame-level timestamps specifying the region of interest (not shown for clarity), and generates a spoken language translation for that segment.
In SSA mode, the model takes the sign video, a target sentence along with its audio-aligned timestamps (if available),
No comments:
Post a Comment