Condividi su:

LIS2Speech Project and Lexis Integration

Deaf and hard-of-hearing people can communicate with each other using Sign Language, but they may have difficulties in connecting with the rest of society. Sign Language Recognition is a field of study that started to be analysed back in 1983, but only in the last decade this task gained more attention. Most of the published works are related to American, Chinese and German sign languages. On the other hand, the number of studies on the Italian Sign Language (LIS) is still scarce.

In order to solve the expressed problem, Neural Networks, Deep Learning and Computer Vision have been exploited to create an application, called LIS2Speech (LIS2S), capable of returning the Italian translation of a LIS sign, performed within a recorded video. The method relies on hands, body and face skeletal features extracted from RGB videos without the need for any additional equipment, such as colour gloves. Since the goal is to embrace as many people as possible, LIS2S has been developed as a Progressive Web App, which is able to be run on any device, be it a computer or a smartphone, equipped with a camera.

The results obtained with the described approach are in line with those obtained by automatic tools that have been developed for other sign languages, allowing the model to correctly understand and discriminate between signs belonging to a vocabulary of 50 words, which is in accord with the size of other corpora for isolated sign language recognition. In addition, a new dataset for Continuous Sign Language Recognition (CSLR) has been created and is being constantly expanded, to create a publicly available benchmark for this kind of task.

Social problem

Spoken languages and sign languages are different in a number of important ways: the former make use of the “vocal – auditory” channel, since the sound is produced with the mouth and perceived with the ear; the latter instead use the “corporal – visual” channel, signs are produced with the body (hands, arms, facial expressions) and perceived with the eyes.

There are several flavours of sign languages, due to the fact that they are not international, and even inside a national sign language different dialects are present. They are natural languages, since they evolved spontaneously wherever communities of deaf people had the possibility of communicating mutually, and are not derived from spoken languages, because they have their own vocabulary and grammatical structures.

The fundamental building block of a sign language is a gloss, which combines manual and non-manual features and represents the closest meaning of a sign. Based on the context, a specific feature can be the most important factor in the interpretation of a gloss: it may change the meaning of a verb, provide spatial or temporal information and discriminate between objects or people.

As known, there is an intrinsic difficulty in the communication between the deaf community and the rest of the society (according to ANSA, in 2018 there were more than 72 million people all over the world using sign languages), so the design of robust systems for automatic sign language recognition would largely reduce this gap.

The definition of Sign Language Recognition (SLR) can be expressed as the task of deducing glosses performed by a signer from video recordings. It can be considered some way related to human action or gesture recognition, but automatic SLR exhibits the following key additional challenges:

  • The interpretation of sign language is highly affected by the exact position in the surrounding space and context. For example, there are no personal pronouns (e.g., “he”, “she” etc.), because the signer points directly to any involved actor.
  • Many glosses are only discernible by their component non-manual features and they are usually difficult to be detected.
  • Based on the execution speed of a given gloss, it may have a different meaning. For instance, signers would not use two glosses to express “run quickly”, but they would simply accelerate the execution of the involved signs.

Machine Learning (ML) and Deep Learning (DL) mechanisms are the base of the so-called Computer Vision (CV); it is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. SLR is a task extremely related to CV; it takes advantage from the significant improvement in performance gained by many video-related tasks, thanks to the rise of deep networks.

LIS2Speech’s goal is to produce a tool useful for improving the integration of deaf people with the rest of the society; this tool should be easily accessible by anyone, and this is why the choice of developing an application able to be run on laptop or even smartphones has been taken.

State of the art

The past decade has seen the rapid expansion of DL techniques in many applications involving spatio-temporal inputs. The CV is an incredibly promising research area, with ample room for improvement; in fact video-related tasks such as human action recognition, gesture recognition, motion capturing etc. have seen considerable progress in their development and performance. SLR is extremely related to CV, since it requires the analysis and processing of video chunks or sequences to extract meaningful information; this is the reason most approaches tackling SLR have adjusted to this direction.

SLR can play an important role in addressing the issue of deaf people integration with the rest of the society. The attention paid by the international community to this particular problem has been growing during the last years; the number of published studies, but also the quantity of available data sets is increasing. There are different automatic SLR tasks, depending on the detail level of the modelling and the subsequent recognition step; these can be approximately divided in:

  • Isolated SLR: in this category, most of the methods aim to address video segment classification tasks, given the fundamental assumption that a single gloss is present.
  • Sign detection in continuous streams: the objective of these approaches is to recognize a set of predefined glosses in a continuous video flow.
  • Continuous SLR (CSLR): the goal of these methods is to identify the sequence of glosses present in a continuous or non-segmented video sequence. The characteristics of this particular category of mechanisms are most suitable for the requirements of real-life SLR applications.

This distinction is necessary to understand the different kinds of problems present for each task; historically, before the advent of the deep learning methods, the focus was on identifying isolated glosses and gesture spotting, so this is why studies on isolated SLR are more common. In the following image the trend of isolated and continuous recognition studies can be observed in blocks of five years up until 2020; the growth looks exponential for isolated studies, while it is close to linear for continuous studies. This can reflect the difficulty of the continuous recognition scenario and the scarcity of available training datasets. In fact, on average it can be observed that there are at least twice as many studies published using isolated sign language data. 

In terms of vocabulary size, the majority of isolated SLR works model a very limited amount of signs (i.e., below 50 signs), while this is not the case when comparing CSLR, where the overall studies are more or less evenly spread across all sign vocabularies. This trend can be observed in the next figure.

Technologies and architectures

LIS2S application is mainly made of two parts: on the client-side there is a Progressive Web Application (PWA), which will be used by the admins and users to access the functionality provided by the software; and the Back-end, which will be managed by a server process constantly listening for requests coming from the application. Whenever a new request is received from the server, it will run a new instance of a Docker container: this will be in charge of processing the data coming with the request and returning back the translation to the client. At the prototypal stage, the server was hosted on a proprietary machine, while, at the final stage, Lexis cluster will be used to provide an highly scalable and performant service.

Moving on towards the back-end section of the LIS2S application, the core of the translation mechanism is performed by the processing of the received videos. This particular task has been realized exploiting the potentiality of the Python programming language, which offers a complete set of tools and libraries extremely useful for Computer Vision and data manipulation. Going more into the details, OpenCV and MediaPipe packages have been mainly used to process videos of our dataset and extract skeletal information about hands, face and upper body of the subject. Here on the right there are some examples of the data extracted from single frames.

The core of the LIS2S application stands in the ability of recognizing the sign performed by the user in his video. To accomplish this not trivial goal, Deep Learning (DL) mechanisms and methodologies need to be used; As the years go on, DL improvements allow us to build technologies that previously were not even imaginable. To make use of these technologies, several frameworks have been developed: among them, the most known are Keras, PyTorch and TensorFlow. TensorFlow is the oldest one, while over the last couple of years, the two major DL libraries that have gained massive popularity, mainly due to how much easier to use they are over TensorFlow, are Keras and Pytorch. The PyTorch framework has been chosen since it provides a perfect balance between ease of use and control over the model training and testing.

Pipeline for Sign Language Recognition

Moving on towards the actual Sign Language recognition, the development team, after an accurate analysis, highlighted the need for four different models. Actually, the developed prototype focuses only on the first two models, which deal with the extraction of skeletal data and the isolated sign recognition; the last two are reported to give indications for future improvements, in order to expand the use cases to which this application could be applied. In the following picture, a diagram of the proposed pipeline is shown.

In the first model skeletal data are extracted from the subject instead of using videos directly: this was decided in order to reduce the dimensionality of the data to manipulate, and to demonstrate that using only lighter data like skeletal data it is possible to obtain state-of-the-art results.

In the proposed architecture, after the features extraction it is possible to find the actual neural network, which is in charge of understanding the temporal information held inside features it is fed with, and return the predicted sign. To do so, first let introduce Recurrent Neural Networks: these are particular networks which are designed to take a series of input with no predetermined limit on size; in this way, the input is considered as a series of information, which can hold additional meaning to what the network is training on. A single input item from the series is related to others and likely will have an influence on its neighbours; RNNs are able to capture this relationship across inputs meaningfully. In fact, they are able to “remember” the past and make decisions based on what they have learnt from the past.

The biggest limitation of this system is that the input must be segmented, since this project is focused on isolated sign language recognition, and in addition the translation is not in real-time. During the development of the project, another model has been considered to effectively switch from isolated to continuous sign language recognition. This model is actually under development and will be, together with the real-time implementation of this application, the next step to reach. Specifically, what the third model should perform is the sign segmentation of sign language sentences.

Finally, in order to make the translation more readable by non-deaf users, the model that has been thought as the final one should perform a type of rephrasing, manipulating the raw translation and converting it into a correct Italian sentence.

Collaboration with Lexis and IT4Innovations

With High Performance Computing (HPC) we refer to the technologies used by cluster computers to create processing systems capable of providing very high performance in the order of PetaFLOPS, typically used for parallel computing.

LIS2Speech project will benefit from a HPC system that requires significant investments and whose management requires the use of high-level specialized personnel. The intrinsic complexity and rapid technological evolution of these tools also requires that such personnel interact deeply with the end users (the experts of the various scientific sectors in which these systems are used), to allow them to use the tools efficiently.

The LEXIS [3] (Large-scale EXecution for Industry & Society) project is building an advanced engineering platform at the confluence of HPC, Cloud and Big Data, which leverages large-scale geographically-distributed resources from the existing HPC infrastructure, employs Big Data analytics solutions and augments them with Cloud services. Further information about the HPC Lexis project can be found at https://lexis-project.eu/web/.

LIS2S has allowed a collaboration with the Lexis Project – High Performance Computing (HPC) in Europe, that has provided Orbyta Team with access to the Barbora cluster.

Barbora is a supercomputer supplied by Atos IT Solutions and Services[2]. It has an extension of the existing Anselm supercomputer, which was commissioned in 2013. It was officially taken over by IT4Innovations National Supercomputing Center and commissioned in late September 2019. “Our goal is to regularly renew our computing resources so that our users have access to state-of-the-art computing systems, and to be able to meet their growing requirements as much as possible,” says Vít Vondrák, Director of IT4Innovations National Supercomputing Center.

IT4Innovations are the leading research, development, and innovation center active in the field of high-performance computing (HPC), data analysis (HPDA), and artificial intelligence (AI). They operate the most powerful supercomputing systems in the Czech Republic, which are provided to Czech and foreign research teams from both academia and industry [4].

The Barbora cluster[1] on which LIS2S has been migrated thanks to the LEXIS team consists of 201 compute nodes, totaling 7232 compute cores with 44544 GB RAM, giving over 848 TFLOP/s theoretical peak performance.

Nodes are interconnected through a fully non-blocking fat-tree InfiniBand network, and are equipped with Intel Cascade Lake processors. A few nodes are also equipped with NVIDIA Tesla V100-SXM2. The cluster runs with an operating system compatible with the Red Hat Linux family. 

Lexis team installed the needed deep learning Python packages that are accessible via the modules environment. The PBS Professional Open Source Project workload manager provides computing resources allocations and job execution.

Further information about Barbore clusters can be found at https://docs.it4i.cz/barbora/introduction/

In particular Barbora is being used from LIS2S as 

  1. Offline ML Training Calculation Platform in order to train all the models (new trainings are required for improvements and updates and for dictionary enlargement) 
  2. Offiline video processing platform for data augmentation
  3. Runtime Real-time user usage (Post and get requests of multiple users for real time translation and get feedback data from users). 
  4. Runtime/Offline Admin usage (Service handling (e.g. Temporary interruption), Post request for video adding to dictionary, Get request for performance and usage supervision)
  5. Data Storage (Permanent and temporary structured alphanumeric data, permanent and temporary raw videos and processed videos)

Other collaborations

From 2020 we collaborate with the Department of Automation and Information Technology of the Politecnico di Torino[7] and with the Ente Nazionale Sordi in order to get a larger Dataset and to test our application[6]. 

Innovation

At the current state, LIS2S already has multiple points of innovation compared to the current state of art: 

  • the application does not require special equipment (Kinect, gloves …) but only a video camera and can be run on any device, smartphone, tablet or PC;
  • the use of skeletal data instead of raw video is innovative and allowed us to get a network that is easier to train
  • our study focuses not only on hand data, but also on body posture and eye, lip and brow movements, unlike most studies in the same area
  • the number of words in the vocabulary is greater than in other LIS studies, but the performance is almost identical. The new LIS dataset has been created in which not only the single signs are present but entire sentences with the indication of the signs that compose them
  • the study and the dataset have been constructed in such a way that the network is signer-independent, i.e. it is independent of who is making the sign and is able to recognize the gesture in most visual conditions
  • the application can also be used to acquire new signs from all admin users, allowing the dictionary to grow over time independently. Also included is a process of re-training the periodic model in order to allow the application to learn these new signs.
  • the application will present a LIS translation feedback collection system
  • the application will present the possibility to modify the translation based on whether a right-handed or a left-handed person is using it
  • Regardless of what has been done for the other states, there is currently no application available that focuses on the translation from LIS to Italian, and we also plan to return the translation both written and spoken. Once in Italian we can therefore also include the translation from LIS into English or other languages. Following the same logic, it would be possible, in a possible subsequent phase, to translate from LIS to other sign languages ​​with appropriate animations.
  • We aim to translate not a single sign, but entire sentences in correct Italian. In fact, it must be remembered that LIS has a different syntactic structure than Italian.

Orbyta Team is working hard in order to improve LIS2S over time and to expand its functionalities and we will share our progress with the research community and our followers. 

Resources

[1] https://docs.it4i.cz/barbora/introduction/, consulted on 12.05.2021

[2] https://www.vsb.cz/en/news-detail/?reportId=39124, consulted on 12.05.2021

[3] https://lexis-project.eu/web/, consulted on 10.05.2021

[4] https://www.youtube.com/watch?v=4TGjJkwAJ40&t=1s, consulted on 10.05.2021

[5] Slides: Workflow Orchestration on Tightly Federated Computing Resources: the LEXIS approach, EGI 2020 Conference, Workflow management solutions, M. Levrier and A. Scionti, 02/11/2020

[6] https://www.ens.it/chi-siamo/struttura

[7] https://www.dauin.polito.it/

Article by

Giuseppe Mercurio and Carla Melia, data scientists at Orbyta Tech Srl, 10.06.2021

#jointherevolution