LIS2Speech Project and Lexis Integration

Deaf and hard-of-hearing people can communicate with each other using Sign Language, but they may have difficulties in connecting with the rest of society. Sign Language Recognition is a field of study that started to be analysed back in 1983, but only in the last decade this task gained more attention. Most of the published works are related to American, Chinese and German sign languages. On the other hand, the number of studies on the Italian Sign Language (LIS) is still scarce.

In order to solve the expressed problem, Neural Networks, Deep Learning and Computer Vision have been exploited to create an application, called LIS2Speech (LIS2S), capable of returning the Italian translation of a LIS sign, performed within a recorded video. The method relies on hands, body and face skeletal features extracted from RGB videos without the need for any additional equipment, such as colour gloves. Since the goal is to embrace as many people as possible, LIS2S has been developed as a Progressive Web App, which is able to be run on any device, be it a computer or a smartphone, equipped with a camera.

The results obtained with the described approach are in line with those obtained by automatic tools that have been developed for other sign languages, allowing the model to correctly understand and discriminate between signs belonging to a vocabulary of 50 words, which is in accord with the size of other corpora for isolated sign language recognition. In addition, a new dataset for Continuous Sign Language Recognition (CSLR) has been created and is being constantly expanded, to create a publicly available benchmark for this kind of task.

Social problem

Spoken languages and sign languages are different in a number of important ways: the former make use of the “vocal – auditory” channel, since the sound is produced with the mouth and perceived with the ear; the latter instead use the “corporal – visual” channel, signs are produced with the body (hands, arms, facial expressions) and perceived with the eyes.

There are several flavours of sign languages, due to the fact that they are not international, and even inside a national sign language different dialects are present. They are natural languages, since they evolved spontaneously wherever communities of deaf people had the possibility of communicating mutually, and are not derived from spoken languages, because they have their own vocabulary and grammatical structures.

The fundamental building block of a sign language is a gloss, which combines manual and non-manual features and represents the closest meaning of a sign. Based on the context, a specific feature can be the most important factor in the interpretation of a gloss: it may change the meaning of a verb, provide spatial or temporal information and discriminate between objects or people.

As known, there is an intrinsic difficulty in the communication between the deaf community and the rest of the society (according to ANSA, in 2018 there were more than 72 million people all over the world using sign languages), so the design of robust systems for automatic sign language recognition would largely reduce this gap.

The definition of Sign Language Recognition (SLR) can be expressed as the task of deducing glosses performed by a signer from video recordings. It can be considered some way related to human action or gesture recognition, but automatic SLR exhibits the following key additional challenges:

  • The interpretation of sign language is highly affected by the exact position in the surrounding space and context. For example, there are no personal pronouns (e.g., “he”, “she” etc.), because the signer points directly to any involved actor.
  • Many glosses are only discernible by their component non-manual features and they are usually difficult to be detected.
  • Based on the execution speed of a given gloss, it may have a different meaning. For instance, signers would not use two glosses to express “run quickly”, but they would simply accelerate the execution of the involved signs.

Machine Learning (ML) and Deep Learning (DL) mechanisms are the base of the so-called Computer Vision (CV); it is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. SLR is a task extremely related to CV; it takes advantage from the significant improvement in performance gained by many video-related tasks, thanks to the rise of deep networks.

LIS2Speech’s goal is to produce a tool useful for improving the integration of deaf people with the rest of the society; this tool should be easily accessible by anyone, and this is why the choice of developing an application able to be run on laptop or even smartphones has been taken.

State of the art

The past decade has seen the rapid expansion of DL techniques in many applications involving spatio-temporal inputs. The CV is an incredibly promising research area, with ample room for improvement; in fact video-related tasks such as human action recognition, gesture recognition, motion capturing etc. have seen considerable progress in their development and performance. SLR is extremely related to CV, since it requires the analysis and processing of video chunks or sequences to extract meaningful information; this is the reason most approaches tackling SLR have adjusted to this direction.

SLR can play an important role in addressing the issue of deaf people integration with the rest of the society. The attention paid by the international community to this particular problem has been growing during the last years; the number of published studies, but also the quantity of available data sets is increasing. There are different automatic SLR tasks, depending on the detail level of the modelling and the subsequent recognition step; these can be approximately divided in:

  • Isolated SLR: in this category, most of the methods aim to address video segment classification tasks, given the fundamental assumption that a single gloss is present.
  • Sign detection in continuous streams: the objective of these approaches is to recognize a set of predefined glosses in a continuous video flow.
  • Continuous SLR (CSLR): the goal of these methods is to identify the sequence of glosses present in a continuous or non-segmented video sequence. The characteristics of this particular category of mechanisms are most suitable for the requirements of real-life SLR applications.

This distinction is necessary to understand the different kinds of problems present for each task; historically, before the advent of the deep learning methods, the focus was on identifying isolated glosses and gesture spotting, so this is why studies on isolated SLR are more common. In the following image the trend of isolated and continuous recognition studies can be observed in blocks of five years up until 2020; the growth looks exponential for isolated studies, while it is close to linear for continuous studies. This can reflect the difficulty of the continuous recognition scenario and the scarcity of available training datasets. In fact, on average it can be observed that there are at least twice as many studies published using isolated sign language data. 

In terms of vocabulary size, the majority of isolated SLR works model a very limited amount of signs (i.e., below 50 signs), while this is not the case when comparing CSLR, where the overall studies are more or less evenly spread across all sign vocabularies. This trend can be observed in the next figure.

Technologies and architectures

LIS2S application is mainly made of two parts: on the client-side there is a Progressive Web Application (PWA), which will be used by the admins and users to access the functionality provided by the software; and the Back-end, which will be managed by a server process constantly listening for requests coming from the application. Whenever a new request is received from the server, it will run a new instance of a Docker container: this will be in charge of processing the data coming with the request and returning back the translation to the client. At the prototypal stage, the server was hosted on a proprietary machine, while, at the final stage, Lexis cluster will be used to provide an highly scalable and performant service.

Moving on towards the back-end section of the LIS2S application, the core of the translation mechanism is performed by the processing of the received videos. This particular task has been realized exploiting the potentiality of the Python programming language, which offers a complete set of tools and libraries extremely useful for Computer Vision and data manipulation. Going more into the details, OpenCV and MediaPipe packages have been mainly used to process videos of our dataset and extract skeletal information about hands, face and upper body of the subject. Here on the right there are some examples of the data extracted from single frames.

The core of the LIS2S application stands in the ability of recognizing the sign performed by the user in his video. To accomplish this not trivial goal, Deep Learning (DL) mechanisms and methodologies need to be used; As the years go on, DL improvements allow us to build technologies that previously were not even imaginable. To make use of these technologies, several frameworks have been developed: among them, the most known are Keras, PyTorch and TensorFlow. TensorFlow is the oldest one, while over the last couple of years, the two major DL libraries that have gained massive popularity, mainly due to how much easier to use they are over TensorFlow, are Keras and Pytorch. The PyTorch framework has been chosen since it provides a perfect balance between ease of use and control over the model training and testing.

Pipeline for Sign Language Recognition

Moving on towards the actual Sign Language recognition, the development team, after an accurate analysis, highlighted the need for four different models. Actually, the developed prototype focuses only on the first two models, which deal with the extraction of skeletal data and the isolated sign recognition; the last two are reported to give indications for future improvements, in order to expand the use cases to which this application could be applied. In the following picture, a diagram of the proposed pipeline is shown.

In the first model skeletal data are extracted from the subject instead of using videos directly: this was decided in order to reduce the dimensionality of the data to manipulate, and to demonstrate that using only lighter data like skeletal data it is possible to obtain state-of-the-art results.

In the proposed architecture, after the features extraction it is possible to find the actual neural network, which is in charge of understanding the temporal information held inside features it is fed with, and return the predicted sign. To do so, first let introduce Recurrent Neural Networks: these are particular networks which are designed to take a series of input with no predetermined limit on size; in this way, the input is considered as a series of information, which can hold additional meaning to what the network is training on. A single input item from the series is related to others and likely will have an influence on its neighbours; RNNs are able to capture this relationship across inputs meaningfully. In fact, they are able to “remember” the past and make decisions based on what they have learnt from the past.

The biggest limitation of this system is that the input must be segmented, since this project is focused on isolated sign language recognition, and in addition the translation is not in real-time. During the development of the project, another model has been considered to effectively switch from isolated to continuous sign language recognition. This model is actually under development and will be, together with the real-time implementation of this application, the next step to reach. Specifically, what the third model should perform is the sign segmentation of sign language sentences.

Finally, in order to make the translation more readable by non-deaf users, the model that has been thought as the final one should perform a type of rephrasing, manipulating the raw translation and converting it into a correct Italian sentence.

Collaboration with Lexis and IT4Innovations

With High Performance Computing (HPC) we refer to the technologies used by cluster computers to create processing systems capable of providing very high performance in the order of PetaFLOPS, typically used for parallel computing.

LIS2Speech project will benefit from a HPC system that requires significant investments and whose management requires the use of high-level specialized personnel. The intrinsic complexity and rapid technological evolution of these tools also requires that such personnel interact deeply with the end users (the experts of the various scientific sectors in which these systems are used), to allow them to use the tools efficiently.

The LEXIS [3] (Large-scale EXecution for Industry & Society) project is building an advanced engineering platform at the confluence of HPC, Cloud and Big Data, which leverages large-scale geographically-distributed resources from the existing HPC infrastructure, employs Big Data analytics solutions and augments them with Cloud services. Further information about the HPC Lexis project can be found at https://lexis-project.eu/web/.

LIS2S has allowed a collaboration with the Lexis Project – High Performance Computing (HPC) in Europe, that has provided Orbyta Team with access to the Barbora cluster.

Barbora is a supercomputer supplied by Atos IT Solutions and Services[2]. It has an extension of the existing Anselm supercomputer, which was commissioned in 2013. It was officially taken over by IT4Innovations National Supercomputing Center and commissioned in late September 2019. “Our goal is to regularly renew our computing resources so that our users have access to state-of-the-art computing systems, and to be able to meet their growing requirements as much as possible,” says Vít Vondrák, Director of IT4Innovations National Supercomputing Center.

IT4Innovations are the leading research, development, and innovation center active in the field of high-performance computing (HPC), data analysis (HPDA), and artificial intelligence (AI). They operate the most powerful supercomputing systems in the Czech Republic, which are provided to Czech and foreign research teams from both academia and industry [4].

The Barbora cluster[1] on which LIS2S has been migrated thanks to the LEXIS team consists of 201 compute nodes, totaling 7232 compute cores with 44544 GB RAM, giving over 848 TFLOP/s theoretical peak performance.

Nodes are interconnected through a fully non-blocking fat-tree InfiniBand network, and are equipped with Intel Cascade Lake processors. A few nodes are also equipped with NVIDIA Tesla V100-SXM2. The cluster runs with an operating system compatible with the Red Hat Linux family. 

Lexis team installed the needed deep learning Python packages that are accessible via the modules environment. The PBS Professional Open Source Project workload manager provides computing resources allocations and job execution.

Further information about Barbore clusters can be found at https://docs.it4i.cz/barbora/introduction/

In particular Barbora is being used from LIS2S as 

  1. Offline ML Training Calculation Platform in order to train all the models (new trainings are required for improvements and updates and for dictionary enlargement) 
  2. Offiline video processing platform for data augmentation
  3. Runtime Real-time user usage (Post and get requests of multiple users for real time translation and get feedback data from users). 
  4. Runtime/Offline Admin usage (Service handling (e.g. Temporary interruption), Post request for video adding to dictionary, Get request for performance and usage supervision)
  5. Data Storage (Permanent and temporary structured alphanumeric data, permanent and temporary raw videos and processed videos)

Other collaborations

From 2020 we collaborate with the Department of Automation and Information Technology of the Politecnico di Torino[7] and with the Ente Nazionale Sordi in order to get a larger Dataset and to test our application[6]. 

Innovation

At the current state, LIS2S already has multiple points of innovation compared to the current state of art: 

  • the application does not require special equipment (Kinect, gloves …) but only a video camera and can be run on any device, smartphone, tablet or PC;
  • the use of skeletal data instead of raw video is innovative and allowed us to get a network that is easier to train
  • our study focuses not only on hand data, but also on body posture and eye, lip and brow movements, unlike most studies in the same area
  • the number of words in the vocabulary is greater than in other LIS studies, but the performance is almost identical. The new LIS dataset has been created in which not only the single signs are present but entire sentences with the indication of the signs that compose them
  • the study and the dataset have been constructed in such a way that the network is signer-independent, i.e. it is independent of who is making the sign and is able to recognize the gesture in most visual conditions
  • the application can also be used to acquire new signs from all admin users, allowing the dictionary to grow over time independently. Also included is a process of re-training the periodic model in order to allow the application to learn these new signs.
  • the application will present a LIS translation feedback collection system
  • the application will present the possibility to modify the translation based on whether a right-handed or a left-handed person is using it
  • Regardless of what has been done for the other states, there is currently no application available that focuses on the translation from LIS to Italian, and we also plan to return the translation both written and spoken. Once in Italian we can therefore also include the translation from LIS into English or other languages. Following the same logic, it would be possible, in a possible subsequent phase, to translate from LIS to other sign languages ​​with appropriate animations.
  • We aim to translate not a single sign, but entire sentences in correct Italian. In fact, it must be remembered that LIS has a different syntactic structure than Italian.

Orbyta Team is working hard in order to improve LIS2S over time and to expand its functionalities and we will share our progress with the research community and our followers. 

Resources

[1] https://docs.it4i.cz/barbora/introduction/, consulted on 12.05.2021

[2] https://www.vsb.cz/en/news-detail/?reportId=39124, consulted on 12.05.2021

[3] https://lexis-project.eu/web/, consulted on 10.05.2021

[4] https://www.youtube.com/watch?v=4TGjJkwAJ40&t=1s, consulted on 10.05.2021

[5] Slides: Workflow Orchestration on Tightly Federated Computing Resources: the LEXIS approach, EGI 2020 Conference, Workflow management solutions, M. Levrier and A. Scionti, 02/11/2020

[6] https://www.ens.it/chi-siamo/struttura

[7] https://www.dauin.polito.it/

Article by

Giuseppe Mercurio and Carla Melia, data scientists at Orbyta Tech Srl, 10.06.2021

#jointherevolution

Lorenzo Sacco – CEO Orbyta

Introduci la tua carriera professionale fino al ruolo da CEO.

Ho conseguito la laurea specialistica in Ingegneria delle telecomunicazioni nel 2005 con il massimo dei voti.

Ho iniziato subito a lavorare nell’ICT come analista programmatore in realtà multinazionali di consulenza dove ho sviluppato competenze e relazioni su Clienti importanti nei settori fintech, insurance, telco e automotive. Sono stato inoltre dipendente FIAT, prima della fusione con Chrysler, dove ho potuto vivere le dinamiche lato cliente.

Fin dall’università avevo il desiderio di costruire una realtà mia e così nel 2011, a 30 anni, dopo 6 anni di esperienza, un bagaglio di competenze e relazioni, ho deciso di lanciarmi in una nuova sfida imprenditoriale fondando la startup EIS.

L’idea dietro ed EIS era quella di essere una società di consulenza dove le persone fossero al centro, dove ciascuno potesse trovare un terreno fertile per far crescere il proprio potenziale e sentirsi parte dell’azienda.

Mi sono dedicato anima e corpo al progetto, lavorando duramente e seguendo tutti gli aspetti della società: l’area tecnica, le risorse umane, l’area amministrativa e finanziaria. Il mio socio, anagraficamente più senior di me, si è occupato inizialmente dello sviluppo commerciale.

In pochi anni EIS è cresciuta in modo esponenziale, nel 2015 siamo stati premiati dalla Deloitte come prima startup tecnologica per crescita di fatturato in Italia nella classifica EMEA TECHNOLOGY FAST 500. Nel 2016 siamo diventati un gruppo, EISWORLD, formato da società specializzate ed io ricoprivo il ruolo di CEO nelle principali.

Nel 2019 il gruppo EISWORLD viene sciolto, ma restano la passione per il mio lavoro e l’entusiasmo del mio staff. Così a febbraio 2020 nasce il gruppo ORBYTA, fondata sui valori in cui credo e che mi tengono unito alle persone che con me condividono questa avventura. Come fondatore e CEO, mi pongo come obiettivo quello di mettere la mia esperienza e le mie competenze al servizio del gruppo per far diventare ORBYTA un punto di riferimento tra le società di servizi.

Qual è stata la tua più grande delusione nella carriera? Cos’hai imparato da questo?

Delusione è una parola che non uso, è un sentimento che non mi appartiene. Ho imparato a far le mie valutazioni sul lungo periodo e, analizzando la mia carriera, posso parlare di cambiamenti, non di delusioni. Cambiamenti anche difficili da affrontare, come la scissione del gruppo EISWORLD, ma inevitabili e necessari, che si sono poi rivelati essere fondamentali per la riuscita dei progetti futuri.

Sicuramente da questo ho imparato ad essere più risoluto, ho imparato che un leader deve avere il coraggio di fare la scelta giusta, soprattutto quando non è facile ed è la scelta che in pochi farebbero. Ho imparato che bisogna credere fortemente nella propria visione per vederla realizzata, senza compromessi.

Che cosa significa innovazione per ORBYTA e in che modo la realizzi?

ORBYTA è una società di servizi, consulenza e progettazione in ambito ICT, ma non solo. Si differenzia perché mette al centro le persone, valorizza la partecipazione ed il coinvolgimento. In ORBYTA si cresce insieme con entusiasmo, il merito viene riconosciuto. Vogliamo essere la società del fare, delle competenze, della qualità, la società che sorprende il cliente portando sempre valore aggiunto. Per fare questo abbiamo costruito un modello culturale nuovo, dirompente, che ci impegniamo a diffondere in azienda e tutto intorno a noi.

Quali principali minacce/opportunità prevedi nel tuo settore? Come pensi di gestirli?

Una grande sfida per le aziende che crescono molto rapidamente, come stiamo facendo noi, è il saper mantenere la capacità di adattarsi al mercato, di trasformarsi per cogliere le differenti opportunità. In ORBYTA crediamo molto nella formazione continua, nella competenza vera. Non sappiamo cosa ci aspetta nel futuro, ma ci prepariamo ogni giorno per affrontarlo.

Come costruisci i rapporti con il team di gestione? Che ruolo giocano gli altri nell’attuazione delle tue strategie?

Una cosa che dico spesso è che i risultati che riusciamo ad ottenere dipendono da tutto il team. Tutti sono importanti, fondamentali, per la realizzazione del progetto ORBYTA. E ciascuno deve essere consapevole della propria importanza, del contributo che porta per la crescita del gruppo. Costruisco i rapporti con le persone basandoli sulla fiducia, voglio che i miei collaboratori siano liberi di esprimersi, di sbagliare, di portare le proprie idee, di sentirsi protagonisti e guida del cambiamento. Li stimolo affinché ciascuno di loro possa esprimere a pieno il proprio potenziale, che vuol anche dire aiutarli a scoprire le proprie capacità e superare i limiti che credono di avere.

#jointherevolution

COVID-19: COME MI COMPORTO SE…?

Sono stato al ristorante con una persona che si  allena con un’amica risultata positiva al virus.
Devo mettermi in quarantena? 

Un collega degli uffici accanto è risultato  positivo al virus.
Devo stare a casa anch’io? 

Un compagno di classe di mio figlio è risultato  positivo al virus.
Posso continuare ad andare a  lavorare?
 

In questi casi la quarantena potrebbe non essere necessaria  perché…

La quarantena (e successivamente il tampone) è prevista soltanto se ho avuto un  contatto stretto con una persona positiva

Cos’è un contatto stretto? 

  • una persona che vive nella stessa casa di un caso di COVID-19;
  • una persona che ha avuto un contatto fisico diretto con un caso di COVID-19 (per esempio la stretta di mano); ▪ una persona che ha avuto un contatto diretto non protetto con le secrezioni di un caso di COVID-19 (ad esempio  toccare a mani nude fazzoletti di carta usati); 
  • una persona che ha avuto un contatto diretto (senza mascherina) con un caso di COVID-19, a distanza minore di 2  metri e di durata maggiore a 15 minuti
  • una persona che si è trovata in un ambiente chiuso (ad esempio aula, sala riunioni, sala d’attesa dell’ospedale) con  un caso di COVID-19 per almeno 15 minuti, a distanza minore di 2 metri senza indossare la mascherina ▪ un operatore sanitario o altra persona che fornisce assistenza diretta ad un caso di COVID-19 oppure personale di  laboratorio addetto alla manipolazione di campioni di un caso di COVID-19 senza l’impiego dei DPI raccomandati  o mediante l’utilizzo di DPI non idonei;
  • una persona che ha viaggiato seduta in treno, aereo o altro mezzo di trasporto entro due posti in qualsiasi direzione  rispetto a un caso COVID-19; sono contatti stretti anche i compagni di viaggio e il personale addetto alla sezione  dell’aereo/treno dove il caso indice era seduto (qualora il caso indice abbia una sintomatologia grave o abbia  effettuato spostamenti all’interno del mezzo, determinando una maggiore esposizione dei passeggeri, considerare  come contatti stretti tutti i passeggeri seduti nella stessa sezione del mezzo o in tutto il mezzo).

Cosa faccio se ho dei sintomi compatibili con il COVID-19? 

Rimango a casa. Avviso il Datore di Lavoro e contatto il mio medico curante per avere indicazioni su come procedere. In presenza di sintomatologia sospetta, il medico di medicina generale (MMG), richiede tempestivamente il test  diagnostico e lo comunica al Dipartimento di Prevenzione (DdP), o al servizio preposto sulla base dell’organizzazione  regionale. 

Quali sono i sintomi compatibili? 

  • febbre ≥ 37,5°C e brividi 
  • tosse di recente comparsa 
  • difficoltà respiratorie 
  • perdita improvvisa dell’olfatto (anosmia) o diminuzione dell’olfatto (iposmia), perdita del gusto (ageusia) o  alterazione del gusto (disgeusia) 
  • raffreddore o naso che cola 
  • mal di gola 
  • diarrea (soprattutto nei bambini).

Ho avuto un contatto stretto con una persona risultata positiva al SARS-CoV-2. Cosa  devo fare? 

Mi metto in quarantena, avviso il Datore di Lavoro e il mio medico curante, seguendo le sue indicazioni. I contatti stretti di casi con infezione da SARS-CoV-2, confermati e identificati dalle autorità sanitarie, potranno  effettuare un test antigenico o molecolare al decimo giorno di quarantena (oppure osservare un periodo di quarantena  di 14 giorni dall’ultima esposizione al caso). 

Cosa si intende per quarantena, sorveglianza attiva ed isolamento? Quali sono le  differenze? 

In tutti e tre i casi non è consentito uscire di casa, ma con le seguenti differenze: 

  • La quarantena si attua ad una persona sana (contatto stretto) che è stata esposta ad un caso COVID-19, con  l’obiettivo di monitorare i sintomi e assicurare l’identificazione precoce dei casi. 
  • L’isolamento consiste nel separare quanto più possibile le persone affette da COVID-19 da quelle sane al fine di  prevenire la diffusione dell’infezione, durante il periodo di trasmissibilità.
  • La sorveglianza attiva è una misura durante la quale l’operatore di sanità pubblica provvede a contattare  quotidianamente, per avere notizie sulle condizioni di salute, la persona in sorveglianza. 

Cosa bisogna fare al termine della quarantena per rientrare al lavoro? 

  • Se non sono comparsi sintomi:
    Al termine del periodo di quarantena, se non sono comparsi sintomi, si può rientrare al lavoro ed il periodo di assenza  risulta coperto dal certificato. 
  • Se sono comparsi sintomi:
    Se durante la quarantena si sviluppassero sintomi, il Dipartimento di Sanità Pubblica, che si occupa della sorveglianza  sanitaria, provvederà all’esecuzione del tampone per la ricerca di SARS-CoV-2. In caso di esito positivo dello stesso  bisognerà attendere la guarigione clinica ed eseguire un test molecolare dopo almeno 3 giorni senza sintomi. Se il test  molecolare risulterà negativo si potrà tornare al lavoro, altrimenti si proseguirà l’isolamento. 

Fonte: Ministero della Salute

#jointherevolution