Virtual Attendants : Revista Pesquisa Fapesp

JOSÉ DE MARTINO / UNICAMPSoon, withdrawing cash, paying a bill or performing any banking transaction with and automatic teller machine may be a very different experience than it is today. Instead of pressing buttons and typing commands on a keypad to interact with a cold, impersonal ATM screen, users will have an experience more similar to being serviced by a flesh and blood employee of the bank. The client will be “attended” by the image of a virtual teller, whose face appears on the screen to verbally guide the user through their transaction. The same virtual attendant could also appear on cellular phones, verbally delivering text messages, and on websites, assisting with travel arrangements, scheduling medical appointments and any number of other services that are now performed users input data and navigating drop-down menus – an apparently a trivial task for many, but one that still causes headaches for people unfamiliar with technology and those with reading disabilities or physical limitations, for example.

This is the scenario projected by a pair of researchers at the State University of Campinas (Unicamp) responsible for developing a realistic video method of facial animation, made from real images of the human face – and not designed with the aid of computer graphics, which creates virtual faces by reproducing the details and nuances observed in real human faces. The goal of this innovation, which was named AnimaFace 2D, is to make interactions with computers and other electronic devices more like real face to face conversations. “The system that we created can support the development of human-computer interfaces more intuitive, efficient and engaging, representing an alternative to more traditional interactive applications that rely on the use of windows, icons, menus, keyboards and/or a mouse,” says electrical engineer José Mario De Martino, one of the researchers responsible for developing this system and a professor of Electrical and Computer Engineering at UNICAMP. “Talking with a human face is an efficient and intuitive process. I think virtual people are a promising alternative for the creation of communication interfaces for different types of devices and applications.

“The end result of the facial animation system created at Unicamp is a set of photographic images of a real face, which, when processed and presented in sequence and at a pace appropriate to transmiting the sensation of realistic human movement, is like speaking with someone live and in person. To generate a realistic video image of someone speaking, the technology uses a base of 34 photographs of different key lip positions, so-called profiling, each of which is associated with the set of phonemes in a given language. Each profile is, therefore, the visual representation of the articulatory position of the mouth characteristic of the actual acoustic production for the set of phonemes. The identification of these profiles was the result of De Martino’s research involving analyses of articulatory movements in the production of the phonemes characteristic of Brazilian Portuguese.

Synchronized Photographs
Electrical engineer Paula Dornhofer Costa, a doctoral student under the orientation of Professor De Martino, used the definition of these profiles to identify a set of 34 photographs of the face of a real person used for animation synthesis. These 34 images allow for the synchronization of lip movements of the virtual face with real speech, thus creating an animation that more accurately mimics a real person speaking live. Drawing a parallel with the world of voiceovers for movies, the correct identification and perfect sequencing of the profiles avoids the need to synchronize speech and lip movements. According to the professor from Unicamp, the software his group developed is unprecedented in Brazil. Similar systems that already exist overseas, are not commercially available and are also directed at other languages.

“All of the research related to technology development for this project was done in Brazil and focused on our reality, especially on our language and its peculiarities,” says De Martino. “We did not rely on partnerships with foreign researchers. The only, yet very important, collaboration we had was with the CPqD foundation in Campinas, which contributed a master’s student scholarship, supported the implementation of the pilot system and allowed us to use their text synthesizer, known as ‘CPqD Texto Fala’ (CPqD Text Speech).” The software used in this pilot system was registered at the National Institute of Industrial Property (INPI) and is ready for commercial use. “We are very interested in establishing partnerships with companies interested in exploring the technology,” he says.

“The technology will be presented to specialists from the market in August this year, and we believe that it will be of interest to various companies working with interfaces between people and machines, such as electronic banking, e-commerce and tourism, among others,”says Giancarlo Stefanuto, scientific coordinator of the Business Mobilization for Innovation (MEPI) project of Inova, the Innovation Agency of Unicamp which administers intellectual property rights and looks for entities that would be interested in licensing such property or creating new companies to utilize it in commercial markets. “We realized that there is a growing market for facial expression animation software, like that developed at Unicamp. The human-machine interface is becoming more and more personalized and innovations such as this tend to have strong demand in the future. For us, this is a strategic line of research, but for the technology to enter the market some adjustments and adaptations are still necessary. It needs to be more user friendly, have more intuitive interfaces, a tutorial and a manual,”says Stefanuto. “The market expansion is the result of perceptions by the experts at Inova and has been proven by the researcher in international forums where technology is debated.”

Facial animation systems may have many types of applications in the future, including the creation of avatars, virtual characters or actors in commercial games or videos, the development of personified virtual agents, such as vendors, tutors, customer support, virtual guides, and news presenters, among others. They may also be used as tools in education and training in lip reading, for example. The technology has features that allow it to be adapted to multiple computing platforms such as mobile phones, smart phones and tablet computers.”You can imagine, for example, an application where a smart phone user could sign up for a news service that would be displayed on the electronic device by their own synthesized presenter. The advantages are that new feeds could be sent to an electronic device in a text format for the system installed on the device to automatically convert that text into an audiovisual presentation. In addition to making such services cheaper- sending text is less expensive than sending pictures, audio, or video – it won’t be necessary for the information distributor to record a live video presentation every day or for every update, “says De Martino.

Republish