In “Blind Assassins,” written by Canadian author Margaret Atwood, it has been mentioned that “touch comes before sight, before speech. It’s the first language and the last, and it always tells the truth.”
Yunzhu Li is a PhD student at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). (Image credit: MIT)
Sense of touch gives humans a means to feel the physical world, and the eyes help them to instantly perceive the complete picture of these tactile signals.
Robots programmed to feel or see do not have the ability to use these signals quite as interchangeably. In an endeavor to better fill the sensory gap, scientists from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed a predictive artificial intelligence (AI) with the potential to learn to see by touching, and learn to feel by seeing.
The system developed by the researchers has the capability to use visual inputs to generate realistic tactile signals and predict the object and the part that is being touched directly from the tactile inputs. They used a KUKA robot arm with an exclusive tactile sensor known as GelSight, developed by another team at MIT.
The researchers used a simple web camera to record almost 200 objects, such as fabrics, household products, tools, and more, being touched over 12,000 times. The team broke down those 12,000 video clips into static frames and compiled “VisGel,” a dataset consisting of over three million visual/tactile-paired images.
By looking at the scene, our model can imagine the feeling of touching a flat surface or a sharp edge. By blindly touching around, our model can predict the interaction with the environment purely from tactile feelings. Bringing these two senses together could empower the robot and reduce the data we might need for tasks involving manipulating and grasping objects.
Yunzhu Li, PhD student, CSAIL
Li is the lead author on a new paper that describes the system.
In recent studies intending to equip robots with more human-like physical senses—for instance, MIT’s 2016 project that used deep learning for visual indication of sounds, or a model for predicting the responses of objects to physical forces—large datasets that are not accessible to understand the interactions between touch and vision were used.
The method developed by the researchers overcomes this by making use of the VisGel dataset and the so-called generative adversarial networks (GANs).
GANs use tactile or visual images to produce images in the other modality. They function by employing a “generator” and a “discriminator” that compete with each other. In this case, the generator intends to produce real-looking images to deceive the discriminator. Each time the discriminator “catches” the generator, it must reveal the internal reasoning for its decision, thereby enabling the generator to repeatedly improve on its own.
Vision to Touch
Humans can perceive how an object feels merely by observing it. In order to better enable machines to achieve this ability, first, the system had to find the position of the touch, and subsequently collect information related to the shape and feel of the region.
The reference images, free from any robot-object interaction, enabled the system to encode information related to the objects and the environment. Later, while the robot arm was in operation, the model could easily compare the current frame with its reference image, and simply determine the location and scale of the touch.
This might appear somewhat like supplying the system with an image of a computer mouse, and subsequently “seeing” the region where the model estimates the object must be touched for pickup—which could largely assist machines to plan more efficient and safer actions.
Touch to Vision
In order to realize touch to vision, the objective was for the model to generate a visual image predicated on tactile data. The model investigated a tactile image and subsequently deciphered the shape and material of the contact position. Then, it viewed the reference image again to “hallucinate” the interaction.
For instance, if the model was fed tactile data on a shoe at the time of testing, it could generate an image of the position at which the shoe was most likely to be touched.
Such potential could prove valuable in completing tasks in scenarios where visual data does not exist, such as when a light is off, or if a person is blindly reaching into an unknown area or box.
The existing dataset consists only examples of interactions in a controlled environment. The researchers believe that this can be enhanced by gathering data in more unstructured areas, or by making use of an innovative MIT-designed tactile glove, to better increase the diversity and size of the dataset.
Still, there are more details that can be complicated to deduce from switching modes, such as identifying the color of an object by merely touching it, or identifying how soft a sofa is without really touching it. According to the researchers, this could be enhanced by developing more robust models for uncertainty, to widen the distribution of probable results.
Moving forward, a model of this form could assist in achieving a more harmonious relationship between robotics and vision, specifically for better understanding of the scene, grasping, object recognition, and assisting with seamless human-robot integration in a manufacturing or assistive setting.
This is the first method that can convincingly translate between visual and touch signals. Methods like this have the potential to be very useful for robotics, where you need to answer questions like ‘is this object hard or soft?’, or ‘if I lift this mug by its handle, how good will my grip be?’ This is a very challenging problem, since the signals are so different, and this model has demonstrated great capability.
Andrew Owens, Postdoc, University of California at Berkeley
Li authored the paper together with MIT professors Russ Tedrake and Antonio Torralba, and MIT postdoc Jun-Yan Zhu. The paper will be presented at The Conference on Computer Vision and Pattern Recognition in Long Beach, California next week.