How to build Vision Intelligence for machines?

Maciej Wolski

7 min read
Share article via
Or copy link

The process of learning relies on transforming raw data coming from sensors into structured information and finally useful knowledge.

Raw sensory data – pixels in frames from the camera sensorStructure – segmentation into separate objects and their partsObject labels, attributes, and mutual relations
ObservationsModels – e.g. scene graphsJudgments – scene understanding, object/event recognition
Table 1. STEPS of THE learning PROCESS

To learn anything, both biological and technical systems need reference points, optimization targets, labels, or pseudo-labels. New information has to be attached to something that existed before or support previously established memory patterns.

In supervised machine learning, the process of learning is organized with training datasets that provide both data samples and correct answers to modify the model in such a way – so it will be able to respond with a correct answer also to similar examples in the future.

The challenge is that collecting enough labeled data for neural networks requires lots of time (and resources). And when the models are deployed in production, they often suffer from ‘model degradation’ – a situation where the model’s accuracy drops significantly when exposed to a real-world environment.

The physical world is unpredictable, frequently changes and requires constant updates, preferably in real-time to not suffer from utilization of incomplete knowledge.

Learning without supervision

To deal with such conditions, it is beneficial to work with solutions that will learn in unsupervised or self-supervised learning modes. Equipped with little to none prior knowledge, they can build references not to provided labeled data, but to similarities to previously observed.

In such scenarios, the relations between data samples are the only information available to the system. The similarities to existing groups may be used as pseudo-labels, where affiliation to a specific group of samples is informative enough to direct the model’s knowledge update. Such pseudo-label may not be anything meaningful to us humans, but it will represent the patterns of similar data.

The approaches that are known for working with unsupervised scenarios are clustering and autoencoders – with all the existing variants of these techniques. The self-supervised learning requires at least a small amount of labeled data.

However the most promising scenario is to build systems with fully unsupervised capability to learn. And to be able to provide labels at any time – from human users, other machines or external data sources.

In such a way, the solution does not need to understand what specific object or its part is – to learn its key attributes. It can attach a linguistic label later, if it will be useful for specific use cases.

In computer vision, it is possible to define the areas of interest in the video frame (or image) – by performing pre-processing – and defining the areas filled with potential objects. It can be done by analyzing depth continuity (physical objects stick together, are not divided by empty space), visual consistency (object’s parts often have similar color or texture), motion analysis (what moves together is probably the same object) and detected collinear edges forming the object’s boundaries.

This is how the raw sensory data is transformed into the initial structure of visual information. Then the areas of interest can be confirmed with low-level saliency maps, that are based on detected visual attributes, used for the task of object recognition.

Such an approach leads to the ability of autonomous learning and operation – without supervision. The associative memory does not need to be exposed to noise or irrelevant information, if we are able to focus on relevant regions of the video frame (or image) and discard the rest.

The definition of the problem

The problem that Artificial Visual System tries to solve may be defined as simultaneous representation learning and relation modeling. We need to know what specific pieces of visual information represent and what are their relations to other samples and data pieces in the memory. In many cases it is also useful to model spatiotemporal relations to other objects in the perceived scene.

The representation learning is a base of modern Neural Networks, however in recent years we can observe an increase in techniques that are focused on modeling relations between data – both graphs, Graph Neural Networks and attention mechanisms are a form of associated context aggregation.

In Computer Vision there is a well known technique of Simultaneous Localization and Mapping, allowing machines to orient in their external environment.

In a similar manner, the machine can orient itself in the internal environment of its own memory – to find the memory locations relevant to the observed environment state and to perform mapping of the external environment features in its memory bank for future usage.

In fact, most of our human knowledge is also based on two aspects:

· What a specific object is? (how it can be represented in memory)

· How does it relate to the other objects? What is the type of relation (hierarchical, directed etc.)?

Pre-trained models

To be useful from the very beginning, the machine can’t start in a blank state.

In order to operate in the environment, the machine has to understand it on at least a basic level. Our technology may support unsupervised zero to few-shot learning procedures, but starting from a blank state will not allow it to realize any reasonable task.

The pre-training operation may build a basic understanding of the environment, while real-time learning mechanisms will allow for rapid learning of the differences between already possessed knowledge and the real world seen through the sensors.

There are two ways how the pre-training procedure may be realized: 

· with dataset created from data gathered through sensors (in case of Computer Vision tasks: stereo camera is preferred)

· with synthetic data that allows for efficient pre-training in multiple scenarios and conditions without the need to physically record and store the data

After analysis of current synthetic data and virtual environment providers, we can assume that the correct direction for the development of Vision for Robots is to base our models on pre-training with synthetic data and allow machines to adapt to real-world data acquired from sensors later.

The synthetic data may differ more or less from the real-world data. This phenomenon may be called the “sim-to-real gap”. Unless, we do not have a photorealistic virtual environment – the 3D models may not fully capture the attributes of the real objects.

To eliminate this sim-to-real gap we can:

·import the real world to the simulation: use 3D object models generated with real-world data (e.g. structure from motion algorithms)

· add real-world elements: texture, background to the scene

The other way how to deal with sim-to-real gaps is through learning structural patterns in parallel to visual attributes, what can be achieved by using depth information and AGICortex’s framework unique features.

The biological brain can recognize objects if we see an outline, sketch, 3D model, photograph or a real object or its equivalent (e.g. toy) in front of our eyes.

The reason is that the attributes of the object are not only visual, but also functional (what specific part does) and structural (how it relates to other parts and objects nearby).

We can simultaneously build multiple representations of the same objects and their parts with variant visual representation – while focusing also on their functional and structural aspects.