Skip to Content

How Standard AI Optimizes Camera Placement for Autonomous Retail

This blog was authored by Luis Yoichi Morales, Staff Researcher, and Dhananjay Singh, Staff Engineer.

At Standard AI, we often say that “we bring the platform but retailers bring the magic.” This is because we don’t ask retailers to adapt their stores to us. We instead adapt to their existing layout, merchandising strategies, and product offer—whatever that might be.

Our decision to use only computer vision enables us to do this. Whereas some vendors require retailers to modify their stores and install invasive shelf sensor hardware, we simply place cameras on the ceiling.

The naïve way to place cameras would be to use a brute force approach. In other words, populate the store with as many cameras as possible. Not only is that inefficient, but it would also be expensive and degrade the aesthetics of the retail environment since the ceiling would be awkwardly full of cameras.

This of course raises the importance of optimal camera placement. It’s essential that we place cameras to maximize visibility, coverage, and avoid any occlusions caused by the structure of the retail environment. Cameras must also be able to keep track of products as they’re carried around the store. All this, without installing an excessive number of cameras.

At a basic level, the figure below shows how you might look at the retail environment from the perspective of computer vision cameras. We have a person with a shopping basket walking through a store. The person is detected by two cameras (C1 and C2) in the yellow area and at the same time, a shelf is detected by cameras C3 and C4 in the blue area. These areas can intersect and cameras can serve both objectives. Uncovered areas are shown in gray. These areas need to be covered, therefore we need to add cameras—or possibly improve the orientation of the existing ones. The goal is to have coverage of all areas of the store (in effect: getting rid of gray areas) with minimal camera setup.

Concept image of a retail environment being covered by four cameras. The walkable area is shown in gray and the camera visible area are shown in yellow (for tracking humans) and in blue to cover regions of interest such as a shelf.

In order to compute optimal camera placement, we need a 3D map—and a method to score both individual cameras and sets of cameras.

Our team uses high-fidelity 3D maps which help us to understand camera requirements in terms of coverage and physical location constraints. We know that we want to place cameras on the top part of the stores, but we also want to make sure to fulfill certain requirements:

  • Camera views are not blocked
  • Camera views are properly aligned to cover regions of interest
  • Cameras can be placed on the computed camera locations (avoiding clutter on the roof such as lamps, grills, or other cameras)

Image of a high-definition 3D color map.

From this 3D map, we can extract different layers of spatial information. Each layer serves different purposes:

  • Walkable areas. These are the areas where people to be tracked will interact with the retail environment.
  • Wall information. This layer delimits the store.
  • Shelves. We can extract accurate dimension and position information of the shelving units.
  • Roof. This layer provides the search space to place cameras.

Figure showing an example of the multilayered spatial information extracted from 3D environmental models.

Now that we have a model of the environment, we have to figure out how to score cameras. An example of a two-camera set with rectilinear lenses (red dots with arrows) is illustrated in the figure below. At the top of the image, Camera A points to the top and Camera B points to the left of the map. To score the camera set, we raycast the camera pixels into the environment and vote the projections into a voxel grid. The bottom part of the figure shows a walkable area grid set at a height of 1.5m. Blue voxels show camera coverage of a single camera, and cyan voxels show coverage by two cameras—which in this case would be the overlap between cameras A and B. The higher the score of a voxel, the better the coverage by different cameras. This is a simple process that helps us to score camera sets, and we call them coverage maps. We can create different types of coverage maps for tracking, shelving, the overall 3D space, and more.

Figure showing an example of the coverage by two rectilinear cameras. Voxels in cyan are covered by two cameras, and voxels in blue are covered by one camera.

The coverage maps are camera-type independent. Here we show a few examples of different coverage scores for two different sets of cameras.

The figure below shows a nine-camera set (numbered orange circles) plotted on a store layout. (with shelves painted in blue)

Image of a store layout with nine cameras (in orange). Shelves are shown in blue.

The coverage map of the nine-camera set is shown below. The color blue shows areas with coverage of five cameras, and red shows coverage from one or less cameras. Coverage with two cameras (yellow) is the minimum for computing stereo vision. This is a low-coverage set that has badly-covered places (in red) and some areas with barely two cameras.

Coverage map of the nine omnidirectional camera set of the previous image. Some voxels are covered by five cameras (blue), and some voxels are covered only by one camera (red).

The figures below show a 17-camera set and its coverage map. It can be seen that most of the areas are covered with five cameras or more (colored in blue). This is a camera set which provides excellent camera coverage.

Image of a store layout with seventeen omnidirectional cameras (in orange). Shelves are shown in blue.

Coverage map of the seventeen omnidirectional camera set of the previous image. Most voxels are covered by five cameras (blue), and some voxels are covered by four and three cameras (yellow).

Coverage maps are a useful way to compute a score of different camera sets for a given store. The remaining two open questions are how to compute the camera set and how to compute the necessary number of cameras to fulfill requirements.

The answer for the first question is rather straightforward: we place cameras over the store roof where physically possible and eliminate cameras blocked or occluded by walls, columns, pipes, and other objects. This generates a pool of camera candidates. Finally, a subset of these cameras has to be selected.

The final question is more complicated and requires optimization. We treated the camera subset selection as a combinatorial optimization problem. We minimized the squared error between the desired and the achieved coverage, and we solved it as a linear programming problem.

This article is a basic introduction to the nuances of optimizing camera coverage, but more technical details can be found in the work we recently published at the 39th IEEE International Conference on Robotics and Automation. (ICRA 2022)

You can check our paper at the link below.


Optimizing Camera Placements for Overlapped Coverage with 3D Camera Projections


Akshay Malhotra, Dhananjay Singh, Tushar Dadlani, Luis Yoichi Morales


The camera placement problem is modeled as a combinatorial optimization where given the maximum number of cameras, a camera set is selected from a larger pool of possible camera poses. We propose to minimize the squared error between the desired and the achieved coverage, and formulate the non-linear cost function as a mixed integer linear programming problem. A camera lens model is utilized to project the camera's view on a 3D voxel map to compute a coverage score which makes the optimization problem in real environments tractable.


If you’re interested in revolutionizing retail and working on cutting-edge problems, our remote-first team is constantly growing. Check out our careers page and see if we’re right for you. To learn more, visit

To learn more, watch our presentation from ICRA 2022

Sharethis page:Share on TwitterShare on LinkedIn