What is the difference between ground truth, bounding box and anchor box?
Ground truth, bounding boxes, and anchor boxes are distinct but interconnected concepts in the field of object detection within computer vision, each serving a specific role in the training and operational pipeline. The ground truth represents the absolute, human-annotated reference data for a given image. It is the definitive answer key, specifying the precise location—typically via coordinates of a bounding box—and the correct class label for every object of interest present. This data is the foundational standard against which a model's predictions are measured during training, and its accuracy is paramount; any error or ambiguity in the ground truth directly corrupts the learning process. It is a static, fixed input that does not change during model inference and exists solely to supervise the training phase by providing the target for loss calculation.
A bounding box is the geometric construct used to define an object's spatial extent within an image's pixel grid. Both ground truth annotations and a model's final predictions are expressed as bounding boxes, usually in the form of coordinates for the box's center, width, and height. The critical distinction lies in their origin: a ground truth bounding box is the manual annotation, while a predicted bounding box is the model's output. The core objective of an object detection model is to produce predicted bounding boxes that align as closely as possible with their ground truth counterparts, a performance measured by metrics like Intersection over Union (IoU). During training, the model learns to regress from its initial, often poor, location estimates to these true coordinates.
Anchor boxes, or priors, are a pivotal mechanism enabling this regression. They are not predictions or truths, but a predefined set of fixed-size template boxes that serve as initial reference points or starting guesses. These anchors are tiled densely across the feature map at various scales and aspect ratios (e.g., tall for pedestrians, wide for vehicles) to capture the possible shapes of objects. The model does not directly predict absolute box coordinates from scratch. Instead, for each anchor, it predicts two sets of adjustments: classification scores for object categories and regression offsets that slightly tweak the anchor's position and dimensions. A positive match occurs when an anchor's IoU with a ground truth box exceeds a threshold, assigning that ground truth as the learning target for that specific anchor. This scheme stabilizes training by providing a structured starting point and allows the model to specialize different anchors for different object shapes.
The functional relationship is thus a sequence from fixed reference to learned adjustment to final output. Anchor boxes provide the initial, multi-scale grid of hypotheses. The model's convolutional network processes the image and, for each anchor, predicts refinement offsets and a class probability. These refinements are applied to the anchor to generate a predicted bounding box. This prediction is then compared to the ground truth bounding box during training to compute loss. Therefore, anchors are a training-time architectural tool, ground truth is the immutable supervisory signal, and bounding boxes are the common language—used for both truth and prediction—to describe object location, with the model's entire purpose being to translate image features into accurate bounding box predictions that match the ground truth.