Skip to main content

Detection Pipeline and Scalability

To maximize performance and scalability, EyesOnIt uses a multi-stage processing pipeline when processing video. This pipeline performs minimal processing in the early stages and saves heavy processing for the later stages. If the early stages of the pipeline determine that no further processing is necessary, EyesOnIt avoids that extra processing. Taking advantage of this detection pipeline allows you to minimize the processing time for each video stream, allowing you to process more video streams without upgrading hardware.

For each frame from the camera video stream, EyesOnIt executes the following detection stages:

Stage 1: Video Decoding

When reading video from an RTSP stream, every video frame must be read and decoded. This is because the video is compressed. With compressed video, each video frame only includes the areas of video that have changed since the last frame. For that reason, you always need the previous frame to get the current frame. Every frame must be read and decoded or decompressed to be able to read future frames. Fortunately, the video decoding operation is fast and is performed on the GPU if you have one.

Stage 2: Frame Rate Limiting

Building on the video decoding stage, just because every frame must be decoded doesn’t mean that every frame must be processed. In scenarios with fast motion, you likely need to process every frame to detect an object, especially if you want to track motion for object counting or line cross detection. But for many scenarios, processing at one frame per second may be sufficient. EyesOnIt lets you specify the number of frames per second to process. As an example, if you tell EyesOnIt to process 5 frames per second and your video stream provides 15 frames per second, EyesOnIt will throw away two frames for every one frame that it processes. Using the frame rate setting is a very effective way to reduce processing time and improve scalability.

Stage 3: Motion Detection

The third stage is motion detection, where EyesOnIt determines if anything has changed in your video. Motion detection is optional. Motion detection is a fast operation that can determine whether the next stage in the detection pipeline needs to be run. In scenarios where motion is rare, it makes sense to run motion detection so that the remaining more time-consuming stages only run when necessary. If there is constant motion in your video, then running motion detection should be turned off, as it will simply use processing time to confirm that the next stage needs to run.

Step 4: Common Object Detection

Common object detection uses a fast computer vision model to detect common objects. This approach not only saves processing time by quickly identifying potential objects of interest but also improves accuracy by providing candidate bounding boxes to the final stage of the detection pipeline. Like motion detection, common object detection is optional, but it is recommended when detecting support objects.

Step 5: Natural Language Object Analysis

The last stage in the detection pipeline is where EyesOnIt runs the Large Vision Model with your natural language object descriptions to determine which of the candidate bounding boxes generated by common object detection contain exactly what you want to detect. This is the most compute-intensive stage of the detection pipeline. Using the previous stages effectively can limit the frequency with which the Large Vision Model runs, thus allowing EyesOnIt to operate more efficiently. Note that if common object detection is not used, the Large Vision Model will analyze the full polygon of each detection region.