An improved method for measuring frame synchronization time parameters based on YOLOv4

September 2022. Microprocessors and Microsystems Volume 93, 104573

Published by Elsevier B.V.

Although the You Look Only Once Version 4 (YOLOv4) algorithm can achieve target detection of images with high speed and accuracy, it could not obtain the frame synchronization time parameter directly.


In digital imaging, dual-sensor fusion imaging has critical applications because it can improve the Signal-To-Noise ratio (SNR) of images and obtain more details than single-sensor. The frame synchronization time parameter, a vital evaluation factor in dual-sensor fusion imaging, still lacks automated measurement.

Although the You Look Only Once Version 4 (YOLOv4) algorithm can achieve target detection of images with high speed and accuracy, it could not obtain the frame synchronization time parameter directly.

To solve this problem, this paper proposes an improved method for measuring frame synchronization time parameters based on YOLOv4. In this method, we propose an optimized network and construct the Position Marker Match algorithm and Light-Emitting Diode (LED) lit area conversion algorithm.

By compared with other popular methods, the proposed method has higher accuracy and automatically calculates frame synchronization time parameters. At the same time, it shows good generalization ability and robustness.


Digital imaging

Signal-to-noise ratio

Object detection

Frame synchronization

1. Introduction

As the semiconductor manufacturing process develops, the sensor development shows the trend of multi-pixel and pixel size miniaturization, which puts higher requirements on image processing speed and image signal-to-noise ratio. In practice, it has become more common to use dual sensors to improve the amount of light in images in low-light scenes to boost the signal-to-noise ratio.

Through dual-sensor image fusion, the clarity of the image is enhanced so that the image can present more picture details. Typical use cases are: in the cell phone camera, dual sensors are used to obtain depth information. Moreover, by facilitating the artistic processing of the image, the resulting image shows a depth-of-field bokeh effect similar to that of a Single-Lens Reflex (SLR). In the field of security surveillance, dual sensors can indirectly increase the amount of incoming light and improve the detail of the video stream in low-light scenes. In the field of Advanced Driving Assistance Systems (ADAS), it can be used to detect objects’ distance or fuse and stick images into a single image with a broader Field Of View (FOV) than a single image sensor.

The dual-sensor fusion imaging can promote the clarity of the picture and improve image quality. On the other hand, it also faces a series of problems. Two different physical components generate the two frames obtained from the dual-sensor exposure. Due to the difference in control signal delay and signal responses, it is difficult to ensure that the two sensors simultaneously start and end the exposure. In contrast, the existence of movement of the subject or camera shake will cause differences in the content of the two frames, which will also affect the image fusion effect. In this case, there are higher requirements for dual-sensor fusion imaging technology.

In the dual-sensor fusion imaging technology, the frame synchronization time parameter is essential in evaluating the quality of fusion imaging. The time-related parameters of the sensor consist of exposure time, exposure start time, Electronic Rolling Shutter (ERS), vertical blanking, frame rate and time lags. Among them, exposure time, exposure start time, and ERS are the main parameters affecting the sensor’s image content. Exposure time and exposure start time jointly affect the time-domain characteristics of the object captured by the sensor. At the same time, ERS is an inherent characteristic of the Complementary Metal-Oxide-Semiconductor (CMOS) sensor and is excluded from the frame synchronization time parameters. For sensor imaging, the exposure start time and exposure time are expressed in the frame images as the alignment of the frame content. There are different requirements for frame synchronization parameters in various scenes. The speed and frame time synchronization parameters of object movement determine the difference between the frame images. It is generally accepted that the difference in frame synchronization time parameter is mostly within 10 ms for ordinary photo scenes of cell phones; however, surveillance cameras on the highway are mainly within 5 ms due to the high speed of cars. When the frame synchronization time parameter is large, there is usually a significant difference in the position of the subjects in the two images, and the fused images are prone to ghosting and artifacts. When it is small, the position difference between the subjects in the two images is usually tiny, the content of the two images is closer, and the fused image is cleaner. Currently, there are two main types of input images for the frame synchronization time parameter: one is a stopwatch image with input accuracy to the millisecond, and the other is a DxO LED Universal Timer image. The former, owing to exposure time, the obtained image in the stopwatch numbers mixed. It can only be distinguished manually. The latter statistical accuracy can be set. The starting position of the light and the length of the lit area in the image form a link with exposure time. The latter can more intuitively present exposure time with better accuracy than the former. There are two ways for the frame synchronization time parameters of the statistics: a manual statistics DxO LED Universal Timer in the number of LED and the DxO automatic statistics method. The former has human subjective differences and is less efficient, while the latter does not depend on human subjectivity and is more efficient in the automatic calculation. The DxO LED Universal Timer improves efficiency but requires a specific environment and camera parameters. It cannot adapt to different natural environments and has poor generalization.

In computer vision, automatic statistics of frame synchronization time parameters is a matter of target detection and target localization. To solve the difficulties faced in statistical frame synchronization time parameters, it is necessary to find a new automated detection method. Automatic detection methods can be divided into traditional and deep learning target detection methods. General steps of the conventional target detection algorithm mainly include (1) traversing the image through sliding windows of various sizes to obtain image information; (2) constructing mathematical models to extract features; (3) using a variety of classifiers to obtain detection results. The traditional target detection method is simple and intuitive. The disadvantage is that the detection accuracy is overly dependent on the background environment of the image, and it is often difficult to meet people’s requirements in the case of backlight, overexposure, and defocus. Furthermore, there is a large amount of feature information about the target object in the construction of mathematical models. Most mathematical models have difficulty in comprehensively summarizing various feature information. Furthermore, most mathematical models constructed by them have poor generalization ability, easily resulting in false detection and missed detection.

With the development of deep learning and computer hardware progress, the mainstream target detection technology based on Convolutional Neural Networks (CNN) provides the direction to solve the above problem. One is based on visual features to get the candidate frame, and then the candidate frame is regressed to get the detection region. The classical ones in this regard are Region-CNN (R-CNN) series algorithms. The other will be the target detection and spatial location as a regression problem to solve a single traversal to get the object’s regional location and confidence. Without generating candidate boxes, the representative algorithms are YOLOv4 and RetinaNet. YOLOv4 version based on You Look Only Once Version 3 (YOLOv3), improved data processing, backbone network, network training, and data enhancement in the detection. It has reached the industry-leading level in terms of precision and accuracy but could not detect the goal directly. Measuring frame synchronization time parameters is a practical task based on YOLOv4. It is worth noting that there is an algorithm called YOLOv5, but it is not official, and the so-called YOLOv5 is not as accurate as YOLOv4 under the same conditions. The comparison of YOLOv4 and YOLOv5 can be found at

Based on the importance of frame time synchronization parameters for dual-sensor fusion imaging and the shortcomings faced by the current measurement, this paper proposes an improved method for measuring frame synchronization time parameters, which can solve the error caused by manually identifying the scale in the DxO LED Universal Timer; it can automate the calculation of frame synchronization time parameters and reduce the labor cost; it has a strong acceptance of the test scenario, instead of requiring specific lighting and other environmental conditions. What is more, the obtained frame synchronization time parameters in different environmental conditions could verify the effect of varying tuning parameters in the Image Signal Processor (ISP) realistically. This method has high practical value in the field of dual-sensor fusion imaging.

To measure the timing information of multiple image frames, we expect the method to be processed on embedded devices (cell phones or surveillance cameras). It is a large project, so we first consider implementing a method running on an Invidia Graphics Processing Unit (GPU). We will try to migrate this method into embedded devices in future research. This paper studies an Improved YOLOv4 method that uses the DxO LED Universal Timer image as an input to automatically identify the rectangular spatial location and size of the strip lit area in the image and calculate the timing information of different image frames. The corresponding optimization strategies are innovatively proposed for different problems in this method. This method proposes a pre-processing image strategy to optimize the accuracy of the small size of image target objects, an image determination threshold logic to optimize the problem of the noise of tiny area of objects being misidentified, a data augmentation strategy marked as Mosaic-a method to optimize the problem of uneven distribution of training samples; a new network structure to optimize the accuracy of object position; a new algorithmic strategy to map the input image spatial position to the value of the frame synchronization time parameter.

Through pre-processing digital images and adjusting the marking threshold and the Mosaic-a method, the accuracy of the algorithm results is improved. It shows good generalization ability and robustness for low illumination, backlighting, and other scenes, as well as the detection of defocus situations. Meanwhile, there is no external digital signal communication between the camera device and the DxO LED Universal Timer. The method only relies on the DxO LED Universal Timer image frames collected by the camera device, which can be applied for future analysis.

2. Related work

As the technology develops by leaps and bounds, the deep learning algorithm represented by CNN plays an increasingly important role in the field of object detection. In order to improve the detection accuracy and algorithm robustness, relevant researchers have done exploration in various directions. Meanwhile, thanks to the high imaging quality and unique structure, the dual-sensor camera has gradually attracted the attention of researchers.

There is a lot of research on the detection of objects to improve accuracy. Yu et al. [1] proposed a detection method based on two-way convolution and multi-scale feature extraction. The input image is divided into two paths for feature processing. Feature extraction, the fusion of the upper and lower side paths are adopted, and feature extraction of different scales is conducted for different size targets. Chen et al. [2] proposed an improved method that was trained on the WIDER FACE dataset and evaluated on the Face Detection Data Set. It may achieve more accurate results by using anchor boxes more appropriate for face detection and a more precise regression loss function. Hsu et al. [3] proposed a new pedestrian detection model following the divide-and-rule concept. Multiresolution adaptive fusion was performed on the output of all images and subimages to generate the final detection result. Huang et al.

The development of artificial intelligence technologies has led to more research applying machine learning to real-life object detection tasks. Lv et al. [4] used a Support Vector Machine (SVM) trained only on RGB color space for fruit identification in natural scenes. They reported that this method performed much better than previous threshold-based methods. Luo et al. [5] proposed an AdaBoost and color feature based framework for object detection. The experiments demonstrated that this method can partly reduce the influence of weather condition, and illumination variation. Sa et al. [6] applied the Faster R-CNN [7] detector to fruit detection. The information from the RGB image and Near-Infrared image was used with two fusion methods. This method obtained results better than previous methods.

As the representative algorithm of object detection, YOLO series algorithms have a wide range of applications in production scenarios by absorbing some other techniques. Bochkovskiy et al. [8] proposed YOLOv4, using some new features: Weighted-Residual-Connections (WRC), Cross-Stage-Partial-Connections (CSP), Cross Mini-Batch Normalization (CmBN), Self-Adversarial-Training (SAT), Mish activation, Mosaic data augmentation, DropBlock regularization, Complete-IoU (CIoU) loss, and combine some of them to achieve state-of-the-art results. Seo et al. [9] focused on separating touching-pigs in real-time using both the fastest CNN-based object detection technique and image processing techniques. The touching-pigs are detected by using image processing techniques with both infrared and depth information acquired from an Intel RealSense camera. They also prepare the learning data for YOLO-based ‘object detection’. Laroca et al. [10] proposed a robust and efficient Automatic License Plate Recognition (ALPR) system based on the state-of-the-art YOLO object detector. Their system design a two-stage approach employing simple data augmentation tricks such as inverted License Plates (LPs) and flipped characters. Tian et al. [11] proposed a YOLOv3-dense model using DenseNet [12] to optimize the feature layers with low resolution in the YOLOv3 model by enhancing feature propagation, promoting feature reuse, and improving network performance. The YOLOv3-dense model can also be used to detect occluded and overlapping apples in real-time. Liu et al. [13] proposed an improved tomato detection model called YOLO-Tomato for dealing with complicated environment conditions, based on YOLOv3. A dense architecture [12] is incorporated into YOLOv3 to facilitate the reuse of features and help to learn a more compact and accurate model. Moreover, the model updates the traditional Rectangular Bounding Box (R-Bbox) with a Circular Bounding Box (C-Bbox) for tomato localization. Liu et al. [14] proposed a network structure used in Unmanned Aerial Vehicle (UAV). Due to small scale of the target, it is improved by increasing convolution operation at an early layer to enrich spatial information based on the darknet. Both these two optimizations can enlarge the receptive field.

Data augmentation methods and regular methods can increase the diversity of samples, improve the resistance of the algorithm to attacks and reduce the occurrence of overfitting. Yun et al. [15] proposed CutMix: patches are cut and pasted among training images where the ground truth labels are also mixed proportionally to the area of the patches. CutMix improves the model robustness against input corruptions and its out-of-distribution detection performances. Ghiasi et al. [16] proposed DropBlock, a form of structured dropout, where units in a contiguous region of a feature map are dropped together. Applying DropBlock in skip connections in addition to the convolution layers increases the accuracy. Also, the gradually increasing number of dropped units during training leads to better accuracy and more robust hyperparameter choices. Chen et al. [17] proposed a simple, general, and effective policy for data augmentation based on information dropping. It deletes uniformly distributed areas and finally forms a grid shape. Using this shape to delete information is more effective than setting a completely random location. Singh et al. [18] presented ‘Hide-and-Seek’, a simple and general data augmentation technique for visual recognition. By randomly hiding patches/frames in a training image/video, they force the network to learn to focus on multiple relevant parts of an object/action.

The loss function helps to speed up the convergence of learning and improve the accuracy of regression objects. Zheng et al. [19] proposed two losses, i.e., Distance-IoU (DIoU) loss and CIoU loss, for bounding box regression along with DIoU Non-Maximum Suppression (NMS) for suppressing redundant detection boxes. By directly minimizing the normalized distance of two central points, DIoU loss can achieve faster convergence than Generalized Intersection Over Union (GIoU) loss. CIoU loss takes three geometric properties into consideration, i.e., overlap area, central point distance and aspect ratio, and leads to faster convergence and better performance. Zhou et al. [20] addressed the 2D/3D object detection problem by introducing the IoU loss for two rotated Bboxes. They proposed a unified framework independent IoU loss layer which can be directly applied to axis-aligned or rotated 2D/3D object detection frameworks. Rezatofighi et al. [21] introduced a generalization to IoU as a new metric, namely GIoU, for comparing any two arbitrary shapes. They showed that this new metric has all of the appealing properties that can be found in IoU while addressing its weakness.

Dual-sensor cameras have a wide range of applications in the field of production life. Wang et al. [22] proposed a novel dual-camera design to acquire 4D High-Speed Hyperspectral (HSHS) videos with high spatial and spectral resolution. They built a dual-camera system that simultaneously captures a panchromatic video at a high frame rate and a hyperspectral video at a low frame rate, which jointly provide reliable projections for the underlying HSHS video. Rupapara et al. [23] presented a low complexity image fusion system in the Bayer domain using a monochrome and Bayer sensor. They have used a novel Bayer domain transform for performing the image fusion. The proposed system requires only one ISP and is computationally less intensive. So it reduces the hardware resources significantly and still gives better quality fused images. Yang et al. [24] proposed a dual-camera based framework to identify and track non-driving activities (NDAs) that require visual attention.

Dual-sensor cameras have important applications, but the research on automation methods for frame synchronization time parameters is inadequate. Masson et al. [25] introduced a new device, the DxO LED Universal Timer, designed to measure the different timings of digital cameras by counting LEDs on images. LED bar graphs make it easy to count LEDs on the instrument, and counting is done automatically with advanced image processing algorithms. A large number of LEDs allow for great measurement accuracy. Measurement algorithms are completely automated, but it depends on the desired shooting conditions. Bucher et al. [26] presented a novel capacitive device that stimulates the touchscreen interface of a smartphone (or of any imaging device equipped with a capacitive touchscreen) and synchronizes triggering with the DxO LED Universal Timer to measure shooting time lag and shutter lag. Rátosi et al. [27] presented a novel method to measure the exposure time of digital cameras. The hardware requirements of the proposed method are low: only a signal generator, driving an LED source, is required.

In summary, as the YOLO series has been developed over a long period of time, it is in the leading position in terms of performance and accuracy. It has been widely used in production and life by combining with other advanced technologies. However, the task of detecting frame time synchronization parameters cannot be done by itself alone. So we introduce some other techniques to achieve this goal with high accuracy. Data enhancement methods can improve the richness of samples. Loss functions can improve the convergence speed of the algorithm. In imaging, dual-sensor cameras play an indispensable role due to their unique structure and high imaging quality. In contrast, there are few methods for automated measurement of frame synchronization. In some methods, additional wiring connections are required, which causes some inconvenience and cost. Based on the above background, this paper proposes an Improved YOLOv4 method, which adopts data enhancement and other methods to realize the automatic detection of frame synchronization time parameters and has higher detection accuracy and adaptability to various scenes than the original YOLOv4.

3. Proposed improvement method

3.1. Dxo LED universal timer

The DxO LED Universal Timer, shown in Fig. 1, consists of 100 LEDs per line * 5 lines. The right side of each line shows the scrolling cycle time of the current line of LEDs. The scrolling cycle time of LEDs in ith line is noted as Tii∈1,2,3,4,5. Each line of LEDs is lit from left to right, and the duration of each LED is Ti/100. The scrolling cycle time is settable. Moreover, there is no digital signal output reacting to the position of each LED. The Marker-A (5 pieces) and Marker-B (1 piece) are divided into two categories, and any one of them can be processed centrally to obtain another category of position marker. In practice, the frame rate can be measured by the Vertical Deflection (VD) signal of the frame interrupt generated by the sensor, which is more accurate than direct observation from the image. This paper focuses on the exposure time and exposure start time.

At the same time, only one LED in each line of running lights is in the lighting state. By rolling each LED in a cycle by row, after a certain exposure time, the image obtained from the photo shows a strip lit area. When capturing the timer image in the actual scene, the timer position is located according to the positional marker’s spatial position, and the timer’s spatial orientation is confirmed according to the kind distribution of the positional marker. The frame exposure starts time information can be converted from the strip lit area location coordinates. What is more, exposure time information can be converted from the strip lit area length.

Cited by

Yunfa Li received the Ph.D. degree from the School of Computer Science and Technology, Huazhong University of Science and Technology, Hubei, China, in 2008. He is currently a vice professor in software engineering at Hangzhou Dianzi University, China. His research interests include Cloud Computing, Internet of Things, Network Security, Big Data, Computer Vision and Deep Learning.

Guanxu Liu is a postgraduate student in Hangzhou Dianzi University, China. His research interests include Computer Vision and Deep Learning.

Ming Yang is currently a researcher of IoT Security at The Third Research Institute of The Ministry of Public Security. His research interests include Internet of Things, Network Security, Computer Vision and Deep Learning.

The research was funded by Key Lab of Information Network Security, Ministry of Public Security, China under Grant C20614.

Full Information On: