Vehicle Safe Distance Detection System Based On Image Processing As Accident Prevention With Faster R-CNN Method

Numerous victims and huge economic and social losses have resulted from the escalating number of traffic accidents. From these issues, a technique to create a camera capable of detecting vehicles going around the driver using the Faster R-CNN method and calculating the vehicle's distance using the Stereo Vision and Mono Vision methods was discovered. The determination of safe distance between these cars is determined by the speed of the driver's vehicle, with the LED and buzzer warning system activating when the parameters are met. Based on the results of object detection experiments utilizing the Faster R-CNN, the model's success rate in identifying and classifying objects had an average success rate of 83.33 percent across 35 object situations examined from different perspectives. The success rates for distance estimates utilizing the Stereo Vision and Mono Vision methods with the Linear Regression equation were 98.84% and 98.10%, respectively.

According to these statistics, the most major cause of accidents is the driver's lack of awareness or focus while driving. "Image Processing Based Vehicle Safe Distance Detection System as Accident Prevention with the Faster R-CNN Method" is an innovation developed by the author to limit the frequency of accidents on toll roads caused by driver irresponsibility. This study uses a camera that captures images similarly to the human eye. It is processed with a specific classification using OpenCV so that the video may be processed in real-time, categorized into many objects using the Faster R-CNN technique, and distances between objects and the camera can be predicted using the Stereo Vision and Mono Vision methods.

II. FASTER REGION CONVOLUTIONAL NEURAL NETWORK (FASTER R-CNN)
Faster Region-based Convolutional Neural Networks (Faster R-CNN) is a detection technique whose primary architecture is Fast R-CNN and RPN. This method is a modification of the Fast R-CNN by replacing its selective search part with RPN. RPN is a neural network that substitutes the role search to submit region. RPN generates certain bounding boxes, with each box having two probability scores, indicating whether an item exists at that position. These areas will serve as inputs for comparable designs, such as Fast R-CNN. Using RPN to replace selective search can drastically lower the computing resources required to make the entire model viable and trainable from beginning to finish [4] [5]. Figure 1 depicts the architecture of the Faster R-CNN algorithm. Faster R-CNN is divided into 2 (two) important parts, namely: 1. Region Proposal Network (RPN) Region Proposal Network (RPN) RPN is a process that aims to explore possibilities for the location of objects in the image that is inserted quickly. The object location entered has object constraints from that region identified, namely the Region of Interest (ROI). Inputs used on The ROI layer is a feature map which is the output of CNN with multiple convolution layers and max pooling layers. In RPN, initially, the input image is processed in a neural convolution to produce a feature map consisting of 6 (six) sections, viz determination of objects and non-objects with a value of 0-1, the coordinates of the value and , as well as the weight and height values of the bounding box. Sliding windows are placed on each feature map with size × , accordingly with each anchor sliding window formed. Every anchor has the same center point but has aspect ratios and different calling factors.

Classifier
The classifier is utilized to categorize the ROI detected by the RPN into classes using CNN.

A. Stereo Vision
A stereo vision system (Stereo Vision) is a field concerned with detecting the three-dimensional structure of a scene using two or more digital pictures captured from varying perspectives. A stereo camera consists of two identical cameras positioned in the horizontal and vertical planes in a straight line. When the item is at a point of view overlap between the two cameras, distance measurements are made [6] [7] [8].  Where to find the value of a and b, namely by using Equations 4 and 5.

III. METHOD
This section explains the experimental design, instruments, methods of data collecting, and control types. This research addresses the issue of how the system can detect passing automobiles using the Faster R-CNN approach and estimate their distance using the Mono Vision method. Then, after the system recognizes the car, the vehicle's distance can be integrated into the warning system via displays, buzzers, and LEDs. The purpose of this research is to develop tools capable of estimating the speed of vehicles passing a car driver in order to improve driver vigilance. Figure 3 is a flowchart that systematically depicts the research's steps.  The parameters for estimating safe distances are dependent on the speed of the user's vehicle, as indicated in Table 1, which is the authors' expansion of the speed data and safe distance [9] [10]. If a vehicle speed between 60 and 100 km/h is detected and the minimal space between the driver's car and other surrounding vehicles has been reached, the warning system will activate. The warning system consists of a buzzer and LED flash sound alert. At a minimum distance and alert, the LED's indicator is a red light, which shuts off when the distance is safe.
This study addresses initiatives to lower accident rates, particularly on toll roads. The toll road is a motorway with a minimum speed of 60 kilometers per hour and a maximum speed of 100 kilometers per hour. Due to the excessive speed and proximity of vehicles, this circumstance frequently results in many collisions. How the system can detect things using Faster R-CNN [11] and estimate the distance between objects and cameras using Stereo Vision and Mono Visio [12] with the Linear Regression method [13] is the statement of the research challenge. The system employs the Faster R-CNN algorithm. Figure.  The first step in the Pre-Processing phase is data collection, which is accomplished by capturing an image of the item with a camera or image collection. The image is subdivided into the following vehicle classes:  (6) Pickup. In addition, the collected dataset for each class Annotation or labelling is assigned to each image using Labeling. This image labelling is used to offer information regarding the location of the required '.xml'-formatted picture. The label gives the image a box border and assigns a name image to each class. The labelled data must then be translated from '.xml' to '.csv' and separated into train and test files. After executing the mapping, the.csv data is transformed such that TensorFlow can read it using TFRecord. This conversion is performed to convert the dataset to binary format for optimal training. Following the completion of the dataset preprocessing phase, the data enters the Faster R-CNN training process, which seeks to train 38 image data. When the training process is completed, the results will display the calculation of the loss value and the required training time to acquire the loss value in order to generate a graph of the training dataset's results and extract the data model to classify picture data. Stage Lastly, data is tested by entering a test image. The machine will then read the data model generated by training and proceed with object detection.

A.
Formation of CNN Architecture CNN (Convolutional Neural Network) is the primary component of Faster R-CNN. CNN will process the first visual input it receives first. CNN's general procedure consists of three stages: pre-processing, processing, and classification. The pre-processing procedure includes two steps: the generation of datasets and their conversion to grayscale. The second step of the processing procedure involves image convolution, picture dimension reduction, max pooling, and softmax. Classification, the third procedure, determines the output. Figure 5 depicts the design of the Faster R-CNN.  This research employs an input image with a resolution of 1080 by 1920 pixels. The RGB image will be converted to grayscale at the specified resolution. The input image will then be subjected to convolution four times and max pooling three times. The result of convolution and max pooling is a feature map that will be transmitted to Region Proposal Network (RPN). In RPN, there is an object classification for generating a proposal object that provides two probabilities as to whether or not an item exists in the specified anchor, as well as a bounding box regressor for changing the bounding box limits to fit the objects inside.
In addition, the object of the proposal generated by the RPN is projected to feature maps generated by CNN and merged ROI (Region of Interest) in order to extract the feature vector corresponding to each object proposal. This procedure will generate layers with input values of 452,352 and hidden layers with 256 neurons. At the final stage, feature maps that have been inputted to fully connected layers will be split into two branches: multiclass classification, which uses multiple layers of convolution and softmax to classify the appropriate proposal object into one of the class categories, and a perfecting bounding box regressor that matches the bounding box boundaries to the objects inside. So that this procedure will generate six object classes, including sedan, SUV, bus, minibus, truck, and pick-up.

B.
Conversion RGB to Grayscale The dataset initially consists of RGB-formatted data. CNN feed is one of the primary components of Faster R-design. The captured dataset will be turned into a grayscale image processing procedure at this point, as CNN can only operate on grayscale images. Figure 7 is a converted RGB to grayscale image. The conversion of RGB photos to grayscale is performed automatically. This grayscale image contains a level of grey. Following the conversion procedure, the input image will be subjected to image convolution.

C. Kernel Convolution Stage and Max Pooling
Convolution is one of the image filtering techniques; in this work, image convolution was performed five times using kernel convolutions measuring 7x7, 5x5, and 3x3 on 1080x1920 greyscale images. The maxpooling of images is a component of the image reduction phase. Image simplification with max pooling by taking three times the most significant value in the axb matrix with a 3x3 filter.

E. Training
The.csv train and test data files are converted to TFRecord (TensorFlow Record) files. The TFRecord file is utilised during the training procedure. The training procedure employs the Faster R-CNN model on the Google Collaboratory platform. Google Collaboratory is utilized to aid the training process, which takes a substantial amount of computation. Before training, GPUs connected to Google Drive were used to configure runtime settings. Table 3 displays the loss results from Faster R-CNN dataset training. During the training process, the system records all processes that occur and saves the file in '.ckpt' format and will stop at step 200,000. The last step will be converted into a training model in the protobuf format with the '.pb' file extension. Files with the protobuf format are used as a model for detection.

F.
Inference Graph Tensorboard Tensorboard is used because the Neural Network is a process known as a black box, where it cannot be observed in detail what processes occur in the system. The training process is carried out in 200,000 steps and produces a Total Loss in the final step of 0.02. The Total Loss graph is shown in Figure 7. Object Detection Testing There are six classes of detection: sedan, SUV, minibus, bus, pickup, and truck. The accuracy of this object detection is evaluated using a variety of driving conditions and object distances and angles. Where researchers conducted tests with the windshield-mounted camera and examined the accuracy of this detection on 28 data samples comprising up to 35 objects. Based on the test results in Table 4, of the 35 objects that have been detected, the average success rate of the system in detecting it is 83.33% with an error rate of 16.67%.

H.
Distance Estimation Test 1. Stereo Vision The distance test estimates the distance between the driver's car and the vehicle in front of it using the Stereo Vision method. The test makes use of the difference between the object's x-coordinate in the left camera image and its y-coordinate in the right camera image. Based on the test findings presented in Table  5, the system's Stereo Vision Method distance estimation accuracy is 98.84%.

Mono Vision
Using Linear Regression, testing the distance with the Stereo Vision method to estimate the distance between the driver's car and the vehicle ahead. Based on the test findings presented in Table 6, it is known that the average error value for estimating distances is 1.23 percent, or alternatively, that the system's accuracy rate when predicting object distances using the Mono Vision Method is 98.11 percent.

V. CONCLUSIONS AND RECOMMENDATIONS
This study divides the system into three major stages: Object Detection, Distance Estimation, and Warning System. The success rate of the system in detecting objects using the Faster R-CNN approach is 83.33 percent and the error rate is 16.67 percent based on 35 conditions of objects tested from various angles and distances. The distance estimate techniques of Stereo Vision and Mono Vision have been integrated into the system. The Stereo Vision approach estimates distances using the coordinate systems of the right and left cameras, whereas the Mono Vision method employs the Linear Regression equation. The success percentage of the system for determining the object's distance from the vehicle is 98.84% for Stereo Vision and 98.11 % for Mono Vision. The warning system in the form of an LED and a buzzer activates when a minimal space between the driver's car and the vehicle in front of it is recognized.