Fine-Grained Vehicle Recognition

BoxCars: Improving Fine-Grained Recognition of Vehicles using 3D Bounding Boxes in Traffic Surveillance [arXiv]

Paper teaser Abstract:  In this paper, we focus on fine-grained recognition of vehicles mainly in traffic surveillance applications. We propose an approach orthogonal to recent advancement in fine-grained recognition (automatic part discovery, bilinear pooling). Also, in contrast to other methods focused on fine-grained recognition of vehicles, we do not limit ourselves to frontal/rear viewpoint but allow the vehicles to be seen from any viewpoint. Our approach is based on 3D bounding boxes built around the vehicles. The bounding box can be automatically constructed from traffic surveillance data. For scenarios where it is not possible to use the precise construction, we propose a method for estimation of the 3D bounding box. The 3D bounding box is used to normalize the image viewpoint by unpacking the image into plane. We also propose to randomly alter the color of the image and add a rectangle with random noise to random position in the image during training Convolutional Neural Networks. We have collected a large fine-grained vehicle dataset BoxCars116k, with 116k images of vehicles from various viewpoints taken by numerous surveillance cameras. We performed a number of experiments which show that our proposed method significantly improves CNN classification accuracy (the accuracy is increased by up to 12 percent points and the error is reduced by up to 50% compared to CNNs without the proposed modifications). We also show that our method outperforms state-of-the-art methods for fine-grained recognition.


BoxCars: 3D Boxes as CNN Input for Improved Fine-Grained Vehicle Recognition [CVPR 2016]

CVPR16 publication teaserAbstract: We are dealing with the problem of fine-grained vehicle make&model recognition and verification. Our contribution is showing that extracting additional data from the video stream - besides the vehicle image itself - and feeding it into the deep convolutional neural network boosts the recognition performance considerably. This additional information includes: 3D vehicle bounding box used for "unpacking" the vehicle image, its rasterized low-resolution shape, and information about the 3D vehicle orientation. Experiments show that adding such information decreases classification error by 26% (the accuracy is improved from 0.772 to 0.832) and boosts verification average precision by 208% (0.378 to 0.785) compared to baseline pure CNN without any input modifications. Also, the pure baseline CNN outperforms the recent state of the art solution by 0.081. We provide an annotated set "BoxCars" of surveillance vehicle images augmented by various automatically extracted auxiliary information. Our approach and the dataset can considerably improve the performance of traffic surveillance systems.


Unsupervised Processing of Vehicle Appearance for Automatic Understanding in Traffic Surveillance [DICTA 2015]

DICTA publications teaserAbstract: This paper deals with unsupervised collection of information from traffic surveillance video streams. Deployment of usable traffic surveillance systems requires minimizing of efforts per installed camera - our goal is to enroll a new view on the street without any human operator input. We propose a method of automatically collecting vehicle samples from surveillance cameras, analyze their appearance and fully automatically collect a fine-grained dataset. This dataset can be used in multiple ways, we are explicitly showcasing the following ones: fine-grained recognition of vehicles and camera calibration including the scale. The experiments show that based on the automatically collected data, make&model vehicle recognition in the wild can be done accurately: average precision 0.890. The camera scale calibration (directly enabling automatic speed and size measurement) is twice as precise as the previous existing method. Our work leads to automatic collection of traffic statistics without the costly need for manual calibration or make&model annotation of vehicle samples. Unlike most previous approaches, our method is not limited to a small range of viewpoints (such as eye-level cameras shots), which is crucial for surveillance applications.