Understanding Traffic Mobility using Computer Vision

The following is an account of one of the projects I worked on during my undergraduate years. It outlines the problem we were trying to solve, the technical details behind the system we developed, and my experience and learnings from the project.

With the advent of deep learning, the field that has arguably been hyped the most in recent years is computer vision. The ability to understand the elements and activities in a still image or a moving video presents endless opportunities for digitizing aspects of human life that were not possible previously. One such example is understanding the flow of vehicle traffic in the road solely through surveillance cameras. This is the problem we attempted to tackle as our undergraduate minor thesis.

Problem statement

The primary task is to build a system for analysing traffic mobility patterns on the roads of Kathmandu, Nepal. Understanding the vehicle traffic in a road entails two primary tasks:

System overview

Before diving into the technical details behind the solution we developed, it helps to have a bigger picture of the problem we were attempting to tackle and discuss what an ideal system could look like.

Block diagram of the expected traffic mobility analysis system

The system consists of three major components:

Task 1: Vehicle Detection

If you have anything to do with images (videos are simply a stream of images), convolutional neural networks (CNNs) are the de-facto standard in deep learning. Specifically for object detection, we employed a YOLOv2 network. YOLO (You Only Look Once) is a CNN architecture tailored to detect objects in an image at lightning speed. I’ll only superficially describe the working of this architecture to maintain brevity. For a rigorous technical treatment, please refer to:

YOLO in brief

Traditionally, object-detection algorithms worked by running classifiers on parts of the image in a sliding window. This was the case for architectures like Faster R-CNN. While they boasted high accuracy, the inference time was too slow to be feasibly used for real-time object detection from video. YOLO transformed the literature by framing object detection as a regression problem. The result is faster inference with a small tradeoff in accuracy.

YOLOv2 is an improvement on the older YOLO architecture and is the state of the art on standard detection tasks like PASCAL VOC and COCO. At 67 FPS, YOLOv2 gets 76.2 mAP on VOC 2007; at 40 FPS, YOLOv2 gets 78.6 mAP. It outperforms Faster R-CNN with ResNet and SSD while still running faster than them.

What does YOLO output?

YOLO divides the image into [S x S] grid; each grid has B bounding box predictions along with their confidence and class probabilities

YOLO divides the image into an S × S grid. Each grid cell has B bounding-box predictions. Each bounding-box prediction is represented by a vector of length 5. This vector includes all the information needed to represent the prediction:

The object (if present) in the grid cell could belong to any of the k classes. Each grid cell also predicts a class-conditional probability vector C = [C_1, C_2, ..., C_k] where C_i = P(Class_i | Object). Consequently, P(obj) * C_i gives the probability that an object of class C_i exists in the box.

The concept of anchor boxes

Different shapes of anchor boxes in a grid cell

A square bounding box is not always the best choice for the different shapes of objects we might need to detect. For instance, trees are better demarcated by vertical rectangles, while a car might need horizontal rectangles. YOLOv2 uses anchor boxes as a solution, which differentiates and improves it from its predecessor. Anchor boxes are a set of boxes with predefined width/height ratios. Instead of directly predicting the bounding-box coordinate, we predict the offsets for these boxes such that they will enclose an object. To pick the best choice among these boxes, YOLO uses Intersection over Union (IoU): the extent of overlap between the predicted bounding box and the ground-truth bounding box, valued in [0, 1].

Effectively, for each box we can calculate P(obj) * C_i * IoU, which gives the probability of an object of a particular class appearing inside that box and how precise those box coordinates are in demarcating the object.

The network architecture

The network architecture of YOLOv1; YOLOv2 is similar with slight but important modifications

The architecture of YOLOv1 consists of elements you’d typically see in a CNN architecture. It has a series of convolutional layers with maxpool layers in between. YOLOv2 architecture has slight modifications:

Loss function

The prediction of a YOLO network can be divided into three groups:

Loss is calculated separately for each and combined together:

The mathematical formulation of the loss function is involved; I’d suggest the interested reader check out the original paper for a detailed description.

The nuances of dealing with Nepali vehicles

A machine-learning model is only as good as its data. We cannot expect the model to perform well on data that are not representative of the data present in the training set. We experienced this problem first-hand: vehicles in Nepal are quite different from what you’d see in Western countries. For instance, Western countries have far fewer motorbikes on the road, whereas Nepal has 10 motorbikes for every car. Similarly, the buses, trucks, and taxis are quite different in appearance. While the standard ImageNet and COCO datasets have images of standard vehicles, we needed a dataset of Nepali vehicles.

The 8 types of vehicles detected by our system

We captured footage of vehicles at major intersections in Kathmandu and collected and annotated a dataset consisting of 10,000 images. The vehicles were categorized into the following 8 classes:

Taxi, Tempo, Motorbike, Car, Microbus, Pickup Truck, Truck, Bus.

Men in action collecting pictures of Nepali vehicles in the roads of Kathmandu

Task 2: Vehicle Tracking

Having figured out a way to classify and localize vehicles in the frames of the video, the next task was to track individual vehicles across frames. We need to do this to count the total number of vehicles that moved past the surveillance camera. This will ultimately let us calculate the extent of road occupancy and provide real-time traffic monitoring to the end user.

Moving objects are tracked by a technique called optical flow. We implemented the Lucas–Kanade optical flow algorithm in OpenCV to track the moving vehicles on the road.

Optical flow

Optical flow is the apparent motion of objects, surfaces, or edges based on the relative motion of the camera.

Movement of a pixel from one point to another across time

Consider a pixel that moved from (x, y) at time t to (x + u, y + v) at time t + 1. I(x, y, t) represents the intensity of the pixel at position and time (x, y, t), and I(x + u, y + v, t + 1) is the intensity at the new position and time.

Given the position of the pixel in two frames, it would be fairly easy to calculate the velocity and hence track the movement of the vehicle. We already have the bounding boxes given by YOLO. All we need to do is track the centre of this bounding box for each vehicle. Right?

The problem is that we cannot easily associate a pixel in frame t + 1 with the pixel in frame t such that both pixels represent the same object. Optical flow aims to solve this by making two assumptions:

Mathematically:

∂I/∂t + (∂I/∂x) · u + (∂I/∂y) · v = 0

Or, simplifying notation:

I_t · 1 + I_x · u + I_y · v = 0

This system cannot be solved as-is — it has two unknowns and one equation. So we make another assumption: a bunch of pixels within a neighborhood move with the same velocities (u, v) from t to t + 1. This gives us a number of equations like the one above, and we can solve for (u, v) using least squares:

A · X = B
X = (Aᵀ · A)⁻¹ · Aᵀ · B

We used the centre of the bounding boxes given by YOLO as the points to track. The above algorithm predicts the most probable location a particular point will reach in the next frame, and this is used to track the vehicles’ movement throughout the video.

Final Result

Finally, we had the system in place, performing the two most important tasks:

We then developed an analytics dashboard for the monitoring station that surfaced traffic-mobility information for any road. Some of the details presented:

The dashboard in the monitoring station, displaying the real-time feed of the road augmented with mobility analytics

Road occupancy across time

Learnings and Experiences

This was a transformative project for me. It helped me sink my teeth into the field of computer vision, and especially helped me realize how challenging it is to collect and prepare a custom dataset and fine-tune a model to perform as expected.

We demonstrated this project to the Metropolitan Traffic Division at Kathmandu, where it was well received.

We also won the Smart Urban Technology Challenge 2018 with this very project.

All in all, it was a wonderful experience taking this project from idea to implementation.