Project Checkpoint

Final Report


We are going to create a C++ based application that efficiently uses the parallel processing capabilities of a machine’s GPU to rapidly perform parallel image analysis on a burst - several photographs captured in quick succession.


We are interested in creating an application that can simplify and accelerate the process of selecting the best image from a burst. Specifically, a burst contains a large number of images and it is often cumbersome to sift through all the images to find the best one. Our application aims to detect the images with the highest rate of people with eyes open and smiling. We do so by implementing a highly efficient version of the Viola-Jones' Face Detection algorithm, as well as parallelizing eye state and smile detection along with simultaneously computing a bunch of images.

A sequential version of the application has several bottlenecks:

  1. Serial analysis and execution of each image in the burst will be extremely slow as there is a large number of photographs in burst. We aim to provide a near real-time solution to the problem.

  2. Sequentially analyzing each image for face and eye detection will be extremely slow, and applying a serial vision algorithm to all the images in the burst would take a long time and a lot of processing power. Specifically, the Viola-Jones' algorithm requires excessive computations.

  3. Creating the composite image sequentially will be a bottleneck in the runtime of this application.

  4. Most importantly, a sequential version of the whole application will not be able to benefit from the machine’s GPU power and will also slow down the CPU as it will perform a lot of tasks on the latter.

Therefore, all the aforementioned aspects of the application can be significantly improved from parallelism. We plan to use CUDA to exploit parallelism in the multi-core chip. The benefits of a parallel application be better understood using the following illustrations highlighting the difference between the sequential and parallel versions.

Photo Burst

Applying to the Sequential Version

s1 s2 s3

Now Applying to the parallel version (For illustration, assuming a 4 core machine)


Sequential algorithm for Vision Analysis (Image taken from here)


Parallel algorithm for Vision Analysis (For illustration, assuming a 4 core machine)


The Challenge

The application is challenging in a number of ways:

  1. There are a lot of different computationally intensive components associated with the Viola-Jones' algorithm and so parallelizing each one of them and at the same time ensuring accurate face detections will be tricky.

  2. As a burst contains a large number of images, there will be a high amount of memory accesses. This implies that a lot will depend on making sure that the computation-to-communication ratio is high. To ensure that the computation-to-communication ratio is high, we will need to handle cache optimizations of image data well in order to achieve maximum cache hits.

  3. Efficiently detecting facial features is a lot more than simply sending different parts of the image to different cores; we will have to make sure that distribution of work is even and handle cases that require high computation with more processing power.

  4. Such type of application will require a lot of inter-communication between processes and more processing power, which means that we will need to dynamically handle parallel processing of the code based not only on the dataset (burst) but also on what the algorithm is working on – eye detection, smile detection, stitching, etc.

  5. During the analysis of each image, our application will perform numerous checks to see if the input meets some requirements. A high number of ‘if’ statements might cause divergent control flow execution, which will need to be handled in way that it does not affect accuracy at the cost of speedup.


To parallelize and speed up face detection we plan to use GPU processing. To this regard, we will be using the latedays cluster at Carnegie Mellon University. More details about the GPU and CPU specifications can be found here.

Goals and Deliverables

We plan to achieve a highly efficient and working C++ based application fully capable of detecting and creating new images from a burst. We also plan to achieve a high speedup from the sequential versions of Computer Vision algorithms we will use.

We plan to present an interactive demo at the presentation day, running the application live with an existing set of bursts, and showcasing different functionalities of the app. In addition, we also plan to showcase speedup graphs of the parallel version of the application versus the serial version.

Stretch Goal

We will further try to perform a quick image classification by parallelizing the popular 'Bag of Visual Words' algorithm. This Image classification can then be used to modify heuristics for detecting relevant image features and stitching. Additionally, If none of the images completely match all the parameters, then our application will create a composite image i.e. stitching together an image - from the entire library of burst - where everyone is smiling with their eyes open.

Generic 'Bag of Visual Words' Algorithm:



We will be using C++ to program the project along with CUDA.


March 28 – April 1, 2016: Decide on a project idea and complete project proposal

April 1 – April 3, 2016: Setup the working environment

April 4 – April 19, 2016: Create a sequential working version of the Viola-Jones' algorithm

April 19 – April 22, 2016: Work on correct implementation of eye state detection and smile detection

April 23 – April 30, 2016: Parallelize the algorithm

April 30 - May 2, 2016: Work on further optimizations

May 3 – May 6, 2016: Perform Testing, develop heuristics, work on stretch goals

May 7 – May 8, 2016: Final application testing

May 9, 2016: Final report and presentation


Authors and Contributors

Tanay Varma (@tanayvarma) and Mohak Nahta (@mnahta)