Recently, I read a computer vision paper that described a technique I had never come across before, so I was intriqued enough to implement it. The paper realies a statistical model for determining visually interesting feeatures of a video frame, which is then applied to detecting the existence of objects not in a background image. This may sound like frame differencing and up to a certain point it is, but the method presented in the paper is much more extensive and robust to noise.
The paper is called Real-Time Segmentation of Moving Objects in a Video Sequence by A Contrario Detection (there is a more compact article available from IEEE). This paper is itself based on another paper called Maximal Meaningful Events and Applications to Image Analysis. The basic premise of the segmentation paper is that the gradient of a background image and the gradient of a frame of video can be compared in such a way that the maximally meaningful areas can be pulled out through a statistical analysis. As a side note, the paper used as source material the CAVIAR video database, which has many videos of different scenarios along with ground truth data stored as XML.
The paper first describes various methods for acquiring a background image if the background is never completely free of people or other moving objects. The method I used in my implementation takes the median at each pixel of > 150 frames. This has the nice effect of almost generating a perfect background and lacks the smearing that averaging would introduce. With enough frame, the median should be a very precise background image. Once the background image is acquired, I take the gradient of the image. There are a number of techniques for doing this some of which use convolution kernels as in the Sobel method, but I used the Finite Difference Method for this because it is fast and quite accurate. The FDM I implemented is second-order accurate.
Once the background gradient is calculated, I calculate the gradient of incoming frames and compare the two images according to the algorithm in the paper. During the comparison process, a "difference image" is generated describing the meaningfulness of all of the pixels. White pixels are where a difference exists, black pixels exist where there are no differences, and most interesting of all, gray pixels are placed where there is not enough information to say whether a pixel should be white or black. In fact, most pixels will be gray after this process.
The next step is to pick out the most meaningful portions of the image based on the difference image. To do this, the image is divided into square sections. For each square, a statistic measure of meaningfulness is calculated by looking at the proporation of white pixels to the total number of meaningful pixels (the sum of white and black pixels). This ratio is compared to the same measure across the entire image and a distance metric is applied to the two ratios. The greater the distance, the more meaningful the square of pixels.
There is a refinement step once the significance of each square has been claculated, but I haven't implemented that portion yet. I have implemented up to the point of culling insignificant regions, and the algorithm is quite robust. The implementation is available for download but I have to stress that it is in a beta state right now. There are a number of improvements that can be made to get more speed out of the thing such as not calculating the angle of the gradient of the frame if the pixel will be grey anyway etc. If you decide to download this and end up making improvements, I'd be very interested as I'd like to incorporate them into the project code that is available here.
The source code along with a custom external called xray.jit.median as well as an abstraction called xray.jit.3dbuffer are in the download package. A schematic and copy of the original paper is also included.