SeoNang is a shared interactive environment linking two physical spaces in real-time, Seattle (USA) and Seoul (Korea). An infrared computer vision system captures participant's silhouettes and turns them into a skeleton form, which is shared between the two spaces. Through the shape of their silhouette, users interact with a membrane that seperates the two physical spaces in the virutal world of SeoNang. Participants collaborate to create forms out of the membrane. Each space sees the membrane from the opposite direction, turning the screen into a window into a virtual space that sits between Seattle and Seoul. Audio generated from the silhouette forms provides local feedback and a greater sense of direct interaction with the membrane.
Each site has an identical setup consisting of an IR camera, 2 high power IR lights, 2 computers, a projector, 2 speakers, and a router. One computer serves as the information gathering and generating hub and the other as the graphics renderer. As the projet developed, he information gathering machine became known as the Sender and the other computer as the Receiver. The Sender takes the camera signal through firewire and skeletonizes the images, effectively reducing the video data from to about 50 kbps/30 frames. The skeletons are nothing more than pairs of points describing the structure of a participant's silhouette and allow us to transmit full speed video across a standard international Internet connection with very few data drops.
When the video comes into the Sender, it is first thresholded to acquire a binary silhouette image. This is then passed on to a distance transform algorithm55. The distance transform scans both vertically and horizontally across the image, counting up along successive white pixels that represent a silhouette. If the pixel has already been counted on a previous pass, it checks its current count against the previous count and takes the minimum. The result is that the largest values in the image are along the middle of the silhouette. The goal of skeletonizing the image is to extract just this middle ridge.
This sounds simple, but there there are a few complicating issues. First, the ridge value is not really a local max. It simply tells you how far in pixels the middle of the shape is from the nearest edge pixel. However, this measure is taken in the L1 Norm and not the more familiar L2 or Euclidean Norm. This leaves ridge-like artifacts that can be difficult to handle. We filtered most of these out by recognizing that neigboring pixels in an image processed by the distance transform are off by one. A ridge was thus defined by a triangular shape extending 3 pixels in opposite directions. If a ridge value was 8, the triangle would be 5 6 7 8 7 6 5.
Once the ridge pixels were extracted, we had to figure out how to connect the pixels to form a skeleton. We did this by starting at the bottom left-hand corner of the image and scanning up successive rows. For each row, we would keep track of what new ridge pixels we encountered and try to find the most likely ridge pixel already encountered to connect them to. Likeliness was determined by a number of factors including proximity and slope between points. The slope factor helped us overcome the horizontal bias in our scanning pattern. Once we had all of the ridge pixels connected to likely matches, we appliead a filter to connect shorter line segments into longer lines. If 2 touching segments had a similar slope or close proximity, we would take out the common point and consider the endpoints as a new line. On average, this reduced our data from 300 pairs of points to around 60 while still closely approximating the original 300 segment skeleton.
After skeletonization of the incoming video, the linesegments are passed on to a number of other modules one of which is the audio synthesis engine. This is a very direct and immediate interactive system that uses the raw line segment data to create aural feedback to the participant. This is achieved by convolving some granualted samples with a spectrum generated by the skeletons.
First, the synthesis algorithm bins line segment points according to their horizontal position into 3 overlapping groups (here red, green and blue) and then places a value in that bin based on vertical position. Points higher in the skeleton are placed further to the right in the spectrum image, corresponding to higher pitches. Points further to the right are lower pitches. This spectrum matrix is then convolved through multiplication in the frequency domain with granulated samples, creating aural feedback for the participant. When the participant raises his arms, higher pitches emerge. When he crouches, lower pitches are created. Panning of the audio is based on the 3 bins. Red is left, green is center, blue is right. Thus, when the participant moves from side to side in the image, the sound will follow.
In each location, there were two computers running the installation with one computer collecting and distributing data. This computer, the Sender, both sent skeletons to and received skeletons from the remote location. The skeletons it received were passed on to the Receiver as were the skeletons it extracted from the silhouettes. The Senders in both locations had statix IPs for direct communication altough the Sender in Seoul used Dynamic DNS.
All rendering is done on the Receiver. It takes both local and remote skeletons and renders the virtual space from the point fo view of the local skeletons. That is the local skeletons are in the foreground with the remote skeletons in the background. In between them is the membrane, a tube-like, elastic surface that reacts to participants' motion both locally and remotely.
The skeletons are rendered by taking the line segments and turning them into textured quads. The quads have a Gaussian kernel type texture applied to them, and they are all blended together to form a filled out person. This is then captured to a texture a further refined through fragment shaders. Finally, the rendered skeletons are textured to the membrane as well as brought back into software where they are used to displace the membrane, creating a more dimensional look.
The membrane itself consists of 3 pieces, the 2 surfaces where local and remote skeletons are rendered into and the interactive form seperating them. This middle form has a 2D wave simulation running along its surface such that when it is stimulated, it fluctuates wildly according to the wave simulation with a damping constant near 1 (meaning that it takes a bit of time for it to return to steady state). The wave surface is textured with as particle streaking texture that illuminates the surface according to participants' lateral position.
Video
SchematicsSystem OverviewSkeletonization Audio Synthesis Skeleton Rendering Membrane texture Stills
|