AFTERLIFE_15 | Tobi Pfeil

AFTERLIFE

15. TECH INSIGHTS:

REAL-TIME MOTION TRACKING IN A 3D GAME UNIVERSE

This chapter offers some in-depth technical insight into the setups and technical considerations that were part of developing AFTERLIFE. If you are planning to do a project using Xbox Kinect sensors and/or Unreal Engine, don't hesitate to contact me for questions.

AFTERLIFE is using three programs in parallel from a centralized computer: Unreal Engine, Ableton Live and Max MSP. 2 Xbox Kinect v2 sensors with PC adapters are used to capture body tracking data from 2 performers on stage, and 2 headset microphones with wireless senders and receivers are used for real-time vocal audio input by the performers. The screen output from Unreal Engine is projected on/connected to a screen behind the performers, so that both the performers and the Unreal Engine universe is visible simultaneously from the audience perspective at all times. The same image is sent to 2 PC monitors in front of the performers, to provide them with visual feedback of their avatar in real-time.

Model of the AFTERLIFE motion-capture-to-Unreal-Engine-to-video-projection setup

Ableton Live functions as the centerpiece of the setup, and a 'master session' runs all the music and instrumental tracks for the piece. All live vocal processing and vocal automation runs in this master session for the entire duration of piece. Using a custom-built m4l device, midi notes in this session are translated to OSC data, calling each new scene in Unreal Engine as the piece progresses, which is stored as scripted sequences in a master level in UE.

Midi CC data in Ableton Live is also sent to Max MSP via internal virtual midi busses, and is translated to DMX data in Max MSP and output to a USB-DMX interface to control lights on stage. Afterlife uses a hybrid light programming setup: Some lights, including a Martin Atomic 3000 strobe, a huge array of conventional lights behind the screen producing the 'God' lights, and side LED profilers, are hard-synced with the music, using midi control from Ableton Live. Other lights, which are different from venue to venue, like a venue's standard ceiling mounted light rig, are programmed and triggered by a light engineer for each scene and subscene using a standard light console. We are planning to implement automated light console cue triggering, over midi or network, for the upcoming performances of AFTERLIFE.

Unreal Engine runs each scene as a scripted sequence, meaning all animation data, like camera movement, character animations, changes in light or weather, are stored as automations inside different UE sequence files, which are called using a an OSC receiver blueprint in UE. A blueprint in Unreal Engine is simply a piece of code built using a node-based visual coding language similar to programs like Touchdesigner or Max MSP. No audio is output from Unreal Engine, as this would make balancing audio from different programs unpredictable. All audio is output solely from the Ableton Live master session. The piece relies on Ableton Live and Unreal Engine to stay in synchronization for the duration of an entire scene, based on the initial triggering of the scene (which in 99% of all cases works quite well).

A simple demonstration of a setup based on the 'examples' UE level that comes with the Neo Kinect UE plugin.

AFTERLIFE uses the Neo Kinect plugin by Rodrigo Villani for real-time avateering. The motion capture data which is generated inside the Xbox Kinect sensor is output directly into the Neo Kinect plugin in UE, and is called there inside Animation Blueprints for our two main characters. The general locations and rotations of the characters' root bones are animated inside the UE scene sequences, meaning that we can move the origin point of the characters around globally using scipted animations, while the characters still flexibly move around that origin point in real-time, controlled by sensor data input.

As AFTERLIFE uses 2 Xbox Kinects to control 2 avatars in the piece, 2 computers are needed on stage. Unfortunately hardware limitations require each Xbox Kinect to run on a separate (windows) computer. The second sensor sends sensor data directly into a locally running instance of Unreal Engine, using the Neo Kinect plugin in UE just like on the main computer. But instead of translating this data into an Unreal Engine scene on the local machine, the data is translated to OSC using a painfully crafted OSC encoder blueprint, and is sent via network to the main computer. Here a parsing blueprint inside the second character's animation blueprint decodes the incoming OSC data, and translates it to euclidean rotation values that control to second avatar in UE.

The Xbox Kinect sensors output about 30 frames of full-body tracking data per second, which are linearly interpolated between each frame, giving the characters a slight lag and smooth feel to prevent exaggerated jittering and noise. As the second computer was under much less CPU stress than the main computer, we actually found that this client-controlled avatar sometimes operated faster and smoother than the avatar controlled on the main computer. The Xbox Kinect is a CPU and GPU hog, and when connected, lowered our performance roughly 10-15 fps on a high-end machine.

15.1. TECHNICAL CONSIDERATIONS AND ADJUSTMENTS

There were several considerations and adjustments which had to be taken into account with the Xbox Kinect motion capture setup. First of all, the Xbox Kinect v2 sensors are based on quite old technology (they were released in 2014), and one that Microsoft unfortunately decided to abandon as sensor sales proved to be very low. The Kinect initially targeted the competitive-gaming Xbox console community, who proved to be more interested in action and competition than motion capture. I am convinced that AI-powered modern one-camera systems would be able to track a human body much better than the Kinect v2, but we yet have to see the emergence of such technology amongst a target group of rather conservative gamers. Some systems for mobile phones exist, but usually don't track full bodies so well, as they are purely based on AI interpreting a one-camera video feed.

The Kinect sensors have the disadvantage of only being able to track a human body when the body is perfectly facing the sensor, meaning that player movement and freedom is quite limited. The player is not necessarily able to roam a space freely as they would in the real world, and have the sensor follow their movement continuously, but has to use the sensor like an instrument - carefully evaluating the sensor's interpretation of their movements and constantly adjust their body to achieve the avatar body postures and movements desired.

The sensors also have the disadvantage of having quite a lot of jitter, especially when players are seated. Head rotation is particularly jittery, and was completely dropped from AFTERLIFE. Although these artefacts can be minimized by having proper distance to the sensors (about 3 meters), and adjusting the sensors to optimal height and angle (The sensor should be about the height of the player's chin, with a very slight downward tilt), some artefacts were unavoidable. As a strategy to turn the jittering to our benefit, we decided that the performers in AFTERLIFE would constantly try to adjust their avatars if they were glitching, so as to always dynamically interact with their avatar and use the glitches and artefacts creatively. Just as jazz musicians are famous for preaching: 'If you play a mistake, repeat it!', we tried to make use of the artefacts in-piece to keep the experience dynamic and alive, and let the glitching avatars influence how the performers move their bodies on stage just as much as the performers control their avatars. Again using the allegory of (free) jazz and musical improvisation, we found that this constant feedback loop of performers continuously adjusting to their slightly unpredictable avatars, created an interesting stage presence and unexpected liveliness to the piece. We ended up using two computer monitors facing the players at the front of the stage, so that the performers could have constant visual feedback from their avatars, and adjust even if they were facing the audience.

The Xbox Kinect sensors work by projecting an IR grid into a room, and then tracking changes in that grid caused by objects (like a human body) interfering with the grid, using built-in color- and IR cameras. We discovered that the sensors stopped working when the stage got filled with stage fog, as the fog would be tracked rather than the bodies, and the sensor would lose the full-body image. We used this sensor behaviour creatively when the main characters die in the story, by completely filling the stage with fog, to provoke a full avatar glitch-out and interrupt the body tracking as a metaphor for death.

We also had intermittent sensor drop-outs in the beginning phases of the project, and found that these could be avoided by following two steps:
-Modifying the sensor's inner cooling system so the internal cooling fan is always on, as shown in this video. Most sensor drop-outs were caused by internal overheating.
-The sensors pulled more current from our laptops than our motherboards were able to supply, thus dropping out repeatedly when directly connected. We solved this problem by using USB 3.0 hubs with external power supplies between sensor, adaptor and PC.

The Xbox Kinects have 3 main advantages, which made them attractive to us rather than huge, multi-camera motion capture systems.
-The sensors are very portable, light, durable, and can be mounted on microphone stands. They send out height and tilt data which makes local compensation of position and angle easy. On the second hand market, they are cheap.
-The sensors don't require the players to wear special suits or markers. We discovered that the cameras worked best with tight clothing, but that tracking conditions were acceptable even with slightly baggy clothes.

-The Neo Kinect plugin by Rodrigo Villani is simply awesome (Kudos for fantastic work!). It proved to be exceptionally fast and stable throughout testing, and exposes every possible parameter of Kinect data to the Unreal Engine blueprint system (including an IR video stream and very useful information like identifying the 'nearest tracked body'). An interesting side note is that AFTERLIFE was not built from scratch, but started out from Villani's Neo Kinect Unreal Engine 'examples' project, which highlights all the plugin's functionalities within one level. AFTERLIFE's avateering code is based on the very useful avateering example code provided by Villani, which was further modified and developed to suit the piece.

Programming the code for controlling the UE avatars was no walk on roses, and approximately one full month went into optimizing the performer-to-avatar interaction system. There were a few core challenges that we faced during programming and development, and this next section focuses on our solutions to them:

We needed to be able to control our own custom avatars, not just a default Unreal Engine Mannequin as provided in the original example level. To do so, we needed to custom rig our character meshes in Blender using the default UE mannequin skeleton, as all bone rotations would have been off if we used our own custom-built skeletons. The process of successfully rigging and weight-painting our characters with the UE default mannequin skeleton, was a painstakingly long effort of eliminating bugs and errors. Additionally, we wanted our avatars' main joints to be controlled by the Kinect sensors, but we needed certain body parts, like the characters' mouths and tails to be externally controlled. As such, we needed to fuse direct sensor input data with external data from UE sequences for the tail rotations, and realtime Ableton Live voice amplitude OSC data for mouth openings of the avatars.
We also wanted the characters to produce footstep sounds as they touched the ground, and have different sounds for each character, thus further building their characteristics not just visually but also sonically. The footsteps would need to be dynamic and follow performer action in real-time, to allow the performers a maximum of expressive freedom.

The Unreal Engine standard Mannequin (to the left) with its internal skeleton (bones are highlighted in blue), and the AFTERLIFE character 'Godzilla' (on the right) using a modified version of the same UE Mannequin skeleton, with additional bones for jaw and tail and some unnecessary bones removed. A lot of work went into rigging the main characters successfully by continuously tweaking the base UE Mannequin skeleton.

In Unreal Engine, a character is animated by connecting or 'parenting' parts of the character mesh to an internal skeleton using a weight system, meaning each bone is given a certain influence on parts of the overall character mesh. As in a human body, the skeleton bones are parented to one another in a hierarchical structure. Every skeleton has a bone called the 'root bone', which defines the character's point of origin/global location in a 3D space. Character animation is done by simply animating the inner bone rotations and the root bone location and rotation, or in the case of real-time avateering, by sending bone rotation data from a sensor to the character mesh bones. The sensor tracks only the location and rotation of the root bone, the other bones require only rotation data, and form the overall body pose following the hierarchical structure of the skeleton.

Body-limb-hierarchy model as used in most 3d programs today, by Alberto Cano. The model shows how the root bone is the only bone needing location and rotation data, and how the rest of the limb positions are deduced by applying rotation data to each joint in the skeleton hierarchy.

The challenges mentioned above were tackled by following these steps:
-Exporting the standard UE Mannequin skeleton and importing it into Blender.
-Rigging our custom character meshes by modifying the sizes of our UE mannequin bones. We tried to leave the bone rotations as intact as possible, to not have to compensate too much with custom rotation adjustments in Unreal Engine post-rigging. We added bones for mouths and tails to the UE mannequin skeleton, as this was necessary for correct mouth and tail movements.

-Carefully parenting our meshes to our skeletons using these tips, and diligently weight painting our characters until all bone rotations were clean and working. It is important to note that only a small amount of bone rotation data (around 16 joints) are used in UE to animate the characters with the Kinect. So weight painting was focused exclusively on these bones.

-Creating idle animations for our character skeletons. We made sure to emphasize subtle tail movement, and subtle movements of extra limbs such as Axolotl's head fins and Godzilla's neck spikes. These animations proved to be of huge value in counteracting sensor-jitter and glitches, and helped keeping the characters feeling alive and smooth while being animated by sensor data. They run continuously throughout the piece, as an additive layer of animation to the tracked motion capture animation data.

-Exporting our Blender character meshes with skeletons and animations as .fbx files and importing them into Unreal Engine.

-Inside Unreal Engine, the Animation Blueprints were modified by adding additional OSC receivers and 'weight' variables (0.-1.) exposed to the Unreal Sequencer System. The OSC receivers were fed OSC data from custom m4l devices in Ableton Live, tracking the incoming performer voice amplitudes and thus moving a character's jaw up and down, using a predefined jaw location and rotation data set, which was blended in by updating its weight in the Animation Blueprint. The weight data was also smoothed from frame to frame for jitter-free operation. Tail movement was animated inside the UE sequencer by similarly 'weighting' a character's upwards tail rotation and thus 'blending out' the tail's rotation based on incoming spine bone sensor data.

-Tweaking, tweaking and more tweaking: Most of the work went into troubleshooting and revising weight painting, and making subtle rotation changes to the default character bone rotations inside UE - so as to optimize the concurrence between irl performer body poses and UE character poses.

We purposely decided to omit details such as blinking and facial expressions for our characters. There were practical and artistic considerations for this decision:

-Blinking and facial expressions would mean more complex character skeletons, more sequencer data and thus more error sources, and finally much more work (face animation in itself is a tedious process and would have to cover the entire length of the piece).

-Artistically, we wanted to tap into an aesthetic related to vintage games, were facial expressions were still underdeveloped, and character emotionality was expressed using exaggerated hand gestures or full body movements. We found that expressing emotions with our bodies would positively influence performer-character interaction, and encourage bigger and thus more expressive performer movements.
-Inspired by the pieces by Ida Müller and Vegard Vinge, but also Noh Theatre and the use of masks in general, as a ritualistic tool but also a digital aesthetic, we were aiming at reducing the psychological emotionality of our characters by omitting their facial expressions, and thus emphasizing their symbolic expressivity through vocal performance and full body movement as expressive parameters.

My friend GNOM's wonderful collaboration with artists Fire-Toolz and Balfua makes use of fairly static 3d masks as avatars narrating this music video. The masks' lack of dynamic expression and emphasis on static expression opens up for a more individual, symbolic and less psychological reading of the characters' emotionality.

Footstep triggering was similarly solved inside UE. Each character had a 'trigger box' containing an OSC sender blueprint attached to their root node. Thus, every time a character would lift a foot outside of the box, and create a new overlap event as the foot went back into the box, UE would send an OSC signal to Ableton Live, translated into a unique midi note per foot, and ultimately trigger a footstep sound. As the trigger boxes are attached to the character root nodes, they would follow the characters wherever the character location may be. Originally, we used different sounds as the characters would walk on different ground surfaces in different scenes, but opted for using only one sound set per character throughout the entire piece, as it became more important to emphasize their unique characteristics than sonic realism.

Trigger-boxes (in green) are located below the characters and are attached to the character root nodes. The red wireframe spheres attached to the characters' feet trigger OSC messages each time they overlap with their corresponding green triggerboxes.

15.2. ANIMATION, OBJECT PERSONIFICATION AND LIGHT PROGRAMMING

NPC's in AFTERLIFE were similarly animated using a bare minimum of reductive animation sets. Most characters have approximately 1 Idle animation, 1 movement animation and 1-2 simple arm, hand or head tilt gestures as their full expressive vocabulary. A lot of work went into synchronizing jaw movement with vocal audio, as this for me is the most delicate area in terms of audiovisual discrepancy, and my tolerance for interpreting a voice coming from a character with out-of synch sound and movement is quite low.

The 'Magic Leopard' Scene in AFTERLIFE. The Leopard's animations are reduced to walking, an angry leap forward (which is an exception for this character), head tilts indicating the direction spoken to, and jaw movement. The heavy simplification and reduction of character movement options, adds to their presence and uniqueness, as a symbolic, rather than psychologically realistic entity in the piece.

Other characters in AFTERLIFE, such as the Half-Life 2 Vortigaunt or the Ghost, communicate by having parts of their body light up in synchronization with their voice's amplitude. In the case of the Ghost, this data was not animated as this would have taken too much time, but automated by recording an OSC stream of voice amplitude data from Ableton Live directly into the UE sequencer. Although this method is time-saving, and was necessary for the Ghost character because of the sheer amount of spoken words, I generally prefer animating jaw movements and character lights synched with audio 'by hand', to maximize synchronization and expressivity of the character within this simple parameter. For the sensor-controlled characters, we tested pre-animated jaw movement, but dropped it quickly, as performer freedom and control was more important than detailed jaw movements. Also pre-animated jaw animations would have limited the vocal expression of the performers, as they would have to constantly sync their singing with the pre-programmed sequences.

The 'Ghost Love Song' scene in AFTERLIFE. The Ghost's light values are automated by recording and scaling an OSC stream of amplitude data from Ableton Live. As shown in the video, consonants tend to produce peaks, which produce spikes in the light automation, creating a slightly artificial feel to the timing. We still accept that the voice is coming from the Ghost, as we perceive the synchronicity between voice audio and light as one coherent action. The OSC data is smoothed by using a ramp up for the peaks with a very short attack time, and a ramp down with a longer release time.

This audio-to-light approach was derived from an earlier piece, 'Lied für Ghost', which takes place in a similar setup, but in an irl installation setting. A physical version of the same Ghost character is animated using a wind machine, a winch pulling the Ghost sculpture up and down, audio and lights.

Lied für Ghost (2021). Irl 'animation' of a sculpture in an installation setting. It is interesting to note how easily we interpret the Ghost as a coherent, dynamic character, by means of tight audiovisual and movement synchronization.

The voice-to-light concept utilized in Lied für Ghost, which is also smartly employed in Alexander Schubert's 'Asterism', was expanded on and abstracted even further from 'Lied für Ghost' in AFTERLIFE's 'God' scene, where a massive array of conventional lights behind the screen were synchronized to the 'God' voice. This creates the illusion of a personified character inhabiting and embodying the entire stage and performance space. The light which shines on sides of the screen, but is obscured by the screen itself, is a direct nod to Plato's allegory of the cave, to which Socrates allegedly commented, that the cave inmates, if released into freedom, would be so blinded by God's light (here representing truth and enlightenment, that which is outside the screen), that they would be unable to make any sense of the world outside the cave and prefer to return to it (the screen itself).

This same likening of the screen to Plato's Cave, is explored by writer and theorist McKenzie Wark in the opening chapters of 'Gamer Theory' (2006), which similarly to AFTERLIFE, questions the implications of gaming and gamer culture upon our reading of contemporary reality.

in the 'God' scene, 'God' is blasphemically personified through the light peaking out at the sides of the screen and lighting up the stage fog, which is blown onto the stage by a fan from behind the screen. This is the first moment in the piece that the environment surrounding the screen and the bare stage is lit up, and therefore emphasized. The scene questions the nature of a God in a model of the world based in games and simulation, and further speculates what the motives for an author of such a realm would be.

The back-screen light array used in AFTERLIFE. All lights are conventional, to create more fluid and organic responses to the incoming DMX stream which is synched to 'God's' voice in the piece.

15.3. COSTUMING AND SET DESIGN

Concerning set design, we decided to keep the stage bare and portray the performers as a hybrid between e-boy/e-girl and gamer archetypes, further strengthening a narrative which could emphasize a reading of the entire storyline as nothing but in-game events and simulation. This is a narrative tool borrowed from 'Existenz' mentioned earlier, where transitions between irl and game realities are purposely blurred. In the same spirit, computers were left visibly on stage, and Kinect sensors obviously exposed to the audience. In future editions of the piece, there will probably be a basic set design, based on digital content brought into the irl stage environment, further strengthening and contextualizing the gamer/e-person narrative.

<< 14. Real-Time Avateering: Digital Drag

back to INDEX

16. Immersive Experiences >>