language-motion-integration

Human Motion Analysis: Exploring Language-Motion Integration for Motion Editing

5.1 Pre-experiments: text-to-motion generation result comparison

(a) Results from MDM. The generated motion is as expected.
(b) Results from MLD. The person moves 3 steps laterally.
(c) Results from T2M-GPT. Only two jumps.
Figure 14: Generated motions of "The person jumps 3 times and sits down". They show insensitivity to numerical values.
(a) Results from MDM. The person jumped three times and then stopped, without falling to the ground.
(b) Results from MLD. The person only jumped once and then stopped.
(c) Results from T2M-GPT. This person took a small jump in place then fell after a big jump.
Figure 15: Generated motions of ”The man jumped twice, then fell to the ground.” In sequences of action instructions, they often execute only the preceding one. For these three, T2M-GPT successfully executes all the commands while the other two ignore the falling to the ground”.
(a) Results from MDM. The generated motion is mostly correct, but the direction is relative to the camera and not to the person.
(b) Results from MLD. The person turns right, then turns left and walks in a curve to the right (from the camera's perspective), and turns right again.
(c) Results from T2M-GPT. The generated motion is mostly correct.
Figure 16: Generated motions of ”The person is walking in a curve to the left and then back around to the right in a curve.” They may lack sensitivity to direction. MLD seems a little confused in the beginning. Other than that, all three give satisfactory results.
(a) Results from MDM. The man looked like he was bent over holding his head forward.
(b) Results from MLD. The man looked like he was feet up on the sloping top surface.
(c) Results from T2M-GPT. The man looked like his body was clenched on something, but still had his feet on the ground.
Figure 17: Generation motions of ”A person is bouldering.” They may fail to accurately process texts that are not encompassed within the training set.

5.2 Latent space exploration

5.2.1 Latent space exploration of MLD

Intra-cluster Interpolation Inter-cluster Interpolation

Schematic Diagram of Latent Space Interpolation: The diagram illustrates two scenarios of interpolation in the latent space - (a) between ‘Throw’ and ‘Throw’, and (b) between ‘Throw’ and ‘Walk’. For both (a) and (b), the points in the rectangular box represent the initial motion, with three evenly interpolated points in between. Additionally, points are plotted along the extension line of the two original actions, at a quarter, a half, one, and ten times the length between the original points.

Figure 20: Generation of Motion from Latent Space: The figure showcases the generation of three distinct motions – 'Walk', 'Throw', and 'Boxing' – derived from their corresponding clusters in the latent space
Rendered Versions of Generated Motions: This figure presents the rendered versions of the 'Walk', 'Throw', and 'Boxing' motions generated from the latent space, as depicted in Fig. 20. Rendered implementation is detailed on the project page.
From left to right: (a)Quarter extrapolation beyond 'Throw' action (right hand). (b)Original 'Throw' action, right hand. (c)First interpolation between 'Throw' actions. (d)Second interpolation between 'Throw' actions. (e)Third interpolation between 'Throw' actions. (f)Original 'Throw' action, left hand. (g)Quarter extrapolation beyond 'Throw' action (left hand).
Visual Representation of Interpolated and Extrapolated Values: the figure showcases a series of continuous interpolation and extrapolation positions depicted from left to right. Figures (a) and (g) showcase extrapolated actions progressing slower than the originals on their side. Figures (c), (d), and (e) demonstrate a transition that falls between the original actions. All interpolated and extrapolated figures trend towards closely resembling one side of the original actions.
From left to right: (a)Ten times extrapolation value beyond 'Throw' action. (b)One time extrapolation value beyond 'Throw' action. (c)Half time extrapolation value beyond 'Throw' action. (d)Original 'Throw' action, right hand. (e)Original 'Throw' action, left hand. (f)Half time extrapolation value beyond 'Throw' action (left hand). (g)One time extrapolation value beyond 'Throw' action (left hand). (h)Ten times extrapolation value beyond 'Throw' action (left hand).
Visual Representation of Extrapolated Values: the figure showcases a series of continuous extrapolation positions depicted from left to right. The ten times extrapolated figures (a) and (h) depict complete motion transformations. In figures (b) and (c), the further the motion is from the original (d), the slower it becomes and the more pronounced the drift. The same principles apply to figures (f) and (g). All extrapolated figures tend to resemble the original figures that are closer to one side.
From left to right: (a)Quarter Extrapolation towards 'Walk' action from 'Throw'. (b)Original 'Throw' action. (c)First interpolation between 'Throw' and 'Walk' actions. (d)Second interpolation between 'Throw' and 'Walk' actions. (e)Third interpolation between 'Throw' and 'Walk' actions. (f)Original 'Walk' action. (g)Quarter Extrapolation beyond 'Walk' action from 'Throw'.
Visual Representation of Interpolated and Extrapolated Values: the figure showcases a series of continuous interpolation and extrapolation positions depicted from left to right. Figures (a) and (g) showcase extrapolated actions progressing slower than the originals on their side. Figures (c), (d), and (e) demonstrate a transition that falls between the original actions. All interpolated and extrapolated figures trend towards closely resembling one side of the original actions.
From left to right: (a)Ten times extrapolation towards 'Walk' action from 'Throw'. (b)One time extrapolation towards 'Walk' action from 'Throw'. (c)Half time extrapolation towards 'Walk' action from 'Throw'. (d)Original 'Throw' action. (e)Original 'Walk' action. (f)Half time extrapolation beyond 'Walk' action from 'Throw'. (g)Ten times extrapolation beyond 'Walk' action from 'Throw'. (h).
Visual Representation of Interpolation and Extrapolation between 'Throw' and 'Walk' actions: the figure showcases a series of continuous extrapolation positions depicted from left to right. The ten times extrapolated figures (a) and (h) depict complete action transformations. In figures (b) and (c), the further the action is from the original (d), the slower it becomes and the more pronounced the drift. The same principles apply to figures (f)One time extrapolation beyond 'Walk' action from 'Throw' and (g). All extrapolated figures tend to resemble the original figures that are closer to one side.

5.2.2 Latent space exploration on MDM

Figure 28: Results of interpolation and extrapolation between "A person sits down" and "A person walks to the right". Most results seem normal, except for (d), where the person is walking with crossing legs and sitting on the ground, which is almost impossible for a human.

5.3 Prompt-based motion editing

(a) The person clasps his hands.
(b) The person slowly clasps his hands. (with Prompt-to-Prompt)
(c) The person slowly clasps his hands. (without Prompt-to-Prompt)
Figure 29: Generation results for ”1. Speed and Duration Edits: a. Modifying speed or pacing.” The generated motion is not significantly slower, but Prompt-to-Prompt makes the timing of moving the arms essentially the same in the generated motion. Without Prompt-to-Prompt, the generated motion clasps his hands again after lowering his arms.
(a) The person clasps his hands.
(b) The person slowly clasps his hands. (with Prompt-to-Prompt)
(c) The person slowly clasps his hands. (without Prompt-to-Prompt)
Figure 30: Generation results for "1. Speed and Duration Edits: b. Changing the duration of the motion." After clasping the hand, the generation with Prompt-to-Prompt is indeed lowering the arm a little later while clasping the hand again at the same time as the original motion. In contrast, the generated result without Prompt-to-Prompt is not consistent with the original motion in terms of timing, although the action was still consistent with the prompt.
(a) A person walks forward.
(b) A person walks backward. (with Prompt-to-Prompt)
(c) A person walks backward. (without Prompt-to-Prompt)
Figure 31: Generation results for "2. Direction and Orientation Edits: a. Changing direction." The two generated results are not significantly different. It is possible that the model understands and performs perfectly for relatively simple orientation adverbs like "forward" and "backward".
(a) A person walks forward.
(b) A person walks forward with the upper body facing right. (with Prompt-to-Prompt)
(c) A person walks forward with the upper body facing right. (without Prompt-to-Prompt)
Figure 32: Generation results for "2. Direction and Orientation Edits: b. Adjusting orientation." It looks like "walks forward with the upper body facing right" is interpreted as "walks toward the right front". Other than that, the result without Prompt-to-Prompt is even more faithful to the execution of the body to the right command, but this right side is the right side of the camera view rather than the right side of the person. The Prompt-to-Prompt suppresses the execution of the body-facing-right command.
(a) A person kicks down with their left leg.
(b) A person kicks down with their right leg. (with Prompt-to-Prompt)
(c) A person kicks down with their right leg. (without Prompt-to-Prompt)
Figure 33: Generation results for "3. Body Part and Joint Edits: a. Changing body parts involved." In (a), the person kicks with their right leg, while in (b) the person kicks with their left leg. Again, the model may interpret the orientation as the direction of the camera view rather than the direction of the person. The generation without Prompt-to-Prompt has more other body movements compared to the original motion.
(a) A person kicks down with their left leg.
(b) A person kicks high with their left leg.
(c) A person kicks high with their left leg. (without Prompt-to-Prompt)
Figure 34: Generation results for "3. Body Part and Joint Edits: b. Adjusting joint angles." The generation without Prompt-to-Prompt changes the orientation of kicking and not kicking high enough.
(a) A person turns to his right and paces back and forth.
(b) A person stands for a while and then turns to his right and paces back and forth. (with Prompt-to-Prompt)
(c) A person stands for a while and then turns to his right and paces back and forth. (without Prompt-to-Prompt)
Figure 35: Generation results for "4. Action and Pose Edits: a. Introducing new actions or poses." The generation with Prompt-to-Prompt does stand for a while before turning right, while the generation without Prompt-to-Prompt stands still for a shorter time and turns left rather than right. However, the generation without Prompt-to-Prompt is actually more in line with the prompt.
(a) A person turns to his right and paces back and forth.
(b) A person paces back and forth. (with Prompt-to-Prompt)
(c) A person paces back and forth. (without Prompt-to-Prompt)
Figure 36: Generation results for "4. Action and Pose Edits: b. Removing or replacing actions." The Prompt-to-Prompt makes the direction of movement go from horizontal left-right to front-back. (This is reflected by the movement of the gray surface representing the ground.) This may be the factor that erases the "turns right".
(a) A person bends down doing something.
(b) A person bends down picking a box off the ground. (with Prompt-to-Prompt)
(c) A person bends down picking a box off the ground. (without Prompt-to-Prompt)
Figure 37: Generation results for "5. Interaction Edits: a. Interacting with objects." The timing of the body undulation in the Prompt-to-Prompt generation result is exactly the same as the original motion, which is not the case without Prompt-to-Prompt.
(a) A person bends down doing something.
(b) A person bends down greeting a kid. (with Prompt-to-Prompt)
(c) A person bends down greeting a kid. (without Prompt-to-Prompt)
Figure 38: Generation results for "5. Interaction Edits: b. Interacting with other characters." The generation without Prompt-to-Prompt walks forward, while both the original motion and the Prompt-to-Prompt generation stay in place.
(a) A person does a dance.
(b) A person does an excited dance. (with Prompt-to-Prompt)
(c) A person does an excited dance. (without Prompt-to-Prompt)
Figure 39: Generation results for "6. Emotional and Style Edits: a. Altering the emotional context or intention." The movements are complex and difficult to compare.
(a) A person does a dance.
(b) A person does a robot dance. (with Prompt-to-Prompt)
(c) A person does a robot dance. (without Prompt-to-Prompt)
Figure 40: Generation results for "6. Emotional and Style Edits: b. Changing the motion style or manner." The movements are complex and difficult to compare.

Data filtering approach

Prompt-to-Prompt effects for "1. Speed and Duration Edits: a. Modifying speed or pacing."
Prompt-to-Prompt effects for "1. Speed and Duration Edits: b. Changing the duration of the motion."
Prompt-to-Prompt effects for "2. Direction and Orientation Edits: a. Changing direction."
Prompt-to-Prompt effects for "2. Direction and Orientation Edits: b. Adjusting orientation."
Prompt-to-Prompt effects for "3. Body Part and Joint Edits: a. Changing body parts involved."
Prompt-to-Prompt effects for "3. Body Part and Joint Edits: b. Adjusting joint angles."
Prompt-to-Prompt effects for "4. Action and Pose Edits: a. Introducing new actions or poses."
Prompt-to-Prompt effects for "4. Action and Pose Edits: b. Removing or replacing actions."
Prompt-to-Prompt effects for "5. Interaction Edits: a. Interacting with objects."
Prompt-to-Prompt effects for "5. Interaction Edits: b. Interacting with other characters."
Prompt-to-Prompt effects for "6. Emotional and Style Edits: a. Altering the emotional context or intention."
Prompt-to-Prompt effects for "6. Emotional and Style Edits: b. Changing the motion style or manner."