Pose Clustering for Martial Arts Action Recognition: the case studies of Kata and Tai Chi

paper, specified "short paper"
  1. 1. Valerio Marsocci

    Sapienza Università di Roma (Sapienza University of Rome)

  2. 2. Lorenzo Lastilla

    Sapienza Università di Roma (Sapienza University of Rome)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.

The history of martial arts is lost in the mists of time: from time immemorial mankind has sought methods of combat fundamental for self-defense, and not only. Just such a millenary tradition has ignited great interest in mass culture towards these techniques, often leading to false beliefs derived from myths without valid sources (Bowman, 2016).

Thus, to overcome this issue, since the end of the last century, the academy has developed an increasing interest in the martial arts, studying these disciplines from increasingly varied perspectives, such as performative arts, anthropology, history and so on. At the same time, there has been an increasing need to provide a comprehensive and consistent definition of "martial art".

According to the Japanese model (Green, 2010), “
martial arts are [...] systems that blend the physical components of combat with strategy, philosophy, tradition, or other features, thereby distinguishing them from pure physical reaction”
(Green, 2010). This definition implies that a fighting technique can be considered a martial art only when it is codified.

In this work, we try to formalize this property in mathematical terms, to make the process of studying martial arts not only qualitative, but also quantitative. In particular, we consider the enhancement of artistic image understanding through computer vision and machine learning techniques, with particular reference to the estimation and proper representation of human poses in artworks (Impett et al., 2016).

These methodologies suit well for the study of martial arts since these arts are based on encoded movements and postures of the body: in this sense, therefore, the development of tools and technologies able to facilitate the monitoring, representation and automatic analysis of human poses and movements in the execution of these arts could lead to several benefits in their practice.

Thus, we investigate the movement and the codification of the aforementioned martial arts, by performing action recognition in martial arts action sequences via human pose clustering, and following  Aby Warburg's concept of
, which describes the representation of emotion, movement, and passion through a repeatable visual paradigm or formula (Impett et al., 2016).

The poses are modeled as 2D skeletons and are defined as sets of 15 points connected by limbs. To perform our analysis, we make use of some of the Tai Chi and Kata sequences from the
MADS dataset
(Zhang et al., 2017).

Particularly, based on the theoretical framework established in (Marsocci et al., 2021), we carry out two series of experiments: a pose clustering to segment the whole video sequence into intervals corresponding to a given set of canonical poses; “pose spotting”, based on a similarity assessment with respect to canonical poses.

Pose Clustering

We selected from the first Tai Chi progression in the MADS dataset the “Qishi” (or “Commencing Form”) sequence, and we clustered the poses extracted from each frame via angular K-Medians, (with 3 clusters) based on the poses that encodes the sequence (Fig. 1). Particularly, starting from the assumption that each sequence of actions can be defined through a succession of very specific poses, we used the standard representations of these poses as predefined centroids of a partition of all the poses assumed during the sequence; as visible in Fig. 1,  the obtained clusters are well ordered and easily identifiable. 

Fig. 1 - Canonical poses (first line) and sequence clustered according to angular K-Medians (second line).

Similarity assessment

For the pose spotting task, we focused on the “Qishi” and “Buddha’s Warrior Attendant Pounds Mortar (I)” (i.e. “Jin Gang Dao Dui (1)”) sequence from the first Tai Chi progression in the MADS dataset, made of 800 frames (i.e. poses). Then, we selected two canonical poses and we computed the distance among them and the rest of the poses. As we can observe from Figs. 2 and 3, the most similar frames to the query pose can be correctly detected by setting a proper threshold on the similarity score.
Similar experiments are shown  in Figs. 4 and 5 for the Kata “Fukyugata Ni” sequence (made of 600 poses). In this case, some interesting results can be outlined: the higher thresholds set for the selected poses highlight that they encode very precise movements that cannot be found in the remaining part of the sequence.

Fig. 2 - Pose spotting (Tai Chi), pose 1: the frames corresponding to the query pose are correctly detected for a threshold of 0.01.

Fig. 3 -  Pose spotting (Tai Chi), pose 2: the frames corresponding to the query pose are correctly detected for a threshold of 0.1.

Fig. 4 -  Pose spotting (Kata), pose 1: the frames corresponding to the query pose are correctly detected for a threshold of 0.1.

Fig. 5 -  Pose spotting (Kata), pose 2: the frames corresponding to the query pose are correctly detected for a threshold of 0.1.


Considering the aforementioned results, the downstream applications for martial arts of our framework are immediately intuitive: indeed, it could be used to compare a performed sequence to the ideal sequence of gestures, and the single poses could be compared to the coded poses too, to correct possible errors during the execution of a progression.


Bowman, Paul
(2016). "Making martial arts history matter."
The International Journal of the History of Sport
33.9: 915-933.

Green, Thomas A.
Martial arts of the world: an encyclopedia of history and innovation
. Vol. 2. Abc-Clio.

Impett, Leonardo, and Sabine Süsstrunk
(2016). "Pose and pathosformel in Aby Warburg’s bilderatlas."
. Springer, Cham.

Marsocci, Valerio, and Lorenzo Lastilla
(2021). "POSE-ID-on—A Novel Framework for Artwork Pose Clustering."
ISPRS International Journal of Geo-Information
10.4 (2021): 257.

Zhang, Weichen, et al
(2017). "Martial arts, dancing and sports dataset: A challenging stereo and multi-view dataset for 3d human pose estimation."
Image and Vision Computing
61: 22-39.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2022
"Responding to Asian Diversity"

Tokyo, Japan

July 25, 2022 - July 29, 2022

361 works by 945 authors indexed

Held in Tokyo and remote (hybrid) on account of COVID-19

Conference website: https://dh2022.adho.org/

Contributors: Scott B. Weingart, James Cummings

Series: ADHO (16)

Organizers: ADHO