Authors: Camila Barros
Date: 2024-05-13
This tutorial can be used to compare multiple annotations in ELAN (Wittenburg et al., 2006) using the multimodal-annotation-distance tool. It requires an annotated file in ELAN (.eaf) format and outputs a table with the reference tier and the compared tiers.
The tool multimodal-annotation-distance can be pulled from: https://github.com/JorgeFCS/multimodal-annotation-distance.git
Cituation
To cite the tool: Barros, C. A., Ciprián-Sánchez, J. F., & Santos, S. M. (2024). A Tool for Determining Distances and Overlaps between Multimodal Annotations. In N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, & N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 1705–1714). ELRA and ICCL. https://aclanthology.org/2024.lrec-main.150
The multimodal-annotation-distance tool allows you to use a tier as a reference to compare other tiers and how they are displaced in time regarding one to another. The following scheme indicates reference and comparison tiers.
In a annotation, we can have the following structure:
The tool will provide the time displacement between the onsets and offsets of each boundary of the reference tier to the comparison tiers, shown in the dotted lines. Each one of this tiers can be defined by the user, as well as if this comparison should account for the span of the reference tier (span comparison) or only the points in time of the offsets (point comparison).
We will use the span comparison in this tutorial. The span comparison takes the time spans of each annotation tier into consideration, comparing the values of the boundaries of the reference tier (bold line in blue) to the onset and offset values of one or more compared tiers (dotted lines in orange and red). Looking at these boundaries, it is possible to infer whether the comparison tiers start before or after a given tier, how much they overlap (ratio), and the duration of the corresponding tiers. A time buffer can be used to determinate if reference and comparison tiers can be considered as co-occurring even when not exactly overlapping.
In case you want to use the point comparison, you should have in mind that this function searches for the offset of the reference tier and outputs the closest match in the compared tiers. It is specially designed to analyze items that do not have a span in time, such as apexes or pitch accents (when considered points in time).
It is important to highlight that this tool was created to compare tiers, but it is not a validation tool for the transcriptions or a way to see inter-rater reliability measures. It is rather an enhancement to the pympi package (Lubbers and Torreira (2013-2021), as it allows comparison with different tier types (including symbolic).
This tool can be used for a single file as well as for a batch. If you are using it for a batch, be sure that all files have the exact same tier name taken into consideration, otherwise it will fail. The media files will not be used at all.
In this case study, we will look in the BGEST Corpus (Barros, 2021) to see how gesture phrases and strokes synchronize to non terminal prosodic units of Parentheticals. So we looked at the reference tier InfoStructure for the searched annotation value PAR, and compared to the tiers GE-Phrase and GE-Phase. All this information is in the config_example_span.ini file.
The configuration file indicates which tiers you want to compare and how do you want to compare them. It requires following information:
- Path of the directory containing the .eaf files to analyze.
- Path to the directory in which to save the results.
- Name of the results file with extension (e.g., results.csv)
- Reference tier against to which we want to perform the comparison.
- Value in the reference tier to search for.
- Mode of operation. Either span or point_comparison
- Buffer time in ms (if you do not want a buffer time, use 0)
- Tiers to compare against the reference tier.
- Enter the comparison tiers separated by commas (e.g.: tier1,tier2) -- but no space between them
Considering that you are running this in R Studio, the first step is to install the requirements and run the tool in the Terminal tab (this tab is next to the console, on the bottom of the screen). Before running the lines below, make the following steps: 1. Clone the repository from https://github.com/JorgeFCS/multimodal-annotation-distance.git 2. Set the directory to the folder containing the multimodal_annotation_distance.py file 3. Install the requirements in R Studio terminal tab 4. Run the tool using the command line below: (You should have a huge output)
## go to cmd to run config.ini file
import os
current_directory = os.getcwd()
# STEP 1: clone repository
# STEP 2: set the directory to the folder containing the multimodal_annotation_distance.py file
# STEP 3: assuming you are using R Studio, run the following lines in the terminal tab
## installing requirements
%cd multimodal-annotation-distance
!pip install -r requirements.txt
## running multimodal-distance-tool at the terminal -- it can take a bit to run ##
!python multimodal_annotation_distance.py --config-file-path ./config_example_span.ini
First, import the dataset.
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from scipy.stats import shapiro
from scipy.stats import pearsonr
# Load datasets
span_PAR = pd.read_csv("bgest_spancomparison.csv")
# It does not directly outputs the data in tidy format.
The dataset for the span comparison should look like this, with following values:
InfoStructure_PAR_start_ms: onset of the searched annotation value on the reference tier in milliseconds;
InfoStructure_PAR_end_ms: offset of the searched annotation value on the reference tier in milliseconds;
InfoStructure_PAR_duration: duration of the searched annotation value on the reference tier in milliseconds. It is calculated as follows:\
ref_tier_duration = (ref_tier_end + buffer) - (ref_tier_start - buffer)
Buffer_ms: assigned buffer time;
Tier: compared tier, ordered first by relation to reference tier then by time;
Value: annotation values for each unit in the compared tiers;
Begin_ms: onset of the compared tier for each annotation value (listed in item 6);
End_ms: offset of the compared tier for each annotation value (listed in item 6);
Duration: duration of the compared tier for each annotation value (listed in item 6). It is obtained as follows:\
comparison_tier_duration = comparison_tier_end - comparison_tier_start
Overlap_time: provides the time in milliseconds for all cases in which there is an overlap between the reference tier and the compared tier and is calculated as follows:\
overlap = min((ref_tier_end + buffer)/ comparison_tier_end) - max((ref_tier_start - buffer), comparison_tier_start)
Overlap_ratio: provides the proportion of the overlap time (rounded to three decimal places) in relation to the length of the reference value and is obtained as follows:\
overlap_ratio = overlap_time / ref_tier_duration
Diff_start: provides the starting time of the comparison tier in relation to the start of the reference value as follows:\
diff_start = (ref_tier_start - buffer) - comparison_tier_start\
For diff_start, a positive value means that the unit on the compared tier starts before the unit on the reference tier (considering the buffer time). A negative value means that the unit on the compared tier starts after the unit on the reference tier (considering the buffer time). Null values means that they match perfectly.
Diff_end: provides the ending time of the comparison tier in relation to the end of the reference value as follows:\
diff_end = (ref_tier_end + buffer) - comparison_tier_end\
For the diff _end, a positive value means that the unit on the compared tier ends before the unit on the reference tier (considering the buffer time). A negative value means that the unit on the compared tier ends after the unit on the reference tier (considering the buffer time). Null values means that they match perfectly.
As points cannot overlap with another unit, the mode point comparison of comparison outputs only the values (1) to (9).
## inspecting the dataframe
span_PAR.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 40768 entries, 0 to 40767 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 InfoStructure_PAR_start_ms 40768 non-null int64 1 InfoStructure_PAR_end_ms 40768 non-null int64 2 InfoStructure_PAR_duration 40768 non-null int64 3 Buffer_ms 40768 non-null int64 4 Tier 40768 non-null object 5 Value 40768 non-null object 6 Begin_ms 40768 non-null int64 7 End_ms 40768 non-null int64 8 Duration 40768 non-null int64 9 Overlap_time 40768 non-null int64 10 Overlap_ratio 40768 non-null float64 11 Diff_start 40768 non-null int64 12 Diff_end 40768 non-null int64 dtypes: float64(1), int64(10), object(2) memory usage: 4.0+ MB
It might require some pre-processing, such as mislabeled data or duplicates.
# Pre-processing
# Replace "hold" with "stroke hold" for consistency
span_PAR["Value"] = span_PAR["Value"].replace("hold", "stroke hold")
# Fix duplicates
span_nodup = span_PAR.drop_duplicates()
Now that the data is clean of duplicates, we can focus on the analysis. We want to check how the overlap ratio, i.e., the amount of overlap that strokes and gesture phrases have with informational unit. The informational unit is delimited in this dataset as non terminated units.
# Dispersion
# Gesture phrase dispersion
dispersion_phrase = span_nodup[span_nodup["Tier"] == "GE-Phrase"].groupby("Tier").agg(
Total=("Value", "count"),
CST_M=("Diff_start", "median"),
CST_SD=("Diff_start", np.std),
CET_M=("Diff_end", "median"),
CET_SD=("Diff_end", np.std)
)
# Stroke dispersion
dispersion_stroke = span_nodup[span_nodup["Tier"] == "GE-Phase"].groupby("Value").agg(
Total=("Value", "count"),
CST_M=("Diff_start", "median"),
CST_SD=("Diff_start", np.std),
CET_M=("Diff_end", "median"),
CET_SD=("Diff_end", np.std)
)
dispersion_combo = pd.concat([dispersion_phrase, dispersion_stroke])
print(dispersion_combo)
Total CST_M CST_SD CET_M CET_SD GE-Phrase 131 306.0 1178.784939 -266.0 1117.386280 preparation 55 -190.0 731.120830 151.0 655.008069 retraction 46 -386.5 665.310946 157.5 667.886819 stroke 85 -2.0 891.311666 278.0 948.343719 stroke hold 14 -425.0 754.864346 333.0 641.598386
In this dispersion table we can understand if the compared tier annotations, gesture phrases and strokes, start after or before the reference tier selected annotation. As CST_M is positive and CET_M is negative, we can infer that the gesture phrase tends to start before and to end after the Parenthetical unit. The stroke, on the other hand, has the opposite pattern, which makes sense as it is included in the gesture phrase.
What we can also state from those values is they show high variance, as all standard deviations are above 640ms. In comparison to the informational unit studied, PAR, which has a median duration of 1250ms (SD= 547.95). This indicates that there is variation that can span up to half of the analyzed unit and three times the value of 200 ms normally assumed in the literature (Loehr, 2004) as a degree of mismatch between a prosodic and a gestural event.
Another way to see this is to plot a scatterplot that takes the starting time of the comparison tiers (centered values) by the overlap ratio (Figure 2). The thought behind this is that the higher the overlap, the tighter is the connection between the reference tier and comparison tier. Here the dispersion is color coded regarding the gesture phrase (blue) and strokes (orange). Althought the different distributions, they have similar tendencies, as shown in the regression line supperimposed in the graph.
# Visualizing overlap ratio and centered starting times for gesture phrases and strokes
span_gephrase = span_nodup[span_nodup["Tier"] == "GE-Phrase"]
sns.regplot(x=span_gephrase["Overlap_ratio"], y=span_gephrase["Diff_start"])
span_stroke = span_nodup[span_nodup["Value"] == "stroke"]
sns.regplot(x=span_stroke["Overlap_ratio"], y=span_stroke["Diff_start"])
<Axes: xlabel='Overlap_ratio', ylabel='Diff_start'>
What occurs with the onset alignment of PAR and the gesture-phrase containing the stroke? It would seem logical to assume that if a gesture-phrase overlaps more with PAR, it would also overlap more with its corresponding stroke.
Looking at gesture-phrases that contain the stroke in relation to the overlap ratio relative to PAR, there is only a weak correlation of the onset alignment of PAR and the gesture phrase. Although the correlation's direction aligns with our expectations (higher the overlap, smaller the centered onset time), its strength does not. As indicated in Table 1, gesture phrases often begin before and end after the overlapping PAR unit. This trend can be examined by referring to Figure 3.
# Correlation of alignment of onsets and overlap ratio for gesture phrases and strokes
sns.regplot(x=span_stroke["Overlap_ratio"], y=span_stroke["Diff_start"])
sns.regplot(x=span_stroke["Overlap_ratio"], y=span_stroke["Diff_end"])
<Axes: xlabel='Overlap_ratio', ylabel='Diff_end'>
One lingering question persists: what degree of overlap signifies coordination between a gesture-phrase and the IU? For gesture-phrases, the answer appears straightforward: nearly all gesture-phrases that start before PAR's onset can be distinguished by a 50% overlap ratio (1st quadrant of Figure 2). However, this pattern is less evident for strokes. It would be intriguing to assess the alignment of strokes (or their central points) with lower-level prosodic events. This investigation could shed light on why apexes tend to correlate with pitch accents. Nonetheless, this does not necessarily imply the necessity of a time buffer to establish synchronicity.
In conclusion, PAR appears to prompt some level of coordination between gesture and speech boundaries, suggesting that the assumption of phonological synchronicity arising from semantic/pragmatic synchronicity might hold true. It's also noteworthy that while existing literature suggests the need for a buffer time to analyze speech-to-gesture coordination, the data presented thus far have not incorporated any buffer. We contend that the dispersion of boundaries is better elucidated in terms of the overlap ratio rather than relying on a buffer time, which could introduce arbitrary elements into the analysis. For a more in-depth discussion of the data presented here, please refer to Barros et al. (2024).