Authors: Camila Barros
Date: 2024-05-13
This tutorial can be used to compare multiple annotations in ELAN (Wittenburg et al., 2006) using the multimodal-annotation-distance tool. It requires an annotated file in ELAN (.eaf) format and outputs a table with the reference tier and the compared tiers.
The tool multimodal-annotation-distance can be pulled from: https://github.com/JorgeFCS/multimodal-annotation-distance.git
Citation
To cite the tool: Barros, Camila A.; Ciprián-Sánchez, Jorge
Francisco; Mendes Santos, Saulo. (2024) A Tool for Determining
Distances and Overlaps between Multimodal Annotations. In
LREC-COLING 2024
The multimodal-annotation-distance tool allows you to use a tier as a reference to compare other tiers and how they are displaced in time regarding one to another. The following scheme indicates reference and comparison tiers.
In a annotation, we can have the following structure:
Figure 1: Tier structure
The tool will provide the time displacement between the onsets and offsets of each boundary of the reference tier to the comparison tiers, shown in the dotted lines. Each one of this tiers can be defined by the user, as well as if this comparison should account for the span of the reference tier (span comparison) or only the points in time of the offsets (point comparison).
We will use the span comparison in this tutorial. The span comparison takes the time spans of each annotation tier into consideration, comparing the values of the boundaries of the reference tier (bold line in blue) to the onset and offset values of one or more compared tiers (dotted lines in orange and red). Looking at these boundaries, it is possible to infer whether the comparison tiers start before or after a given tier, how much they overlap (ratio), and the duration of the corresponding tiers. A time buffer can be used to determinate if reference and comparison tiers can be considered as co-occurring even when not exactly overlapping.
In case you want to use the point comparison, you should have in mind that this function searches for the offset of the reference tier and outputs the closest match in the compared tiers. It is specially designed to analyze items that do not have a span in time, such as apexes or pitch accents (when considered points in time).
It is important to highlight that this tool was created to compare tiers, but it is not a validation tool for the transcriptions or a way to see inter-rater reliability measures. It is rather an enhancement to the pympi package (Lubbers and Torreira (2013-2021), as it allows comparison with different tier types (including symbolic).
This tool can be used for a single file as well as for a batch. If you are using it for a batch, be sure that all files have the exact same tier name taken into consideration, otherwise it will fail. The media files will not be used at all.
In this case study, we will look in the BGEST Corpus (Barros, 2021)
to see how gesture phrases and strokes synchronize to non terminal
prosodic units of Parentheticals. So we looked at the reference tier
InfoStructure for the searched annotation value PAR,
and compared to the tiers GE-Phrase and GE-Phase. All
this information is in the config_example_span.ini
file.
The configuration file indicates which tiers you want to compare and how do you want to compare them. It requires following information:
- Path of the directory containing the .eaf files to analyze.
- Path to the directory in which to save the results.
- Name of the results file with extension (e.g., results.csv)
- Reference tier against to which we want to perform the comparison.
- Value in the reference tier to search for.
- Mode of operation. Either span or
point_comparison
- Buffer time in ms (if you do not want a buffer time, use 0)
- Tiers to compare against the reference tier.
- Enter the comparison tiers separated by commas (e.g.: tier1,tier2) – but no space between them
Considering that you are running this in R Studio, the first step is to install the requirements and run the tool in the Terminal tab (this tab is next to the console, on the bottom of the screen). Before running the lines below, make the following steps: 1. Clone the repository from https://github.com/JorgeFCS/multimodal-annotation-distance.git 2. Set the directory to the folder containing the multimodal_annotation_distance.py file 3. Install the requirements in R Studio terminal tab 4. Run the tool using the command line below:
First, import the dataset.
The dataset for the span comparison should look like this, with following values:
InfoStructure_PAR_start_ms: onset of the searched annotation value on the reference tier in milliseconds;
InfoStructure_PAR_end_ms: offset of the searched annotation value on the reference tier in milliseconds;
InfoStructure_PAR_duration: duration of the searched annotation
value on the reference tier in milliseconds. It is calculated as
follows:
ref_tier_duration = (ref_tier_end + buffer) - (ref_tier_start - buffer)
Buffer_ms: assigned buffer time;
Tier: compared tier, ordered first by relation to reference tier then by time;
Value: annotation values for each unit in the compared tiers;
Begin_ms: onset of the compared tier for each annotation value (listed in item 6);
End_ms: offset of the compared tier for each annotation value (listed in item 6);
Duration: duration of the compared tier for each annotation value
(listed in item 6). It is obtained as follows:
comparison_tier_duration = comparison_tier_end - comparison_tier_start
Overlap_time: provides the time in milliseconds for all cases in
which there is an overlap between the reference tier and the compared
tier and is calculated as follows:
overlap = min((ref_tier_end + buffer)/ comparison_tier_end) - max((ref_tier_start - buffer), comparison_tier_start)
Overlap_ratio: provides the proportion of the overlap time
(rounded to three decimal places) in relation to the length of the
reference value and is obtained as follows:
overlap_ratio = overlap_time / ref_tier_duration
Diff_start: provides the starting time of the comparison tier in
relation to the start of the reference value as follows:
diff_start = (ref_tier_start - buffer) - comparison_tier_start
For diff_start, a positive value means that the unit on the
compared tier starts before the unit on the reference tier (considering
the buffer time). A negative value means that the unit on the compared
tier starts after the unit on the reference tier (considering the buffer
time). Null values means that they match perfectly.
Diff_end: provides the ending time of the comparison tier in
relation to the end of the reference value as follows:
diff_end = (ref_tier_end + buffer) - comparison_tier_end
For the diff _end, a positive value means that the unit on the
compared tier ends before the unit on the reference tier (considering
the buffer time). A negative value means that the unit on the compared
tier ends after the unit on the reference tier (considering the buffer
time). Null values means that they match perfectly.
As points cannot overlap with another unit, the mode point comparison of comparison outputs only the values (1) to (9).
str(span_PAR)
## 'data.frame': 40768 obs. of 13 variables:
## $ InfoStructure_PAR_start_ms: int 7660 7660 12059 12059 23185 23185 23185 23185 23185 23185 ...
## $ InfoStructure_PAR_end_ms : int 8010 8010 14017 14017 24876 24876 24876 24876 24876 24876 ...
## $ InfoStructure_PAR_duration: int 350 350 1958 1958 1691 1691 1691 1691 1691 1691 ...
## $ Buffer_ms : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Tier : chr "GE-Phrase" "GE-Phase" "GE-Phrase" "GE-Phase" ...
## $ Value : chr "203" "preparation" "205" "preparation" ...
## $ Begin_ms : int 7612 7612 13467 13467 23217 24642 23217 23638 24372 24642 ...
## $ End_ms : int 9471 8445 16449 14334 24642 25844 23638 24372 24642 25190 ...
## $ Duration : int 1859 833 2982 867 1425 1202 421 734 270 548 ...
## $ Overlap_time : int 350 350 550 550 1425 234 421 734 270 234 ...
## $ Overlap_ratio : num 1 1 0.281 0.281 0.843 0.138 0.249 0.434 0.16 0.138 ...
## $ Diff_start : int 48 48 -1408 -1408 -32 -1457 -32 -453 -1187 -1457 ...
## $ Diff_end : int -1461 -435 -2432 -317 234 -968 1238 504 234 -314 ...
It might require some pre-processing, such as mislabeled data or duplicates.
# pre-processing mislabeled data #
## replaces "hold" for "stroke hold" for consistency ##
span_PAR <- span_PAR %>%
mutate(Value = recode(Value, hold = "stroke hold"))
## fix duplicates ##
### Check for duplicated rows ###
duplicates <- duplicated(span_PAR)
### Remove duplicated rows, keeping only the first instance ###
span_nodup <- span_PAR[!duplicates, ]
Now that the data is clean of duplicates, we can focus on the analysis. We want to check how the overlap ratio, i.e., the amount of overlap that strokes and gesture phrases have with informational unit. The informational unit is delimited in this dataset as non terminated units.
# sorting data frames #
span_gephrase <- span_nodup %>%
filter(Tier == "GE-Phrase")
span_stroke <- span_nodup %>%
filter(Value == "stroke")
# dispersion #
## looking at each value (preparation, stroke, retraction, hold) within a gesture se,
## for the median (M) and standard deviation (SD) for centered start times (CST) centered end times(CET)
dispersion_phrase <- span_gephrase %>%
summarise(Value = "Gesture phrase",
Total = n(),
CST_M = median(Diff_start),
CST_SD = round(sd(Diff_start),1),
CET_M = median(Diff_end),
CET_SD = round(sd(Diff_end),1))
dispersion_stroke <- span_nodup %>%
filter(Tier == "GE-Phase")%>%
group_by(Value)%>%
summarise(Total = n(),
CST_M = median(Diff_start),
CST_SD = round(sd(Diff_start),1),
CET_M = median(Diff_end),
CET_SD = round(sd(Diff_end),1))
(dispersion_combo <- rbind(dispersion_phrase, dispersion_stroke))
## Value Total CST_M CST_SD CET_M CET_SD
## 1 Gesture phrase 131 306.0 1178.8 -266.0 1117.4
## 2 preparation 55 -190.0 731.1 151.0 655.0
## 3 retraction 46 -386.5 665.3 157.5 667.9
## 4 stroke 85 -2.0 891.3 278.0 948.3
## 5 stroke hold 14 -425.0 754.9 333.0 641.6
In this dispersion table we can understand if the compared tier annotations, gesture phrases and strokes, start after or before the reference tier selected annotation. As CST_M is positive and CET_M is negative, we can infer that the gesture phrase tends to start before and to end after the Parenthetical unit. The stroke, on the other hand, has the opposite pattern, which makes sense as it is included in the gesture phrase.
What we can also state from those values is they show high variance, as all standard deviations are above 640ms. In comparison to the informational unit studied, PAR, which has a median duration of 1250ms (SD= 547.95). This indicates that there is variation that can span up to half of the analyzed unit and three times the value of 200 ms normally assumed in the literature (Loehr, 2004) as a degree of mismatch between a prosodic and a gestural event.
Another way to see this is to plot a scatterplot that takes the starting time of the comparison tiers (centered values) by the overlap ratio (Figure 2). The thought behind this is that the higher the overlap, the tighter is the connection between the reference tier and comparison tier. Here the dispersion is color coded regarding the quantiles, purple represents the second quartile and the third quartiles; and grey stands for the first and fourth quartiles. Points on the black horizontal line (0 value) indicate that the starting points of the gesture phrase (left graph) or stroke (right graph) perfectly match the PAR’s onset. The blue vertical line indicates when the overlap has a 50% ratio.
# based on quantiles #
# Define a vector of three Viridis colors from the "D" palette for the quant
selected_colors <- viridis_pal(option = "D")(50)
#quantil"#440154FF" line"#2D718EFF" outlier"#FDE725FF"
# calculate the quantiles for phrases
quants <- quantile(span_gephrase$Diff_start, c(0.25, 0.75))
span_gephrase$quant <- with(span_gephrase, factor(ifelse(Diff_start < quants[1], 0,
ifelse(Diff_start < quants[2], 1, 2))))
# scatter phrases
point_phrase <- ggplot(span_gephrase, aes(Overlap_ratio, Diff_start)) +
geom_point(aes(colour = quant), alpha = .8, shape = "triangle", size = 2.4) +
scale_colour_manual(values = c("#8B939C", "#440154FF", "#8B939C"),
labels = c("0-0.25", "0.25-0.75", "0.75-1"))+
theme_bw()+
theme(legend.position = "top", plot.title = element_text(hjust = 0.5))+
labs(title="",
x ="Overlap ratio for gesture phrase", y = "Centered starting time (ms)",
color = "Quantiles")+
geom_vline(xintercept=.5, color="darkblue", linewidth=.5)+
geom_hline(yintercept = 0, color="#151515", linewidth=.5)
# calculate the quantiles for strokes
quants <- quantile(span_stroke$Diff_start, c(0.25, 0.75))
span_stroke$quant <- with(span_stroke, factor(ifelse(Diff_start < quants[1], 0,
ifelse(Diff_start < quants[2], 1, 2))))
# ploting phrases
point_stroke <- ggplot(span_stroke, aes(Overlap_ratio, Diff_start)) +
geom_point(aes(colour = quant), alpha = .7, shape = "triangle", size = 2.4) +
scale_colour_manual(values = c("#8B939C", "#440154FF", "#8B939C"),
labels = c("0-0.25", "0.25-0.75", "0.75-1"))+
theme_bw()+
theme(legend.position = "top", plot.title = element_text(hjust = 0.5))+
labs(title="",
x ="Overlap ratio for strokes", y = "Centered starting time (ms)",
color = "Quantiles")+
geom_vline(xintercept=.5, color="darkblue", linewidth=.5)+
geom_hline(yintercept = 0, color="#151515", linewidth=.5)
grid.arrange(point_phrase, point_stroke, ncol = 2, nrow = 1)
Figure 2: Overlap ratio by centered times for gesture phrase and strokes
What occurs with the onset alignment of PAR and the gesture-phrase containing the stroke? It would seem logical to assume that if a gesture-phrase overlaps more with PAR, it would also overlap more with its corresponding stroke.
Looking at gesture-phrases that contain the stroke in relation to the overlap ratio relative to PAR, there is only a weak correlation of the onset alignment of PAR and the gesture phrase. Although the correlation’s direction aligns with our expectations (higher the overlap, smaller the centered onset time), its strength does not. As indicated in Table 1, gesture phrases often begin before and end after the overlapping PAR unit. This trend can be examined by referring to Figure 3.
### correlations
## overlap_ratio and start diff
# for stroke
stroke_start <- ggscatter(span_stroke, x = "Overlap_ratio", y = "Diff_start",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "Overlap_ratio stroke", ylab = "Diff_start ms")
## overlap_ratio and end diff
# for stroke
stroke_end <- ggscatter(span_stroke, x = "Overlap_ratio", y = "Diff_end",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "Overlap_ratio stroke", ylab = "Diff_end ms")
grid.arrange(stroke_start, stroke_end, ncol = 2, nrow = 1)
One lingering question persists: what degree of overlap signifies coordination between a gesture-phrase and the IU? For gesture-phrases, the answer appears straightforward: nearly all gesture-phrases that start before PAR’s onset can be distinguished by a 50% overlap ratio (1st quadrant of Figure 2). However, this pattern is less evident for strokes. It would be intriguing to assess the alignment of strokes (or their central points) with lower-level prosodic events. This investigation could shed light on why apexes tend to correlate with pitch accents. Nonetheless, this does not necessarily imply the necessity of a time buffer to establish synchronicity.
In conclusion, PAR appears to prompt some level of coordination between gesture and speech boundaries, suggesting that the assumption of phonological synchronicity arising from semantic/pragmatic synchronicity might hold true. It’s also noteworthy that while existing literature suggests the need for a buffer time to analyze speech-to-gesture coordination, the data presented thus far have not incorporated any buffer. We contend that the dispersion of boundaries is better elucidated in terms of the overlap ratio rather than relying on a buffer time, which could introduce arbitrary elements into the analysis. For a more in-depth discussion of the data presented here, please refer to Barros et al. (2024).