Chapter 5 Streak metric for tag-based referencing
For this work, we used the Streak metric to quantify the similarity between two tags when performing a tag-based reference. The Streak metric was originally proposed by (Downing 2015) At a high-level, this metric measures similarity between two bit strings in terms of the relationship between the lengths of the longest contiguously-matching and longest contiguously-mismatching substrings. That is, we XOR the two bit strings and count the longest substring of all 0’s to find the longest contiguously-matching substring and the longest substring of all 1’s to find the longest contiguously-mismatching substring. Our implementation of the Streak metric can be found in the Empirical library (Ofria et al. 2020). We provide a formal definition of the Streak metric below.
5.1 Formal definition
Each tag was represented as an ordered, fixed-length bit string,
\[ t = \langle t_0, t_1, t_2, \dots, t_{n-2}, t_{n-1} \rangle \]
where
\[ \forall i, t_i \in \{0, 1\}. \]
We define the greatest contiguously-matching length of \(n\)-long bitstrings \(t\) and \(u\) as follows,
\[ m(t, u) = \max(\{i - j \forall i, j \in 0..n-1 \mid \forall q \in i..j, t_q = u_q \}) \]
We define the greatest contiguously-mismatching length as follows,
\[ n(t, u) = \max(\{i - j \forall i, j \in 0..n-1 \mid \forall q \in i..j, t_q \neq u_q \}) \]
The streak metric \(S'\) with tags \(t\) and \(u\)
\[ S'(t, u) = \frac{ p(n(t,u)) }{p(m(t,u)) + p(n(t,u))}. \]
where \(p\) approximates the probability of a contiguously-matching substring of a given length between \(t\) and \(u\).
It is worth noting that the formula for computing the probability of a \(k\)-bit match or mismatch, given by Downing as follows, is actually mathematically flawed.
\[ p_k = \frac{n - k + 1}{2^k} \]
The probability of a \(0\)-bit match according to this formula would be computed as \(p_0 = \frac{n - 0 + 1}{2^0} = n + 1\) which is clearly impossible because \(p_0 > 1 \forall n > 0\). The actual probability can be computed using a lookup table that is generated using dynamic programming. However, the formula Downing presented provides a useful approximation to the probability of a \(k\) bit match. For computational efficiency and consistency with the existing literature we use clamp edge cases between 0 and 1 to yield the corrected streak metric \(S\).
\[ S(t, u) = \max( \min( S'(t, u), 1), 0) \]
References
Downing, Keith L. 2015. Intelligence Emerging: Adaptivity and Search in Evolving Neural Systems. Cambridge, Massachusetts: The MIT Press.
Ofria, Charles, Matthew Andres Moreno, Emily Dolson, Alexander Lalejini, Santiago Rodriguez-Papa, Jake Fenton, Katherine Perry, et al. 2020. “Empirical: A scientific software library for research, education, and public engagement.” Zenodo. https://doi.org/10.5281/zenodo.4141943.