Chapter 5 Streak metric for tag-based referencing
For this work, we used the Streak metric to quantify the similarity between two tags when performing a tag-based reference. The Streak metric was originally proposed by (Downing 2015) At a high-level, this metric measures similarity between two bit strings in terms of the relationship between the lengths of the longest contiguously-matching and longest contiguously-mismatching substrings. That is, we XOR the two bit strings and count the longest substring of all 0’s to find the longest contiguously-matching substring and the longest substring of all 1’s to find the longest contiguously-mismatching substring. Our implementation of the Streak metric can be found in the Empirical library (Ofria et al. 2020). We provide a formal definition of the Streak metric below.
5.1 Formal definition
Each tag was represented as an ordered, fixed-length bit string,
t=⟨t0,t1,t2,…,tn−2,tn−1⟩
where
∀i,ti∈{0,1}.
We define the greatest contiguously-matching length of n-long bitstrings t and u as follows,
m(t,u)=max
We define the greatest contiguously-mismatching length as follows,
n(t, u) = \max(\{i - j \forall i, j \in 0..n-1 \mid \forall q \in i..j, t_q \neq u_q \})
The streak metric S' with tags t and u
S'(t, u) = \frac{ p(n(t,u)) }{p(m(t,u)) + p(n(t,u))}.
where p approximates the probability of a contiguously-matching substring of a given length between t and u.
It is worth noting that the formula for computing the probability of a k-bit match or mismatch, given by Downing as follows, is actually mathematically flawed.
p_k = \frac{n - k + 1}{2^k}
The probability of a 0-bit match according to this formula would be computed as p_0 = \frac{n - 0 + 1}{2^0} = n + 1 which is clearly impossible because p_0 > 1 \forall n > 0. The actual probability can be computed using a lookup table that is generated using dynamic programming. However, the formula Downing presented provides a useful approximation to the probability of a k bit match. For computational efficiency and consistency with the existing literature we use clamp edge cases between 0 and 1 to yield the corrected streak metric S.
S(t, u) = \max( \min( S'(t, u), 1), 0)
References
Downing, Keith L. 2015. Intelligence Emerging: Adaptivity and Search in Evolving Neural Systems. Cambridge, Massachusetts: The MIT Press.
Ofria, Charles, Matthew Andres Moreno, Emily Dolson, Alexander Lalejini, Santiago Rodriguez-Papa, Jake Fenton, Katherine Perry, et al. 2020. “Empirical: A scientific software library for research, education, and public engagement.” Zenodo. https://doi.org/10.5281/zenodo.4141943.