Chapter 5 Streak metric for tag-based referencing

For this work, we used the Streak metric to quantify the similarity between two tags when performing a tag-based reference. The Streak metric was originally proposed by (Downing 2015) At a high-level, this metric measures similarity between two bit strings in terms of the relationship between the lengths of the longest contiguously-matching and longest contiguously-mismatching substrings. That is, we XOR the two bit strings and count the longest substring of all 0’s to find the longest contiguously-matching substring and the longest substring of all 1’s to find the longest contiguously-mismatching substring. Our implementation of the Streak metric can be found in the Empirical library (Ofria et al. 2020). We provide a formal definition of the Streak metric below.

5.1 Formal definition

Each tag was represented as an ordered, fixed-length bit string,

t=t0,t1,t2,,tn2,tn1

where

i,ti{0,1}.

We define the greatest contiguously-matching length of n-long bitstrings t and u as follows,

m(t,u)=max

We define the greatest contiguously-mismatching length as follows,

n(t, u) = \max(\{i - j \forall i, j \in 0..n-1 \mid \forall q \in i..j, t_q \neq u_q \})

The streak metric S' with tags t and u

S'(t, u) = \frac{ p(n(t,u)) }{p(m(t,u)) + p(n(t,u))}.

where p approximates the probability of a contiguously-matching substring of a given length between t and u.

It is worth noting that the formula for computing the probability of a k-bit match or mismatch, given by Downing as follows, is actually mathematically flawed.

p_k = \frac{n - k + 1}{2^k}

The probability of a 0-bit match according to this formula would be computed as p_0 = \frac{n - 0 + 1}{2^0} = n + 1 which is clearly impossible because p_0 > 1 \forall n > 0. The actual probability can be computed using a lookup table that is generated using dynamic programming. However, the formula Downing presented provides a useful approximation to the probability of a k bit match. For computational efficiency and consistency with the existing literature we use clamp edge cases between 0 and 1 to yield the corrected streak metric S.

S(t, u) = \max( \min( S'(t, u), 1), 0)

References

Downing, Keith L. 2015. Intelligence Emerging: Adaptivity and Search in Evolving Neural Systems. Cambridge, Massachusetts: The MIT Press.

Ofria, Charles, Matthew Andres Moreno, Emily Dolson, Alexander Lalejini, Santiago Rodriguez-Papa, Jake Fenton, Katherine Perry, et al. 2020. “Empirical: A scientific software library for research, education, and public engagement.” Zenodo. https://doi.org/10.5281/zenodo.4141943.