C. Hinsley
19 September 2024
The Lempel-Ziv (LZ) algorithm takes a binary string (though it generalizes to arbitrary finite alphabets, not just binary) and passes over it left-to-right, building a dictionary of newly observed substrings, and at the end counts the symbols in the dictionary to obtain a measure of complexity of the binary string. If you normalize the LZ complexity, multiplying by $\log_2(n)/n$, where $n$ is the length of the string, this complexity measure will converge for large $n$ to the entropy rate of a source channel which produced the string.
I was curious if there is a possibility that LZ complexity is sensitive to reversals of the binary string, because it does build a dictionary in a relatively chaotic way — perhaps indicative of a directional preference forward or backward — for sufficiently complex sequences. By sensitivity, I mean such that the discrepancy between the normalized Lempel-Ziv complexity of a binary string and its reversal fails to converge to zero for large $n$.
Below you can find a tabulation of these discrepancies:
Length ($n$) | String | LZ | LZ reverse | Discrepancy | Normalized ($\log_2(n)/n$) Discrepancy |
---|---|---|---|---|---|
1 | 0 | 1 | 1 | 0 | 0.00000 |
2 | 00 | 2 | 2 | 0 | 0.00000 |
3 | 001 | 2 | 3 | 1 | 0.52832 |
4 | 0001 | 3 | 3 | 0 | 0.00000 |
5 | 0000 1 | 3 | 4 | 1 | 0.46439 |
6 | 0000 01 | 3 | 4 | 1 | 0.43083 |
7 | 0000 001 | 3 | 4 | 1 | 0.40105 |
8 | 0000 0010 | 4 | 5 | 1 | 0.37500 |
9 | 0000 0000 1 | 4 | 5 | 1 | 0.35221 |
10 | 0001 1001 00 | 6 | 4 | 2 | 0.66439 |
11 | 0001 1100 100 | 6 | 4 | 2 | 0.62899 |
12 | 0000 0010 0001 | 4 | 6 | 2 | 0.59749 |
13 | 0000 0010 0001 0 | 4 | 6 | 2 | 0.56930 |
14 | 0000 0001 1001 00 | 7 | 5 | 2 | 0.54391 |
15 | 0000 0001 1100 100 | 7 | 5 | 2 | 0.52092 |
16 | 0001 1110 0110 0100 | 8 | 5 | 3 | 0.75000 |
17 | 0001 1010 0111 0010 0 | 8 | 5 | 3 | 0.72132 |
18 | 0000 0010 0110 1001 11 | 5 | 8 | 3 | 0.69499 |
19 | 0000 0010 0011 0001 001 | 5 | 8 | 3 | 0.67073 |
20 | 0000 0001 1110 0110 0100 | 9 | 6 | 3 | 0.64829 |
21 | 0000 0001 1010 0111 0010 0 | 9 | 6 | 3 | 0.62747 |
22 | 0000 0001 1010 0110 1111 11 | 9 | 6 | 3 | 0.60810 |
23 | 0000 0001 1001 0100 0000 100 | 9 | 6 | 3 | 0.59003 |
24 | 0000 0000 0000 0010 0001 1110 | 6 | 9 | 3 | 0.57312 |
25 | 0010 0111 0000 1110 1111 0000 1 | 6 | 10 | 4 | 0.74302 |
26 | 0001 1010 1100 1101 0011 1001 00 | 10 | 6 | 4 | 0.72314 |
27 | 0000 0010 0000 0110 0101 0000 111 | 6 | 10 | 4 | 0.70443 |
28 | 0000 0010 0000 0110 0010 1000 0111 | 6 | 10 | 4 | 0.68676 |
29 | 0000 0001 1110 1000 1101 0000 0010 0 | 11 | 7 | 4 | 0.67007 |
30 | 0000 0001 1010 1100 1101 0011 1001 00 | 11 | 7 | 4 | 0.65425 |
31 | 0001 0000 1011 1110 1001 0110 1011 011 | 11 | 7 | 4 | 0.63925 |
32 | 0001 0000 0001 0001 0100 0000 0110 1010 | 7 | 11 | 4 | 0.62500 |
33 | 0001 1110 0000 1011 0011 1100 1011 0010 0 | 11 | 6 | 5 | 0.76430 |
34 | 0001 1110 1001 0011 0011 1100 1001 1001 00 | 11 | 6 | 5 | 0.74816 |
35 | 0101 0110 0000 1101 0110 0110 1001 1100 100 | 12 | 7 | 5 | 0.73275 |
Julia code used to calculate the data above:
using Base.Threads
# Lempel-Ziv 1976 complexity (no normalization).
function LZ76_complexity(sequence::Vector{Int})::Float64
if isempty(sequence)
return 0.0
end
complexity = 0
i = 1
n = length(sequence)
# b = length(unique(sequence)) # Unused variable, can be removed if not needed
while i <= n
max_match_length = 0
# Search for the longest match in the window (from the start up to position i-1).
for j in 1:(i-1)
match_length = 0
while (i + match_length <= n) && (sequence[j + match_length] == sequence[i + match_length])
match_length += 1
# Prevent overlapping matches.
if (j + match_length > i - 1)
break
end
end
if match_length > max_match_length
max_match_length = match_length
end
end
if max_match_length > 0
# Found a match; increment complexity and move the pointer.
complexity += 1
i += max_match_length + 1
else
# No match found; treat the current symbol as a new phrase.
complexity += 1
i += 1
end
end
return complexity
end
# Find the binary string of length n (WLOG starting with 0) which has the
# greatest discrepancy in LZ76 complexity from its reverse string.
function maximal_discrepancy(n::Int)::Tuple{Vector{Int}, Int}
if n < 1
error("Length of the binary string must be at least 1.")
end
if n == 1
return ([0], 0)
end
if n == 2
return ([0, 0], 0)
end
max_discrepancy = -1.0
max_string = Int[]
total_strings = 2^(n-1) # Since the first bit is fixed to 0.
# Initialize per-thread storage for maximum discrepancy and corresponding string.
n_threads = Threads.nthreads()
per_thread_max_discrepancy = fill(-1.0, n_threads)
per_thread_max_string = [Int[] for _ in 1:n_threads]
Threads.@threads for k in 0:(total_strings - 1)
# Generate the binary string starting with 0.
bits = zeros(Int, n)
bits[1] = 0 # First bit is 0.
for bit_pos in 2:n
bits[bit_pos] = (k >> (n - bit_pos)) & 1
end
bits_reversed = reverse(bits)
# Check for palindromic strings.
if bits == bits_reversed
continue # Skip palindromic strings as discrepancy is zero.
end
# If n is even, check for anti-palindromic strings (equal to reverse and inversion).
if iseven(n) && bits == (1 .- bits_reversed)
continue # Skip anti-palindromic strings as discrepancy is zero.
end
# Compute complexities.
c_original = LZ76_complexity(bits)
c_reversed = LZ76_complexity(bits_reversed)
# Compute discrepancy.
discrepancy = abs(c_original - c_reversed)
# Update thread-local maximum discrepancy and corresponding string if needed.
thread_id = threadid()
if discrepancy > per_thread_max_discrepancy[thread_id]
per_thread_max_discrepancy[thread_id] = discrepancy
per_thread_max_string[thread_id] = copy(bits)
end
end
# After the parallel loop, find the global maximum discrepancy and the corresponding string.
for thread_id in 1:n_threads
if per_thread_max_discrepancy[thread_id] > max_discrepancy
max_discrepancy = per_thread_max_discrepancy[thread_id]
max_string = copy(per_thread_max_string[thread_id])
end
end
return (max_string, Int(max_discrepancy))
end