Duplicate Khipu Search

Statistics is the science of learning from experience, particularly experience that arrives a little bit at a time: the successes and failures of a new experimental drug, the uncertain measurements of an asteroid’s path toward Earth. It may seem surprising that any one theory can cover such an amorphous target as “learning from experience.”

Computer Age Statistical Inference Algorithms, Evidence, and Data Science Bradley Efron & Trevor Hastie

This page is non-reproducible. Duplicate Khipus have been removed from the current database, so we can’t search for duplicate khipus anymore! - Nonethless, it has been included to show the difficulties in searching for duplicates

07 - Searching For Duplicate Khipus

It’s possible, for various reasons, that khipus are duplicated in the Khipu Field Guide. For example, an Ascher khipu might be re-read and entered in a reverse order, minus some strings.

Manuel Medrano found the following khipu duplicates using an old-fashioned shoe leather approach, plus diagrams on the KFG. My algorithmic approach to find duplicates failed to find clear duplicates, so it’s time to analyze what wrong, and fix it!

UR035/AS70
UR043/AS30
UR044/AS031 (AS031 does not exist in the KFG)
UR083/AS208
UR126/UR115
UR133/UR036
UR163/AS056
UR176/LL01
UR236/AS181
UR281/AS068
HP035/AS047 (HP035 does not exist in the KFG)
HP041/AS046

10 duplicates exist in the KFG. Since the duplicate khipus are likely to be removed, this data study’s lifetime will be short, but the lessons will be long-lived ;-)

Whoops. That didn’t work. Phase 1

My initial approach, that failed to find duplicates was to take a “winnowing” approach:

Step 1 - Assume that possible matches are at least 80% of the number of the cords of the test match.
Step 2 - If it’s a reversed khipu, it should have similar cord values. The set of intersections of cord values should be similar in length to each individual khipu.
Step 3 - The same idea should be true of cord colors

This approach failed. Miserably.

Why doesn’t that work? What Does Work? Why?

Let’s review an image quilt of the duplicate khipus that Manuel Medrano found.

Code

real_matches_found = [('UR035', 'AS070'), ('UR043', 'AS030'), ('UR083', 'AS208'), ('UR126', 'UR115'), ('UR133', 'UR036'), #UR044/AS031 not included since AS031 doesn't exist
                      ('UR163', 'AS056'), ('UR176', 'LL01'), ('UR236', 'AS181'), ('UR281', 'AS068'),('HP041', 'AS046')] #HP035/AS047 not included since HP035 doesn't exist

real_matches_flat = ku.flatten_list(real_matches_found)
from khipu_html import make_image_quilt_file
ku.ensure_directory_exists(output_dir := f"{kq.project_directory()}/notebook/fieldmarks/appendix/notebooks/duplicate_khipus_quilt")
make_image_quilt_file(real_matches_flat, grouped=False, title_string=f"Duplicate Khipus Quilt",
                      sort_string = "", template_file = "fieldmark_image_quilt_template",
                      output_dir = output_dir, output_filename = f"duplicate_khipus");

from fieldmark_table import make_mini_fieldmark_browser
ku.ensure_directory_exists(output_directory := f"{kq.project_directory()}/notebook/fieldmarks/appendix/notebooks/duplicate_khipus_browser")
make_mini_fieldmark_browser(real_matches_flat, output_directory, browser_title="Duplicate Khipu Pairs - Fieldmarks");

Starting make_mini_fieldmark_browser 0:0 min
Finished make_mini_fieldmark_browser after 0:15 min

Image Quilt

Visual examination of the Image Quilt shows that the following 6 pairs (± 2) appear similar to the eye:

(‘UR035’, ‘AS070’)
(‘UR043’, ‘AS030’)
(‘UR126’, ‘UR115’)
(‘UR133’, ‘UR036’)
(‘UR281’, ‘AS068’)
(‘HP041’, ‘AS046’)

I note that visually, it appears we use knot location and cord length as visual clues for matching, more than conventional symbolic values like knot value and color which might be easier for the computer to match.

Now let’s look at their fieldmarks:

Duplicate Khipu Fieldmarks

The fieldmark browser sorts khipus based on a vector of all fieldmarks. So the neighboring pairs in the mini browser we just made, aren’t always neighboring pairs in the browser. But they’re close!

The number one predictor of closeness is Similarity Index, and it appears to be a pretty good predictor! We’ll get back to why that is later.

What isn’t a predictor of closeness? The # of Ascher colors is a terrible predictor. Witness UR035 with 52 colors to AS070’s 30. So is cord value - it’s proxy, mean cord value, has a close match only a few times. The lesson is clear. In a database, where two khipu measurers produce such different results, we should avoid using just a few fieldmarks to pick similarity.

So why is the similarity index such a good predictor? Remember that the similarity index uses a textual description of the khipu. Everything including knots, locations, cord lengths, cords per cluster, etc is recorded in that textual description and then a very large n-dimensional vector is created and the hierarchical clustering algorithm does its magic on that. Many dimensions make the search more accurate.

What’s the take away for me?

Visual inspection matters.
Fieldmarks matter. But only if you use a bunch of them simultaneously (in parallel, not series)!
Dimensions matter. A complete textual description of a khipu, converted into the numerical patois of a machine learning algorithm, produces a pretty good match predictor!

The Old Approach (That Didn’t Work!)

It’s possible to look for these khipus. We’ll use a winnowing process.

Step 1 - Assume that possible matches are at least 80% of the number of the cords of the test match.
Step 2 - If it’s a reversed khipu, it should have similar cord values. The set of intersections of cord values should be similar in length to each individual khipu.
Step 3 - The same idea should be true of cord colors

Code

(khipu_dict, all_khipus) = kamayuq.fetch_khipus()

khipu_summary_df = kq.fetch_khipu_summary()
khipu_pendant_count = sorted([(khipu.name(), khipu.num_pendant_cords()) for khipu in all_khipus], key=lambda x: x[1], reverse=True)

# We first winnow search by comparing pendant cord count.
# This is a right triangular search (khipu[x] is matched with khipu[x+1,x+2....x+n])
possible_matches = {}
for index, (khipu_name, pendant_count) in enumerate(khipu_pendant_count):
    right_khipus = khipu_pendant_count[index+1:]
    possible_matches[khipu_name] = [a_name for (a_name, aCount) in right_khipus if (a_name!=khipu_name) and (aCount <= pendant_count) and (aCount >= 0.7*pendant_count)]
possible_matches = {name:aList for name,aList in possible_matches.items() if len(aList)}

from collections import Counter

Code

# Build dictionaries of cord colors and cord color sets for each khipu
khipu_colors = {}
khipu_color_counter = {}
khipu_color_set = {}
for aKhipuName, match_list in possible_matches.items():
    khipu_colors[aKhipuName] = [aCord.longest_ascher_color() for aCord in khipu_dict[aKhipuName].pendant_cords()]
    khipu_color_counter[aKhipuName] = Counter(khipu_colors[aKhipuName])
    khipu_color_set[aKhipuName] = set(khipu_color_counter[aKhipuName].keys())
for aKhipuName, match_list in possible_matches.items():
    for aMatchName in match_list:
        if not aMatchName in khipu_colors.keys():
            khipu_colors[aMatchName] = [aCord.longest_ascher_color() for aCord in khipu_dict[aMatchName].pendant_cords()]
            khipu_color_counter[aMatchName] = Counter(khipu_colors[aMatchName])
            khipu_color_set[aMatchName] = set(khipu_color_counter[aMatchName].keys())

# And then winnow possible matches based on len of color sets and intersection of color sets length
color_matches = {}
for aKhipuName, match_list in possible_matches.items():
    my_set_length = len(khipu_color_set[aKhipuName])
    less_set_length = .6*my_set_length
    more_set_length = 1.4*my_set_length
    def is_match(search_khipu_name):
        if len(search_khipu_name)==0: return False 
        
        match_length = len(khipu_color_set[search_khipu_name])
        lengths_match = (ku.in_range(match_length, less_set_length, more_set_length))
        common_colors = khipu_color_set[aKhipuName].intersection(khipu_color_set[search_khipu_name])
        colors_match = (ku.in_range(len(common_colors), less_set_length, more_set_length)) 
        return lengths_match and colors_match
    the_khipu_set_matches = [aMatchName for aMatchName in match_list if is_match(aMatchName)]
    if len(the_khipu_set_matches): 
        color_matches[aKhipuName] = the_khipu_set_matches
    
len(color_matches)
# color_matches     
possible_matches = color_matches

Code

# Build cord_value dictionaries....
khipu_cord_vals = {}
khipu_cord_val_counter = {}
khipu_cord_val_set = {}
for aKhipuName, match_list in possible_matches.items():
    khipu_cord_vals[aKhipuName] = [aCord.knotted_value() for aCord in khipu_dict[aKhipuName].pendant_cords()]
    khipu_cord_val_counter[aKhipuName] = Counter(khipu_cord_vals[aKhipuName])
    khipu_cord_val_set[aKhipuName] = set(khipu_cord_val_counter[aKhipuName].keys())
for aKhipuName, match_list in possible_matches.items():
    for aMatchName in match_list:
        if not aMatchName in khipu_cord_vals.keys():
            khipu_cord_vals[aMatchName] = [aCord.knotted_value() for aCord in khipu_dict[aMatchName].pendant_cords()]
            khipu_cord_val_counter[aMatchName] = Counter(khipu_cord_vals[aMatchName])
            khipu_cord_val_set[aMatchName] = set(khipu_cord_val_counter[aMatchName].keys())

# And then winnow possible matches based on len of cord value sets and intersection of cord value length
# Check (and report) on multiples/divisibles by 10 in case there's an off by ten error
def get_cord_val_matches(aKhipuName, match_list):
    my_set_length = len(khipu_cord_val_set[aKhipuName])
    less_set_length = .7*my_set_length
    more_set_length = 1.3*my_set_length
    
    def is_match(search_khipu_name):
        match_length = len(khipu_cord_val_set[search_khipu_name])
        lengths_match = (ku.in_range(match_length, less_set_length, more_set_length))
        if not lengths_match:
            return False
        
        def match_cord_val_set(search_khipu_name, cord_val_set):
            common_cord_values = khipu_cord_val_set[aKhipuName].intersection(cord_val_set)
            return (ku.in_range(len(common_cord_values), less_set_length, more_set_length)) 
        
        cord_values_match = match_cord_val_set(search_khipu_name, khipu_cord_val_set[search_khipu_name])
        if not cord_values_match:
            off_by_ten = {x*10 for x in khipu_cord_val_set[search_khipu_name]}
            cord_values_match = match_cord_val_set(search_khipu_name, off_by_ten)
            if cord_values_match: print(f"\t{aKhipuName}->{search_khipu_name} - OBOX10 Matched")
        if not cord_values_match:
            off_by_ten = {round(x/10) for x in khipu_cord_val_set[search_khipu_name]}
            cord_values_match = match_cord_val_set(search_khipu_name, off_by_ten)
            if cord_values_match: print(f"\t{aKhipuName}->{search_khipu_name} - OBO/10 Matched")
            
    the_khipu_set_matches = [aMatchName for aMatchName in match_list if is_match(aMatchName)]
    return the_khipu_set_matches    

cord_val_matches = {}
for aKhipuName, match_list in possible_matches.items():
    if the_khipu_set_matches := get_cord_val_matches(aKhipuName, match_list):
        cord_val_matches[aKhipuName] = the_khipu_set_matches
    
cord_val_matches

    UR028->UR044 - OBO/10 Matched

{}

There ~~don’t~~ appear to be duplicates at first glance….