# Duplicate Khipu Search

Statistics is the science of learning from experience, particularly experience that arrives a little bit at a time: the successes and failures of a new experimental drug, the uncertain measurements of an asteroidâs path toward Earth. It may seem surprising that any one theory can cover such an amorphous target as âlearning from experience.â

Computer Age Statistical Inference Algorithms, Evidence, and Data Science Bradley Efron & Trevor Hastie

This page is non-reproducible. Duplicate Khipus have been removed from the current database, so we canât search for duplicate khipus anymore! - Nonethless, it has been included to show the difficulties in searching for duplicates

# 07 - Searching For Duplicate Khipus

Itâs possible, for various reasons, that khipus are duplicated in the Khipu Field Guide. For example, an Ascher khipu might be re-read and entered in a reverse order, minus some strings.

Manuel Medrano found the following khipu duplicates using an old-fashioned shoe leather approach, plus diagrams on the KFG. My algorithmic approach to find duplicates failed to find clear duplicates, so itâs time to analyze what wrong, and fix it!

• UR035/AS70
• UR043/AS30
• UR044/AS031 (AS031 does not exist in the KFG)
• UR083/AS208
• UR126/UR115
• UR133/UR036
• UR163/AS056
• UR176/LL01
• UR236/AS181
• UR281/AS068
• HP035/AS047 (HP035 does not exist in the KFG)
• HP041/AS046

10 duplicates exist in the KFG. Since the duplicate khipus are likely to be removed, this data studyâs lifetime will be short, but the lessons will be long-lived ;-)

## Whoops. That didnât work. Phase 1

My initial approach, that failed to find duplicates was to take a âwinnowingâ approach:

• Step 1 - Assume that possible matches are at least 80% of the number of the cords of the test match.
• Step 2 - If itâs a reversed khipu, it should have similar cord values. The set of intersections of cord values should be similar in length to each individual khipu.
• Step 3 - The same idea should be true of cord colors

This approach failed. Miserably.

## Why doesnât that work? What Does Work? Why?

Letâs review an image quilt of the duplicate khipus that Manuel Medrano found.

Code
``````real_matches_found = [('UR035', 'AS070'), ('UR043', 'AS030'), ('UR083', 'AS208'), ('UR126', 'UR115'), ('UR133', 'UR036'), #UR044/AS031 not included since AS031 doesn't exist
('UR163', 'AS056'), ('UR176', 'LL01'), ('UR236', 'AS181'), ('UR281', 'AS068'),('HP041', 'AS046')] #HP035/AS047 not included since HP035 doesn't exist

real_matches_flat = ku.flatten_list(real_matches_found)
from khipu_html import make_image_quilt_file
ku.ensure_directory_exists(output_dir := f"{kq.project_directory()}/notebook/fieldmarks/appendix/notebooks/duplicate_khipus_quilt")
make_image_quilt_file(real_matches_flat, grouped=False, title_string=f"Duplicate Khipus Quilt",
sort_string = "", template_file = "fieldmark_image_quilt_template",
output_dir = output_dir, output_filename = f"duplicate_khipus");

from fieldmark_table import make_mini_fieldmark_browser
ku.ensure_directory_exists(output_directory := f"{kq.project_directory()}/notebook/fieldmarks/appendix/notebooks/duplicate_khipus_browser")
make_mini_fieldmark_browser(real_matches_flat, output_directory, browser_title="Duplicate Khipu Pairs - Fieldmarks");``````
``````Starting make_mini_fieldmark_browser 0:0 min
Finished make_mini_fieldmark_browser after 0:15 min``````

Image Quilt

Visual examination of the Image Quilt shows that the following 6 pairs (Â± 2) appear similar to the eye:

• (âUR035â, âAS070â)
• (âUR043â, âAS030â)
• (âUR126â, âUR115â)
• (âUR133â, âUR036â)
• (âUR281â, âAS068â)
• (âHP041â, âAS046â)

I note that visually, it appears we use knot location and cord length as visual clues for matching, more than conventional symbolic values like knot value and color which might be easier for the computer to match.

Now letâs look at their fieldmarks:

Duplicate Khipu Fieldmarks

The fieldmark browser sorts khipus based on a vector of all fieldmarks. So the neighboring pairs in the mini browser we just made, arenât always neighboring pairs in the browser. But theyâre close!

The number one predictor of closeness is Similarity Index, and it appears to be a pretty good predictor! Weâll get back to why that is later.

What isnât a predictor of closeness? The # of Ascher colors is a terrible predictor. Witness UR035 with 52 colors to AS070âs 30. So is cord value - itâs proxy, mean cord value, has a close match only a few times. The lesson is clear. In a database, where two khipu measurers produce such different results, we should avoid using just a few fieldmarks to pick similarity.

So why is the similarity index such a good predictor? Remember that the similarity index uses a textual description of the khipu. Everything including knots, locations, cord lengths, cords per cluster, etc is recorded in that textual description and then a very large n-dimensional vector is created and the hierarchical clustering algorithm does its magic on that. Many dimensions make the search more accurate.

Whatâs the take away for me?

• Visual inspection matters.
• Fieldmarks matter. But only if you use a bunch of them simultaneously (in parallel, not series)!
• Dimensions matter. A complete textual description of a khipu, converted into the numerical patois of a machine learning algorithm, produces a pretty good match predictor!

# The Old Approach (That Didnât Work!)

Itâs possible to look for these khipus. Weâll use a winnowing process.

• Step 1 - Assume that possible matches are at least 80% of the number of the cords of the test match.
• Step 2 - If itâs a reversed khipu, it should have similar cord values. The set of intersections of cord values should be similar in length to each individual khipu.
• Step 3 - The same idea should be true of cord colors
Code
``````(khipu_dict, all_khipus) = kamayuq.fetch_khipus()

khipu_summary_df = kq.fetch_khipu_summary()
khipu_pendant_count = sorted([(khipu.name(), khipu.num_pendant_cords()) for khipu in all_khipus], key=lambda x: x[1], reverse=True)

# We first winnow search by comparing pendant cord count.
# This is a right triangular search (khipu[x] is matched with khipu[x+1,x+2....x+n])
possible_matches = {}
for index, (khipu_name, pendant_count) in enumerate(khipu_pendant_count):
right_khipus = khipu_pendant_count[index+1:]
possible_matches[khipu_name] = [a_name for (a_name, aCount) in right_khipus if (a_name!=khipu_name) and (aCount <= pendant_count) and (aCount >= 0.7*pendant_count)]
possible_matches = {name:aList for name,aList in possible_matches.items() if len(aList)}

from collections import Counter``````
Code
``````# Build dictionaries of cord colors and cord color sets for each khipu
khipu_colors = {}
khipu_color_counter = {}
khipu_color_set = {}
for aKhipuName, match_list in possible_matches.items():
khipu_colors[aKhipuName] = [aCord.longest_ascher_color() for aCord in khipu_dict[aKhipuName].pendant_cords()]
khipu_color_counter[aKhipuName] = Counter(khipu_colors[aKhipuName])
khipu_color_set[aKhipuName] = set(khipu_color_counter[aKhipuName].keys())
for aKhipuName, match_list in possible_matches.items():
for aMatchName in match_list:
if not aMatchName in khipu_colors.keys():
khipu_colors[aMatchName] = [aCord.longest_ascher_color() for aCord in khipu_dict[aMatchName].pendant_cords()]
khipu_color_counter[aMatchName] = Counter(khipu_colors[aMatchName])
khipu_color_set[aMatchName] = set(khipu_color_counter[aMatchName].keys())

# And then winnow possible matches based on len of color sets and intersection of color sets length
color_matches = {}
for aKhipuName, match_list in possible_matches.items():
my_set_length = len(khipu_color_set[aKhipuName])
less_set_length = .6*my_set_length
more_set_length = 1.4*my_set_length
def is_match(search_khipu_name):
if len(search_khipu_name)==0: return False

match_length = len(khipu_color_set[search_khipu_name])
lengths_match = (ku.in_range(match_length, less_set_length, more_set_length))
common_colors = khipu_color_set[aKhipuName].intersection(khipu_color_set[search_khipu_name])
colors_match = (ku.in_range(len(common_colors), less_set_length, more_set_length))
return lengths_match and colors_match
the_khipu_set_matches = [aMatchName for aMatchName in match_list if is_match(aMatchName)]
if len(the_khipu_set_matches):
color_matches[aKhipuName] = the_khipu_set_matches

len(color_matches)
# color_matches
possible_matches = color_matches``````
``259``
Code
``````# Build cord_value dictionaries....
khipu_cord_vals = {}
khipu_cord_val_counter = {}
khipu_cord_val_set = {}
for aKhipuName, match_list in possible_matches.items():
khipu_cord_vals[aKhipuName] = [aCord.knotted_value() for aCord in khipu_dict[aKhipuName].pendant_cords()]
khipu_cord_val_counter[aKhipuName] = Counter(khipu_cord_vals[aKhipuName])
khipu_cord_val_set[aKhipuName] = set(khipu_cord_val_counter[aKhipuName].keys())
for aKhipuName, match_list in possible_matches.items():
for aMatchName in match_list:
if not aMatchName in khipu_cord_vals.keys():
khipu_cord_vals[aMatchName] = [aCord.knotted_value() for aCord in khipu_dict[aMatchName].pendant_cords()]
khipu_cord_val_counter[aMatchName] = Counter(khipu_cord_vals[aMatchName])
khipu_cord_val_set[aMatchName] = set(khipu_cord_val_counter[aMatchName].keys())

# And then winnow possible matches based on len of cord value sets and intersection of cord value length
# Check (and report) on multiples/divisibles by 10 in case there's an off by ten error
def get_cord_val_matches(aKhipuName, match_list):
my_set_length = len(khipu_cord_val_set[aKhipuName])
less_set_length = .7*my_set_length
more_set_length = 1.3*my_set_length

def is_match(search_khipu_name):
match_length = len(khipu_cord_val_set[search_khipu_name])
lengths_match = (ku.in_range(match_length, less_set_length, more_set_length))
if not lengths_match:
return False

def match_cord_val_set(search_khipu_name, cord_val_set):
common_cord_values = khipu_cord_val_set[aKhipuName].intersection(cord_val_set)
return (ku.in_range(len(common_cord_values), less_set_length, more_set_length))

cord_values_match = match_cord_val_set(search_khipu_name, khipu_cord_val_set[search_khipu_name])
if not cord_values_match:
off_by_ten = {x*10 for x in khipu_cord_val_set[search_khipu_name]}
cord_values_match = match_cord_val_set(search_khipu_name, off_by_ten)
if cord_values_match: print(f"\t{aKhipuName}->{search_khipu_name} - OBOX10 Matched")
if not cord_values_match:
off_by_ten = {round(x/10) for x in khipu_cord_val_set[search_khipu_name]}
cord_values_match = match_cord_val_set(search_khipu_name, off_by_ten)
if cord_values_match: print(f"\t{aKhipuName}->{search_khipu_name} - OBO/10 Matched")

the_khipu_set_matches = [aMatchName for aMatchName in match_list if is_match(aMatchName)]
return the_khipu_set_matches

cord_val_matches = {}
for aKhipuName, match_list in possible_matches.items():
if the_khipu_set_matches := get_cord_val_matches(aKhipuName, match_list):
cord_val_matches[aKhipuName] = the_khipu_set_matches

cord_val_matches        ``````
``    UR028->UR044 - OBO/10 Matched``
``{}``

There donât appear to be duplicates at first glanceâŠ.