Khipu Analysis

Loading the Khipu Summary

In this page, we will perform high-level analysis, comparing khipus to each other, as a whole.

As part of the building of the python class object OODB, we now have a summary table of useful metrics of Khipu. A list of sample metrics includes, how many pendant cords, how many S-Twist cords, how many subsidiary cords for each pendant cord, etc..

One interesting field is the document field which is a textual representation of the khipus “features”, such as cord attachments, values, colors, knot types, etc. This textual representation becomes a “proxy” for the Khipu, and allows NLP algorithms to work on the khipu. The long term goal then becomes translating that text document into something meaningful to us.

I will admit it is an odd process. First humans convert thoughts to cloth. Then humans measure the cloth, and convert it back to numbers (the database). Then I come along and convert the numbers back to pixel-cloth and I convert the numbers to text. Which I will convert back to numbers in code, so that I can hopefully convert those numbers back to thoughts. Hmmmmn….

The long prologue of imports and setup has been extracted to a startup file for brevity. The first task is to complete initialization by doing final setup and importing the Python Khipu OODB (Object Oriented Database) and summary files.

Code
import os
import numpy as np
import pandas as pd
import scipy
    
from pandas import Series, DataFrame

# Plotly
import plotly
from plotly.offline import iplot, init_notebook_mode
import plotly.graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
plotly.offline.init_notebook_mode(connected = False)


# Khipu Imports
import khipu_kamayuq as kk  # A Khipu Maker is known (in Quechua) as a Khipu Kamayuq
import khipu_qollqa as kq
import khipu_utils as ku

(khipu_dict, all_khipus) = kk.fetch_khipus()
khipu_summary_df = kq.fetch_khipu_summary()
<Figure size 6000x3000 with 0 Axes>

The KhipuKamayuq class (a Khipu scribe is a KhipuKamayuq) fetches khipus using the data accessed using the KhipuQollqa (a Khipu Warehouse) class. The KhipuKamayuq class can fetch all the khipus stored(pickled) as python object-oriented class objects. Additionally it has a khipu summary dataframe made by the KhipuQollqa. Let’s load the OODB, and the summary dataframe. A short list of summary columns is useful, as well.

Code
khipu_summary_df.columns
Index(['Unnamed: 0', 'khipu_id', 'kfg_name', 'okr_name', 'original_name',
       'legal_filename', 'nickname', 'museum_name', 'museum_num',
       'museum_description', 'region', 'provenance', 'num_cord_clusters',
       'mean_cords_per_cluster', 'max_subsidiary_branches',
       'num_top_cord_double_sum_clusters', 'num_top_cord_sum_clusters',
       'num_banded_clusters', 'num_seriated_clusters',
       'num_decreasing_clusters', 'benford_match', 'hierarchical_cluster_pos',
       'num_canuto_cords', 'num_cords', 'recto_cords', 'verso_cords',
       'r_ratio', 'v_ratio', 'rv_cords', 'num_pendant_cords',
       'num_nonzero_top_pendant_cords', 'num_top_cords',
       'num_subsidiary_cords', 'num_s_cords', 'num_z_cords', 'mean_cord_value',
       'fanout_ratio', 'num_total_cords', 'num_ascher_colors',
       'knot_cluster_count', 'knot_count', 'mean_knots_per_cord',
       'num_s_knots', 'num_z_knots', 'percent_s_knots', 'percent_z_knots',
       'notes', 'pcord_notes'],
      dtype='object')

Benford’s Law

Beginning programmers are taught to learn how to write “Hello World” as their first programming exercise. For khipu scholars, who are researching khipus statistically, Benford’s Law is their “Hello World.” Benford’s law, simply put, says that the first digits of numbers used to measure human activity tend to have a higher probability that they are 1. For example, 10,000, 1,000, etc. We can use Benford’s law to predict which khipu are of an “accounting nature” - ie. khipu that have sums and accounts of things.

>Benford’s Law by Dan Ma

The first digit (or leading digit) of a number is the leftmost digit (e.g. the first digit of 567 is 5). The first digit of a number can only be 1, 2, 3, 4, 5, 6, 7, 8, and 9 since we do not usually write a number such as 567 as 0567. Some fraudsters may think that the first digits in numbers in financial documents appear with equal frequency (i.e. each digit appears about 11% of the time). In fact, this is not the case. It was discovered by Simon Newcomb in 1881 and rediscovered by physicist Frank Benford in 1938 that the first digits in many data sets occur according to the probability distribution indicated in Figure 1 below:

The above probability distribution is now known as the Benford’s law. It is a powerful and yet relatively simple tool for detecting financial and accounting frauds. For example, according to the Benford’s law, about 30% of numbers in legitimate data have 1 as a first digit. Fraudsters who do not know this will tend to have much fewer ones as first digits in their faked data.

Data for which the Benford’s law is applicable are data that tend to distribute across multiple orders of magnitude. Examples include income data of a large population, census data such as populations of cities and counties. In addition to demographic data and scientific data, the Benford’s law is also applicable to many types of financial data, including income tax data, stock exchange data, corporate disbursement and sales data. The author of I’ve Got Your Number How a Mathematical Phenomenon can help CPAs Uncover Fraud and other Irregularities., Mark Nigrini, also discussed data analysis methods (based on the Benford’s law) that are used in forensic accounting and auditing.

Code
first_digit_probability = [0.301, 0.176, 0.125, 0.097, 0.079, 0.067, 0.058, 0.051, 0.046]
df = pd.DataFrame({'first_digit':list(range(1,10)), 'proportion':first_digit_probability})
fig = px.bar(df, x='first_digit', y='proportion', text='proportion', width=1200, height=300, title="Benford\'s Law").show()

An Explanation of “Cosine Distance”

Using the distribution of “first digits” in a cord (the value of the first knot) we can use a simple cosine distance similarity metric of Benford’s distribution and cord distribution. Cosine distance similarity is a simple concept that is used a lot in this study … If you’re unfamiliar here’s an intuitive explanation.

Suppose we want to see how close two sets of numbers are to each other. For example house A has 3 dogs and 2 cats, house B has 5 dogs and 1 cat and house C has 1 dog and 7 cats. We can represent each of these as a vector:

A = [3,2] B = [5,1] C = [1,7]

Each of these vectors can be “normalized” to a unit vector (a vector of length 1) by dividing by the vector length. Then you can take advantage of math. A vector dot-product of two vectors measures the angle between those two vectors. If we want to know if A is closer to B or C we take A . B and A . C and see which has the smallest angle between them. The dot -product generates the cosine of the angle, which is why this technique is known as the cosine distance metric.



We can do the same for khipus by creating a vector of first digit counts, for a given khipu and then dot producting it with the Benford probabilities above.

f(angle_theta) = first_digit_probability . given_khipu_digit_distribution. If the angle theta is 0, the vectors lie on top of each other (an exact match) so cos(theta) = 1, with no match = 0.

Another approach to understanding which khipus are accounting in nature, and which aren’t is to sum up cord values and see if that sum occurs elsewhere on another cord in the khipu. Dr. Urton, shows an example of this kind, khipu UR028, in his youtube video on accounting khipu. Here I examine how many “sum matches” occur in a khipu (normalized by the number of total cords…). This gives us a secondary statistical check, in addition to Benford’s Law.

Note that the metric calculated below is not exact. Here, only two cord sums are measured. The first decoded khipu, in the 1920s by Leland Locke, had one up-cord sum for four down cord values and would not score well on this metric.

There is another metric similar to the above. Some cord clusters (ie. top cords) are the sum, or approximate sum of the cluster adjacent (ie. on either side) of them. The khipu class tags these “sum_clusters” and allows us another metric.

Code
## Benford Law Similarity
# Note that 0 index's are ignored, by setting first_digit_probability[0] to 0
first_digit_probability = [0.0, 0.301, 0.176, 0.125, 0.097, 0.079, 0.067, 0.058, 0.051, 0.046]
def benford_match(aKhipu):
    digit_count = [0]*10
    for cord in aKhipu.attached_cords():
        first_digit = int(str(cord.decimal_value)[0])
        digit_count[first_digit] += 1
    cosine_similarity = 1 - scipy.spatial.distance.cosine(first_digit_probability, digit_count)
    return cosine_similarity
khipu_benford_match = [benford_match(aKhipu) for aKhipu in all_khipus]
Code
# Cords whose sum occurs as another cord in a khipu
def sum_matches(numbers, verbose = False):
    numbers = list(filter(lambda a: a != 0, numbers))
    num_matches= 0
    max_match = 0
    all_matches = [0]
    for i in range(len(numbers)):
        for j in range(i):
            match_sum = numbers[i] + numbers[j]
            if (match_sum in numbers) and (match_sum != 0):
                if verbose: print(f"[{i}]{numbers[i]}, [{j}]{numbers[j]} sums to [{numbers.index(match_sum)}]{match_sum}")
                num_matches += 1
                all_matches.append(match_sum)
    return ({'num_matches':num_matches, 'max_match': max(all_matches), 'mean_match': mean(all_matches) })

def khipu_sum_matches(aKhipu):
    cord_values = [aCord.decimal_value for aCord in aKhipu.attached_cords() if aCord.decimal_value > 0]
    return sum_matches(cord_values)


sum_match_dicts = [khipu_sum_matches(aKhipu) for aKhipu in all_khipus]
sum_match_counts = [aDict['num_matches']for aDict in sum_match_dicts]
sum_match_max = [aDict['max_match']for aDict in sum_match_dicts]
sum_match_mean = [aDict['mean_match']for aDict in sum_match_dicts]
Code
benford_df = pd.DataFrame({'khipu_id': [aKhipu.khipu_id for aKhipu in all_khipus],
                           'name': [aKhipu.name() for aKhipu in all_khipus],
                           'num_pendants': [aKhipu.num_pendant_cords() for aKhipu in all_khipus],
                           'num_subsidiary_cords': [aKhipu.num_subsidiary_cords() for aKhipu in all_khipus],
                           'num_total_cords': [aKhipu.num_attached_cords()for aKhipu in all_khipus],
                           'fanout_ratio': [aKhipu.num_pendant_cords() / aKhipu.num_attached_cords() for aKhipu in all_khipus],
                           'benford_match': khipu_benford_match,
                           'sum_match_count': sum_match_counts,
                           'sum_match_max': sum_match_max,
                           'sum_match_mean': sum_match_mean})

benford_df['normalized_sum_match_count'] = benford_df['sum_match_count']/benford_df['num_total_cords']
benford_df = benford_df.sort_values(by='benford_match', ascending=False)
benford_df.to_csv(kq.qollqa_data_directory() + "benford.csv") # Save for later when we cross correlate with other statistical tests
benford_df.head()
khipu_id name num_pendants num_subsidiary_cords num_total_cords fanout_ratio benford_match sum_match_count sum_match_max sum_match_mean normalized_sum_match_count
252 1000269 UR053E 70 35 105 0.666667 0.992311 3811 81 14 36.295238
645 6000089 UR193 99 127 226 0.438053 0.985153 15906 130 54 70.380531
577 6000020 KH0058 80 98 178 0.449438 0.984313 10570 115 22 59.382022
426 1000404 UR151 12 8 20 0.600000 0.978858 64 120 61 3.200000
612 6000055 QU005 57 25 82 0.695122 0.978147 425 1107 112 5.182927
Code
hist_data = [khipu_benford_match]
group_labels = ['benford_match']
colors = ['#835AF1']

(
ff.create_distplot(hist_data, group_labels,  bin_size=.01, 
                         show_rug=False,show_curve=True)
    .update_layout(title_text='Distribution Plot of Khipu -> Benford Matches', # title of plot
                     xaxis_title_text='Benford Match Value (cosine_similarity)', 
                     yaxis_title_text='Percentage of Khipus', # yaxis label
                     bargap=0.05, # gap between bars of adjacent location coordinates
                     #bargroupgap=0.1 # gap between bars of the same location coordinates
                     )
     .show()
)

Here’s a clear statistical/graphical narrative as to which khipus are “accounting” in nature (from a little above .6 to 1.0!), somewhat accounting in nature (.4 to .6), and possibly narrative in nature (< 0.4). How many khipus in each category?

Code
is_likely_accounting = benford_df[benford_df.benford_match > .625]
print(f"Khipus with benford_match > .625 = {is_likely_accounting.shape[0]} (is_likely_accounting)")
is_likely_narrative = benford_df[benford_df.benford_match < .425]
print(f"Khipus with benford_match < .425 = {is_likely_narrative.shape[0]} (is_likely_narrative)")
Khipus with benford_match > .625 = 431 (is_likely_accounting)
Khipus with benford_match < .425 = 102 (is_likely_narrative)

We see that there’s probably 352 khipu that are likely accounting, and 67 khipu that are likely narrative.

Code
### Benford Match vs Sum Clusters

Let’s do a scatter plot of Benford Matching vs (normalized) cords whose sums match another cord. The clustering of accounting khipus is revealed even more clearly.

Code
fig = px.scatter(benford_df, x="benford_match", y="normalized_sum_match_count", log_y=True,
                 size="sum_match_max", color="sum_match_mean", 
                 labels={"benford_match": "#Benford Match", "normalized_sum_match_count": "#Number of Matching Cord Sums (Normalized by # Total Cords)"},
                 hover_data=['name'], title="<b>Benford Match vs Normalized Sum Match Counts</b> - <i>Hover Over Circles to View Khipu Info</i>",
                 width=950, height=950);
fig.update_layout(showlegend=True).show()

Again, we see the lumpiness and confirmation of the “accounting khipus” whose Benford match is > 0.6 or so….

Sanity Check - Marcia Ascher’s “Mathematical” Khipus versus their Corresponding Benford Match

As another checkpoint on Benford Law matching approach, let’s review the khipus that Marcia Ascher’s annotated with mathematical relationships and compare them with their corresponding Benford match.

Code
# These are the khipus that Marcia Ascher annotated as having interesting mathematical relationships
ascher_mathy_khipus = ['AS010', 'AS020', 'AS026B', 'AS029', 'AS038', 'AS041', 'AS048', 'AS066', 'AS068', 
                       'AS079', 'AS080', 'AS082', 'AS092', 'AS115', 'AS128', 'AS129', 'AS164', 'AS168', 'AS192', 
                       'AS193', 'AS197', 'AS198', 'AS199', 'AS201', 'AS206', 'AS207A', 'AS208', 'AS209', 'AS210', 
                       'AS211', 'AS212', 'AS213', 'AS214', 'AS215', 'UR1034', 'UR1096', 'UR1097', 'UR1098', 
                       'UR1100', 'UR1103', 'UR1104', 'UR1114', 'UR1116', 'UR1117', 'UR1120', 'UR1126', 'UR1131', 
                       'UR1135', 'UR1136', 'UR1138', 'UR1140', 'UR1141', 'UR1143', 'UR1145', 'UR1148', 'UR1149', 
                       'UR1151', 'UR1152', 'UR1163', 'UR1175', 'UR1180']
ascher_khipu_benford_match = [benford_match(aKhipu) for aKhipu in all_khipus if aKhipu.name() in ascher_mathy_khipus]

def is_an_ascher_khipu(aKhipu):
    return True if str(aKhipu.original_name)[0:2] == "AS" or aKhipu.name()[0:2] == "AS" else False

ascher_khipus = [aKhipu for aKhipu in all_khipus if is_an_ascher_khipu(aKhipu)]
percent_mathy_khipus = 100.0 * float(len(ascher_mathy_khipus))/float(len(ascher_khipus))
percent_string = "{:4.1f}".format(percent_mathy_khipus)

hist_data = [ascher_khipu_benford_match]
group_labels = ['benford_match']
colors = ['#835AF1']

(
ff.create_distplot(hist_data, group_labels,  bin_size=.01, 
                         show_rug=False,show_curve=True)
    .update_layout(title_text='<b>Benford Match for Ascher Khipus with Mathematical Relationships</b> - <i>Hover Over Circles to View Khipu Info</i>', # title of plot
                     xaxis_title_text='Benford Match Value (cosine_similarity)', 
                     yaxis_title_text='Percentage of Khipus', # yaxis label
                     bargap=0.05, # gap between bars of adjacent location coordinates
                     #bargroupgap=0.1 # gap between bars of the same location coordinates
                     )
     .show()
)
print (f"Percent of ascher khipus having mathematical relationships ({len(ascher_mathy_khipus)} out of {len(ascher_khipus)}) is {percent_string}%")
Percent of ascher khipus having mathematical relationships (61 out of 144) is 42.4%

Benford Match vs Fanout Ratio (Primaries vs Subsidiaries)

Let’s conclude this section with an interesting “triangular” graph - Benford match vs Fanout Ratio (the ratio of Pendant cords vs Subsidiary cords)

Code
fig = px.scatter(benford_df, x="benford_match", y="fanout_ratio", 
                 size="num_total_cords", color="num_subsidiary_cords", 
                 labels={"num_total_cords": "#Total Cords", "num_subsidiary_cords": "#Subsidiary Cords"},
                 hover_data=['name'], title="<b>Benford Match vs Fanout Ratio</b> - <i>Hover Over Circles to View Khipu Info</i>",
                 width=950, height=950);
fig.update_layout(showlegend=True).show()

Grouping Khipus into Families - What Constitutes “Similarity”

As a useful first step, it will help to have the ability to sort khipus not just by meta-features like benford_match, but also by other forms of similarity.

One of my favorite similarity techniques is the text-clustering approaches of NLP. A driving need in NLP is to sort topics into various categories; for example, is this a news document or fiction. So how do we text mine khipus? A simple solution is to convert the khipu to words. Each Python Khipu class object (knot, cord, cluster, etc.) is given a “writer” function that documents itself as text. Below is the “document” for a simple 3 cord khipu - AS055. It’s a pretty sparse and simple description. We’ll go about making this more sophisticated as the project becomes more developed. The point is that we now have a set of words that roughly describe a khipu with categorical words such as U_knot_twist, or single_knot and numerical words such as 1421_decimal_value for the cord value. Now we use that set of words as the input to NLP algorithms.

Code
# To assist clustering Ascher khipus, where many khipu metrics such as cord attachment, twist, etc are unknown, 
# We can impute values for attachment, cord twist and knot twist'
def clean_doc(aDocString): return (aDocString.replace("U_twist", "S_twist")
                                             .replace("U_knot_twist", "S_knot_twist")
                                             .replace("Unknown_attachment_type", "Recto_attachment_type"))

khipu_names = [aKhipu.kfg_name() for aKhipu in all_khipus]
khipu_documents = [clean_doc(aKhipu.as_document()) for aKhipu in all_khipus]
khipu_docstring_dict = {name:doc for (name,doc) in zip(khipu_names, khipu_documents)}
Code
print(khipu_docstring_dict["AS055"])
 primary_cord_start YB cord_cluster_start 1_cluster_level 3_cc_cords cord_start U_attachment_type S_twist 1_cord_level 1_cord_ordinal 1421_decimal_value 1_cluster_ordinal KB knot_cluster_start single_knot S_knot_twist ʘ_knot_value knot_cluster_end knot_cluster_start single_knot S_knot_twist o_knot_value single_knot S_knot_twist o_knot_value single_knot S_knot_twist o_knot_value single_knot S_knot_twist o_knot_value knot_cluster_end knot_cluster_start single_knot S_knot_twist  _knot_value single_knot S_knot_twist  _knot_value knot_cluster_end knot_cluster_start figure_eight_knot S_knot_twist  _knot_value knot_cluster_end cord_end cord_start U_attachment_type S_twist 1_cord_level 2_cord_ordinal 2427_decimal_value 2_cluster_ordinal YB knot_cluster_start single_knot S_knot_twist ʘ_knot_value single_knot S_knot_twist ʘ_knot_value knot_cluster_end knot_cluster_start single_knot S_knot_twist o_knot_value single_knot S_knot_twist o_knot_value single_knot S_knot_twist o_knot_value single_knot S_knot_twist o_knot_value knot_cluster_end knot_cluster_start single_knot S_knot_twist  _knot_value single_knot S_knot_twist  _knot_value knot_cluster_end knot_cluster_start long_knot S_knot_twist 7_knot_value knot_cluster_end cord_end cord_start U_attachment_type S_twist 1_cord_level 3_cord_ordinal 734_decimal_value 3_cluster_ordinal KB knot_cluster_start single_knot S_knot_twist o_knot_value single_knot S_knot_twist o_knot_value single_knot S_knot_twist o_knot_value single_knot S_knot_twist o_knot_value single_knot S_knot_twist o_knot_value single_knot S_knot_twist o_knot_value single_knot S_knot_twist o_knot_value knot_cluster_end knot_cluster_start single_knot S_knot_twist  _knot_value single_knot S_knot_twist  _knot_value single_knot S_knot_twist  _knot_value knot_cluster_end knot_cluster_start long_knot S_knot_twist 4_knot_value knot_cluster_end cord_end cord_cluster_end primary_cord_end 


Then we can pass the document to modern NLP text mining algorithms which can cluster the text by similarity along the n-dimensions of all the different words and their frequencies in each document.

The usual first step in NLP document analysis is to build a term-frequency, inverse-document-frequency matrix.

Tf-Idf matrices for NLP have a long history. An explanation:

Term Frequency, Inverse Document Frequency and Tf°Idf

As part of training our machine model, we want to give it information about how words are distributed across our set of data. This is commonly done using Term Frequency/Inverse Document Frequency techniques.

The first step, then, is to calculate the term frequency (Tf), the inverse document frequency (Idf), and their product (Tf°Idf) Apologies: There’s an occasional Google Chrome bug with latex/equation rendering - see equations in Safari if you want to understand them

(Tf): TERM FREQUENCY
How probable is that this particular word appears in a SINGLE document.

\(Tf(\omega) = \left( \frac{ \# occurences\ of\ \omega\ in\ document}{\# words\ in\ document} \right)\)

(Idf): INVERSE DOCUMENT FREQUENCY
How common or rare a word is across ALL documents. If it’s a common word across all documents, it’s regarded as a non-differentiating word, and we want to weight it lower.

Alternatively, it can be thought of as the specificity of a term - quantified as an inverse function of the number of documents in which it occurs. The most common function used is log(natural) of the number of docs divided by the number of docs in which this term appears.

\(Idf(\omega) = ln \left( \frac{ \# docs}{\# docs.containing(\omega) } \right)\)

(Tf°Idf):TERM SPECIFICITY
Tf°Idf, the product of Tf and Idf, is used to weight words according to how “important” they are to the machine learning algorithm ☺. The higher the TF°IDF score (weight), the rarer the term and vice versa. The intuition for this measure is this:

If a word appears frequently in a document, then it should be important and we should give that word a high score. But if a word appears in too many other documents, it’s probably not a unique identifier, therefore we should assign a lower score to that word.

Tf°idf is one of the most popular term-weighting schemes today. Over 80% of text-based recommender systems, including Google’s search engine, use Tf°Idf. Introduced, as “term specificity” by Karen Spärck Jones in a 1972 paper, it has worked well as a heuristic. However, its theoretical foundations have been troublesome for at least three decades afterward, with many researchers trying to find information theoretic justifications for it.

Ashok Khosla - Detecting Deceptive Hotel Reviews

Code
%%capture cell_output.txt

# Calculate TF-IDF Matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk

def simple_tokenizer(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for word in nltk.word_tokenize(text)]
    return tokens

tfidf_vectorizer = TfidfVectorizer(max_df=.9, max_features=200000,
                                   min_df=.1, stop_words=None,
                                   use_idf=True, tokenizer=simple_tokenizer, ngram_range=(1,10))

tfidf_matrix = tfidf_vectorizer.fit_transform(khipu_documents)

terms = tfidf_vectorizer.get_feature_names_out()
cosine_distance = 1 - cosine_similarity(tfidf_matrix)

K-Means Clustering


Traditionally the next step is to use a variety of clustering methods. Here I will only use two. The first, K-means clustering, groups the “documents” based on n-dimensional clustering of the bag of words contained in the documents. I’ve used K-means in the past. It’s not a great clustering algorithm, but it has the advantage that it outputs discrete groupings which can be handy. So I always do it as a first pass.

Code
%%capture cell_output.txt
from sklearn.cluster import KMeans

num_clusters = 9

km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()
Code
khipu_cluster_dict = { 'khipu_name': khipu_names, 'document': khipu_documents, 'cluster': clusters  }
khipu_cluster_df = pd.DataFrame(khipu_cluster_dict, index = [clusters] , columns = ['khipu_name', 'document', 'cluster'])
#display_dataframe(khipu_cluster_df)

kMeansClusters = []
for cluster_id in range(num_clusters):
    khipus_in_cluster = khipu_cluster_df[khipu_cluster_df.cluster == cluster_id].khipu_name.values.tolist()
    kMeansClusters.append(khipus_in_cluster)
    print(cluster_id, ":\t", khipus_in_cluster,",")
    
pd.DataFrame(kMeansClusters).to_csv(f"{kq.qollqa_data_directory()}/kMeansClusters.csv")     
0 :  ['AS025', 'AS081', 'HP038', 'HP044', 'HP050', 'JC002', 'JC003', 'JC004', 'JC005', 'JC006', 'JC008', 'JC009', 'JC010', 'JC011', 'JC012', 'JC013', 'JC014', 'JC015', 'JC016', 'JC019', 'JC022', 'JC023', 'UR046', 'UR075', 'UR103', 'UR178', 'UR180', 'UR251', 'UR282', 'UR283', 'UR285', 'UR286', 'UR287', 'UR288', 'UR289', 'UR290', 'UR291A', 'AS004', 'MM011', 'MM013', 'QU022', 'UR039', 'UR050', 'UR054', 'UR055', 'UR110', 'UR112', 'UR144', 'UR292A'] ,
1 :  ['AS011', 'AS014', 'AS018', 'AS020', 'AS029', 'AS035C', 'AS037', 'AS041', 'AS050', 'AS054', 'AS055', 'AS063B', 'AS065', 'AS073', 'AS101 - Part 1', 'AS101 - Part 2', 'AS111', 'AS115', 'AS122', 'AS125', 'AS129', 'AS133', 'AS134', 'AS137', 'AS156', 'AS164', 'AS168', 'AS170', 'AS172', 'AS177', 'AS182B', 'AS183', 'AS184', 'AS187', 'AS188', 'AS192', 'AS197', 'AS200', 'AS206', 'AS209', 'AS35B', 'HP037', 'HP039', 'HP052', 'HP053', 'HP056', 'UR002', 'UR011', 'UR013', 'UR014', 'UR015', 'UR049', 'UR053A', 'UR053B', 'UR053D', 'UR053E', 'UR058A', 'UR058B', 'UR085', 'UR094', 'UR095', 'UR098', 'UR099', 'UR100', 'UR101', 'UR1033E', 'UR104', 'UR1040', 'UR108', 'UR1099', 'UR1105', 'UR1113', 'UR1116', 'UR1150', 'UR1176', 'UR118', 'UR119', 'UR121', 'UR122', 'UR125', 'UR130', 'UR131A', 'UR131B', 'UR131C', 'UR136', 'UR137', 'UR142', 'UR143', 'UR145', 'UR161', 'UR166', 'UR182', 'UR184', 'UR186', 'UR187', 'UR194', 'UR216', 'UR223', 'UR247', 'AS001', 'AS002', 'AS003', 'AS006', 'AS007', 'AS008', 'AS009', 'AS072', 'CM009', 'HP047', 'KH0049', 'KH0057', 'KH0058', 'KH0081', 'KH0083', 'MM002', 'MM004', 'MM008', 'MM009', 'MM010', 'MM012', 'MM014', 'MM015', 'MM016', 'MM017', 'QU003', 'QU005', 'QU011', 'QU012', 'QU014', 'QU019', 'UR253'] ,
2 :  ['AS010', 'AS012', 'AS013', 'AS015', 'AS016', 'AS019', 'AS021', 'AS023', 'AS024', 'AS028', 'AS035D', 'AS036', 'AS039', 'AS042', 'AS043', 'AS044', 'AS045', 'AS059', 'AS060', 'AS063', 'AS064', 'AS065B', 'AS066', 'AS069', 'AS077', 'AS082', 'AS083', 'AS085', 'AS089', 'AS090/N2', 'AS092', 'AS093', 'AS110', 'AS112', 'AS128', 'AS132', 'AS139', 'AS142', 'AS158', 'AS159', 'AS160', 'AS169', 'AS171', 'AS173', 'AS178', 'AS182', 'AS185', 'AS202', 'AS203', 'AS207A', 'AS207C', 'AS210', 'AS211', 'AS214', 'AS215', 'AS35A', 'UR040', 'UR041', 'UR042', 'UR051', 'UR1033A', 'UR1034', 'UR1097', 'UR1117', 'UR1121', 'UR1141', 'UR1146', 'UR1162A', 'UR1163', 'KH0080', 'UR052'] ,
3 :  ['AS194', 'AS195', 'AS196', 'AS199', 'HP003', 'HP013', 'HP017', 'HP021', 'HP034', 'JC001', 'JC017', 'UR008', 'UR018', 'UR061', 'UR062', 'UR086', 'UR088', 'UR105', 'UR1084', 'UR1098', 'UR1100', 'UR1102', 'UR1103', 'UR1107', 'UR1109', 'UR1114', 'UR1123', 'UR1124', 'UR1126', 'UR1127', 'UR1131', 'UR1135', 'UR1136', 'UR1138', 'UR1140', 'UR1144', 'UR1147', 'UR1161', 'UR1162B', 'UR1165', 'UR1167', 'UR1175', 'UR120', 'UR127', 'UR138', 'UR141', 'UR146', 'UR149', 'UR151', 'UR152', 'UR154', 'UR174', 'UR206', 'UR207', 'UR219', 'UR227', 'UR228', 'UR232', 'UR237', 'UR240', 'UR246', 'UR258', 'UR262', 'UR267A', 'UR270', 'UR279', 'UR293', 'UR294', 'KH0001', 'MM006/AN001', 'MM1086', 'UR193', 'UR221', 'UR280'] ,
4 :  ['AS076', 'AS080', 'AS191', 'HP015', 'HP020', 'HP032', 'HP033', 'HP040', 'HP046 A', 'HP051 B', 'HP057', 'UR006', 'UR019', 'UR021', 'UR027', 'UR053C', 'UR056B', 'UR059', 'UR066', 'UR067', 'UR069', 'UR078', 'UR079', 'UR082', 'UR089', 'UR091', 'UR093', 'UR102', 'UR1032', 'UR1033C', 'UR1033F', 'UR1053', 'UR1058', 'UR1091', 'UR1095', 'UR1096', 'UR114', 'UR117B', 'UR117D', 'UR124', 'UR128', 'UR134', 'UR135', 'UR139', 'UR140', 'UR160', 'UR170', 'UR173', 'UR179', 'UR189', 'UR191', 'UR195', 'UR198', 'UR203', 'UR210', 'UR211', 'UR214', 'UR231', 'UR235', 'UR239', 'UR242', 'UR243', 'UR244', 'UR250', 'UR252', 'UR272', 'HP045', 'HP048', 'KH0227', 'MM001', 'MM003', 'MM021', 'QU004', 'QU007', 'QU009', 'QU013', 'QU017', 'QU021', 'UR155', 'UR190'] ,
5 :  ['AS017', 'AS026A', 'AS026B', 'AS027', 'AS048', 'AS061/MA036', 'AS062', 'AS071', 'AS094', 'AS153', 'AS155', 'AS157', 'AS174', 'AS186', 'AS189', 'AS201', 'AS204', 'AS205', 'AS207B', 'AS212', 'AS213', 'AS215F', 'HP016', 'UR003', 'UR007', 'UR012', 'UR016', 'UR023', 'UR024', 'UR029', 'UR057', 'UR084', 'UR096', 'UR097', 'UR1052', 'UR109', 'UR1108', 'UR1130', 'UR1151', 'UR1166', 'UR116A', 'UR117C', 'UR132', 'UR148', 'UR156', 'UR169', 'UR175', 'UR177', 'UR196', 'UR205', 'UR215', 'UR230', 'UR238', 'UR241', 'UR276', 'AS005', 'KH0267', 'KH0350'] ,
6 :  ['HP001', 'HP002', 'HP006', 'HP007', 'HP008', 'HP010', 'HP011', 'HP024', 'HP027', 'HP029', 'HP030', 'JC007', 'JC018', 'JC020', 'JC021', 'UR020', 'UR026', 'UR038', 'UR1104', 'UR1106', 'UR1118', 'UR1119', 'UR1120', 'UR1124 Detail 1', 'UR1143', 'UR1145', 'UR1148', 'UR1149', 'UR1152', 'UR1179', 'UR1180', 'UR129', 'UR147', 'UR153', 'UR162', 'UR164', 'UR183', 'UR185', 'UR188', 'UR200', 'UR201', 'UR218', 'UR226', 'UR234', 'UR248', 'UR249', 'UR256', 'UR257', 'UR266', 'UR267B', 'UR268', 'UR269', 'UR271', 'UR273A', 'UR273B', 'UR274A', 'UR274B', 'UR275', 'UR278', 'KH0067', 'KH0197', 'MM018', 'QU010', 'QU015', 'UR165'] ,
7 :  ['AS074', 'AS075', 'AS078', 'AS079', 'AS193', 'AS198', 'HP019', 'HP025', 'HP026', 'HP028', 'HP031', 'HP042', 'HP043', 'HP046 B', 'HP049', 'UR009', 'UR028', 'UR032', 'UR037', 'UR045', 'UR056C', 'UR060', 'UR063', 'UR065', 'UR070', 'UR071', 'UR072', 'UR073', 'UR074', 'UR076', 'UR077', 'UR080', 'UR081', 'UR087', 'UR092', 'UR1033B', 'UR1049', 'UR1057', 'UR1088', 'UR111', 'UR116B', 'UR117A', 'UR123', 'UR157', 'UR158', 'UR159', 'UR168', 'UR171', 'UR209', 'UR245', 'UR259', 'UR263', 'UR265', 'MM005', 'MM007/AN002', 'MM019', 'QU001', 'QU002', 'QU006', 'QU008', 'QU018', 'QU020', 'UR167'] ,
8 :  ['HP004', 'HP005', 'HP009', 'HP012', 'HP014', 'HP018', 'HP022', 'HP023', 'HP051 A', 'HP054', 'UR005', 'UR010', 'UR022', 'UR033', 'UR034', 'UR047', 'UR048', 'UR056A', 'UR064', 'UR068', 'UR090', 'UR1033D', 'UR1033G', 'UR1051', 'UR106', 'UR107', 'UR1087', 'UR113', 'UR1154', 'UR150', 'UR172', 'UR192', 'UR197', 'UR199', 'UR202', 'UR204', 'UR208', 'UR212', 'UR213', 'UR217', 'UR220', 'UR222', 'UR225', 'UR229', 'UR233', 'UR254', 'UR260', 'UR261', 'UR264', 'UR277', 'UR284', 'HP055', 'KH0032', 'KH0033', 'MM020', 'QU016', 'UR001', 'UR004', 'UR017', 'UR255'] ,

One simple test of the clustering - where do known khipu clusters cluster? For example, where does UR1033[A-F] group? How about UR058[A-B]? Examine these and similar clusters in the list above and below in the hierarchical dendogram. It is also useful to browse the results visually using the Khipu Field Guide Image Browser

Hierarchical Clustering

Hierarchical clustering, uses “cosine similarity” (i.e. how close together are two normalized vectors in an n-dimensional bag-of-words frequency matrix) to group together and cluster documents. In my experience, it generally performs better than Kmeans or Principal Component Analysis. Although I prefer the R package for clustering, Python provides graphics adequate for the job at hand. In Phase 3 we will likely move to R.

Code
from scipy.cluster.hierarchy import ward, dendrogram
import matplotlib.pyplot as plt

Z = ward(cosine_distance) #define the linkage_matrix using ward clustering pre-computed distances

fig, ax = plt.subplots(figsize=(15, 40)) # set size
ax = dendrogram(Z, orientation="left", labels=khipu_names);

plt.tick_params(\
    axis= 'x',          # changes apply to the x-axis
    which='both',       # both major and minor ticks are affected
    bottom='off',       # ticks along the bottom edge are off
    top='off',          # ticks along the top edge are off
    labelbottom='off')

plt.tight_layout() #show plot with tight layout

#uncomment below to save figure
plt.savefig('./images/khipu_clusters.png', dpi=200) #save figure as ward_clusters

# Clustered 
sorted_by_similarity = ax['ivl']
print(sorted_by_similarity)
pd.DataFrame(sorted_by_similarity).to_csv(f"{kq.qollqa_data_directory()}/kHierarchicalCluster.csv")     
['UR020', 'UR162', 'UR185', 'JC007', 'JC020', 'UR154', 'HP029', 'UR274A', 'UR256', 'QU010', 'UR038', 'UR226', 'UR266', 'UR026', 'UR273A', 'UR271', 'HP006', 'HP024', 'UR153', 'KH0067', 'UR257', 'UR200', 'UR234', 'UR201', 'UR269', 'UR1179', 'UR1119', 'UR1124 Detail 1', 'UR1143', 'UR1148', 'JC021', 'UR1118', 'UR1120', 'HP001', 'UR1149', 'HP002', 'HP010', 'UR188', 'UR164', 'UR249', 'UR268', 'JC018', 'HP027', 'UR165', 'HP030', 'UR248', 'UR275', 'UR1145', 'UR151', 'UR1103', 'UR262', 'UR1144', 'UR1107', 'UR1135', 'UR129', 'UR1106', 'UR1180', 'UR1123', 'UR1152', 'UR147', 'UR274B', 'UR183', 'MM018', 'HP008', 'UR273B', 'UR218', 'UR267B', 'UR018', 'AS199', 'UR237', 'UR240', 'QU015', 'HP007', 'HP011', 'HP034', 'UR062', 'UR278', 'MM006/AN001', 'UR146', 'KH0197', 'UR1087', 'UR001', 'UR005', 'UR022', 'UR213', 'UR217', 'UR1051', 'UR260', 'UR261', 'UR208', 'UR233', 'HP023', 'UR202', 'UR033', 'UR219', 'HP012', 'UR220', 'HP003', 'HP055', 'UR197', 'MM020', 'UR255', 'UR1175', 'UR1109', 'UR1124', 'UR1126', 'UR1136', 'UR1165', 'UR1100', 'UR1140', 'UR1161', 'AS194', 'AS195', 'UR1131', 'UR280', 'HP021', 'UR061', 'UR086', 'UR120', 'UR232', 'UR258', 'UR138', 'UR206', 'UR088', 'UR149', 'UR207', 'UR141', 'UR152', 'MM1086', 'JC001', 'UR105', 'UR294', 'UR227', 'UR228', 'AS196', 'HP013', 'UR008', 'UR221', 'UR267A', 'UR246', 'HP017', 'UR174', 'UR1102', 'UR1138', 'UR1167', 'UR1147', 'UR1127', 'UR1162B', 'UR293', 'JC017', 'KH0001', 'UR193', 'UR270', 'UR279', 'UR203', 'UR235', 'UR089', 'UR091', 'UR160', 'UR189', 'UR059', 'UR190', 'UR069', 'UR239', 'UR067', 'UR231', 'MM001', 'UR155', 'UR113', 'UR277', 'HP004', 'UR034', 'UR204', 'HP009', 'UR004', 'UR010', 'UR090', 'HP051 A', 'UR107', 'UR242', 'KH0032', 'QU016', 'UR222', 'UR017', 'UR048', 'UR150', 'UR212', 'UR264', 'UR192', 'UR199', 'HP005', 'HP018', 'HP014', 'UR064', 'UR068', 'HP022', 'UR225', 'UR229', 'UR056C', 'UR063', 'UR073', 'UR047', 'UR1033D', 'UR172', 'HP054', 'UR074', 'UR1033G', 'KH0033', 'UR056A', 'UR209', 'UR283', 'UR286', 'JC009', 'JC013', 'JC006', 'JC005', 'UR284', 'JC011', 'UR251', 'UR254', 'JC010', 'JC019', 'UR103', 'UR1154', 'HP050', 'UR106', 'UR116A', 'UR205', 'UR019', 'UR272', 'UR196', 'UR116B', 'UR238', 'UR148', 'UR175', 'UR177', 'UR096', 'UR109', 'UR1114', 'UR1098', 'UR127', 'UR1166', 'UR1130', 'UR215', 'UR053C', 'MM003', 'UR1104', 'UR1084', 'MM002', 'HP037', 'UR166', 'UR085', 'QU011', 'MM008', 'QU019', 'UR049', 'MM004', 'KH0057', 'UR253', 'UR012', 'UR014', 'UR053E', 'UR104', 'UR184', 'HP056', 'QU003', 'QU014', 'UR058A', 'UR058B', 'UR145', 'KH0083', 'UR053B', 'UR130', 'UR013', 'UR121', 'UR098', 'UR1033E', 'UR216', 'HP045', 'UR102', 'QU021', 'UR066', 'KH0227', 'QU009', 'UR137', 'UR095', 'QU012', 'AS192', 'MM015', 'UR125', 'UR223', 'UR186', 'UR118', 'UR119', 'AS200', 'UR011', 'UR131B', 'UR131A', 'UR131C', 'UR161', 'HP047', 'UR247', 'AS197', 'UR053A', 'UR015', 'KH0049', 'UR122', 'UR187', 'UR108', 'AS008', 'KH0081', 'AS209', 'AS115', 'AS125', 'AS054', 'AS111', 'AS029', 'AS170', 'UR142', 'UR143', 'UR002', 'UR094', 'AS183', 'AS014', 'UR1040', 'UR1150', 'AS101 - Part 1', 'AS101 - Part 2', 'AS072', 'AS156', 'AS168', 'AS187', 'UR1116', 'AS065', 'AS035C', 'AS122', 'UR1113', 'AS073', 'AS134', 'AS020', 'AS188', 'AS133', 'AS206', 'AS018', 'AS182B', 'HP052', 'AS009', 'UR136', 'MM014', 'UR053D', 'UR101', 'HP039', 'CM009', 'HP053', 'QU005', 'UR099', 'MM012', 'MM016', 'MM017', 'AS055', 'AS172', 'AS129', 'UR1099', 'AS082', 'AS139', 'AS042', 'UR1117', 'UR084', 'KH0267', 'AS001', 'AS007', 'AS006', 'KH0058', 'AS35B', 'AS137', 'AS177', 'AS184', 'AS002', 'AS003', 'UR100', 'MM009', 'MM010', 'UR039', 'AS083', 'UR1141', 'AS153', 'AS174', 'AS159', 'AS189', 'AS212', 'AS158', 'AS160', 'AS010', 'AS215', 'AS021', 'AS207A', 'AS089', 'AS069', 'AS085', 'AS164', 'UR1105', 'AS011', 'AS041', 'AS185', 'AS063B', 'UR1176', 'AS050', 'AS037', 'AS036', 'AS178', 'AS092', 'AS110', 'AS063', 'AS066', 'AS171', 'AS173', 'UR042', 'AS043', 'AS128', 'AS132', 'AS045', 'AS210', 'KH0080', 'AS044', 'UR1034', 'AS012', 'UR040', 'AS026A', 'AS211', 'AS186', 'UR1151', 'UR132', 'AS061/MA036', 'AS062', 'AS157', 'AS026B', 'AS207B', 'AS142', 'AS201', 'AS213', 'AS155', 'AS215F', 'AS017', 'AS204', 'AS094', 'AS027', 'AS048', 'AS071', 'AS205', 'AS013', 'AS028', 'AS035D', 'UR041', 'UR1163', 'AS023', 'AS35A', 'UR1033A', 'AS182', 'AS059', 'AS214', 'UR1146', 'UR1108', 'UR1162A', 'AS112', 'UR1121', 'AS202', 'AS203', 'AS015', 'UR1097', 'AS024', 'AS065B', 'AS016', 'AS064', 'AS207C', 'UR051', 'UR052', 'AS019', 'AS060', 'AS093', 'AS169', 'AS090/N2', 'AS039', 'AS077', 'UR173', 'HP051 B', 'UR170', 'UR1091', 'UR117D', 'UR1095', 'HP032', 'UR243', 'UR056B', 'UR079', 'UR021', 'UR252', 'UR244', 'HP046 A', 'HP033', 'UR214', 'UR124', 'UR139', 'UR128', 'UR210', 'UR1033F', 'UR1096', 'UR191', 'UR195', 'MM021', 'UR009', 'UR006', 'UR045', 'HP057', 'QU013', 'QU017', 'UR032', 'UR263', 'UR1053', 'UR093', 'QU001', 'UR1033C', 'UR134', 'HP015', 'AS076', 'HP040', 'UR1032', 'UR114', 'UR198', 'UR135', 'QU007', 'AS080', 'UR027', 'UR117B', 'HP048', 'UR082', 'UR211', 'HP020', 'UR140', 'UR078', 'UR1058', 'AS191', 'UR250', 'QU004', 'AS025', 'UR112', 'UR144', 'UR050', 'UR110', 'AS081', 'UR054', 'UR055', 'MM011', 'MM013', 'UR179', 'UR180', 'UR178', 'UR290', 'UR285', 'UR289', 'UR288', 'UR287', 'UR282', 'UR291A', 'JC002', 'JC015', 'JC003', 'JC016', 'UR065', 'JC012', 'UR292A', 'HP038', 'UR046', 'JC004', 'JC008', 'JC014', 'JC022', 'JC023', 'QU020', 'UR1033B', 'UR245', 'QU008', 'HP046 B', 'UR072', 'MM007/AN002', 'UR087', 'UR1049', 'HP019', 'UR092', 'HP031', 'UR167', 'UR075', 'QU022', 'UR080', 'UR081', 'UR117A', 'UR158', 'QU006', 'UR071', 'HP025', 'HP026', 'HP049', 'HP028', 'UR070', 'AS198', 'AS193', 'UR157', 'UR037', 'UR265', 'AS079', 'UR1057', 'AS074', 'AS075', 'UR028', 'UR259', 'AS078', 'UR076', 'HP043', 'UR060', 'HP042', 'UR159', 'UR171', 'MM005', 'UR111', 'QU002', 'UR1088', 'MM019', 'UR182', 'UR168', 'QU018', 'UR123', 'UR194', 'UR003', 'UR097', 'UR117C', 'UR007', 'KH0350', 'UR230', 'UR241', 'UR016', 'UR169', 'AS004', 'AS005', 'HP044', 'UR024', 'UR057', 'UR023', 'UR029', 'HP016', 'UR1052', 'UR077', 'UR156', 'UR276']

Sanity Check - Calendar Khipus

Again, you can examine the results visually using the Khipu Field Guide Image Browser. As a sample of how well this hierarchical clustering occurs, let’s look at how close UR006, UR009, UR021, and UR1084/AS084 cluster. These 4 khipus are all considered to be “astronomical” or calendar type khipu, based on their cluster frequencies and cords per cluster. You can see that the clustering algorithm correctly places UR006, UR009 and UR021 close together, but it places UR1084 much farther away. Why? I’ll have to analyze Juliana Martin’s work on UR1084 more in phase 2 or 3.

Code
calendar_khipus = ['UR021', 'UR006', 'UR009', 'UR1084']
for i, aKhipu in enumerate(calendar_khipus):
    previous_khipu_name = calendar_khipus[i-1] if i > 0 else calendar_khipus[0]
    cur_khipu_name = calendar_khipus[i]
    previous_khipu_index = sorted_by_similarity.index(previous_khipu_name)
    cur_khipu_index = sorted_by_similarity.index(cur_khipu_name)
    delta_previous = abs(cur_khipu_index - previous_khipu_index)
    print(f"{aKhipu} is at position: {cur_khipu_index} - {delta_previous} away from previous khipu")
UR021 is at position: 493 - 0 away from previous khipu
UR006 is at position: 509 - 16 away from previous khipu
UR009 is at position: 508 - 1 away from previous khipu
UR1084 is at position: 247 - 261 away from previous khipu

What exactly are we measuring when we cluster? That’s a difficult question to answer. How good is the clustering? Again a difficult question to answer. Later, in phase 3, we can look at the eigenvectors of a khipu, the mathematical equivalent of archetypal khipus that are the basis of all khipus. Until then, a simple plot of our hierarchical clustering with respect to benford_match and color shows that hierarchical clustering is occuring i.e. lots of clusters of little hills, but not as we expect, i.e. montonically decreasing.

The khipus simply defy simple explanation. As Piet Hein says:

Problems worthy
Of attack,
Prove their worth,
By hitting back.

Code
benford_match = [benford_df[benford_df.name == khipu_name].benford_match.values[0] for khipu_name in sorted_by_similarity]
match_df = pd.DataFrame({'khipu_name':sorted_by_similarity, 'benford_match':benford_match})
fig = px.bar(match_df, x='khipu_name', y='benford_match', width=2400, height=300, title="Similarity by Benford\'s Law Match").show()
Code
color_similarity = [khipu_summary_df[khipu_summary_df.kfg_name == khipu_name].num_ascher_colors.values[0] for khipu_name in sorted_by_similarity]
match_df = pd.DataFrame({'khipu_name':sorted_by_similarity, 'num_ascher_colors':color_similarity})
fig = px.bar(match_df, x='khipu_name', y='num_ascher_colors', width=2400, height=300, title="Similarity by Number of Ascher Colors").show()