Khipu Analysis

Loading the Khipu Summary

In this page, we will perform high-level analysis, comparing khipus to each other, as a whole.

As part of the building of the python class object OODB, we now have a summary table of useful metrics of Khipu. A list of sample metrics includes, how many pendant cords, how many S-Twist cords, how many subsidiary cords for each pendant cord, etc..

One interesting field is the document field which is a textual representation of the khipus "features", such as cord attachments, values, colors, knot types, etc. This textual representation becomes a "proxy" for the Khipu, and allows NLP algorithms to work on the khipu. The long term goal then becomes translating that text document into something meaningful to us.

I will admit it is an odd process. First humans convert thoughts to cloth. Then humans measure the cloth, and convert it back to numbers (the database). Then I come along and convert the numbers back to pixel-cloth and I convert the numbers to text. Which I will convert back to numbers in code, so that I can hopefully convert those numbers back to thoughts. Hmmmmn....

The long prologue of imports and setup has been extracted to a startup file for brevity. The first task is to complete initialization by doing final setup and importing the Python Khipu OODB (Object Oriented Database) and summary files.

In [1]:
# Initialize plotly
plotly.offline.init_notebook_mode(connected = True);

# Khipu Imports
sys.path.insert(0, '../../classes')  # Load external khipu classes contained in this directory
import khipu_kamayuq as kamayuq  # A Khipu Maker is known (in Quechua) as a Khipu Kamayuq
import khipu_qollqa as kq

The KhipuKamayuq class (a Khipu scribe is a KhipuKamayuq) fetches khipus using the data accessed using the KhipuQollqa (a Khipu Warehouse) class. The KhipuKamayuq class can fetch all the khipus stored(pickled) as python object-oriented class objects. Additionally it has a khipu summary dataframe made by the KhipuQollqa. Let's load the OODB, and the summary dataframe

The khipu summary dataframe is large. Any tabular view may require a mouse that scrolls horizontally!. So in addition let's list the columns

In [2]:
start_time = time.time()
all_khipus = [aKhipu for aKhipu in kamayuq.fetch_all_khipus().values()]
print(f"That took {(time.time() - start_time)} seconds")
print(f"The khipu_kamayuq fetched {len(all_khipus)} khipus from the khipu qollqa")
That took 14.605737924575806 seconds
The khipu_kamayuq fetched 511 khipus from the khipu qollqa
In [3]:
khipu_summary_df = kq.fetch_khipu_summary()
display_dataframe(khipu_summary_df)
khipu_summary_df.columns
Out[3]:
Index(['Unnamed: 0', 'khipu_id', 'name', 'legal_filename', 'original_name',
       'nickname', 'museum_name', 'museum_num', 'museum_description', 'region',
       'provenance', 'num_cord_clusters', 'mean_cords_per_cluster',
       'num_sum_clusters', 'num_banded_clusters', 'num_seriated_clusters',
       'benford_match', 'num_canuto_cords', 'num_cords', 'recto_cords',
       'verso_cords', 'rv_cords', 'num_pendant_cords', 'num_top_pendant_cords',
       'num_top_cords', 'num_subsidiary_cords', 'num_s_cords', 'num_z_cords',
       'mean_cord_value', 'fanout_ratio', 'num_total_cords',
       'num_ascher_colors', 'knot_cluster_count', 'knot_count',
       'mean_knots_per_cord', 'num_s_knots', 'num_z_knots', 'notes',
       'pcord_notes'],
      dtype='object')

Beginning programmers are taught to learn how to write "Hello World" as their first programming exercise. For khipu scholars, who are researching khipus statistically, Benford's Law is their "Hello World." Benford's law, simply put, says that the first digits of numbers used to measure human activity tend to have a higher probability that they are 1. For example, 10,000, 1,000, etc. We can use Benford's law to predict which khipu are of an "accounting nature" - ie. khipu that have sums and accounts of things.

Benford’s Law by Dan Ma

The first digit (or leading digit) of a number is the leftmost digit (e.g. the first digit of 567 is 5). The first digit of a number can only be 1, 2, 3, 4, 5, 6, 7, 8, and 9 since we do not usually write a number such as 567 as 0567. Some fraudsters may think that the first digits in numbers in financial documents appear with equal frequency (i.e. each digit appears about 11% of the time). In fact, this is not the case. It was discovered by Simon Newcomb in 1881 and rediscovered by physicist Frank Benford in 1938 that the first digits in many data sets occur according to the probability distribution indicated in Figure 1 below:

The above probability distribution is now known as the Benford’s law. It is a powerful and yet relatively simple tool for detecting financial and accounting frauds. For example, according to the Benford’s law, about 30% of numbers in legitimate data have 1 as a first digit. Fraudsters who do not know this will tend to have much fewer ones as first digits in their faked data.

Data for which the Benford’s law is applicable are data that tend to distribute across multiple orders of magnitude. Examples include income data of a large population, census data such as populations of cities and counties. In addition to demographic data and scientific data, the Benford’s law is also applicable to many types of financial data, including income tax data, stock exchange data, corporate disbursement and sales data. The author of I've Got Your Number How a Mathematical Phenomenon can help CPAs Uncover Fraud and other Irregularities., Mark Nigrini, also discussed data analysis methods (based on the Benford’s law) that are used in forensic accounting and auditing.

In [4]:
first_digit_probability = [0.301, 0.176, 0.125, 0.097, 0.079, 0.067, 0.058, 0.051, 0.046]
df = pd.DataFrame({'first_digit':list(range(1,10)), 'proportion':first_digit_probability})
fig = px.bar(df, x='first_digit', y='proportion', text='proportion', width=1200, height=300, title="Benford\'s Law").show()

Using the distribution of "first digits" in a cord (the value of the first knot) we can use a simple cosine distance similarity metric of Benford's distribution and cord distribution. Cosine distance similarity is a simple concept that is used a lot in this study ... If you're unfamiliar here's an intuitive explanation.

Suppose we want to see how close two sets of numbers are to each other. For example house A has 3 dogs and 2 cats, house B has 5 dogs and 1 cat and house C has 1 dog and 7 cats. We can represent each of these as a vector:

A = [3,2] B = [5,1] C = [1,7]

Each of these vectors can be "normalized" to a unit vector (a vector of length 1) by dividing by the vector length. Then you can take advantage of math. A vector dot-product of two vectors measures the angle between those two vectors. If we want to know if A is closer to B or C we take A . B and A . C and see which has the smallest angle between them. The dot -product generates the cosine of the angle, which is why this technique is known as the cosine distance metric.



We can do the same for khipus by creating a vector of first digit counts, for a given khipu and then dot producting it with the Benford probabilities above.

f(angle_theta) = first_digit_probability . given_khipu_digit_distribution. If the angle theta is 0, the vectors lie on top of each other (an exact match) so cos(theta) = 1, with no match = 0.

Another approach to understanding which khipus are accounting in nature, and which aren't is to sum up cord values and see if that sum occurs elsewhere on another cord in the khipu. Dr. Urton, shows an example of this kind, khipu UR028, in his youtube video on accounting khipu. Here I examine how many "sum matches" occur in a khipu (normalized by the number of total cords...). This gives us a secondary statistical check, in addition to Benford's Law.

Note that the metric calculated below is not exact. Here, only two cord sums are measured. The first decoded khipu, in the 1920s by Leland Locke, had one up-cord sum for four down cord values and would not score well on this metric.

There is another metric similar to the above. Some cord clusters (ie. top cords) are the sum, or approximate sum of the cluster adjacent (ie. on either side) of them. The khipu class tags these "sum_clusters" and allows us another metric.

In [5]:
## Benford Law Similarity
# Note that 0 index's are ignored, by setting first_digit_probability[0] to 0
first_digit_probability = [0.0, 0.301, 0.176, 0.125, 0.097, 0.079, 0.067, 0.058, 0.051, 0.046]
def benford_match(aKhipu):
    digit_count = [0]*10
    for cord in aKhipu.attached_cords():
        first_digit = int(str(cord.decimal_value)[0])
        digit_count[first_digit] += 1
    cosine_similarity = 1 - scipy.spatial.distance.cosine(first_digit_probability, digit_count)
    return cosine_similarity
khipu_benford_match = [benford_match(aKhipu) for aKhipu in all_khipus]
In [6]:
# Cords whose sum occurs as another cord in a khipu
def sum_matches(numbers, verbose = False):
    numbers = list(filter(lambda a: a != 0, numbers))
    num_matches= 0
    max_match = 0
    all_matches = [0]
    for i in range(len(numbers)):
        for j in range(i):
            match_sum = numbers[i] + numbers[j]
            if (match_sum in numbers) and (match_sum != 0):
                if verbose: print(f"[{i}]{numbers[i]}, [{j}]{numbers[j]} sums to [{numbers.index(match_sum)}]{match_sum}")
                num_matches += 1
                all_matches.append(match_sum)
    return ({'num_matches':num_matches, 'max_match': max(all_matches), 'mean_match': mean(all_matches) })

def khipu_sum_matches(aKhipu):
    cord_values = [aCord.decimal_value for aCord in aKhipu.attached_cords() if aCord.decimal_value > 0]
    return sum_matches(cord_values)


sum_match_dicts = [khipu_sum_matches(aKhipu) for aKhipu in all_khipus]
sum_match_counts = [aDict['num_matches']for aDict in sum_match_dicts]
sum_match_max = [aDict['max_match']for aDict in sum_match_dicts]
sum_match_mean = [aDict['mean_match']for aDict in sum_match_dicts]
In [7]:
benford_df = pd.DataFrame({'khipu_id': [aKhipu.khipu_id for aKhipu in all_khipus],
                           'name': [aKhipu.name() for aKhipu in all_khipus],
                           'num_pendants': [aKhipu.num_pendant_cords() for aKhipu in all_khipus],
                           'num_subsidiary_cords': [aKhipu.num_subsidiary_cords() for aKhipu in all_khipus],
                           'num_total_cords': [aKhipu.num_attached_cords()for aKhipu in all_khipus],
                           'fanout_ratio': [aKhipu.num_pendant_cords() / aKhipu.num_attached_cords() for aKhipu in all_khipus],
                           'benford_match': khipu_benford_match,
                           'sum_match_count': sum_match_counts,
                           'sum_match_max': sum_match_max,
                           'sum_match_mean': sum_match_mean})

benford_df['normalized_sum_match_count'] = benford_df['sum_match_count']/benford_df['num_total_cords']
benford_df = benford_df.sort_values(by='benford_match', ascending=False)
benford_df.to_csv("../data/CSV/benford.csv") # Save for later when we cross correlate with other statistical tests
display_dataframe(benford_df)
In [8]:
hist_data = [khipu_benford_match]
group_labels = ['benford_match']
colors = ['#835AF1']

(
ff.create_distplot(hist_data, group_labels,  bin_size=.01, 
                         show_rug=False,show_curve=True)
    .update_layout(title_text='Distribution Plot of Khipu -> Benford Matches', # title of plot
                     xaxis_title_text='Benford Match Value (cosine_similarity)', 
                     yaxis_title_text='Percentage of Khipus', # yaxis label
                     bargap=0.05, # gap between bars of adjacent location coordinates
                     #bargroupgap=0.1 # gap between bars of the same location coordinates
                     )
     .show()
)

Here's a clear statistical/graphical narrative as to which khipus are "accounting" in nature (from a little above .6 to 1.0!), somewhat accounting in nature (.4 to .6), and possibly narrative in nature (< 0.4). How many khipus in each category?

In [9]:
is_likely_accounting = benford_df[benford_df.benford_match > .625]
print(f"Khipus with benford_match > .625 = {is_likely_accounting.shape[0]} (is_likely_accounting)")
is_likely_narrative = benford_df[benford_df.benford_match < .425]
print(f"Khipus with benford_match < .425 = {is_likely_narrative.shape[0]} (is_likely_narrative)")
Khipus with benford_match > .625 = 340 (is_likely_accounting)
Khipus with benford_match < .425 = 69 (is_likely_narrative)

We see that there's probably 352 khipu that are likely accounting, and 67 khipu that are likely narrative.

In [10]:
### Benford Match vs Sum Clusters

Let's do a scatter plot of Benford Matching vs (normalized) cords whose sums match another cord. The clustering of accounting khipus is revealed even more clearly.

In [11]:
fig = px.scatter(benford_df, x="benford_match", y="normalized_sum_match_count", log_y=True,
                 size="sum_match_max", color="sum_match_mean", 
                 labels={"benford_match": "#Benford Match", "normalized_sum_match_count": "#Number of Matching Cord Sums (Normalized by # Total Cords)"},
                 hover_data=['name'], title="<b>Benford Match vs Normalized Sum Match Counts</b> - <i>Hover Over Circles to View Khipu Info</i>",
                 width=1500, height=1000);
fig.update_layout(showlegend=True).show()

Again, we see the lumpiness and confirmation of the "accounting khipus" whose Benford match is > 0.6 or so....

Sanity Check - Marcia Ascher's "Mathematical" Khipus versus their Corresponding Benford Match

As another checkpoint on Benford Law matching approach, let's review the khipus that Marcia Ascher's annotated with mathematical relationships and compare them with their corresponding Benford match.

In [12]:
# These are the khipus that Marcia Ascher annotated as having interesting mathematical relationships
ascher_mathy_khipus = ['AS010', 'AS020', 'AS026B', 'AS029', 'AS038', 'AS041', 'AS048', 'AS066', 'AS068', 
                       'AS079', 'AS080', 'AS082', 'AS092', 'AS115', 'AS128', 'AS129', 'AS164', 'AS168', 'AS192', 
                       'AS193', 'AS197', 'AS198', 'AS199', 'AS201', 'AS206', 'AS207A', 'AS208', 'AS209', 'AS210', 
                       'AS211', 'AS212', 'AS213', 'AS214', 'AS215', 'UR1034', 'UR1096', 'UR1097', 'UR1098', 
                       'UR1100', 'UR1103', 'UR1104', 'UR1114', 'UR1116', 'UR1117', 'UR1120', 'UR1126', 'UR1131', 
                       'UR1135', 'UR1136', 'UR1138', 'UR1140', 'UR1141', 'UR1143', 'UR1145', 'UR1148', 'UR1149', 
                       'UR1151', 'UR1152', 'UR1163', 'UR1175', 'UR1180']
ascher_khipu_benford_match = [benford_match(aKhipu) for aKhipu in all_khipus if aKhipu.name() in ascher_mathy_khipus]

def is_an_ascher_khipu(aKhipu):
    return True if str(aKhipu.original_name)[0:2] == "AS" or aKhipu.name()[0:2] == "AS" else False

ascher_khipus = [aKhipu for aKhipu in all_khipus if is_an_ascher_khipu(aKhipu)]
percent_mathy_khipus = 100.0 * float(len(ascher_mathy_khipus))/float(len(ascher_khipus))
percent_string = "{:4.1f}".format(percent_mathy_khipus)

hist_data = [ascher_khipu_benford_match]
group_labels = ['benford_match']
colors = ['#835AF1']

(
ff.create_distplot(hist_data, group_labels,  bin_size=.01, 
                         show_rug=False,show_curve=True)
    .update_layout(title_text='<b>Benford Match for Ascher Khipus with Mathematical Relationships</b> - <i>Hover Over Circles to View Khipu Info</i>', # title of plot
                     xaxis_title_text='Benford Match Value (cosine_similarity)', 
                     yaxis_title_text='Percentage of Khipus', # yaxis label
                     bargap=0.05, # gap between bars of adjacent location coordinates
                     #bargroupgap=0.1 # gap between bars of the same location coordinates
                     )
     .show()
)
print (f"Percent of ascher khipus having mathematical relationships ({len(ascher_mathy_khipus)} out of {len(ascher_khipus)}) is {percent_string}%")
Percent of ascher khipus having mathematical relationships (61 out of 223) is 27.4%

Benford Match vs Fanout Ratio (Primaries vs Subsidiaries)

Let's conclude this section with an interesting "triangular" graph - Benford match vs Fanout Ratio (the ratio of Pendant cords vs Subsidiary cords)

In [13]:
fig = px.scatter(benford_df, x="benford_match", y="fanout_ratio", 
                 size="num_total_cords", color="num_subsidiary_cords", 
                 labels={"num_total_cords": "#Total Cords", "num_subsidiary_cords": "#Subsidiary Cords"},
                 hover_data=['name'], title="<b>Benford Match vs Fanout Ratio</b> - <i>Hover Over Circles to View Khipu Info</i>",
                 width=1500, height=1000);
fig.update_layout(showlegend=True).show()

Grouping Khipus into Families - What Constitutes "Similarity"

As a useful first step, it will help to have the ability to sort khipus not just by meta-features like benford_match, but also by other forms of similarity.

One of my favorite similarity techniques is the text-clustering approaches of NLP. A driving need in NLP is to sort topics into various categories; for example, is this a news document or fiction. So how do we text mine khipus? A simple solution is to convert the khipu to words. Each Python Khipu class object (knot, cord, cluster, etc.) is given a "writer" function that documents itself as text. Below is the "document" for a simple 3 cord khipu - AS055. It's a pretty sparse and simple description. We'll go about making this more sophisticated as the project becomes more developed. The point is that we now have a set of words that roughly describe a khipu with categorical words such as U_knot_twist, or single_knot and numerical words such as 1421_decimal_value for the cord value. Now we use that set of words as the input to NLP algorithms.

In [14]:
khipu_docstrings_df = kq.fetch_khipu_docstrings()
khipu_names = list(khipu_docstrings_df.name.values)
khipu_documents_orig = list(khipu_docstrings_df.document.values)

# To assist clustering Ascher khipus, where many khipu metrics such as cord attachment, twist, etc are unknown, 
# We can impute values for attachment, cord twist and knot twist
khipu_documents = [aDocument.replace("Unknown_attachment_type", "Recto_attachment_type").replace("U_twist", "S_twist").replace("U_knot_twist", "S_knot_twist") for aDocument in khipu_documents_orig]
In [15]:
print(khipu_docstrings_df[khipu_docstrings_df.name=="AS055"].document.values[0])
 primary_cord_start YB cord_cluster_start 1_cluster_level 3_cc_cords cord_start Unknown_attachment_type U_twist 1_cord_level 1_cord_ordinal 1421_decimal_value 1_cluster_ordinal KB knot_cluster_start single_knot U_knot_twist ʘ_knot_value knot_cluster_end knot_cluster_start single_knot U_knot_twist o_knot_value single_knot U_knot_twist o_knot_value single_knot U_knot_twist o_knot_value single_knot U_knot_twist o_knot_value knot_cluster_end knot_cluster_start single_knot U_knot_twist •_knot_value single_knot U_knot_twist •_knot_value knot_cluster_end knot_cluster_start figure_eight_knot U_knot_twist ※_knot_value knot_cluster_end cord_end cord_start Unknown_attachment_type U_twist 1_cord_level 2_cord_ordinal 2427_decimal_value 2_cluster_ordinal YB knot_cluster_start single_knot U_knot_twist ʘ_knot_value single_knot U_knot_twist ʘ_knot_value knot_cluster_end knot_cluster_start single_knot U_knot_twist o_knot_value single_knot U_knot_twist o_knot_value single_knot U_knot_twist o_knot_value single_knot U_knot_twist o_knot_value knot_cluster_end knot_cluster_start single_knot U_knot_twist •_knot_value single_knot U_knot_twist •_knot_value knot_cluster_end knot_cluster_start long_knot U_knot_twist 7_knot_value knot_cluster_end cord_end cord_start Unknown_attachment_type U_twist 1_cord_level 3_cord_ordinal 734_decimal_value 3_cluster_ordinal KB knot_cluster_start single_knot U_knot_twist o_knot_value single_knot U_knot_twist o_knot_value single_knot U_knot_twist o_knot_value single_knot U_knot_twist o_knot_value single_knot U_knot_twist o_knot_value single_knot U_knot_twist o_knot_value single_knot U_knot_twist o_knot_value knot_cluster_end knot_cluster_start single_knot U_knot_twist •_knot_value single_knot U_knot_twist •_knot_value single_knot U_knot_twist •_knot_value knot_cluster_end knot_cluster_start long_knot U_knot_twist 4_knot_value knot_cluster_end cord_end cord_cluster_end primary_cord_end 


Then we can pass the document to modern NLP text mining algorithms which can cluster the text by similarity along the n-dimensions of all the different words and their frequencies in each document.

The usual first step in NLP document analysis is to build a term-frequency, inverse-document-frequency matrix.

Tf-Idf matrices for NLP have a long history. An explanation:

As part of training our machine model, we want to give it information about how words are distributed across our set of data. This is commonly done using Term Frequency/Inverse Document Frequency techniques.

The first step, then, is to calculate the term frequency (Tf), the inverse document frequency (Idf), and their product (Tf°Idf) Apologies: There’s an occasional Google Chrome bug with latex/equation rendering - see equations in Safari if you want to understand them

(Tf): TERM FREQUENCY - How probable is that this particular word appears in a SINGLE document.

$$Tf(\omega) = \left( \frac{ \# occurences\ of\ \omega\ in\ document}{\# words\ in\ document} \right)$$

(Idf): INVERSE DOCUMENT FREQUENCY - How common or rare a word is across ALL documents. If it’s a common word across all documents, it’s regarded as a non-differentiating word, and we want to weight it lower.

Alternatively, it can be thought of as the specificity of a term - quantified as an inverse function of the number of documents in which it occurs. The most common function used is log(natural) of the number of docs divided by the number of docs in which this term appears.

$$Idf(\omega) = ln \left( \frac{ \# docs}{\# docs.containing(\omega) } \right)$$

(Tf°Idf):TERM SPECIFICITY Tf°Idf, the product of Tf and Idf, is used to weight words according to how “important” they are to the machine learning algorithm ☺. The higher the TF°IDF score (weight), the rarer the term and vice versa. The intuition for this measure is this:

If a word appears frequently in a document, then it should be important and we should give that word a high score. But if a word appears in too many other documents, it’s probably not a unique identifier, therefore we should assign a lower score to that word.

Tf°idf is one of the most popular term-weighting schemes today. Over 80% of text-based recommender systems, including Google’s search engine, use Tf°Idf. Introduced, as “term specificity” by Karen Spärck Jones in a 1972 paper, it has worked well as a heuristic. However, its theoretical foundations have been troublesome for at least three decades afterward, with many researchers trying to find information theoretic justifications for it.

Ashok Khosla - Detecting Deceptive Hotel Reviews

In [16]:
# Calculate TF-IDF Matrix

def simple_tokenizer(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for word in nltk.word_tokenize(text)]
    return tokens

tfidf_vectorizer = TfidfVectorizer(max_df=.9, max_features=200000,
                                   min_df=.1, stop_words=None,
                                   use_idf=True, tokenizer=simple_tokenizer, ngram_range=(1,10))

tfidf_matrix = tfidf_vectorizer.fit_transform(khipu_documents)

terms = tfidf_vectorizer.get_feature_names()
cosine_distance = 1 - cosine_similarity(tfidf_matrix)

K-means clustering


Traditionally the next step is to use a variety of clustering methods. Here I will only use two. The first, K-means clustering, groups the "documents" based on n-dimensional clustering of the bag of words contained in the documents. I've used K-means in the past. It's not a great clustering algorithm, but it has the advantage that it outputs discrete groupings which can be handy. So I always do it as a first pass.

In [17]:
num_clusters = 20

km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()
Out[17]:
KMeans(n_clusters=20)
In [18]:
khipu_cluster_dict = { 'khipu_name': khipu_names, 'document': khipu_documents, 'cluster': clusters  }
khipu_cluster_df = pd.DataFrame(khipu_cluster_dict, index = [clusters] , columns = ['khipu_name', 'document', 'cluster'])
#display_dataframe(khipu_cluster_df)

for cluster_id in range(num_clusters):
    khipus_in_cluster = khipu_cluster_df[khipu_cluster_df.cluster == cluster_id]
    print(cluster_id, ":\t", list(khipus_in_cluster.khipu_name.values))
0 :	 ['UR1162A', 'UR1146', 'UR1108', 'UR1097', 'AS215', 'AS207C', 'AS207A', 'AS205', 'AS203', 'AS202', 'AS169', 'AS094', 'AS093', 'AS090/N2', 'AS077', 'AS071', 'AS065B', 'AS064', 'AS060', 'AS039', 'AS035D', 'AS027', 'AS024', 'AS016', 'AS015']
1 :	 ['UR264', 'UR229', 'UR225', 'UR222', 'UR220', 'UR212', 'UR199', 'UR192', 'UR172', 'UR113', 'UR1033G', 'UR1033D', 'UR074', 'UR068', 'UR064', 'UR056C', 'UR056A', 'UR048', 'UR047', 'HP023', 'HP022', 'HP018', 'HP014', 'HP012', 'HP005']
2 :	 ['UR274B', 'UR256', 'UR249', 'UR248', 'UR236', 'UR234', 'UR226', 'UR218', 'UR201', 'UR200', 'UR188', 'UR185', 'UR183', 'UR165', 'UR164', 'UR163', 'UR162', 'UR153', 'UR147', 'UR129', 'UR1180', 'UR1179', 'UR1152', 'UR1149', 'UR1148', 'UR1145', 'UR1143', 'UR1124 Detail 1', 'UR1123', 'UR1120', 'UR1119', 'UR1118', 'UR1106', 'UR1104', 'UR038', 'UR026', 'UR020', 'HP030', 'HP029', 'HP027', 'HP024', 'HP011', 'HP010', 'HP008', 'HP007', 'HP006', 'HP002', 'HP001']
3 :	 ['UR286', 'UR242', 'UR235', 'UR231', 'UR207', 'UR204', 'UR203', 'UR189', 'UR173', 'UR160', 'UR155', 'UR150', 'UR149', 'UR141', 'UR138', 'UR120', 'UR1087', 'UR1084', 'UR107', 'UR106', 'UR091', 'UR090', 'UR089', 'UR088', 'UR086', 'UR053C', 'UR046', 'UR043', 'UR034', 'UR017', 'UR010', 'UR004', 'UR001', 'HP050', 'HP021', 'HP004']
4 :	 ['UR230', 'UR223', 'UR216', 'UR194', 'UR184', 'UR182', 'UR145', 'UR137', 'UR130', 'UR125', 'UR123', 'UR121', 'UR104', 'UR1033E', 'UR1031', 'UR100', 'UR098', 'UR095', 'UR058B', 'UR058A', 'UR053E', 'UR053B', 'UR049', 'UR035', 'UR014', 'UR013', 'UR012', 'HP056', 'AS192', 'AS070']
5 :	 ['UR250', 'UR170', 'UR140', 'UR1095', 'UR1058', 'UR1053', 'UR078', 'UR066', 'HP020', 'AS076']
6 :	 ['UR1176', 'UR1141', 'UR1034', 'UR042', 'AS35B', 'AS214', 'AS210', 'AS208', 'AS185', 'AS184', 'AS178', 'AS177', 'AS173', 'AS171', 'AS160', 'AS159', 'AS158', 'AS132', 'AS110', 'AS092', 'AS083', 'AS067/MA029', 'AS066', 'AS063B', 'AS063', 'AS047', 'AS046', 'AS045', 'AS044', 'AS043', 'AS037', 'AS036', 'AS021', 'AS010']
7 :	 ['UR169', 'UR057', 'UR029', 'UR024', 'UR023', 'UR016', 'HP016']
8 :	 ['UR186', 'UR136', 'UR119', 'UR118', 'UR1150', 'UR1116', 'UR1113', 'UR1105', 'UR1040', 'UR101', 'UR099', 'UR053D', 'HP052', 'AS206', 'AS200', 'AS188', 'AS187', 'AS183', 'AS182B', 'AS168', 'AS164', 'AS156', 'AS137', 'AS134', 'AS133', 'AS122', 'AS111', 'AS101 - Part 2', 'AS101 - Part 1', 'AS086', 'AS073', 'AS072', 'AS065', 'AS041', 'AS035C', 'AS020', 'AS018', 'AS014', 'AS011']
9 :	 ['UR294', 'UR246', 'UR240', 'UR237', 'UR233', 'UR232', 'UR228', 'UR227', 'UR221', 'UR219', 'UR217', 'UR213', 'UR208', 'UR202', 'UR197', 'UR176', 'UR174', 'UR154', 'UR152', 'UR151', 'UR146', 'UR1175', 'UR1167', 'UR1162B', 'UR1147', 'UR1144', 'UR1138', 'UR1135', 'UR1131', 'UR1127', 'UR1107', 'UR1103', 'UR1102', 'UR1051', 'UR105', 'UR062', 'UR061', 'UR033', 'UR022', 'UR018', 'UR008', 'UR005', 'LL01', 'HP017', 'HP013', 'HP003', 'AS199', 'AS196']
10 :	 ['UR1151', 'UR1130', 'AS215F', 'AS213', 'AS212', 'AS211', 'AS207B', 'AS204', 'AS201', 'AS189', 'AS186', 'AS174', 'AS157', 'AS155', 'AS153', 'AS142', 'AS062', 'AS061/MA036', 'AS048', 'AS026B', 'AS026A', 'AS017']
11 :	 ['UR132', 'UR1163', 'UR1121', 'UR1033A', 'UR051', 'UR041', 'UR040', 'AS35A', 'AS182', 'AS128', 'AS112', 'AS089', 'AS085', 'AS069', 'AS067/MA29', 'AS059', 'AS028', 'AS023', 'AS019', 'AS013', 'AS012']
12 :	 ['UR241', 'UR198', 'UR191', 'UR135', 'UR126', 'UR115', 'UR114', 'UR1032', 'UR102', 'UR093', 'UR083', 'UR007', 'AS080']
13 :	 ['UR291A', 'UR290', 'UR289', 'UR287', 'UR285', 'UR283', 'UR282', 'UR180', 'UR179', 'AS081']
14 :	 ['UR244', 'UR243', 'UR239', 'UR214', 'UR211', 'UR210', 'UR195', 'UR178', 'UR139', 'UR134', 'UR128', 'UR124', 'UR117D', 'UR117B', 'UR1096', 'UR1091', 'UR1033F', 'UR1033C', 'UR079', 'UR069', 'UR067', 'UR059', 'UR056B', 'UR027', 'UR021', 'UR006', 'HP051 B', 'HP032', 'HP015', 'AS191']
15 :	 ['UR265', 'UR245', 'UR171', 'UR168', 'UR167', 'UR159', 'UR157', 'UR117A', 'UR111', 'UR1088', 'UR1057', 'UR1049', 'UR1033B', 'UR092', 'UR087', 'UR081', 'UR080', 'UR076', 'UR073', 'UR072', 'UR065', 'UR063', 'UR060', 'UR037', 'UR032', 'UR028', 'UR009', 'QU02', 'HP049', 'HP031', 'HP019', 'AS198', 'AS193', 'AS079', 'AS078', 'AS075', 'AS074']
16 :	 ['UR133', 'UR1117', 'UR1099', 'UR084', 'UR036', 'AS172', 'AS139', 'AS129', 'AS082', 'AS055', 'AS042', 'AS030']
17 :	 ['UR127', 'UR1166', 'UR1165', 'UR1161', 'UR1154', 'UR1140', 'UR1136', 'UR1126', 'UR1124', 'UR1114', 'UR1109', 'UR1100', 'UR1098', 'UR103', 'AS195', 'AS194']
18 :	 ['UR247', 'UR187', 'UR166', 'UR161', 'UR143', 'UR142', 'UR131C', 'UR131B', 'UR131A', 'UR122', 'UR108', 'UR094', 'UR085', 'UR053A', 'UR015', 'UR011', 'UR002', 'AS209', 'AS197', 'AS181', 'AS170', 'AS125', 'AS115', 'AS068', 'AS056', 'AS054', 'AS050', 'AS038', 'AS029']
19 :	 ['UR238', 'UR215', 'UR205', 'UR177', 'UR175', 'UR156', 'UR148', 'UR117C', 'UR116B', 'UR116A', 'UR109', 'UR1052', 'UR097', 'UR096', 'UR077', 'UR075', 'UR045', 'UR019', 'UR003']

One simple test of the clustering - where do known khipu clusters cluster? For example, where does UR1033[A-F] group? How about UR058[A-B]? Examine these and similar clusters in the list above and below in the hierarchical dendogram. It is also useful to browse the results visually using the Khipu Field Guide Image Browser

Hierarchical clustering, uses "cosine similarity" (i.e. how close together are two normalized vectors in an n-dimensional bag-of-words frequency matrix) to group together and cluster documents. In my experience, it generally performs better than Kmeans or Principal Component Analysis. Although I prefer the R package for clustering, Python provides graphics adequate for the job at hand. In Phase 3 we will likely move to R.

In [19]:
Z = ward(cosine_distance) #define the linkage_matrix using ward clustering pre-computed distances

fig, ax = plt.subplots(figsize=(15, 40)) # set size
ax = dendrogram(Z, orientation="left", labels=khipu_names);

plt.tick_params(\
    axis= 'x',          # changes apply to the x-axis
    which='both',       # both major and minor ticks are affected
    bottom='off',       # ticks along the bottom edge are off
    top='off',          # ticks along the top edge are off
    labelbottom='off')

plt.tight_layout() #show plot with tight layout

#uncomment below to save figure
plt.savefig('./images/khipu_clusters.png', dpi=200) #save figure as ward_clusters

# Clustered 
sorted_by_similarity = ax['ivl']
print(sorted_by_similarity)
['UR1163', 'AS039', 'AS028', 'AS035D', 'UR041', 'AS013', 'AS023', 'UR1033A', 'AS35A', 'AS182', 'AS214', 'AS059', 'AS019', 'AS169', 'AS093', 'UR051', 'AS090/N2', 'AS016', 'AS207C', 'AS077', 'AS060', 'AS081', 'AS067/MA29', 'AS203', 'AS202', 'UR1121', 'AS112', 'UR1097', 'AS015', 'AS064', 'AS065B', 'AS024', 'UR215', 'UR1130', 'UR1146', 'UR1162A', 'UR1108', 'AS155', 'AS061/MA036', 'AS017', 'AS215F', 'AS204', 'AS094', 'AS048', 'AS027', 'AS205', 'AS071', 'UR040', 'AS012', 'AS211', 'AS026A', 'UR1151', 'AS186', 'UR132', 'AS157', 'AS062', 'AS207B', 'AS026B', 'AS142', 'AS213', 'AS201', 'AS174', 'AS153', 'AS159', 'AS212', 'AS189', 'AS215', 'AS207A', 'AS021', 'AS089', 'AS085', 'AS069', 'AS010', 'AS160', 'AS158', 'AS208', 'UR1034', 'AS046', 'AS210', 'AS044', 'AS185', 'AS047', 'AS128', 'AS067/MA029', 'AS173', 'AS063', 'AS132', 'AS045', 'AS043', 'UR042', 'AS092', 'UR1105', 'UR1176', 'AS164', 'AS063B', 'UR035', 'AS070', 'AS050', 'AS181', 'AS011', 'AS171', 'AS35B', 'AS177', 'AS037', 'AS036', 'AS178', 'AS110', 'AS066', 'UR187', 'UR122', 'UR015', 'AS125', 'AS115', 'AS209', 'AS038', 'UR247', 'UR161', 'UR053A', 'AS197', 'UR186', 'UR119', 'UR118', 'UR011', 'UR131C', 'UR131B', 'UR131A', 'UR145', 'AS200', 'UR143', 'UR094', 'UR142', 'UR002', 'AS170', 'AS029', 'AS068', 'AS054', 'AS041', 'AS014', 'UR1150', 'UR1040', 'AS111', 'AS065', 'AS122', 'AS035C', 'AS168', 'AS156', 'UR1116', 'AS187', 'UR1113', 'AS137', 'AS188', 'AS020', 'AS134', 'AS073', 'AS086', 'AS206', 'AS133', 'AS018', 'AS101 - Part 2', 'AS101 - Part 1', 'AS072', 'UR136', 'AS183', 'UR108', 'AS182B', 'UR101', 'UR053D', 'UR098', 'UR099', 'HP052', 'UR084', 'UR133', 'UR036', 'AS172', 'AS055', 'UR1099', 'AS129', 'AS056', 'AS139', 'AS082', 'UR1117', 'AS042', 'AS030', 'UR100', 'AS184', 'UR1141', 'AS083', 'UR130', 'UR053B', 'UR058B', 'UR058A', 'UR121', 'UR013', 'UR216', 'UR1033E', 'UR194', 'UR014', 'UR012', 'UR053E', 'UR184', 'UR104', 'UR166', 'UR085', 'UR137', 'UR182', 'UR095', 'AS192', 'UR223', 'UR125', 'UR147', 'UR185', 'UR162', 'UR020', 'UR163', 'UR153', 'HP024', 'HP006', 'UR201', 'UR234', 'UR200', 'UR256', 'HP029', 'UR164', 'UR249', 'HP002', 'UR188', 'HP010', 'UR1179', 'UR1124 Detail 1', 'UR1119', 'UR1148', 'UR1143', 'HP001', 'UR1118', 'UR1149', 'UR1120', 'UR154', 'UR152', 'UR146', 'UR228', 'UR227', 'UR221', 'UR008', 'HP013', 'AS196', 'UR246', 'UR174', 'HP017', 'UR176', 'LL01', 'UR1102', 'UR1167', 'UR1138', 'UR1147', 'UR1162B', 'UR1127', 'UR219', 'UR033', 'UR294', 'UR105', 'UR1145', 'UR1144', 'UR1135', 'UR1107', 'UR1152', 'UR129', 'UR1180', 'UR1106', 'UR1123', 'UR1103', 'UR165', 'HP027', 'UR248', 'HP030', 'HP011', 'HP007', 'UR274B', 'UR183', 'UR218', 'HP008', 'UR038', 'UR226', 'UR151', 'UR240', 'UR018', 'AS199', 'UR237', 'UR062', 'UR1031', 'UR049', 'UR1104', 'UR1084', 'UR053C', 'UR1166', 'UR1114', 'UR127', 'UR1098', 'UR1136', 'UR1126', 'UR1165', 'UR1140', 'UR1100', 'UR1124', 'UR1109', 'UR1175', 'UR1131', 'UR1161', 'AS195', 'AS194', 'UR103', 'HP004', 'UR235', 'UR091', 'UR089', 'UR203', 'UR155', 'UR239', 'UR069', 'UR231', 'UR067', 'UR004', 'UR090', 'UR010', 'UR1087', 'UR107', 'UR120', 'UR232', 'UR086', 'UR061', 'UR026', 'UR141', 'HP021', 'UR138', 'UR207', 'UR149', 'UR088', 'UR056A', 'UR150', 'UR048', 'UR242', 'UR043', 'UR113', 'UR204', 'UR034', 'UR222', 'UR017', 'HP022', 'UR229', 'UR225', 'UR264', 'UR220', 'HP012', 'UR199', 'UR212', 'UR192', 'HP014', 'UR068', 'UR064', 'HP018', 'HP005', 'HP003', 'UR197', 'UR233', 'UR208', 'UR1051', 'UR202', 'HP023', 'UR022', 'UR005', 'UR217', 'UR213', 'UR236', 'UR001', 'UR189', 'UR019', 'UR195', 'UR210', 'UR191', 'UR059', 'UR079', 'UR056B', 'UR170', 'UR243', 'UR117D', 'UR1091', 'UR173', 'HP032', 'UR128', 'UR160', 'UR1095', 'HP051 B', 'HP015', 'UR139', 'UR124', 'UR244', 'UR214', 'UR178', 'UR289', 'UR285', 'UR290', 'UR180', 'UR179', 'UR1154', 'UR106', 'UR046', 'UR291A', 'HP050', 'UR283', 'UR287', 'UR282', 'UR1053', 'UR093', 'UR117C', 'UR134', 'UR1033C', 'UR1058', 'UR078', 'UR250', 'UR211', 'UR117B', 'UR027', 'UR1032', 'AS080', 'UR140', 'HP020', 'UR135', 'AS076', 'UR083', 'UR066', 'UR102', 'UR198', 'UR114', 'AS191', 'UR1096', 'UR1033F', 'HP056', 'UR111', 'QU02', 'UR168', 'UR123', 'AS075', 'AS074', 'UR1057', 'AS079', 'UR1088', 'UR028', 'UR157', 'AS193', 'UR037', 'AS198', 'UR265', 'HP049', 'UR171', 'UR159', 'UR075', 'UR080', 'UR117A', 'UR081', 'UR1049', 'UR087', 'UR245', 'UR077', 'UR060', 'UR1033B', 'UR076', 'AS078', 'UR065', 'UR092', 'HP019', 'UR167', 'HP031', 'UR241', 'UR230', 'UR169', 'UR016', 'UR057', 'UR029', 'UR023', 'UR156', 'UR1052', 'HP016', 'UR097', 'UR003', 'UR126', 'UR115', 'UR024', 'UR007', 'UR074', 'UR073', 'UR063', 'UR072', 'UR056C', 'UR172', 'UR1033G', 'UR1033D', 'UR286', 'UR047', 'UR021', 'UR006', 'UR032', 'UR009', 'UR116A', 'UR045', 'UR238', 'UR116B', 'UR205', 'UR109', 'UR096', 'UR148', 'UR177', 'UR175']