In this page, we will perform high-level analysis, comparing khipus to each other, as a whole.

As part of the building of the python class object OODB, we now have a summary table of useful metrics of Khipu. A list of sample metrics includes, how many pendant cords, how many S-Twist cords, how many subsidiary cords for each pendant cord, etc..

One interesting field is the *document* field which is a textual representation of the khipus "features", such as cord attachments, values, colors, knot types, etc. This textual representation becomes a "proxy" for the Khipu, and allows NLP algorithms to work on the khipu. The long term goal then becomes translating that text document into something meaningful to us.

I will admit it is an odd process. First humans convert thoughts to cloth. Then humans measure the cloth, and convert it back to numbers (the database). Then I come along and convert the numbers back to pixel-cloth and I convert the numbers to text. Which I will convert back to numbers in code, so that I can hopefully convert those numbers back to thoughts. Hmmmmn....

The long prologue of imports and setup has been extracted to a startup file for brevity. The first task is to complete initialization by doing final setup and importing the Python Khipu OODB (Object Oriented Database) and summary files.

In [1]:

```
# Initialize plotly
plotly.offline.init_notebook_mode(connected = True);
# Khipu Imports
sys.path.insert(0, '../../classes') # Load external khipu classes contained in this directory
import khipu_kamayuq as kamayuq # A Khipu Maker is known (in Quechua) as a Khipu Kamayuq
import khipu_qollqa as kq
```

The KhipuKamayuq class (a Khipu scribe is a KhipuKamayuq) fetches khipus using the data accessed using the KhipuQollqa (a Khipu Warehouse) class. The KhipuKamayuq class can fetch all the khipus stored(*pickled*) as python object-oriented class objects. Additionally it has a khipu summary dataframe made by the KhipuQollqa. Let's load the OODB, and the summary dataframe

The khipu summary dataframe is large. Any tabular view may require a mouse that scrolls horizontally!. So in addition let's list the columns

In [2]:

```
start_time = time.time()
all_khipus = [aKhipu for aKhipu in kamayuq.fetch_all_khipus().values()]
print(f"That took {(time.time() - start_time)} seconds")
print(f"The khipu_kamayuq fetched {len(all_khipus)} khipus from the khipu qollqa")
```

In [3]:

```
khipu_summary_df = kq.fetch_khipu_summary()
display_dataframe(khipu_summary_df)
khipu_summary_df.columns
```

Out[3]:

Beginning programmers are taught to learn how to write "Hello World" as their first programming exercise. For khipu scholars, who are researching khipus statistically, Benford's Law is their "Hello World." Benford's law, simply put, says that the first digits of numbers used to measure human activity tend to have a higher probability that they are 1. For example, 10,000, 1,000, etc. We can use Benford's law to predict which khipu are of an "accounting nature" - ie. khipu that have sums and accounts of things.

Benford’s Law by Dan Ma

The first digit (or leading digit) of a number is the leftmost digit (e.g. the first digit of 567 is 5). The first digit of a number can only be 1, 2, 3, 4, 5, 6, 7, 8, and 9 since we do not usually write a number such as 567 as 0567. Some fraudsters may think that the first digits in numbers in financial documents appear with equal frequency (i.e. each digit appears about 11% of the time). In fact, this is not the case. It was discovered by Simon Newcomb in 1881 and rediscovered by physicist Frank Benford in 1938 that the first digits in many data sets occur according to the probability distribution indicated in Figure 1 below:

The above probability distribution is now known as the Benford’s law. It is a powerful and yet relatively simple tool for detecting financial and accounting frauds. For example, according to the Benford’s law, about 30% of numbers in legitimate data have 1 as a first digit. Fraudsters who do not know this will tend to have much fewer ones as first digits in their faked data.

Data for which the Benford’s law is applicable are data that tend to distribute across multiple orders of magnitude. Examples include income data of a large population, census data such as populations of cities and counties. In addition to demographic data and scientific data, the Benford’s law is also applicable to many types of financial data, including income tax data, stock exchange data, corporate disbursement and sales data. The author of

I've Got Your Number How a Mathematical Phenomenon can help CPAs Uncover Fraud and other Irregularities., Mark Nigrini, also discussed data analysis methods (based on the Benford’s law) that are used in forensic accounting and auditing.

In [4]:

```
first_digit_probability = [0.301, 0.176, 0.125, 0.097, 0.079, 0.067, 0.058, 0.051, 0.046]
df = pd.DataFrame({'first_digit':list(range(1,10)), 'proportion':first_digit_probability})
fig = px.bar(df, x='first_digit', y='proportion', text='proportion', width=1200, height=300, title="Benford\'s Law").show()
```

Using the distribution of "first digits" in a cord (the value of the first knot) we can use a simple cosine distance similarity metric of Benford's distribution and cord distribution. Cosine distance similarity is a simple concept that is used a lot in this study ... If you're unfamiliar here's an intuitive explanation.

Suppose we want to see how close two sets of numbers are to each other. For example house A has 3 dogs and 2 cats, house B has 5 dogs and 1 cat and house C has 1 dog and 7 cats. We can represent each of these as a vector:

**A** = [3,2]
**B** = [5,1]
**C** = [1,7]

Each of these vectors can be "normalized" to a unit vector (a vector of length 1) by dividing by the vector length. Then you can take advantage of math. A vector dot-product of two vectors measures the angle between those two vectors. If we want to know if A is closer to B or C we take A . B and A . C and see which has the smallest angle between them. The dot -product generates the cosine of the angle, which is why this technique is known as the cosine distance metric.

We can do the same for khipus by creating a vector of first digit counts, for a given khipu and then dot producting it with the Benford probabilities above.

f(angle_theta) = first_digit_probability . given_khipu_digit_distribution. If the angle theta is 0, the vectors lie on top of each other (an exact match) so cos(theta) = 1, with no match = 0.

Another approach to understanding which khipus are accounting in nature, and which aren't is to sum up cord values and see if that sum occurs elsewhere on another cord in the khipu. Dr. Urton, shows an example of this kind, khipu UR028, in his youtube video on accounting khipu. Here I examine how many "sum matches" occur in a khipu (normalized by the number of total cords...). This gives us a secondary statistical check, in addition to Benford's Law.

Note that the metric calculated below is not exact. Here, only two cord sums are measured. The first decoded khipu, in the 1920s by Leland Locke, had one up-cord sum for four down cord values and would not score well on this metric.

There is another metric similar to the above. Some cord clusters (ie. top cords) are the sum, or approximate sum of the cluster adjacent (ie. on either side) of them. The khipu class tags these "sum_clusters" and allows us another metric.

In [5]:

```
## Benford Law Similarity
# Note that 0 index's are ignored, by setting first_digit_probability[0] to 0
first_digit_probability = [0.0, 0.301, 0.176, 0.125, 0.097, 0.079, 0.067, 0.058, 0.051, 0.046]
def benford_match(aKhipu):
digit_count = [0]*10
for cord in aKhipu.attached_cords():
first_digit = int(str(cord.decimal_value)[0])
digit_count[first_digit] += 1
cosine_similarity = 1 - scipy.spatial.distance.cosine(first_digit_probability, digit_count)
return cosine_similarity
khipu_benford_match = [benford_match(aKhipu) for aKhipu in all_khipus]
```

In [6]:

```
# Cords whose sum occurs as another cord in a khipu
def sum_matches(numbers, verbose = False):
numbers = list(filter(lambda a: a != 0, numbers))
num_matches= 0
max_match = 0
all_matches = [0]
for i in range(len(numbers)):
for j in range(i):
match_sum = numbers[i] + numbers[j]
if (match_sum in numbers) and (match_sum != 0):
if verbose: print(f"[{i}]{numbers[i]}, [{j}]{numbers[j]} sums to [{numbers.index(match_sum)}]{match_sum}")
num_matches += 1
all_matches.append(match_sum)
return ({'num_matches':num_matches, 'max_match': max(all_matches), 'mean_match': mean(all_matches) })
def khipu_sum_matches(aKhipu):
cord_values = [aCord.decimal_value for aCord in aKhipu.attached_cords() if aCord.decimal_value > 0]
return sum_matches(cord_values)
sum_match_dicts = [khipu_sum_matches(aKhipu) for aKhipu in all_khipus]
sum_match_counts = [aDict['num_matches']for aDict in sum_match_dicts]
sum_match_max = [aDict['max_match']for aDict in sum_match_dicts]
sum_match_mean = [aDict['mean_match']for aDict in sum_match_dicts]
```

In [7]:

```
benford_df = pd.DataFrame({'khipu_id': [aKhipu.khipu_id for aKhipu in all_khipus],
'name': [aKhipu.name() for aKhipu in all_khipus],
'num_pendants': [aKhipu.num_pendant_cords() for aKhipu in all_khipus],
'num_subsidiary_cords': [aKhipu.num_subsidiary_cords() for aKhipu in all_khipus],
'num_total_cords': [aKhipu.num_attached_cords()for aKhipu in all_khipus],
'fanout_ratio': [aKhipu.num_pendant_cords() / aKhipu.num_attached_cords() for aKhipu in all_khipus],
'benford_match': khipu_benford_match,
'sum_match_count': sum_match_counts,
'sum_match_max': sum_match_max,
'sum_match_mean': sum_match_mean})
benford_df['normalized_sum_match_count'] = benford_df['sum_match_count']/benford_df['num_total_cords']
benford_df = benford_df.sort_values(by='benford_match', ascending=False)
benford_df.to_csv("../data/CSV/benford.csv") # Save for later when we cross correlate with other statistical tests
display_dataframe(benford_df)
```

In [8]:

```
hist_data = [khipu_benford_match]
group_labels = ['benford_match']
colors = ['#835AF1']
(
ff.create_distplot(hist_data, group_labels, bin_size=.01,
show_rug=False,show_curve=True)
.update_layout(title_text='Distribution Plot of Khipu -> Benford Matches', # title of plot
xaxis_title_text='Benford Match Value (cosine_similarity)',
yaxis_title_text='Percentage of Khipus', # yaxis label
bargap=0.05, # gap between bars of adjacent location coordinates
#bargroupgap=0.1 # gap between bars of the same location coordinates
)
.show()
)
```

Here's a clear statistical/graphical narrative as to which khipus are "accounting" in nature (from a little above .6 to 1.0!), somewhat accounting in nature (.4 to .6), and possibly narrative in nature (< 0.4). How many khipus in each category?

In [9]:

```
is_likely_accounting = benford_df[benford_df.benford_match > .625]
print(f"Khipus with benford_match > .625 = {is_likely_accounting.shape[0]} (is_likely_accounting)")
is_likely_narrative = benford_df[benford_df.benford_match < .425]
print(f"Khipus with benford_match < .425 = {is_likely_narrative.shape[0]} (is_likely_narrative)")
```

We see that there's probably 352 khipu that are likely accounting, and 67 khipu that are likely narrative.

In [10]:

```
### Benford Match vs Sum Clusters
```

Let's do a scatter plot of Benford Matching vs (normalized) cords whose sums match another cord. The clustering of accounting khipus is revealed even more clearly.

In [11]:

```
fig = px.scatter(benford_df, x="benford_match", y="normalized_sum_match_count", log_y=True,
size="sum_match_max", color="sum_match_mean",
labels={"benford_match": "#Benford Match", "normalized_sum_match_count": "#Number of Matching Cord Sums (Normalized by # Total Cords)"},
hover_data=['name'], title="<b>Benford Match vs Normalized Sum Match Counts</b> - <i>Hover Over Circles to View Khipu Info</i>",
width=1500, height=1000);
fig.update_layout(showlegend=True).show()
```

Again, we see the lumpiness and confirmation of the "accounting khipus" whose Benford match is > 0.6 or so....

As another checkpoint on Benford Law matching approach, let's review the khipus that Marcia Ascher's annotated with mathematical relationships and compare them with their corresponding Benford match.

In [12]:

```
# These are the khipus that Marcia Ascher annotated as having interesting mathematical relationships
ascher_mathy_khipus = ['AS010', 'AS020', 'AS026B', 'AS029', 'AS038', 'AS041', 'AS048', 'AS066', 'AS068',
'AS079', 'AS080', 'AS082', 'AS092', 'AS115', 'AS128', 'AS129', 'AS164', 'AS168', 'AS192',
'AS193', 'AS197', 'AS198', 'AS199', 'AS201', 'AS206', 'AS207A', 'AS208', 'AS209', 'AS210',
'AS211', 'AS212', 'AS213', 'AS214', 'AS215', 'UR1034', 'UR1096', 'UR1097', 'UR1098',
'UR1100', 'UR1103', 'UR1104', 'UR1114', 'UR1116', 'UR1117', 'UR1120', 'UR1126', 'UR1131',
'UR1135', 'UR1136', 'UR1138', 'UR1140', 'UR1141', 'UR1143', 'UR1145', 'UR1148', 'UR1149',
'UR1151', 'UR1152', 'UR1163', 'UR1175', 'UR1180']
ascher_khipu_benford_match = [benford_match(aKhipu) for aKhipu in all_khipus if aKhipu.name() in ascher_mathy_khipus]
def is_an_ascher_khipu(aKhipu):
return True if str(aKhipu.original_name)[0:2] == "AS" or aKhipu.name()[0:2] == "AS" else False
ascher_khipus = [aKhipu for aKhipu in all_khipus if is_an_ascher_khipu(aKhipu)]
percent_mathy_khipus = 100.0 * float(len(ascher_mathy_khipus))/float(len(ascher_khipus))
percent_string = "{:4.1f}".format(percent_mathy_khipus)
hist_data = [ascher_khipu_benford_match]
group_labels = ['benford_match']
colors = ['#835AF1']
(
ff.create_distplot(hist_data, group_labels, bin_size=.01,
show_rug=False,show_curve=True)
.update_layout(title_text='<b>Benford Match for Ascher Khipus with Mathematical Relationships</b> - <i>Hover Over Circles to View Khipu Info</i>', # title of plot
xaxis_title_text='Benford Match Value (cosine_similarity)',
yaxis_title_text='Percentage of Khipus', # yaxis label
bargap=0.05, # gap between bars of adjacent location coordinates
#bargroupgap=0.1 # gap between bars of the same location coordinates
)
.show()
)
print (f"Percent of ascher khipus having mathematical relationships ({len(ascher_mathy_khipus)} out of {len(ascher_khipus)}) is {percent_string}%")
```

Let's conclude this section with an interesting "triangular" graph - Benford match vs Fanout Ratio (the ratio of Pendant cords vs Subsidiary cords)

In [13]:

```
fig = px.scatter(benford_df, x="benford_match", y="fanout_ratio",
size="num_total_cords", color="num_subsidiary_cords",
labels={"num_total_cords": "#Total Cords", "num_subsidiary_cords": "#Subsidiary Cords"},
hover_data=['name'], title="<b>Benford Match vs Fanout Ratio</b> - <i>Hover Over Circles to View Khipu Info</i>",
width=1500, height=1000);
fig.update_layout(showlegend=True).show()
```

As a useful first step, it will help to have the ability to sort khipus not just by meta-features like benford_match, but also by other forms of similarity.

One of my favorite similarity techniques is the text-clustering approaches of NLP. A driving need in NLP is to sort topics into various categories; for example, is this a news document or fiction. So how do we text mine khipus? A simple solution is to convert the khipu to words. Each Python Khipu class object (knot, cord, cluster, etc.) is given a "writer" function that documents itself as text. Below is the "document" for a simple 3 cord khipu - AS055. It's a pretty sparse and simple description. We'll go about making this more sophisticated as the project becomes more developed. The point is that we now have a set of words that roughly describe a khipu with categorical words such as *U_knot_twist*, or *single_knot* and numerical words such as 1421_decimal_value for the cord value. Now we use that set of words as the input to NLP algorithms.

In [14]:

```
khipu_docstrings_df = kq.fetch_khipu_docstrings()
khipu_names = list(khipu_docstrings_df.name.values)
khipu_documents_orig = list(khipu_docstrings_df.document.values)
# To assist clustering Ascher khipus, where many khipu metrics such as cord attachment, twist, etc are unknown,
# We can impute values for attachment, cord twist and knot twist
khipu_documents = [aDocument.replace("Unknown_attachment_type", "Recto_attachment_type").replace("U_twist", "S_twist").replace("U_knot_twist", "S_knot_twist") for aDocument in khipu_documents_orig]
```

In [15]:

```
print(khipu_docstrings_df[khipu_docstrings_df.name=="AS055"].document.values[0])
```

Then we can pass the document to modern NLP text mining algorithms which can cluster the text by similarity along the n-dimensions of all the different words and their frequencies in each document.

The usual first step in NLP document analysis is to build a term-frequency, inverse-document-frequency matrix.

Tf-Idf matrices for NLP have a long history. An explanation:

As part of training our machine model, we want to give it information about how words are distributed across our set of data. This is commonly done using Term Frequency/Inverse Document Frequency techniques.

The first step, then, is to calculate the term frequency (Tf), the inverse document frequency (Idf), and their product (Tf°Idf) Apologies: There’s an occasional Google Chrome bug with latex/equation rendering - see equations in Safari if you want to understand them

(Tf): TERM FREQUENCY - How probable is that this particular word appears in a SINGLE document.

$$Tf(\omega) = \left( \frac{ \# occurences\ of\ \omega\ in\ document}{\# words\ in\ document} \right)$$

(Idf): INVERSE DOCUMENT FREQUENCY - How common or rare a word is across ALL documents. If it’s a common word across all documents, it’s regarded as a non-differentiating word, and we want to weight it lower.

Alternatively, it can be thought of as the specificity of a term - quantified as an inverse function of the number of documents in which it occurs. The most common function used is log(natural) of the number of docs divided by the number of docs in which this term appears.

$$Idf(\omega) = ln \left( \frac{ \# docs}{\# docs.containing(\omega) } \right)$$(Tf°Idf):TERM SPECIFICITY Tf°Idf, the product of Tf and Idf, is used to weight words according to how “important” they are to the machine learning algorithm ☺. The higher the TF°IDF score (weight), the rarer the term and vice versa. The intuition for this measure is this:

If a word appears frequently in a document, then it should be important and we should give that word a high score. But if a word appears in too many other documents, it’s probably not a unique identifier, therefore we should assign a lower score to that word.

Tf°idf is one of the most popular term-weighting schemes today. Over 80% of text-based recommender systems, including Google’s search engine, use Tf°Idf. Introduced, as “term specificity” by Karen Spärck Jones in a 1972 paper, it has worked well as a heuristic. However, its theoretical foundations have been troublesome for at least three decades afterward, with many researchers trying to find information theoretic justifications for it.

*Ashok Khosla - Detecting Deceptive Hotel Reviews*

In [16]:

```
# Calculate TF-IDF Matrix
def simple_tokenizer(text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word.lower() for word in nltk.word_tokenize(text)]
return tokens
tfidf_vectorizer = TfidfVectorizer(max_df=.9, max_features=200000,
min_df=.1, stop_words=None,
use_idf=True, tokenizer=simple_tokenizer, ngram_range=(1,10))
tfidf_matrix = tfidf_vectorizer.fit_transform(khipu_documents)
terms = tfidf_vectorizer.get_feature_names()
cosine_distance = 1 - cosine_similarity(tfidf_matrix)
```

Traditionally the next step is to use a variety of clustering methods. Here I will only use two. The first, K-means clustering, groups the "documents" based on n-dimensional clustering of the bag of words contained in the documents. I've used K-means in the past. It's not a great clustering algorithm, but it has the advantage that it outputs discrete groupings which can be handy. So I always do it as a first pass.

In [17]:

```
num_clusters = 20
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()
```

Out[17]:

In [18]:

```
khipu_cluster_dict = { 'khipu_name': khipu_names, 'document': khipu_documents, 'cluster': clusters }
khipu_cluster_df = pd.DataFrame(khipu_cluster_dict, index = [clusters] , columns = ['khipu_name', 'document', 'cluster'])
#display_dataframe(khipu_cluster_df)
for cluster_id in range(num_clusters):
khipus_in_cluster = khipu_cluster_df[khipu_cluster_df.cluster == cluster_id]
print(cluster_id, ":\t", list(khipus_in_cluster.khipu_name.values))
```

One simple test of the clustering - where do known khipu clusters cluster? For example, where does UR1033[A-F] group? How about UR058[A-B]? Examine these and similar clusters in the list above and below in the hierarchical dendogram. It is also useful to browse the results visually using the Khipu Field Guide Image Browser

Hierarchical clustering, uses "cosine similarity" (i.e. how close together are two normalized vectors in an n-dimensional bag-of-words frequency matrix) to group together and cluster documents. In my experience, it generally performs better than Kmeans or Principal Component Analysis. Although I prefer the R package for clustering, Python provides graphics adequate for the job at hand. In Phase 3 we will likely move to R.

In [19]:

```
Z = ward(cosine_distance) #define the linkage_matrix using ward clustering pre-computed distances
fig, ax = plt.subplots(figsize=(15, 40)) # set size
ax = dendrogram(Z, orientation="left", labels=khipu_names);
plt.tick_params(\
axis= 'x', # changes apply to the x-axis
which='both', # both major and minor ticks are affected
bottom='off', # ticks along the bottom edge are off
top='off', # ticks along the top edge are off
labelbottom='off')
plt.tight_layout() #show plot with tight layout
#uncomment below to save figure
plt.savefig('./images/khipu_clusters.png', dpi=200) #save figure as ward_clusters
# Clustered
sorted_by_similarity = ax['ivl']
print(sorted_by_similarity)
```