Zipf’s Law


1. Zipf’s Law

I have previously stated that I share the opinion of many khipu scholars that khipu are not language.

I will now proceed to beat a dead horse.

1.1 Introduction

This set of investigative studies, is guided by the book Statistical Universals of Language: Mathematical Chance vs. Human Choice by Kumiko Tanaka-Ishii (Springer-Verlag 2021).

1.2 Motivation

From the author Kumiko Tanaka-Ishii:

For nearly hundred years, researchers have noticed how language ubiquitously follows certain mathematical properties. These properties differ from linguistic universals that contribute to describing the variation of human languages. Rather, they are statistical: they can only be identified by examining a huge number of usages, and none of us is conscious of them when we use language. Today, abundant data is available in various languages, and it provides a clearer picture of what these properties are. They apply universally across genres, languages, authors, and time periods, in a range of sign-based human activities, even in music and computer programming. Often, these properties are called scaling laws, but the term is not applicable to all of them. Because they are both statistical and universal, we call them statistical universals.

2. Zipf’s Law:

The first thing we are interested in is the classic application of Zipf’s law. For this we need three items:

  • A total vocabulary size N (the number of words)
  • A ranking k - words, sorted in order, by count
  • A frequency - the word’s count / total vocabulary size

Zipf’s law states:

Let: - N be the number of elements; - k be their rank; - s be the value of the exponent characterizing the distribution.

Zipf’s law then predicts that out of a population of N elements, the normalized frequency of the element of rank k, normalized_freq(k, s, N), is:

\({\displaystyle normalized freq(k,s,N)={\frac {1/k^{s}}{\sum \limits _{n=1}^{N}(1/n^{s})}}}\)

2.1 Some notes about Zipf’s law:

The following list summarizes the consequences of Zipf’s law in relation to language:

  1. The value of s is approximately 1.
  2. Zipf’s law is universal. It applies to any text, regardless of the genre, language, time, or place.
  3. Zipf’s law applies to not only natural language data but also data related to human language, such as music and programming language source code.
  4. The law is a rough approximation, and both the heads and tails of distributions often deviate. The plots of some texts show a convex tendency. Furthermore, certain kinds of texts show a large deviation from a power law.
  5. Changing the elements from words to morphemes or characters changes the shape of the plot.
  6. There are other power laws related to Zipf’s law.

3. Study Data

These studies, will attempt to apply these “statistical universals” to four “languages”, English, Spanish, Khipu, and Quechua. Khipu may, or may not, be a language. We’re pretty sure the other three are :-)

Seven sample files are the subject of this investigation:

  • Sample 1: 00_Train_English_Moby_Dick.txt - English Moby Dick by Herman Melville. This file is used as a reference check, since it is also used by Kumiko Tanaka-Ishii.
  • Sample 2: 01_Train_English_Magister_Ludi.txt - the translated work of the novel by Herman Hesse
  • Sample 3: 02_Train_Espanol_El_Espia_del_Inka.txt - The Spanish text of El Espía del Inka (Spy of the Inka) by Rafel Dumett
  • Sample 4: 03_Train_Quechua_New_Testament.xml - an XML document containing the Quechua New Testament in XML Form.
  • Sample 5: 10_Train_Khipu_Document.txt - a text file containing one (very long) line per khipu, where each line is the khipu’s entire data represented as a very long text string (think of it as a pickling of the data using English instead of numbers). This is the same long document description used in hierarchical grouping, where it’s use is described in detail.
  • Sample 6: khipu_docstrings.csv - a csv dataframe with columns that comprise a khipu name, and a list of each khipu’s pendant cord colors, pendant cord values, and pendant knot sequence

3.1 Reading in the Study Data - Making Words

Code
# Initialize plotly
import plotly
plotly.offline.init_notebook_mode(connected = False);
Code
import os
import utils_loom as uloom

def clean_word(aWord):
    return aWord.replace(",","").replace(".","").replace("\t","").replace("\n","").strip()

def text_file_to_words(text_file):
    data_directory = f"{uloom.data_directory()}/CORPUS"
    corpus_file = f"{data_directory}/{text_file}"
    with open(corpus_file) as f: lines = f.readlines()
    document = " ".join(lines)
    words = document.split()
    words = [clean_word(aWord) for aWord in words]
    return words

moby_dick_words = text_file_to_words("00_Train_English_Moby_Dick.txt")
magister_ludi_words = text_file_to_words("01_Train_English_Magister_Ludi.txt")
inka_espia_words = text_file_to_words("02_Train_Espanol_El_Espia_del_Inka.txt")
# moby_dic_words
Code
import xml.etree.ElementTree as ET
def xml_file_to_words(xml_file):
    data_directory = f"{uloom.data_directory()}/CORPUS"
    corpus_file =  f"{data_directory}/{xml_file}"    
    tree = ET.parse(corpus_file)
    text_nodes = tree.findall("./text/body/div/div/seg")
    document = "\n".join([clean_word(aNode.text) for aNode in text_nodes])    
    words = document.replace("\n"," ").split(" ")
    words = [clean_word(aWord) for aWord in words]
    words = [word for word in words if len(word) > 0]
    return (document, words)

(quechua_new_testament_document, quechua_new_testament_words) = xml_file_to_words("03_Train_Quechua_New_Testament.xml")
Code
# Make/Read various khipu document corpus-es
import pandas as pd
import qollqa_chuspa as qc

# Make/Read corpus where entire khipu is described verbally...
do_make_khipu_document_file = True

document_corpus_file = f"{uloom.data_directory()}/CORPUS/10_Khipu_Documents.txt"    
if do_make_khipu_document_file:
    khipu_dict, all_khipus = qc.fetch_khipus()
    khipu_document_str = ""
    for aKhipu in all_khipus:
        khipu_document_str += "\n"+aKhipu.as_document()
    with open(document_corpus_file, "w") as text_file:
        text_file.write(khipu_document_str)
    khipu_document_words = khipu_document_str.split(" ")
else:
    with open(document_corpus_file, "r") as f:
        khipu_document_words = f.read().split(" ")

# Read Load khipu documents for cord colors, cord values, and group colors documents
khipu_doc_df = pd.read_csv(f"{uloom.data_directory()}/CSV/khipu_docstrings.csv")
def khipu_doc_to_words(column_name):
    lines = list(khipu_doc_df[column_name].values)
    words = " ".join(lines).split(" ")
    return (words)
khipu_color_words = khipu_doc_to_words("pendant_color_document")
khipu_cord_value_words = khipu_doc_to_words("pendant_cord_value_document")

khipu_group_color_words = list(pd.read_csv(f"{uloom.data_directory()}/CSV/group_color_sequence.csv").group_cord_colors.values)
Code
## 3.2 Making the Rank Frequency table
from collections import Counter

normalizer_cache = {}
def normalized_frequency(k,N,s=1.0):
    # normalizer = 1.0/sum([1.0/pow(float(n),s) for n in range(1,N+1)])
    # norm_freq = normalizer*(1.0/pow(float(k),s))
    for n in range(1,N+1):
        if n not in normalizer_cache.keys():
            normalizer_cache[n] = normalizer_cache[n-1] + 1.0/float(n) if n>1 else 1
    norm_freq = (1.0/normalizer_cache[N])*(1.0/float(k))
    return norm_freq

def rank_frequency_table(words):
    num_words = float(len(words))
    frequency_count = Counter(words).most_common()
    rank_frequency_df = pd.DataFrame(frequency_count, columns=['word', 'count'])
    rank_frequency_df['rank'] = range(1, len(rank_frequency_df)+1)
    rank_frequency_df['frequency'] = [float(theCount)/num_words for theCount in list(rank_frequency_df['count'].values)]
    #rank_frequency_df['normalized_frequency'] = [normalized_frequency(k, num_words) for k in range(1, len(rank_frequency_df)+1)]
    return(rank_frequency_df)

moby_dick_rank_frequency_table = rank_frequency_table(moby_dick_words)
magister_ludi_rank_frequency_table = rank_frequency_table(magister_ludi_words)
inka_espia_rank_frequency_table = rank_frequency_table(inka_espia_words)
quechua_new_testament_rank_frequency_table = rank_frequency_table(quechua_new_testament_words)
khipu_document_rank_frequency_table = rank_frequency_table(khipu_document_words)
khipu_color_rank_frequency_table = rank_frequency_table(khipu_color_words)
khipu_group_color_rank_frequency_table = rank_frequency_table(khipu_group_color_words)
khipu_cord_value_rank_frequency_table = rank_frequency_table(khipu_cord_value_words)
Code
moby_dick_rank_frequency_table.head(10)
word count rank frequency
0 the 13852 1 0.064212
1 of 6659 2 0.030868
2 and 6064 3 0.028110
3 to 4576 4 0.021212
4 a 4548 5 0.021083
5 in 3955 6 0.018334
6 that 2821 7 0.013077
7 his 2451 8 0.011362
8 it 1924 9 0.008919
9 I 1834 10 0.008502

4. Zipfian Rank-Frequency Plots

We can graph the classic Zipfian frequency plot for each of these: Plotly’s hover feature will be used so we can hover over points to see the words rank, count, etc.

Code
import plotly.express as px
import plotly.graph_objs as go

def show_rank_frequency(rank_frequency_table, sub_title):
    num_words = len(rank_frequency_table)
    norm_freq_start = normalized_frequency(num_words, num_words)
    norm_freq_end = normalized_frequency(1, num_words)
    fig = px.scatter(rank_frequency_table, 
                  x="frequency", y="rank", 
                  color="count",
                  hover_name="word", hover_data=["rank", "count", "frequency"],
                  log_x=True, log_y=True,
                  width=944, height=944)
    fig.update_traces(marker_size=4)       
    fig.add_trace(
        go.Scatter(
            x=[norm_freq_start, norm_freq_end, ],
            y=[num_words, 1],
            mode="lines",
            line=go.scatter.Line(color="black"),
            showlegend=False)
            )
    fig.update_layout(title=f"<b>{sub_title} - Zipfian <i>Word</i> Distribution</b> - Black Line is Predicted Normalized Frequency - Hover for More Information", 
                      showlegend=False).show()

4.1 Indo-European Languages - Rank-Frequency

First let’s look at the conventional indo-european languages, English and Spanish.

Code
show_rank_frequency(moby_dick_rank_frequency_table, "Moby Dick")
show_rank_frequency(magister_ludi_rank_frequency_table, "Magister Ludi")
show_rank_frequency(inka_espia_rank_frequency_table, "El Espía del Inka")