Zipf’s Law

1. Zipf’s Law

I have previously stated that I share the opinion of many khipu scholars that khipu are not language.

I will now proceed to beat a dead horse.

1.1 Introduction

This set of investigative studies, is guided by the book Statistical Universals of Language: Mathematical Chance vs. Human Choice by Kumiko Tanaka-Ishii (Springer-Verlag 2021).

1.2 Motivation

From the author Kumiko Tanaka-Ishii:

For nearly hundred years, researchers have noticed how language ubiquitously follows certain mathematical properties. These properties differ from linguistic universals that contribute to describing the variation of human languages. Rather, they are statistical: they can only be identified by examining a huge number of usages, and none of us is conscious of them when we use language. Today, abundant data is available in various languages, and it provides a clearer picture of what these properties are. They apply universally across genres, languages, authors, and time periods, in a range of sign-based human activities, even in music and computer programming. Often, these properties are called scaling laws, but the term is not applicable to all of them. Because they are both statistical and universal, we call them statistical universals.

2. Zipf’s Law:

The first thing we are interested in is the classic application of Zipf’s law. For this we need three items:

A total vocabulary size N (the number of words)
A ranking k - words, sorted in order, by count
A frequency - the word’s count / total vocabulary size

Zipf’s law states:

Let: - N be the number of elements; - k be their rank; - s be the value of the exponent characterizing the distribution.

Zipf’s law then predicts that out of a population of N elements, the normalized frequency of the element of rank k, normalized_freq(k, s, N), is:

\({\displaystyle normalized freq(k,s,N)={\frac {1/k^{s}}{\sum \limits _{n=1}^{N}(1/n^{s})}}}\)

2.1 Some notes about Zipf’s law:

The following list summarizes the consequences of Zipf’s law in relation to language:

The value of s is approximately 1.
Zipf’s law is universal. It applies to any text, regardless of the genre, language, time, or place.
Zipf’s law applies to not only natural language data but also data related to human language, such as music and programming language source code.
The law is a rough approximation, and both the heads and tails of distributions often deviate. The plots of some texts show a convex tendency. Furthermore, certain kinds of texts show a large deviation from a power law.
Changing the elements from words to morphemes or characters changes the shape of the plot.
There are other power laws related to Zipf’s law.

3. Study Data

These studies, will attempt to apply these “statistical universals” to four “languages”, English, Spanish, Khipu, and Quechua. Khipu may, or may not, be a language. We’re pretty sure the other three are :-)

Seven sample files are the subject of this investigation:

Sample 1: 00_Train_English_Moby_Dick.txt - English Moby Dick by Herman Melville. This file is used as a reference check, since it is also used by Kumiko Tanaka-Ishii.
Sample 2: 01_Train_English_Magister_Ludi.txt - the translated work of the novel by Herman Hesse
Sample 3: 02_Train_Espanol_El_Espia_del_Inka.txt - The Spanish text of El Espía del Inka (Spy of the Inka) by Rafel Dumett
Sample 4: 03_Train_Quechua_New_Testament.xml - an XML document containing the Quechua New Testament in XML Form.
Sample 5: 10_Train_Khipu_Document.txt - a text file containing one (very long) line per khipu, where each line is the khipu’s entire data represented as a very long text string (think of it as a pickling of the data using English instead of numbers). This is the same long document description used in hierarchical grouping, where it’s use is described in detail.
Sample 6: khipu_docstrings.csv - a csv dataframe with columns that comprise a khipu name, and a list of each khipu’s pendant cord colors, pendant cord values, and pendant knot sequence

3.1 Reading in the Study Data - Making Words

Code

# Initialize plotly
import plotly
plotly.offline.init_notebook_mode(connected = False);

Code

import os
import utils_loom as uloom

def clean_word(aWord):
    return aWord.replace(",","").replace(".","").replace("\t","").replace("\n","").strip()

def text_file_to_words(text_file):
    data_directory = f"{uloom.data_directory()}/CORPUS"
    corpus_file = f"{data_directory}/{text_file}"
    with open(corpus_file) as f: lines = f.readlines()
    document = " ".join(lines)
    words = document.split()
    words = [clean_word(aWord) for aWord in words]
    return words

moby_dick_words = text_file_to_words("00_Train_English_Moby_Dick.txt")
magister_ludi_words = text_file_to_words("01_Train_English_Magister_Ludi.txt")
inka_espia_words = text_file_to_words("02_Train_Espanol_El_Espia_del_Inka.txt")
# moby_dic_words

Code

import xml.etree.ElementTree as ET
def xml_file_to_words(xml_file):
    data_directory = f"{uloom.data_directory()}/CORPUS"
    corpus_file =  f"{data_directory}/{xml_file}"    
    tree = ET.parse(corpus_file)
    text_nodes = tree.findall("./text/body/div/div/seg")
    document = "\n".join([clean_word(aNode.text) for aNode in text_nodes])    
    words = document.replace("\n"," ").split(" ")
    words = [clean_word(aWord) for aWord in words]
    words = [word for word in words if len(word) > 0]
    return (document, words)

(quechua_new_testament_document, quechua_new_testament_words) = xml_file_to_words("03_Train_Quechua_New_Testament.xml")

Code

# Make/Read various khipu document corpus-es
import pandas as pd
import qollqa_chuspa as qc

# Make/Read corpus where entire khipu is described verbally...
do_make_khipu_document_file = True

document_corpus_file = f"{uloom.data_directory()}/CORPUS/10_Khipu_Documents.txt"    
if do_make_khipu_document_file:
    khipu_dict, all_khipus = qc.fetch_khipus()
    khipu_document_str = ""
    for aKhipu in all_khipus:
        khipu_document_str += "\n"+aKhipu.as_document()
    with open(document_corpus_file, "w") as text_file:
        text_file.write(khipu_document_str)
    khipu_document_words = khipu_document_str.split(" ")
else:
    with open(document_corpus_file, "r") as f:
        khipu_document_words = f.read().split(" ")

# Read Load khipu documents for cord colors, cord values, and group colors documents
khipu_doc_df = pd.read_csv(f"{uloom.data_directory()}/CSV/khipu_docstrings.csv")
def khipu_doc_to_words(column_name):
    lines = list(khipu_doc_df[column_name].values)
    words = " ".join(lines).split(" ")
    return (words)
khipu_color_words = khipu_doc_to_words("pendant_color_document")
khipu_cord_value_words = khipu_doc_to_words("pendant_cord_value_document")

khipu_group_color_words = list(pd.read_csv(f"{uloom.data_directory()}/CSV/group_color_sequence.csv").group_cord_colors.values)

Code

## 3.2 Making the Rank Frequency table
from collections import Counter

normalizer_cache = {}
def normalized_frequency(k,N,s=1.0):
    # normalizer = 1.0/sum([1.0/pow(float(n),s) for n in range(1,N+1)])
    # norm_freq = normalizer*(1.0/pow(float(k),s))
    for n in range(1,N+1):
        if n not in normalizer_cache.keys():
            normalizer_cache[n] = normalizer_cache[n-1] + 1.0/float(n) if n>1 else 1
    norm_freq = (1.0/normalizer_cache[N])*(1.0/float(k))
    return norm_freq

def rank_frequency_table(words):
    num_words = float(len(words))
    frequency_count = Counter(words).most_common()
    rank_frequency_df = pd.DataFrame(frequency_count, columns=['word', 'count'])
    rank_frequency_df['rank'] = range(1, len(rank_frequency_df)+1)
    rank_frequency_df['frequency'] = [float(theCount)/num_words for theCount in list(rank_frequency_df['count'].values)]
    #rank_frequency_df['normalized_frequency'] = [normalized_frequency(k, num_words) for k in range(1, len(rank_frequency_df)+1)]
    return(rank_frequency_df)

moby_dick_rank_frequency_table = rank_frequency_table(moby_dick_words)
magister_ludi_rank_frequency_table = rank_frequency_table(magister_ludi_words)
inka_espia_rank_frequency_table = rank_frequency_table(inka_espia_words)
quechua_new_testament_rank_frequency_table = rank_frequency_table(quechua_new_testament_words)
khipu_document_rank_frequency_table = rank_frequency_table(khipu_document_words)
khipu_color_rank_frequency_table = rank_frequency_table(khipu_color_words)
khipu_group_color_rank_frequency_table = rank_frequency_table(khipu_group_color_words)
khipu_cord_value_rank_frequency_table = rank_frequency_table(khipu_cord_value_words)

Code

moby_dick_rank_frequency_table.head(10)

	word	count	rank	frequency
0	the	13852	1	0.064212
1	of	6659	2	0.030868
2	and	6064	3	0.028110
3	to	4576	4	0.021212
4	a	4548	5	0.021083
5	in	3955	6	0.018334
6	that	2821	7	0.013077
7	his	2451	8	0.011362
8	it	1924	9	0.008919
9	I	1834	10	0.008502

4. Zipfian Rank-Frequency Plots

We can graph the classic Zipfian frequency plot for each of these: Plotly’s hover feature will be used so we can hover over points to see the words rank, count, etc.

Code

import plotly.express as px
import plotly.graph_objs as go

def show_rank_frequency(rank_frequency_table, sub_title):
    num_words = len(rank_frequency_table)
    norm_freq_start = normalized_frequency(num_words, num_words)
    norm_freq_end = normalized_frequency(1, num_words)
    fig = px.scatter(rank_frequency_table, 
                  x="frequency", y="rank", 
                  color="count",
                  hover_name="word", hover_data=["rank", "count", "frequency"],
                  log_x=True, log_y=True,
                  width=944, height=944)
    fig.update_traces(marker_size=4)       
    fig.add_trace(
        go.Scatter(
            x=[norm_freq_start, norm_freq_end, ],
            y=[num_words, 1],
            mode="lines",
            line=go.scatter.Line(color="black"),
            showlegend=False)
            )
    fig.update_layout(title=f"<b>{sub_title} - Zipfian <i>Word</i> Distribution</b> - Black Line is Predicted Normalized Frequency - Hover for More Information", 
                      showlegend=False).show()

4.1 Indo-European Languages - Rank-Frequency

First let’s look at the conventional indo-european languages, English and Spanish.

Code

show_rank_frequency(moby_dick_rank_frequency_table, "Moby Dick")
show_rank_frequency(magister_ludi_rank_frequency_table, "Magister Ludi")
show_rank_frequency(inka_espia_rank_frequency_table, "El Espía del Inka")

4.2 Quechua Language - Rank-Frequency

Then, let’s look at Quechua, a non-Indo European language:

Code

show_rank_frequency(quechua_new_testament_rank_frequency_table, "Quechua New Testament")

Here we see that the frequency of common high-frequency words tails off much quicker than expected. I suspect this is because Quechua is an agglutinative language. Suffix-agglutinative languages append root words with common prepositions such as from/to, of, with, etc. Agglutinative languages “clip” high-frequency counts. As an example, the English sentence “I give flowers to my mother also” translates in Quechua to “Mamaymanpas t’ikakunata quni” where Mama-y-man-pas translates to mother-mine-to-also, t’ikakunata translates to flower(s)-(object of to give) and quni translates to give-I. 3 words versus 7 in English. If we broke the words down to their roots and suffixes, I suspect we’d have a stronger fit.

4.3 Khipu “Words” - Rank-Frequency

Now let’s look at Zipfian distribution for khipu “words” using the textual description of the khipu.

Code

show_rank_frequency(khipu_document_rank_frequency_table, "Khipu Description")

Well it’s a power law distribution, for khipu structure as a whole, but the fit is terrible.

4.4 Khipu Cord-Color or Cord Value “Words” - Rank-Frequency

What about khipu cord colors or khipu cord values?

Code

show_rank_frequency(khipu_color_rank_frequency_table, "Khipu Cord Colors")
show_rank_frequency(khipu_group_color_rank_frequency_table, "Khipu Group Cord Colors")
show_rank_frequency(khipu_cord_value_rank_frequency_table, "Khipu Cord Values")

We see that khipus are not very good at matching a Zipfian distribution for natural language.

4.5 Frequency Histograms

And their histograms:

Code

def show_frequency_histogram(rank_frequency_table, sub_title):
    fig = (px.histogram(rank_frequency_table, 
                  x="frequency", histnorm='probability density',
                  #size="count", color="count",
                  #hover_name="word", hover_data=["rank", "count", "frequency"],
                  # log_x=True, 
                  log_y=True,
                  width=944, height=450)
        .update_layout(title=f"Zipfian Frequency Histogram of Words in {sub_title}", showlegend=False)
        .show()
      )

show_frequency_histogram(moby_dick_rank_frequency_table, "Moby Dick")
show_frequency_histogram(magister_ludi_rank_frequency_table, "Magister Ludi")
show_frequency_histogram(inka_espia_rank_frequency_table, "El Espía del Inka")
show_frequency_histogram(quechua_new_testament_rank_frequency_table, "Quechua New Testament")
show_frequency_histogram(khipu_document_rank_frequency_table, "Khipu Description")
show_frequency_histogram(khipu_color_rank_frequency_table, "Khipu Cord Colors")
show_frequency_histogram(khipu_cord_value_rank_frequency_table, "Khipu Cord Values")

5. Consequences of Zipf’s Power Law:

Hapax Legomena:

What exactly does this power function mean? Zipf’s law indicates that the population distribution—if considered in terms of the frequency among words in rank order—is preserved for any word population of a text. One important characteristic of this population is the number of rare words. Kretzschmar Jr. (2015) called this the “80/20 rule” to represent how 80% of word types yield only 20% of word tokens. The left graph in Fig. 4.1 shows that the vocabulary includes many rare words. Indeed, among all the word types of Moby Dick (v = 20, 472), almost half occur only once, including examples such as “white-fire” and “weazel”. Such words that occur only once in a text are called hapax legomena. The proportion of hapax legomena in a text is usually around half, roughly ranging from 40% to 60%. Zipf’s law suggests that this proportion remains the same regardless of the text size.

Code

def percent_hapax_legomena(rank_freq_tbl):
    one_hits = rank_freq_tbl[rank_freq_tbl['count'] == 1]
    percent_hapax = 100.0*float(len(one_hits))/float(len(rank_freq_tbl))
    return percent_hapax
def print_percent_hapax_legoena(rank_freq_tbl, sub_title):
    the_percent_hl = percent_hapax_legomena(rank_freq_tbl)
    print(f"Percent of {sub_title} words that are Hapax Legomena = {the_percent_hl:.1f}%")
    
hapax_legomena_dict = {
        "Moby Dick (Eng)": percent_hapax_legomena(moby_dick_rank_frequency_table),
        "Magister Ludi (Eng)": percent_hapax_legomena(magister_ludi_rank_frequency_table), 
        "El Espia del Inka (Esp)": percent_hapax_legomena(inka_espia_rank_frequency_table), 
        "Quechua New Testament (Que)": percent_hapax_legomena(quechua_new_testament_rank_frequency_table), 
        "Khipu Description": percent_hapax_legomena(khipu_document_rank_frequency_table), 
        "Khipu Cord Colors": percent_hapax_legomena(khipu_color_rank_frequency_table), 
        "Khipu Group Colors": percent_hapax_legomena(khipu_group_color_rank_frequency_table), 
        "Khipu Cord Values": percent_hapax_legomena(khipu_cord_value_rank_frequency_table),  
        }   

# Display the chart
doc_names = hapax_legomena_dict.keys()
doc_counts = [percent_string for percent_string in hapax_legomena_dict.values()]
doc_count_strings = [f"{x:.0f}" for x in doc_counts]

hl_df = pd.DataFrame(zip(doc_names, doc_counts, doc_count_strings), columns=["Document Name", "% Hapax Legomena", "Percent Hapax Legomena"])
fig = px.bar(hl_df, x="Document Name", y="% Hapax Legomena", 
             text="Percent Hapax Legomena", 
             title=f"Percent Hapax Legomena per Training Document",  width=944, height=450).update_layout(showlegend=True, xaxis_tickangle=270).show()

6. Conclusion

It could be argued that the small amount of data we have in khipus prevents us from achieving a reasonable Zipfian fit for documents, cord colors, and cord values. That would explain the drop off in low-ranking/high-frequency words, but not the erratic fit of high-rank/low-frequency words.

Similarly, we know that cord-colors are off the chart abnormal for Zipfian distribution - ie. their low percentage of hapax legomena (perhaps caused by a limited color palette or a limited sample size), or their high percentage for group color sequences. The only distribution that follows Zipfian expectations is cord-values, although that too is a poor fit.

The statistical evidence is becoming increasingly small that khipus are linguistic in nature. However, the data indicates that if there is a place for language to be found, it lies in exploration of the cord’s knotted values, not their color.

1. Zipf’s Law

1.1 Introduction

1.2 Motivation

2. Zipf’s Law:

2.1 Some notes about Zipf’s law:

3. Study Data

3.1 Reading in the Study Data - Making Words

4. Zipfian Rank-Frequency Plots

4.1 Indo-European Languages - Rank-Frequency

4.2 Quechua Language - Rank-Frequency

4.3 Khipu “Words” - Rank-Frequency

4.4 Khipu Cord-Color or Cord Value “Words” - Rank-Frequency

4.5 Frequency Histograms

5. Consequences of Zipf’s Power Law:Hapax Legomena:

6. Conclusion

5. Consequences of Zipf’s Power Law:

Hapax Legomena: