Benford Match


1. Benford Match

Benford’s Law

Beginning programmers are taught to learn how to write “Hello World” as their first programming exercise. For khipu scholars, who are researching khipus statistically, Benford’s Law is their “Hello World.” Benford’s law, simply put, says that the first digits of numbers used to measure human activity tend to have a higher probability that they are 1. For example, 10,000, 1,000, etc. We can use Benford’s law to predict which khipu are of an “accounting nature” - ie. khipu that have sums and accounts of things.

Benford’s Law by Dan Ma

The first digit (or leading digit) of a number is the leftmost digit (e.g. the first digit of 567 is 5). The first digit of a number can only be 1, 2, 3, 4, 5, 6, 7, 8, and 9 since we do not usually write a number such as 567 as 0567. Some fraudsters may think that the first digits in numbers in financial documents appear with equal frequency (i.e. each digit appears about 11% of the time). In fact, this is not the case. It was discovered by Simon Newcomb in 1881 and rediscovered by physicist Frank Benford in 1938 that the first digits in many data sets occur according to the probability distribution indicated in Figure 1 below:

The above probability distribution is now known as the Benford’s law. It is a powerful and yet relatively simple tool for detecting financial and accounting frauds. For example, according to the Benford’s law, about 30% of numbers in legitimate data have 1 as a first digit. Fraudsters who do not know this will tend to have much fewer ones as first digits in their faked data.

Data for which the Benford’s law is applicable are data that tend to distribute across multiple orders of magnitude. Examples include income data of a large population, census data such as populations of cities and counties. In addition to demographic data and scientific data, the Benford’s law is also applicable to many types of financial data, including income tax data, stock exchange data, corporate disbursement and sales data. The author of I’ve Got Your Number How a Mathematical Phenomenon can help CPAs Uncover Fraud and other Irregularities., Mark Nigrini, also discussed data analysis methods (based on the Benford’s law) that are used in forensic accounting and auditing.

Code
# Initialize plotly
import plotly
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected = False)

first_digit_probability = [0.301, 0.176, 0.125, 0.097, 0.079, 0.067, 0.058, 0.051, 0.046]
df = pd.DataFrame({'first_digit':list(range(1,10)), 'proportion':first_digit_probability})
fig = px.bar(df, x='first_digit', y='proportion', text='proportion', width=944, height=300, title="Benford\'s Law").show()
0.3010.1760.1250.0970.0790.0670.0580.0510.04612345678900.10.20.3
Benford's Lawfirst_digitproportion
<Figure size 6000x3000 with 0 Axes>

2. Search Criteria:

An Explanation of “Cosine Distance”

Using the distribution of “first digits” in a cord (the value of the first knot) we can use a simple cosine distance similarity metric of Benford’s distribution and cord distribution. Cosine distance similarity is a simple concept that is used a lot in this study … If you’re unfamiliar here’s an intuitive explanation.

Suppose we want to see how close two sets of numbers are to each other. For example house A has 3 dogs and 2 cats, house B has 5 dogs and 1 cat and house C has 1 dog and 7 cats. We can represent each of these as a vector:

A = [3,2] B = [5,1] C = [1,7]

Each of these vectors can be “normalized” to a unit vector (a vector of length 1) by dividing by the vector length. Then you can take advantage of math. A vector dot-product of two vectors measures the angle between those two vectors. If we want to know if A is closer to B or C we take A . B and A . C and see which has the smallest angle between them. The dot -product generates the cosine of the angle, which is why this technique is known as the cosine distance metric.



We can do the same for khipus by creating a vector of first digit counts, for a given khipu and then dot producting it with the Benford probabilities above.

f(angle_theta) = first_digit_probability . given_khipu_digit_distribution. If the angle theta is 0, the vectors lie on top of each other (an exact match) so cos(theta) = 1, with no match = 0.

Another approach to understanding which khipus are accounting in nature, and which aren’t is to sum up cord values and see if that sum occurs elsewhere on another cord in the khipu. Dr. Urton, shows an example of this kind, khipu UR028, in his youtube video on accounting khipu. Here I examine how many “sum matches” occur in a khipu (normalized by the number of total cords…). This gives us a secondary statistical check, in addition to Benford’s Law.

Note that the metric calculated below is not exact. Here, only two cord sums are measured. The first decoded khipu, in the 1920s by Leland Locke, had one up-cord sum for four down cord values and would not score well on this metric.

There is another metric similar to the above. Some cord groups (ie. top cords) are the sum, or approximate sum of the group adjacent (ie. on either side) of them. The khipu class tags these “sum_groups” and allows us another metric.

Code
# Initialize plotly
plotly.offline.init_notebook_mode(connected = False)

# Khipu Imports
import khipu_kamayuq as kamayuq  # A Khipu Maker is known (in Quechua) as a Khipu Kamayuq
import khipu_qollqa as kq
all_khipus = [aKhipu for aKhipu in kamayuq.fetch_all_khipus().values()]
khipu_summary_df = kq.fetch_khipu_summary()
Code
## Benford Law Similarity
# Note that 0 index's are ignored, by setting first_digit_probability[0] to 0
first_digit_probability = [0.0, 0.301, 0.176, 0.125, 0.097, 0.079, 0.067, 0.058, 0.051, 0.046]
def benford_match(aKhipu):
    digit_count = [0]*10
    for cord in aKhipu.attached_cords():
        first_digit = int(str(cord.decimal_value)[0])
        digit_count[first_digit] += 1
    cosine_similarity = 1 - scipy.spatial.distance.cosine(first_digit_probability, digit_count)
    return cosine_similarity
khipu_benford_match = [benford_match(aKhipu) for aKhipu in all_khipus]

We’re going to build a dataframe of benford match vs cords, and vs the number of ascher sum relationships that occur. This provides an alternative checkpoint on Benford Law matching approach, To do this we read in the previously computed Ascher Sum Relationship table, and add up all the number of ascher relationships for a given khipu

Code
fieldmarks_dir = f"{os.path.dirname(os.path.dirname(kq.qollqa_data_directory()))}/FIELDMARKS/CSV"
ascher_sum_relationship_df = pd.read_csv(f"{fieldmarks_dir}/ascher_sum_relationships.csv")
ascher_sums_dict = {aKhipu.name():0 for aKhipu in all_khipus}
for index in range(len(ascher_sum_relationship_df)):
    record = ascher_sum_relationship_df.iloc[index]
    khipu_name = ku.strip_html(record['name'])
    num_ascher_sums =  record['num_pendant_pendant_sum'] + record['num_pendant_pendants_color_sum'] + record['num_indexed_pendant_sum']
    num_ascher_sums += record['num_subsidiary_pendant_sum'] + record['num_group_group_sum'] + record['num_group_sum_bands'] 
    ascher_sums_dict[khipu_name] = num_ascher_sums

sum_match_counts =  [ascher_sums_dict[aKhipu.name()] for aKhipu in all_khipus]
Code
benford_df = pd.DataFrame({'khipu_id': [aKhipu.khipu_id for aKhipu in all_khipus],
                           'name': [aKhipu.name() for aKhipu in all_khipus],
                           'num_pendants': [aKhipu.num_pendant_cords() for aKhipu in all_khipus],
                           'num_subsidiary_cords': [aKhipu.num_subsidiary_cords() for aKhipu in all_khipus],
                           'num_total_cords': [aKhipu.num_cc_cords()for aKhipu in all_khipus],
                           'fanout_ratio': [aKhipu.num_pendant_cords() / aKhipu.num_attached_cords() for aKhipu in all_khipus],
                           'benford_match': khipu_benford_match,
                           'sum_match_count': sum_match_counts})

benford_df['normalized_sum_match_count'] = benford_df['sum_match_count']/benford_df['num_total_cords']
benford_df = benford_df.sort_values(by='benford_match', ascending=False)
display_dataframe(benford_df)
khipu_id name num_pendants num_subsidiary_cords num_total_cords fanout_ratio benford_match sum_match_count normalized_sum_match_count
252 1000269 UR053E 70 35 105 0.666667 0.992311 53 0.504762
645 6000089 UR193 99 127 226 0.438053 0.985153 87 0.384956
577 6000020 KH0058 80 98 178 0.449438 0.984313 63 0.353933
426 1000404 UR151 12 8 20 0.600000 0.978858 4 0.200000
612 6000055 QU005 57 25 82 0.695122 0.978147 6 0.073171
... ... ... ... ... ... ... ... ... ...
432 1000417 UR158 24 1 25 0.960000 0.000000 0 0.000000
159 1000445 HP026 17 0 17 1.000000 0.000000 0 0.000000
158 1000444 HP025 55 0 55 1.000000 0.000000 0 0.000000
608 6000051 QU001 33 0 33 1.000000 0.000000 0 0.000000
14 1000196 AS025 2 0 2 1.000000 0.000000 0 0.000000

651 rows × 9 columns

3. Summary Charts:

Code
# Read in the Fieldmark and its associated dataframe and match dictionary
from fieldmark_khipu_summary import Fieldmark_Benford_Match
aFieldmark = Fieldmark_Benford_Match()
fieldmark_dataframe = aFieldmark.dataframes[0].dataframe
raw_match_dict = aFieldmark.raw_match_dict()
Code
# Plot Matching khipu
matching_khipus = aFieldmark.matching_khipus() 
matching_values = [raw_match_dict[aKhipuName] for aKhipuName in matching_khipus]
matching_df =  pd.DataFrame(list(zip(matching_khipus, matching_values)), columns =['KhipuName', 'Value'])
fig = px.bar(matching_df, x='KhipuName', y='Value', labels={"KhipuName": "Khipu Name", "Value": "Benford Match", }, 
            title=f"Matching Khipu ({len(matching_khipus)}) for Benford Match",  width=944, height=450).update_layout(showlegend=True).show()
UR053EKH0081AS185UR034UR1099UR260AS174UR1148UR143UR017UR1103HP050AS153UR117DUR249UR246AS186AS020UR021UR248UR267BUR262HP043UR050HP057UR131AUR112AS016UR162UR185AS078HP039UR042UR1162AHP003UR006UR172AS211UR102QU002UR024UR1053UR091UR191HP024AS026AUR075AS198UR274AQU017MM003AS203UR180HP04900.20.40.60.81
Matching Khipu (640) for Benford MatchKhipu NameBenford Match
Code
# Plot Significant khipu
significant_khipus = [aKhipuName for aKhipuName in matching_khipus if raw_match_dict[aKhipuName] > 0.6]
significant_values = [raw_match_dict[aKhipuName] for aKhipuName in significant_khipus]
significant_df =  pd.DataFrame(list(zip(significant_khipus, significant_values)), columns =['KhipuName', 'Value'])
fig = px.bar(significant_df, x='KhipuName', y='Value', labels={"KhipuName": "Khipu Name", "Value": "Benford Match", },
             title=f"Significant Khipu ({len(significant_khipus)}) for Benford Match", width=944, height=450).update_layout(showlegend=True).show()
UR053EUR1140QU011UR020UR034UR1152UR177UR283AS174UR1091UR1095UR056AUR017AS026BUR230UR088AS153UR048UR225UR195UR246QU007UR223AS101 - Part 1UR021UR267AAS021AS008UR262HP051 BAS189UR1033FHP057UR060JC013AS055AS016HP023UR1104UR1165AS078AS205UR1106UR130UR1162AUR266HP018UR1097UR172UR005AS02800.20.40.60.81
Significant Khipu (451) for Benford MatchKhipu NameBenford Match

4. Significance Criteria:

What is deemed “significant”

Code
hist_data = [khipu_benford_match]
group_labels = ['benford_match']
colors = ['#835AF1']

(
ff.create_distplot(hist_data, group_labels,  bin_size=.01, 
                         show_rug=False,show_curve=True)
    .update_layout(title_text='Distribution Plot of Khipu -> Benford Matches', # title of plot
                     xaxis_title_text='Benford Match Value (cosine_similarity)', 
                     yaxis_title_text='Percentage of Khipus', # yaxis label
                     bargap=0.05, # gap between bars of adjacent location coordinates
                     #bargroupgap=0.1 # gap between bars of the same location coordinates
                     )
     .show()
)
00.20.40.60.8100.511.522.53
benford_matchDistribution Plot of Khipu -> Benford MatchesBenford Match Value (cosine_similarity)Percentage of Khipus

Here’s a clear statistical/graphical narrative as to which khipus are “accounting” in nature (from a little above .6 to 1.0!), somewhat accounting in nature (.4 to .6), and possibly narrative in nature (< 0.4). How many khipus in each category?

Code
is_likely_accounting = benford_df[benford_df.benford_match > .625]
print(f"Khipus with benford_match > .625 = {ku.pct_kfg_khipus(is_likely_accounting.shape[0])} (is_likely_accounting)")
is_likely_narrative = benford_df[benford_df.benford_match < .425]
print(f"Khipus with benford_match < .425 = {ku.pct_kfg_khipus(is_likely_narrative.shape[0])} (is_likely_narrative)")
Khipus with benford_match > .625 = 431 (66%) (is_likely_accounting)
Khipus with benford_match < .425 = 102 (16%) (is_likely_narrative)

We see that there’s probably 378(66%) khipu that are likely accounting, and 79(14%) khipu that are likely narrative.

5. Summary Results:

Measure Result
Number of Khipus That Match 639 (98%)
Number of Significant Khipus 447 (69%)
Five most Significant Khipu UR053E, KH0058, UR193, UR151, QU005
Image Quilt Click here

6. Exploratory Data Analysis

Sanity Check - Benford Match vs Sum Groups

Let’s do a scatter plot of Benford Matching vs (normalized) cords whose values match another set of cords (ie. one of the Ascher X-ray Sum Relationships). The grouping of accounting khipus is revealed even more clearly.

Code
fig = px.scatter(benford_df, x="benford_match", y="normalized_sum_match_count", log_y=True,
                 size="num_total_cords", color="num_total_cords", 
                 labels={"benford_match": "#Benford Match", "normalized_sum_match_count": "#Number of Matching Cord Sums (Normalized by # Total Cords)"},
                 hover_data=['name'], title="<b>Benford Match vs Normalized Ascher Sum Relationship Counts</b> - <i>Hover Over Circles to View Khipu Info</i>",
                 width=944, height=944);
fig.update_layout(showlegend=True).show()
00.20.40.60.8150.01250.12512
20040060080010001200140016001800num_total_cordsBenford Match vs Normalized Ascher Sum Relationship Counts - Hover Over Circles to View Khipu Info#Benford Match#Number of Matching Cord Sums (Normalized by # Total Cords)

Again, we see the lumpiness and confirmation of the “accounting khipus” whose Benford match is > 0.6 or so….But we ALSO see high evidence of summation relationships in low benford match khipu which have a large number of cords. This is not a good graphic for the argument that khipus are written speech!

Sanity Check - Marcia Ascher’s “Mathematical” Khipus versus their Corresponding Benford Match

As another checkpoint on Benford Law matching approach, let’s review the reverse - khipus that Marcia Ascher’s annotated with mathematical relationships and compare them with their corresponding Benford match distributions

Code
# These are the khipus that Marcia Ascher annotated as having interesting mathematical relationships
ascher_mathy_khipus = ['AS010', 'AS020', 'AS026B', 'AS029', 'AS038', 'AS041', 'AS048', 'AS066', 'AS068', 
                       'AS079', 'AS080', 'AS082', 'AS092', 'AS115', 'AS128', 'AS129', 'AS164', 'AS168', 'AS192', 
                       'AS193', 'AS197', 'AS198', 'AS199', 'AS201', 'AS206', 'AS207A', 'AS208', 'AS209', 'AS210', 
                       'AS211', 'AS212', 'AS213', 'AS214', 'AS215', 'UR1034', 'UR1096', 'UR1097', 'UR1098', 
                       'UR1100', 'UR1103', 'UR1104', 'UR1114', 'UR1116', 'UR1117', 'UR1120', 'UR1126', 'UR1131', 
                       'UR1135', 'UR1136', 'UR1138', 'UR1140', 'UR1141', 'UR1143', 'UR1145', 'UR1148', 'UR1149', 
                       'UR1151', 'UR1152', 'UR1163', 'UR1175', 'UR1180']
ascher_khipu_benford_match = [benford_match(aKhipu) for aKhipu in all_khipus if aKhipu.name() in ascher_mathy_khipus]

def is_an_ascher_khipu(aKhipu):
    return True if str(aKhipu.original_name)[0:2] == "AS" or aKhipu.name()[0:2] == "AS" else False

ascher_khipus = [aKhipu for aKhipu in all_khipus if is_an_ascher_khipu(aKhipu)]
percent_mathy_khipus = 100.0 * float(len(ascher_mathy_khipus))/float(len(ascher_khipus))
percent_string = "{:4.1f}".format(percent_mathy_khipus)

hist_data = [ascher_khipu_benford_match]
group_labels = ['benford_match']
colors = ['#835AF1']

(
ff.create_distplot(hist_data, group_labels,  bin_size=.01, 
                         show_rug=False,show_curve=True)
    .update_layout(title_text='<b>Benford Match for Ascher Khipus with Mathematical Relationships</b> - <i>Hover Over Circles to View Khipu Info</i>', # title of plot
                     xaxis_title_text='Benford Match Value (cosine_similarity)', 
                     yaxis_title_text='Percentage of Khipus', # yaxis label
                     bargap=0.05, # gap between bars of adjacent location coordinates
                     #bargroupgap=0.1 # gap between bars of the same location coordinates
                     )
     .show()
)
print (f"Percent of ascher khipus having mathematical relationships ({len(ascher_mathy_khipus)} out of {len(ascher_khipus)}) is {percent_string}%")
0.40.50.60.70.80.901234567
benford_matchBenford Match for Ascher Khipus with Mathematical Relationships - Hover Over Circles to View Khipu InfoBenford Match Value (cosine_similarity)Percentage of Khipus
Percent of ascher khipus having mathematical relationships (61 out of 144) is 42.4%

We see that all of Ascher’s khipus that Marcia Ascher analyzed have a Benford match over 0.4. This is no surprise, but it’s a good confirmation that the technique makes sense.

Benford Match vs Fanout Ratio (Primaries vs Subsidiaries)

Let’s conclude this section with an interesting “triangular” graph - Benford match vs Fanout Ratio (the ratio of Pendant cords vs Subsidiary cords)

Code
fig = px.scatter(benford_df, x="benford_match", y="fanout_ratio", 
                 size="num_total_cords", color="normalized_sum_match_count", 
                 labels={"num_total_cords": "#Total Cords", "num_subsidiary_cords": "#Subsidiary Cords"},
                 hover_data=['name'], title="<b>Benford Match vs Fanout Ratio</b><i>(# Pendants/# Subsidiaries - Higher Fanout = Fewer Subsidiaries)</i>",
                 width=950, height=950);
fig.update_layout(showlegend=True).show()
00.20.40.60.810.20.40.60.81
00.511.522.5normalized_sum_match_countBenford Match vs Fanout Ratio(# Pendants/# Subsidiaries - Higher Fanout = Fewer Subsidiaries)benford_matchfanout_ratio

If khipus are indeed linguistic, we would expect their cords to take on one of two possible linguistic structures:

  • Linear/Serial Structure: One possibility is that language is encoded serially with each cord expressing a morpheme or word. The Zipf’s Law study below explores this possibility.
  • Tree Structure: The other possibility is that language is expressed in a tree fashion, using primaries and subsidiaries, like the sentence diagrams you used to do in grade school. The above Benford’s law Fan-out study suggests that encoding in a tree fashion is highly unlikely.

The only significant khipu to appear in the lower left triangle is UR093. However, an examination of the Ascher sum relationships for this khipu show it to have significant arithmetic relationships.

7. Zipf’s Law

Another approach to understanding if khipus are linguistic in nature, is an examination of their various distributions of color, cord-value, etc., according to Zipf’s law. Zipf’s Law, as a linguistic measure has been extensively studied, and it’s general conclusions apply to all forms of human language.

Zipf’s Law, like Benford’s Law is a similar power curve, although it is different in distribution than Benford’s Law. A study of khipus, using Zipf’s Law, once again, diminishes the argument that khipus are linguistic in nature.

8. Conclusion

There are many statistical “leading indicators” that imply that khipus are most likely not linguistic.

  • Application of Benford’s Law, indicates that the majority of found khipus have numbers that closely fit with distributions of numbers used in human “accounts” such as taxes, inventories, census data, etc.
  • The significant presence of Ascher Sum relationships such as pendant sums of other pendants, or group sums of other groups, in low Benford match khipus, implies, fairly strongly, that khipus are not a medium of literature or writing (in the western sense of the word), but rather an accounting structure of some sort.
  • Application of Zipf’s Law, adds considerable statistical weight to the argument that khipus do not have a “natural language vocabulary” whether color, cord-value, or overall structure. These three possible linguistic communication media, all fail to follow a classic Zipfian natual language distribution.
  • Only cord-values obey the Zipfian natural language distribution of hapax legomena - the 40-60% of vocabulary that is only used once. Cord colors fail miserably on this measure.

It could be argued that the small amount of data we have in khipus prevents us from achieving a reasonable Zipfian fit for documents, cord colors, and cord values. That would explain the drop off in low-ranking/ high-frequency words, but not the erratic fit of high-ranking/low-frequency words. Similarly, we know that cord-colors are off the chart abnormal for zipfian distribution - ie. their low percentage of hapax legomena (perhaps caused by a limited color palette or a limited sample size). The only thing that is Zipfian normal is cord-values although that too is a poor fit.