Statistics is the science of learning from experience, particularly experience that arrives a little bit at a time: the successes and failures of a new experimental drug, the uncertain measurements of an asteroidâs path toward Earth. It may seem surprising that any one theory can cover such an amorphous target as âlearning from experience.â
Computer Age Statistical Inference Algorithms, Evidence, and Data Science Bradley Efron & Trevor Hastie
Itâs possible, for various reasons, that khipus are duplicated in the Khipu Field Guide. For example, an Ascher khipu might be re-read and entered in a reverse order, minus some strings.
Manuel Medrano found the following khipu duplicates using an old-fashioned shoe leather approach, plus diagrams on the KFG. My algorithmic approach to find duplicates failed to find clear duplicates, so itâs time to analyze what wrong, and fix it!
UR035/AS70
UR043/AS30
UR044/AS031 (AS031 does not exist in the KFG)
UR083/AS208
UR126/UR115
UR133/UR036
UR163/AS056
UR176/LL01
UR236/AS181
UR281/AS068
HP035/AS047 (HP035 does not exist in the KFG)
HP041/AS046
10 duplicates exist in the KFG. Since the duplicate khipus are likely to be removed, this data studyâs lifetime will be short, but the lessons will be long-lived ;-)
Whoops. That didnât work. Phase 1
My initial approach, that failed to find duplicates was to take a âwinnowingâ approach:
Step 1 - Assume that possible matches are at least 80% of the number of the cords of the test match.
Step 2 - If itâs a reversed khipu, it should have similar cord values. The set of intersections of cord values should be similar in length to each individual khipu.
Step 3 - The same idea should be true of cord colors
This approach failed. Miserably.
Why doesnât that work? What Does Work? Why?
Letâs review an image quilt of the duplicate khipus that Manuel Medrano found.
Visual examination of the Image Quilt shows that the following 6 pairs (± 2) appear similar to the eye:
(âUR035â, âAS070â)
(âUR043â, âAS030â)
(âUR126â, âUR115â)
(âUR133â, âUR036â)
(âUR281â, âAS068â)
(âHP041â, âAS046â)
I note that visually, it appears we use knot location and cord length as visual clues for matching, more than conventional symbolic values like knot value and color which might be easier for the computer to match.
The fieldmark browser sorts khipus based on a vector of all fieldmarks. So the neighboring pairs in the mini browser we just made, arenât always neighboring pairs in the browser. But theyâre close!
The number one predictor of closeness is Similarity Index, and it appears to be a pretty good predictor! Weâll get back to why that is later.
What isnât a predictor of closeness? The # of Ascher colors is a terrible predictor. Witness UR035 with 52 colors to AS070âs 30. So is cord value - itâs proxy, mean cord value, has a close match only a few times. The lesson is clear. In a database, where two khipu measurers produce such different results, we should avoid using just a few fieldmarks to pick similarity.
So why is the similarity index such a good predictor? Remember that the similarity index uses a textual description of the khipu. Everything including knots, locations, cord lengths, cords per cluster, etc is recorded in that textual description and then a very large n-dimensional vector is created and the hierarchical clustering algorithm does its magic on that. Many dimensions make the search more accurate.
Whatâs the take away for me?
Visual inspection matters.
Fieldmarks matter. But only if you use a bunch of them simultaneously (in parallel, not series)!
Dimensions matter. A complete textual description of a khipu, converted into the numerical patois of a machine learning algorithm, produces a pretty good match predictor!
The Old Approach (That Didnât Work!)
Itâs possible to look for these khipus. Weâll use a winnowing process.
Step 1 - Assume that possible matches are at least 80% of the number of the cords of the test match.
Step 2 - If itâs a reversed khipu, it should have similar cord values. The set of intersections of cord values should be similar in length to each individual khipu.
Step 3 - The same idea should be true of cord colors
Code
(khipu_dict, all_khipus) = kamayuq.fetch_khipus()khipu_summary_df = kq.fetch_khipu_summary()khipu_pendant_count =sorted([(khipu.name(), khipu.num_pendant_cords()) for khipu in all_khipus], key=lambda x: x[1], reverse=True)# We first winnow search by comparing pendant cord count.# This is a right triangular search (khipu[x] is matched with khipu[x+1,x+2....x+n])possible_matches = {}for index, (khipu_name, pendant_count) inenumerate(khipu_pendant_count): right_khipus = khipu_pendant_count[index+1:] possible_matches[khipu_name] = [a_name for (a_name, aCount) in right_khipus if (a_name!=khipu_name) and (aCount <= pendant_count) and (aCount >=0.7*pendant_count)]possible_matches = {name:aList for name,aList in possible_matches.items() iflen(aList)}from collections import Counter
Code
# Build dictionaries of cord colors and cord color sets for each khipukhipu_colors = {}khipu_color_counter = {}khipu_color_set = {}for aKhipuName, match_list in possible_matches.items(): khipu_colors[aKhipuName] = [aCord.longest_ascher_color() for aCord in khipu_dict[aKhipuName].pendant_cords()] khipu_color_counter[aKhipuName] = Counter(khipu_colors[aKhipuName]) khipu_color_set[aKhipuName] =set(khipu_color_counter[aKhipuName].keys())for aKhipuName, match_list in possible_matches.items():for aMatchName in match_list:ifnot aMatchName in khipu_colors.keys(): khipu_colors[aMatchName] = [aCord.longest_ascher_color() for aCord in khipu_dict[aMatchName].pendant_cords()] khipu_color_counter[aMatchName] = Counter(khipu_colors[aMatchName]) khipu_color_set[aMatchName] =set(khipu_color_counter[aMatchName].keys())# And then winnow possible matches based on len of color sets and intersection of color sets lengthcolor_matches = {}for aKhipuName, match_list in possible_matches.items(): my_set_length =len(khipu_color_set[aKhipuName]) less_set_length =.6*my_set_length more_set_length =1.4*my_set_lengthdef is_match(search_khipu_name):iflen(search_khipu_name)==0: returnFalse match_length =len(khipu_color_set[search_khipu_name]) lengths_match = (ku.in_range(match_length, less_set_length, more_set_length)) common_colors = khipu_color_set[aKhipuName].intersection(khipu_color_set[search_khipu_name]) colors_match = (ku.in_range(len(common_colors), less_set_length, more_set_length)) return lengths_match and colors_match the_khipu_set_matches = [aMatchName for aMatchName in match_list if is_match(aMatchName)]iflen(the_khipu_set_matches): color_matches[aKhipuName] = the_khipu_set_matcheslen(color_matches)# color_matches possible_matches = color_matches
259
Code
# Build cord_value dictionaries....khipu_cord_vals = {}khipu_cord_val_counter = {}khipu_cord_val_set = {}for aKhipuName, match_list in possible_matches.items(): khipu_cord_vals[aKhipuName] = [aCord.knotted_value() for aCord in khipu_dict[aKhipuName].pendant_cords()] khipu_cord_val_counter[aKhipuName] = Counter(khipu_cord_vals[aKhipuName]) khipu_cord_val_set[aKhipuName] =set(khipu_cord_val_counter[aKhipuName].keys())for aKhipuName, match_list in possible_matches.items():for aMatchName in match_list:ifnot aMatchName in khipu_cord_vals.keys(): khipu_cord_vals[aMatchName] = [aCord.knotted_value() for aCord in khipu_dict[aMatchName].pendant_cords()] khipu_cord_val_counter[aMatchName] = Counter(khipu_cord_vals[aMatchName]) khipu_cord_val_set[aMatchName] =set(khipu_cord_val_counter[aMatchName].keys())# And then winnow possible matches based on len of cord value sets and intersection of cord value length# Check (and report) on multiples/divisibles by 10 in case there's an off by ten errordef get_cord_val_matches(aKhipuName, match_list): my_set_length =len(khipu_cord_val_set[aKhipuName]) less_set_length =.7*my_set_length more_set_length =1.3*my_set_lengthdef is_match(search_khipu_name): match_length =len(khipu_cord_val_set[search_khipu_name]) lengths_match = (ku.in_range(match_length, less_set_length, more_set_length))ifnot lengths_match:returnFalsedef match_cord_val_set(search_khipu_name, cord_val_set): common_cord_values = khipu_cord_val_set[aKhipuName].intersection(cord_val_set)return (ku.in_range(len(common_cord_values), less_set_length, more_set_length)) cord_values_match = match_cord_val_set(search_khipu_name, khipu_cord_val_set[search_khipu_name])ifnot cord_values_match: off_by_ten = {x*10for x in khipu_cord_val_set[search_khipu_name]} cord_values_match = match_cord_val_set(search_khipu_name, off_by_ten)if cord_values_match: print(f"\t{aKhipuName}->{search_khipu_name} - OBOX10 Matched")ifnot cord_values_match: off_by_ten = {round(x/10) for x in khipu_cord_val_set[search_khipu_name]} cord_values_match = match_cord_val_set(search_khipu_name, off_by_ten)if cord_values_match: print(f"\t{aKhipuName}->{search_khipu_name} - OBO/10 Matched") the_khipu_set_matches = [aMatchName for aMatchName in match_list if is_match(aMatchName)]return the_khipu_set_matches cord_val_matches = {}for aKhipuName, match_list in possible_matches.items():if the_khipu_set_matches := get_cord_val_matches(aKhipuName, match_list): cord_val_matches[aKhipuName] = the_khipu_set_matchescord_val_matches
UR028->UR044 - OBO/10 Matched
{}
There donât appear to be duplicates at first glanceâŠ.