This is an experiment that came about from studying how to identify duplicate khipus. One of the insights in that study was that visual similarity worked. Somewhat. So then the question suggests itself: can you sort and search on image similarity? Answer! Of course you can!
There are a number of approaches to solve that problem. Since this is a prototype experiment, I picked an off the shelf package, ran it, and then reverse engineered the output, as shown below. The input sources to the package were the small image-quilt thumbnails used in the Khipu Field Guide. The package I used, pixplot, does a k-means clustering using the pixel “features” in the image, then projects the image onto the 2d plane, using tSNE as the dimensionality reduction algorithm. The final result, known as a raster-fairy is a jittered result that fits into a grid like format.
At the bottom of the page I show the results, and how effective that experiment was.
Reverse Engineering the Output of PixPlot
Code
import json# Initialize plotlyplotly.offline.init_notebook_mode(connected =False)pixplot_directory =f"{ku.project_directory()}/data/pixplot/output"imagelist_file =f"{pixplot_directory}/data/imagelists/imagelist.json"withopen(imagelist_file) as f: image_data_dict = json.load(f)# parse and load the dataimages = image_data_dict['images']image_names = [x.replace("_wide.jpg","") for x in images]image_sizes = image_data_dict['cell_sizes'][0]image_atlas = image_data_dict['atlas']image_positions = image_atlas['positions']
Code
atlas_points = [(item[0], item[1]) for item in image_positions[0]]named_atlas_points = [(name, position[0], position[1]) for (name, position) inzip(image_names, atlas_points)]named_atlas_points_df = pd.DataFrame(named_atlas_points, columns=['name', 'x', 'y'])fig = (px.scatter(named_atlas_points_df, x="x", y="y", hover_name='name', hover_data=['x', 'y'], title=f"<b>Image Similarity with Unknown Layout</b>", width=1000, height=1000) .update_layout(showlegend=False).update(layout_coloraxis_showscale=False).show() )
Not what I want.
Code
grid_file =f"{pixplot_directory}/data/layouts/grid_layout.json"withopen(grid_file) as f: grid_data_dict = json.load(f)grid_points = [(item[0], item[1]) for item in grid_data_dict]named_grid_points = [(name, position[0], position[1]) for (name, position) inzip(image_names, grid_points)]named_grid_points_df = pd.DataFrame(named_grid_points, columns=['name', 'x', 'y'])fig = (px.scatter(named_grid_points_df, x="x", y="y", hover_name='name', hover_data=['x', 'y'], title=f"<b>Image Similarity with Unknown Layout</b>", width=1000, height=1000) .update_layout(showlegend=False).update(layout_coloraxis_showscale=False).show() )
Also, not what I want.
Code
rasterfairy_file =f"{pixplot_directory}/data/layouts/rasterfairy_layout.json"withopen(rasterfairy_file) as f: rasterfairy_data_dict = json.load(f)grid_x_space =round(1./(1-.9166666))grid_y_space =round(1./(1-.92308))rasterfairy_points = [(round(item[0]*grid_x_space)+12, round(item[1]*grid_y_space)+13) for item in rasterfairy_data_dict]named_rasterfairy_points = [(name, position[0], position[1]) for (name, position) inzip(image_names, rasterfairy_points)]named_rasterfairy_points_df = pd.DataFrame(named_rasterfairy_points, columns=['name', 'x', 'y'])fig = (px.scatter(named_rasterfairy_points_df, x="x", y="y", hover_name='name', hover_data=['x', 'y'], title=f"<b>Image Similarity in Projected 2D Space (using UMAP/tSNE) Followed by a Gridded Deformation</b>", width=1000, height=1000) .update_layout(showlegend=False).update(layout_coloraxis_showscale=False).show() )named_rasterfairy_points_df
See Full Dataframe in Mito
name
x
y
0
AS001
19
14
1
AS002
19
1
2
AS003
23
11
3
AS004
18
0
4
AS005
13
17
...
...
...
...
654
UR290
24
20
655
UR291A
0
7
656
UR292A
10
22
657
UR293
23
7
658
UR294
22
11
This provides an equivalent of what I need.
Building an HTML Table version of PixPlot’s RasterFairy
Code
max_x =max(named_rasterfairy_points_df.x.tolist())max_y =max(named_rasterfairy_points_df.y.tolist())table_dict = {(x,y):""for x inrange(max_x+1) for y inrange(max_y+1)}for row_num inrange(len(named_rasterfairy_points_df)): record = named_rasterfairy_points_df.iloc[row_num] table_dict[(record['x'], record['y'])] = record['name']#table_dict
The quilt deforms the projection so that every khipu occupies a unique spot on a grid, the eponymous raster-fairy layout. Interestingly, it appears that the first level sort is very simple - width vs height. From there on it uses additional pixel level features (edge detection, etc.) to sort the images by closeness.
Evaluation of How well the RasterFairy did.
The following pairs were identified in a previous appendix as duplicate khipus, by Manuel Medrano. We can compute the distance for the grid points for each pair to see how well the algorithm did at grouping duplicates.
UR035/AS70
UR043/AS30
UR083/AS208
UR126/UR115
UR133/UR036
UR163/AS056
UR176/LL01
UR236/AS181
UR281/AS068
HP041/AS046
As a baselines, let’s use the KFG Similarity index, created by a Hierarchical Clustering approach on textual documents describing the khipu.
Unsurprisingly, it’s better. It’s still bad where the above approach is bad, but overall the 2D projection is better in accuracy than the deformed grid. It may even be better in the original unprojected space, but still bad where the above approach is bad!
In the above graph, not all khipus are visible, since some khipus lie on top of each other after the 2D tSNE projection from n-dimensional space. For example, UR083 and AS208 (the lime-green circle) lie on top of each other.
Reviewing a RasterFairy Grid of Duplicates
Let’s redraw the quilt showing the matches to see how well they match visually, and what influences their placement in the rasterfairy quilt.