It's funny to put the conclusions first, but let me do so, so you get an idea of what's happening. The data is open-source and easily obtained from the Harvard Khipu Database project. They have the khipu data in two forms - a limited set of 349 khipus in Excel and a bigger SQL database of about 600+ khipu. So the steps are as follows:
The Khipu Database Project has stored all the khipu measurements in two forms: Excel and SQL files. The SQL files are allmost.... ready for datamining. However, in an attempt to make this project portable (in a software sense, not a cloth one LOL) and to save hassles with a SQL server, etc. I am converting the SQL to CSV files. I note that Urton, et.al.. prefix the database tables with the Quechua word for warehouse collca (but a Spanish spelling unfortunately). The Quechua spelling is qollqa.
Since the khipu database/tables are so small (a total of 100Mb in SQL statements) I used an open-source mySQL, and TablePlus, a SQL GUI to:
############################################################################################################################################
# Load required libraries and intialize Jupyter notebook
############################################################################################################################################
# Khipu Imports
import khipu_kamayuq as kamayuq # A Khipu Maker is known (in Quechua) as a Khipu Kamayuq
import khipu_qollqa as kq
The Harvard Khipu database schema (description image by the Khipu Database Project) is shown below. The python classes reconstruct this schema.
As said previously, the SQL database tables and query results are stored as a CSV tables instead of as SQL CREATE statements. Key tables include khipu_main, cord and cordcluster. Tables that end with dc are code descriptors for symbolic codes in tables. For example the ascher_color_dc tells you that color MB -> translates to Medium Brown...
%ls ../../data/CSV/
We start by building a virginal object-oriented database (OODB) of khipus (essentially Python Classes). Building the initial Khipu OODB of about 620 khipus takes about 10 minutes.
#Clean files
import shutil
BUILD_FRESH_OODB = True
CSV_dir = kq.qollqa_data_directory()
if BUILD_FRESH_OODB:
shutil.copy(f"{CSV_dir}khipu_main.csv", f"{CSV_dir}khipu_main_clean.csv");
shutil.copy(f"{CSV_dir}primary_cord.csv", f"{CSV_dir}primary_cord_clean.csv");
shutil.copy(f"{CSV_dir}cord_cluster.csv", f"{CSV_dir}cord_cluster_clean.csv");
shutil.copy(f"{CSV_dir}cord.csv", f"{CSV_dir}cord_clean.csv");
shutil.copy(f"{CSV_dir}ascher_cord_color.csv", f"{CSV_dir}ascher_cord_color_clean.csv");
shutil.copy(f"{CSV_dir}knot_cluster.csv", f"{CSV_dir}knot_cluster_clean.csv");
shutil.copy(f"{CSV_dir}knot.csv", f"{CSV_dir}knot_clean.csv");
#Build a fresh version of the object oriented database (OODB) that is not "cleansed"
print("Building initial khipu OODB")
all_khipus = [aKhipu for aKhipu in kamayuq.fetch_all_khipus(clean_build=True).values()]
print(f"Done - built and fetched {len(all_khipus)} khipus")
Let's start by looking at the big picture - what khipus do we have to work with. What's the "quality" and "integrity" of the data. We've already had one khipu fail - Khipu ID 1000484, known as UR167 or B/3453A from the American Museum of Natural History.
I note that khipu_main.csv (or the equivalent SQL table) has two errors - one, an empty row without any information (khipu id 10000500) and one mislableled investigator name Ur189. I deleted the empty row by hand, and edited the name to UR189 using MS Excel, prior to starting the database loading...
khipu_main_df = pd.read_csv(f"{CSV_dir}khipu_main.csv")
khipu_main_df = kq.clean_column_names(khipu_main_df)
khipu_main_df.shape
khipu_main_df.columns
khipu_main_df.head()
So we have approximately 630 khipus to start with in the database. Already a few khipu have been culled due to data integrity issues. Most of the issues have to due with cords pointing to the wrong place - for example Pendant Cord 1 belonging to Khipu 1 having a subsidiary cord that is attached to Khipu 2...which in turn has a subsidiary cord attached to Khipu 1, which in turn....
Many of these fields are empty, or uninteresting to our data exploration, so let's build a smaller table:
uninteresting_khipu_columns = ['earliest_age', 'latest_age', 'date_discovered', 'discovered_by', 'complete',
'created_by', 'created_on', 'changed_by', 'changed_on', 'duplicate_flag', 'duplicate_id', 'archive_num']
khipu_df = khipu_main_df.drop(uninteresting_khipu_columns, axis=1)
khipu_df.museum_descr = khipu_df.museum_descr.fillna(value='')
khipu_df.nickname = khipu_df.nickname.fillna(value='')
khipu_df.provenance = khipu_df.provenance.fillna(value='')
khipu_df.provenance = np.where(khipu_df.provenance == 'unknown','Unknown', khipu_df.provenance)
khipu_df.provenance = np.where(khipu_df.provenance == '','Unknown', khipu_df.provenance)
khipu_df.region = khipu_df.region.fillna(value='')
khipu_df.region = np.where(khipu_df.region == 'unknown','Unknown', khipu_df.region)
khipu_df.region = np.where(khipu_df.region == '','Unknown', khipu_df.region)
khipu_df.conditionofkhipu = khipu_df.conditionofkhipu.fillna(value='')
khipu_df
Apparently some khipu are in fragmentary condition. Let's remove those for the purpose of this study. Also the orig_inv_num meaning the original author who described the khipu generally matches with the investigator_num Some Ascher descriptions are replaced by Urton descriptions, but on the whole most Aascher descriptions are honored and labeled as such.
fragmentary_khipus_df = khipu_df[khipu_df.conditionofkhipu == "Fragmentary"]
fragmentary_khipu_ids = list(fragmentary_khipus_df.khipu_id.values)
fragmentary_khipu_names = list(fragmentary_khipus_df.investigator_num.values)
print(f"\tfragmentary_khipu_ids: {fragmentary_khipu_ids}")
print(f"\tfragmentary_khipu_names: {fragmentary_khipu_names}")
khipu_df = khipu_df.drop(khipu_df[khipu_df.conditionofkhipu == "Fragmentary"].index)
khipu_df = khipu_df.drop(['conditionofkhipu'], axis=1)
display_dataframe(khipu_df)
We now have a clean khipu database with 623 khipus to investigate
Most?! khipus have a primary cord. Let's examine the primary cord database
primary_cord_df = pd.read_csv(f"{CSV_dir}primary_cord.csv")
primary_cord_df.shape
primary_cord_df.columns
primary_cord_df.head()
Once again, let's remove uninteresting columns
primary_cord_df = kq.clean_column_names(primary_cord_df)
primary_cord_df = primary_cord_df.drop(['created_by', 'created_date', 'changed_by', 'changed_date'], axis=1)
display_dataframe(primary_cord_df)
Two questions immediately are raised. Are there any primary cords that are not attached to a khipu? (In which case we should remove them). The notes for primary cords should be reviewed, as well.
Remove primary cords belonging to fragmentary khipus or to the null row...
print(f"Before: primary_cord_df.shape = {primary_cord_df.shape}")
errant_khipu_ids = list((set(primary_cord_df.khipu_id.values) - set(khipu_df.khipu_id.values)) - set(fragmentary_khipu_ids))
errant_khipu_names = khipu_main_df[khipu_main_df.khipu_id.isin(errant_khipu_ids)].investigator_num.values
print(f"Removing errant_khipu_ids {errant_khipu_ids}")
print(f"Removing errant_khipu_names {errant_khipu_names}")
khipu_ids = khipu_df.khipu_id.values
primary_cord_df = primary_cord_df[primary_cord_df.khipu_id.isin(khipu_ids)]
print(f"After: primary_cord_df.shape = {primary_cord_df.shape}")
primary_cord_khipu_ids = primary_cord_df.khipu_id.values
print(f"Before: khipu_df.shape = {khipu_df.shape}")
khipu_df = khipu_df[khipu_df.khipu_id.isin(primary_cord_khipu_ids)]
print(f"After: khipu_df.shape = {khipu_df.shape}")
And review primary cord notes:
notes_series = primary_cord_df.notes
notes_series = notes_series[notes_series.notnull()]
for note in notes_series: print(note)
A few khipus have no cords. Let's eliminate the zero cord khipus
zero_cord_khipu_ids = [aKhipu.khipu_id for aKhipu in all_khipus if aKhipu.num_pendant_cords()==0]
zero_cord_khipu_name = [kq.khipu_name_from_id(aKhipu.khipu_id) for anId in zero_cord_khipu_ids]
zero_cord_khipu_name.sort()
print(f"Removing zero_cord_khipu_name {zero_cord_khipu_name}")
print(f"Before: khipu_df.shape = {khipu_df.shape}, Zero cord ids: {len(zero_cord_khipu_ids)}")
khipu_df = khipu_df[~khipu_df.khipu_id.isin(zero_cord_khipu_ids)]
print(f"After: khipu_df.shape = {khipu_df.shape}")
Do the same for cords, cord clusters, and ascher_cord_colors
valid_khipu_ids = list(set(khipu_df.khipu_id.values) & set(kq.cord_cluster_df.khipu_id.values))
print(f"Before: cord_cluster_df.shape = {kq.cord_cluster_df.shape}")
cord_cluster_df = kq.cord_cluster_df[kq.cord_cluster_df.khipu_id.isin(valid_khipu_ids)]
print(f"After: cord_cluster_df.shape = {cord_cluster_df.shape}")
cord_df = pd.read_csv(f"{CSV_dir}cord.csv")
cord_df = kq.clean_column_names(cord_df)
cord_df = cord_df.drop(['created_by', 'created_on', 'changed_by', 'changed_on'], axis=1)
print(f"Before: cord_df.shape = {cord_df.shape}")
cord_df = cord_df[cord_df.khipu_id.isin(valid_khipu_ids)]
print(f"After: cord_df.shape = {cord_df.shape}")
ascher_cord_color_df = pd.read_csv(f"{CSV_dir}ascher_cord_color.csv")
ascher_cord_color_df = kq.clean_column_names(ascher_cord_color_df)
ascher_cord_color_df = ascher_cord_color_df.drop(['created_by', 'created_on', 'changed_by', 'changed_on'], axis=1)
# Ascher cord colors also point to primary cords (see pcord_flag)
print(f"Before: ascher_cord_color_df.shape = {ascher_cord_color_df.shape}")
valid_cord_color_ids = list(set(cord_df.cord_id.values) | set(primary_cord_df.pcord_id.values))
ascher_cord_color_df = ascher_cord_color_df[ascher_cord_color_df.cord_id.isin(valid_cord_color_ids)]
print(f"After: ascher_cord_color_df.shape = {ascher_cord_color_df.shape}")
# Many cords (1 in 6!) have NaN as their attached_to. What's up with that?
print(cord_df[np.isnan(cord_df.attached_to)].shape)
Some cords have missing parents...
has_cord_parents_mask = cord_df.pendant_from.isin(cord_df.cord_id.values)
has_pcord_parents_mask = cord_df.pendant_from.isin(primary_cord_df.pcord_id.values)
has_parents_mask = (has_cord_parents_mask | has_pcord_parents_mask)
num_orphan_cords = sum(~has_parents_mask)
print(f"# of cords missing parents = {num_orphan_cords}")
print(f"Before: cord_df.shape = {cord_df.shape}")
cord_df = cord_df[has_parents_mask]
print(f"After: cord_df.shape = {cord_df.shape}")
By comparing the pendant_from fields of cords versus the cord_id of clusters, I discovered that 44 khipu have cord clusters that point to cords that don't belong to the khipu.
For example, look at UR181/1000491 which has a cord cord_id=3052039 whose pendant_from 1000592 actually points to UR254/1000592
UR003 and UR149
Two of the khipus, UR003 and UR149 have excel files. On viewing the Excel files, I find that UR003 has 371 cords that have something in their fields, and a total of 146 cords that say nothing, while the database says it has 758 directly attached pendants, and 761 cord cluster pendants. UR149 says it has 256 to 265 cords, but the Excel spreadsheet says it has 272 cords. Clearly something's wrong.
As a data "safety measure", we could/should remove these khipus ....
def funky_check(aKhipuList):
funky_ids = [ ]
funky_names = [ ]
funky_cord_cluster_cords = [ ]
funky_attached_cords = [ ]
for aKhipu in aKhipuList:
#if aKhipu.is_funky_khipu():
if aKhipu.num_cc_cords() != aKhipu.num_attached_cords():
funky_ids.append(aKhipu.khipu_id)
funky_names.append(aKhipu.name())
funky_cord_cluster_cords.append(aKhipu.num_cc_cords())
funky_attached_cords.append(aKhipu.num_attached_cords())
the_funky_df = pd.DataFrame({"khipu_id": funky_ids, "name": funky_names,
"#cord_cluster_cords":funky_cord_cluster_cords, "#attached_cords":funky_attached_cords})
return the_funky_df
funky_df = funky_check(all_khipus)
funky_df.to_csv(f"{CSV_dir}funky_khipu.csv")
display_dataframe(funky_df)
print(f"Before funky removal: khipu_df.shape = {khipu_df.shape}")
khipu_df = khipu_df[~khipu_df.khipu_id.isin(funky_df.khipu_id.values)]
print(f"After funky removal: khipu_df.shape = {khipu_df.shape}")
First remove knot clusters and knots from previously eliminated khipu.
Then, a few khipus have no knots. Let's eliminate the zero knot khipus.
knot_cluster_df = pd.read_csv(f"{CSV_dir}knot_cluster.csv")
knot_cluster_df = kq.clean_column_names(knot_cluster_df)
knot_cluster_df = knot_cluster_df.drop(['created_by', 'created_on', 'changed_by', 'changed_on'], axis=1)
print(f"Before: knot_cluster_df.shape = {knot_cluster_df.shape}")
knot_cluster_df = knot_cluster_df[knot_cluster_df.cord_id.isin(cord_df.cord_id.values)]
print(f"After: knot_cluster_df.shape = {knot_cluster_df.shape}")
knot_df = pd.read_csv(f"{CSV_dir}knot.csv")
knot_df = kq.clean_column_names(knot_df)
knot_df = knot_df.drop(['created_by', 'created_on', 'changed_by', 'changed_on'], axis=1)
print(f"Before: knot_df.shape = {knot_df.shape}")
knot_df = knot_df[knot_df.cord_id.isin(cord_df.cord_id.values)]
print(f"After: knot_df.shape = {knot_df.shape}")
Then remove khipus that have cords with no knots.
cord_ids = list(knot_df.cord_id.unique())
knotty_khipu_ids = list(cord_df[cord_df.cord_id.isin(cord_ids)].khipu_id.unique())
zero_knot_khipu_ids = khipu_df[~ khipu_df.khipu_id.isin(knotty_khipu_ids)].khipu_id.values
zero_knot_khipu_names = khipu_df[khipu_df.khipu_id.isin(zero_knot_khipu_ids)].investigator_num.values
zero_knot_khipu_names.sort()
print(f"Removing zero_knot_khipu_names {zero_knot_khipu_names}")
print(f"Before: khipu_df.shape = {khipu_df.shape}")
khipu_df = khipu_df[khipu_df.khipu_id.isin(knotty_khipu_ids)]
print(f"After: khipu_df.shape = {khipu_df.shape}")
#Remove cords and cord_clusters that have no khipus associated with them as a result of all this deletion
cord_cluster_df = cord_cluster_df[cord_cluster_df.khipu_id.isin(khipu_df.khipu_id.values)]
cord_df = cord_df[cord_df.khipu_id.isin(khipu_df.khipu_id.values)]
Some knot_clusters fail integrity checks for num_knots field. A manual/code fix is to reset num_knots to correct length of knots in database table.
Knot Cluster 1000036 fails integrity check - num_knots: 8 != len(self._knots): 1
knot_cluster_id: 1000036, cord_id: 3000030, khipu_id: 1000002, khipu_name: UR020
Knot Cluster 1016227 fails integrity check - num_knots: 1 != len(self._knots): 4
knot_cluster_id: 1016227, cord_id: 3016537, khipu_id: 1000175, khipu_name: AS056
Knot Cluster 1017046 fails integrity check - num_knots: 2 != len(self._knots): 1
knot_cluster_id: 1017046, cord_id: 3017118, khipu_id: 1000185, khipu_name: AS014
Knot Cluster 1017096 fails integrity check - num_knots: 3 != len(self._knots): 1
knot_cluster_id: 1017096, cord_id: 3017132, khipu_id: 1000185, khipu_name: AS014
Knot Cluster 1022353 fails integrity check - num_knots: 1 != len(self._knots): 2
knot_cluster_id: 1022353, cord_id: 3021924, khipu_id: 1000275, khipu_name: UR089
Knot Cluster 1041878 fails integrity check - num_knots: 1 != len(self._knots): 0
knot_cluster_id: 1041878, cord_id: 3040011, khipu_id: 1000472, khipu_name: UR165
Knot Cluster 1041882 fails integrity check - num_knots: 5 != len(self._knots): 6
knot_cluster_id: 1041882, cord_id: 3039869, khipu_id: 1000472, khipu_name: UR165
Finally, we save the cleaned DataFrames (with one last integrity check) and rebuild the database. Building the final Khipu OODB takes about 10 minutes.
primary_cord_df = primary_cord_df[primary_cord_df.khipu_id.isin(khipu_df.khipu_id)]
cord_cluster_df = cord_cluster_df[cord_cluster_df.khipu_id.isin(khipu_df.khipu_id)]
cord_df = cord_df[cord_df.khipu_id.isin(khipu_df.khipu_id)]
ascher_cord_color_df = ascher_cord_color_df[ascher_cord_color_df.khipu_id.isin(khipu_df.khipu_id)]
knot_cluster_df = knot_cluster_df[knot_cluster_df.cord_id.isin(cord_df.cord_id)]
knot_df = knot_df[knot_df.cord_id.isin(cord_df.cord_id)]
# Refresh in-memory databases
kq.refresh_database()
if BUILD_FRESH_OODB:
khipu_df.to_csv(f"{CSV_dir}khipu_main_clean.csv")
primary_cord_df.to_csv(f"{CSV_dir}primary_cord_clean.csv")
cord_cluster_df.to_csv(f"{CSV_dir}cord_cluster_clean.csv")
cord_df.to_csv(f"{CSV_dir}cord_clean.csv")
ascher_cord_color_df.to_csv(f"{CSV_dir}ascher_cord_color_clean.csv")
knot_cluster_df.to_csv(f"{CSV_dir}knot_cluster_clean.csv")
knot_df.to_csv(f"{CSV_dir}knot_clean.csv")
print("Building final khipu object-oriented database")
all_khipus = [aKhipu for aKhipu in kamayuq.fetch_all_khipus(clean_build=BUILD_FRESH_OODB).values()]
print(f"Done - processed {len(all_khipus)} khipus")
# Another Integrity check...
all_khipus = [aKhipu for aKhipu in kamayuq.fetch_all_khipus().values()]
funky_khipu_df = funky_check(all_khipus)
display_dataframe(funky_khipu_df)
A review of which khipus were removed, and why.
deleted_khipus_df = pd.read_csv(f"{CSV_dir}deleted_khipus.csv")
# Integrity check
original_khipu_df = pd.read_csv(f"{CSV_dir}khipu_main.csv")
original_khipu_ids = set(original_khipu_df['KHIPU_ID'].values)
final_khipu_ids = set(kq.khipu_df.khipu_id.values)
removed_khipu_ids = sorted(list(set(list(original_khipu_ids - final_khipu_ids))))
if len(removed_khipu_ids) != deleted_khipus_df.shape[0]:
print("Need to update deleted_khipus.csv")
print(removed_khipu_ids)
print(f"Removed {len(removed_khipu_ids)} khipus due to lack of cords, knots, or integrity check failures")
deleted_khipus_df["Reason_For_Removal"].value_counts()
removed_khipu_names = list(original_khipu_df[original_khipu_df.KHIPU_ID.isin(removed_khipu_ids)].INVESTIGATOR_NUM.values)
removed_urton_khipus = [aName for aName in removed_khipu_names if aName.startswith("UR")]
print(f"Removed {len(removed_urton_khipus)} Urton Khipus, two (UR003, and UR149) due to data integrity check failure:")
print(removed_urton_khipus)
print("\nRemoved following Database Khipus that have associated Excel files:")
from pathlib import Path
path = Path(CSV_dir)
XLS_dir = f"{path.parent}XLS/"
excel_files_removed = 0
for aName in removed_khipu_names:
excel_name = XLS_dir + aName+ ".xls"
if os.path.exists(excel_name):
excel_files_removed += 1
print(f"\t{aName}.xls")
print(f"{excel_files_removed} excel files removed")
122 khipus were deleted. A log of these files, their names, and their reasons for being deleted is kept in the CSV folder under deleted_khipus.csv. As you can see above, the code checks to see if the log needs to be updated.
83 khipus were deleted due to data integrity checks failing. These checks often fail on a consecutive series of 4 or 5 khipus.
38 khipus exhibited zen-like existence with unknotted knots or uncorded cords... (0 knots or 0 cords)
53 Urton Khipus were removed (mostly due to being fragmentary, or zero cords or zero knots)
10 of the Khipus removed has associated Excel files. Two (UR003, UR149) due to integrity check failures with cords and cord clusters.
5 Khipus had knot clusters with bad num-knots counts: AS014, AS056, UR020, UR089, and UR165.
These were not deleted, since the object-oriented Python classes can fix this at creation time.
Some knots had Nan (not a number) or other missing information for type_code (i.e. single, figure-8, long knot, etc.) and num_turns.
These empty fields now default to 'U' for empty type code, and 0 for num_turns of the knot.
Nudo desanudado (the untied knot ) - from an actual cord note in the database
Have we understood the zen koan yet?