## What is this about?
Imagine you want to compare various iterations of a table or data file. You may need to ensure that your versions of a
list that is maintained by more than one person or system are same. When you have to solve this manually, it is
difficult. So, let's make use of this library right now!
## How does it work?
The script takes an entry's ID value from the first table, looks for it in the second table and sees if it can find it
there. If it doesn't, it will list this entry as an entry that was found in table 1, but not in table 2. This will
happen vice versa for table 2 against table 1. Moreover, the script compares values of entries with the
same ID, and if they are different, the script lists these differences.
## What does it require?
The table entries MUST have a unique identifier, or key value. For example, this could be an order number, or an
employee number - but it could also be other characters, as long as it is unique to an entry. This key value MUST be
located in the first column of your table. The orders of the other columns, as well as the orders of the rows in
general, is completely irrelevant.You have a use case, too, we hope this library will save you some time!
Version Release Beta
==========
0.0.1 (29/06/2022)
-------------------
pip install pandas
pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.
pip install numpy
NumPy is a Python library used for working with arrays. It also has functions for working in domain of linear algebra,
fourier transform, and matrices.
import pandas as pd
import numpy as np
'''//
DESCRIPTION:
The comparison requires comparing table1 against table2 and vice versa in order to find values that are in one table,
but not the other,therefore, the files are imported based on current comparison mode, and in mode 2to1, their order
is switched to avoid differences being identified based on different data formats: convert numbers to equal format/
"syntax", then to string
///'''
def toString(x):
# print('from ', x, type(x))
if type(x) == str:
return x
if type(x) == np.float64 or type(x) == np.int64:
x.tolist()
if x == None:
x = 'NaN'
x = str(float(x))
if x[-1] == '0':
x = x[0:-2] # removing .0 from the float
# print('to ', x, type(x))
return str(x)
'''///----Comparison funct----///'''
def runComparison(findDifferingValues, fileName1, fileName2, df1, df2):
df_comparison = pd.DataFrame() # tables have to be initalized because they are later referred to for concatenation
df_entrynotFound = pd.DataFrame()
# in both tables, the first column must be the one that contains the entries' ID/key
keyCol = df2.iloc[:, 0] # 1st column in the table we're comparing to
keyColArray = list(map(toString, keyCol.values)) # converting entries to string for comparison
# print('to str: ', test, pd.Series(keyColArray))
keyCol = pd.Series(keyColArray)
'''//---compare rows funct---//'''
def compareRows(rowA):
key = toString(rowA[0]) # saving the key value from 1st table so I can get corresponding value from 2nd table
deviations = pd.DataFrame() # here's where I'll store the info on deviating entries
# now going through every col entry in rowA
for category, value in rowA.items():
if category not in df2.columns: # if category(column) does not exist in table 2, ignore & break
break
valueBList = pd.Series(df2[category].where(
keyCol == key).values) # identifiying value for same ID/col in table2 > array with comparison resut for each row
valueB = list(filter(lambda x: pd.isna(x) == False,
valueBList)) # from results array, filtering out only those values that are not 'NaN'
if (pd.isna(valueBList.values)).all(): # if the result contains only NaN values > cell is empty:
if (pd.isna(value) or value == float(
'nan')): # see if corresponding value in rowA is NaN, too, in which case don't list as difference
break
else: # if corresponding value in rowA is not NaN, redefine valueB so it's not an empty array
valueB = ['#empty!']
value = toString(value)
valueB[0] = toString(valueB[0])
# print(value, valueB[0])
if value != valueB[
0]: # now actually comparing value of key/colName in table1 vs same key/colName in table2
# print('deviating value: ', valueB[0])
# print(f'Difference found in entry with ID {key}, in category: {category}') #optional: log the info
newRow = [[key, category, value, str(valueB[0])]]
# print(newRow) #activate when debugging
add = pd.DataFrame(newRow, columns=[('Entry ID in ' + fileName1), 'Difference found in',
('Value in ' + fileName1), ('Value in ' + fileName2)])
deviations = pd.concat([deviations, add])
return deviations
#--------------------------------end of compare rows funct----------------------------------//'''
for currentRow, values in df1.iterrows(): # going through each row in table 1
rowDf1 = pd.Series(df1.iloc[currentRow]) # grab values from current row of table 1, save as Series
key = toString(rowDf1.values[0]) # set entry in 1st column as key to look for in table 2
'''//now, look for this key in table 2. If it is not found, add to "not found" sheet for final results file//'''
if df2.loc[keyCol == key].empty:
newRow = [[key, rowDf1.index[1], rowDf1.values[1]]]
result = pd.DataFrame(newRow, columns=[('Key not found in ' + fileName2), ('name next col ' + fileName1),
('value next col ' + fileName1)])
df_entrynotFound = pd.concat([df_entrynotFound, result]) # add result to compare df
# print(f'ID no. {key} not found in table 2') #optional: log the info
'''//if it is found: identify differences and store them in table for for final results file, except for
comparison mode '2to1' where differences have already been found + listed and we are only interested
in entries that are in table2 but not table1 //'''
if df2.loc[keyCol == key].empty == False and findDifferingValues == True:
result = compareRows(
rowDf1) # pass rowDf1 to compare funct to compare it to corresponding row in table2
df_comparison = pd.concat([df_comparison, result])
# print(result)
return [df_comparison, df_entrynotFound]
'''//-----End of comparison funct-------//'''
'''//save results as CSV files, as well as different sheets in same Excel file.
Use excel format and writer for that. Index not included in files//'''
def saveToFile(data, name, writer):
data.to_excel(writer, sheet_name=str(name), index=False)
#data.to_csv(str(name), index=False)
----------
Usage
-------------------
Install the published library:
pip install comp-df-syren==0.2.0