Evaluation Metrics¶
Many evaluation metrics are available for recommendation systems and each has its own pros and cons.
CAPRI
supports the two types of evaluation metrics: Accuracy and Beyond Accuracy.
In this page, we will discuss these two evaluation metrics in detail.
Accuracy Metrics¶
Precision@K¶
Precision at k is the fraction of relevant recommended items in the top-k set. Assume that in a top-10 recommendation problem, my precision at 10 is 75%. This means that 75% of the suggestions I offer are applicable to the user.
def precisionk(actual: list, recommended: list):
"""
Computes the number of relevant results among the top k recommended items
Parameters
----------
actual: list
A list of ground truth items
example: [X, Y, Z]
recommended: list
A list of ground truth items (all possible relevant items)
example: [x, y, z]
Returns
----------
precision at k
"""
relevantResults = set(actual) & set(recommended)
assert 0 <= len(
relevantResults), f"The number of relevant results is not true (currently: {len(relevantResults)})"
return 1.0 * len(relevantResults) / len(recommended)
Recall@K¶
The proportion of relevant things discovered in the top-k suggestions is known as recall at k. Assume we computed recall at 10 and discovered it to be 55% in our top-10 recommendation system. This suggests that the top-k results contain 55% of the entire number of relevant items.
def recallk(actual: list, recommended: list):
"""
The number of relevant results among the top k recommended items divided by the total number of relevant items
Parameters
----------
actual: list
A list of ground truth items (all possible relevant items)
example: [X, Y, Z]
recommended: list
A list of items recommended by the system
example: [x, y, z]
Returns
----------
recall at k
"""
relevantResults = set(actual) & set(recommended)
assert 0 <= len(relevantResults), f"The number of relevant results is not true (currently: {len(relevantResults)})"
return 1.0 * len(relevantResults) / len(actual)
Map@K¶
MAP@k stands for Mean Average Precision at Cut Off k and is most commonly employed in recommendation systems, although it can also be utilized in other systems. When you use MAP to evaluate a recommender algorithm, you’re treating the recommendation as a ranking problem.
def mapk(actual: list, predicted: list, k: int = 10):
"""
Computes mean Average Precision at k (mAPk) for a set of recommended items
Parameters
----------
actual: list
A list of ground truth items (order doesn't matter)
example: [X, Y, Z]
predicted: list
A list of predicted items, recommended by the system (order matters)
example: [x, y, z]
k: integer, optional (default to 10)
The number of elements of predicted to consider in the calculation
Returns
----------
score:
The mean Average Precision at k (mAPk)
"""
score = 0.0
numberOfHits = 0.0
for i, p in enumerate(predicted):
if p in actual and p not in predicted[:i]:
numberOfHits += 1.0
score += numberOfHits / (i+1.0)
if not actual:
return 0.0
score = score / min(len(actual), k)
return score
NDCG@K¶
The Discounted Cumulative Gain for k displayed recommendations sums up the importance of the items displayed for the current user (cumulative), while penalizing relevant items in later slots (discounted). In the Normalized Cumulative Gain for k given suggestions, this score is divided by the maximum possible value of DCG@K for the current user.
def dcg(scores: list):
"""
Computes the Discounted Cumulative Gain (DCG) for a list of scores
Parameters
----------
scores: list
A list of scores
Returns
----------
dcg: float
The Discounted Cumulative Gain (DCG)
"""
return np.sum(np.divide(np.power(2, scores) - 1, np.log(np.arange(scores.shape[0], dtype=np.float32) + 2)), dtype=np.float32)
def ndcgk(actual: list, predicted: list, relevance=None, at=None):
"""
Calculates the implicit version of Normalized Discounted Cumulative Gain (NDCG) for top k items in the ranked output
Parameters
----------
actual: list
A list of ground truth items
example: [X, Y, Z]
predicted: list
A list of predicted items, recommended by the system
example: [x, y, z]
relevance: list, optional (default to None)
A list of relevance scores for the items in the ground truth
at: any, optional (default to None)
The number of items to consider in the calculation
Returns
----------
ndcg:
Normalized DCG score
Metric Defintion
----------
Jarvelin, K., & Kekalainen, J. (2002). Cumulated gain-based evaluation of IR techniques.
ACM Transactions on Information Systems (TOIS), 20(4), 422-446.
"""
# Convert list to numpy array
actual, predicted = np.asarray(list(actual)), np.asarray(list(predicted))
# Check the relevance value
if relevance is None:
relevance = np.ones_like(actual)
assert len(relevance) == actual.shape[0]
# Creating a dictionary associating itemId to its relevance
item2rel = {it: r for it, r in zip(actual, relevance)}
# Creates array of length "at" with the relevance associated to the item in that position
rankScores = np.asarray([item2rel.get(it, 0.0)
for it in predicted[:at]], dtype=np.float32)
# IDCG has all relevances to 1, up to the number of items in the test set
idcg = dcg(np.sort(relevance)[::-1])
# Calculating rank-DCG, DCG uses the relevance of the recommended items
rdcg = dcg(rankScores)
if rdcg == 0.0:
return 0.0
# Return items
return round(rdcg / idcg, 4)
Beyond-accuracy Metrics¶
Beyound accuracy metrics refer to evaluating recommender systems by coverage and serendipity.
List Diversity¶
List Diversity is one of the most common metrics used to evaluate recommender systems.
def listDiversity(predicted: list, itemsSimilarityMatrix):
"""
Computes the diversity for a list of recommended items for a user
Parameters
----------
predicted: list
A list of predicted numeric/character vectors of retrieved documents for the corresponding element of actual
example: ['X', 'Y', 'Z']
Returns
----------
diversity
"""
pairCount = 0
similarity = 0
pairs = itertools.combinations(predicted, 2)
for pair in pairs:
itemID1 = pair[0]
itemID2 = pair[1]
similarity += itemsSimilarityMatrix[itemID1, itemID2]
pairCount += 1
averageSimilarity = similarity / pairCount
diversity = 1 - averageSimilarity
return diversity
Novelty¶
Novelty is one of the most common metrics used to evaluate recommender systems.
def novelty(predicted: list, pop: dict, u: int, k: int):
"""
Computes the novelty for a list of recommended items for a user
Parameters
----------
predicted: a list of recommedned items
Ordered predictions
example: ['X', 'Y', 'Z']
pop: dictionary
A dictionary of all items alongside of its occurrences counter in the training data
example: {1198: 893, 1270: 876, 593: 876, 2762: 867}
u: integer
The number of users in the training data
k: integer
The length of recommended lists per user
Returns
----------
novelty:
The novelty of the recommendations in system level
Metric Definition
----------
Zhou, T., Kuscsik, Z., Liu, J. G., Medo, M., Wakeling, J. R., & Zhang, Y. C. (2010).
Solving the apparent diversity-accuracy dilemma of recommender systems.
Proceedings of the National Academy of Sciences, 107(10), 4511-4515.
"""
selfInformation = 0
for item in predicted:
if item in pop.keys():
itemPopularity = pop[item]/u
itemNoveltyValue = np.sum(-np.log2(itemPopularity))
else:
itemNoveltyValue = 0
selfInformation += itemNoveltyValue
noveltyScore = selfInformation/k
return noveltyScore
Catalog Coverage¶
Catalog Coverage is one of the most common metrics used to evaluate recommender systems.
def catalogCoverage(predicted: List[list], catalog: set):
"""
Computes the catalog coverage for k lists of recommendations
Coverage is the percent of items in the training data the model is able to recommend on a test set
Parameters
----------
predicted: a list of lists
Ordered predictions
example: [['X', 'Y', 'Z'], ['X', 'Y', 'Z']]
catalog: list
A list of all unique items in the training data
example: ['A', 'B', 'C', 'X', 'Y', 'Z']
k: integer
The number of observed recommendation lists
which randomly choosed in our offline setup
Returns
----------
catalogCoverage:
The catalog coverage of the recommendations as a percent
rounded to 2 decimal places
Metric Definition
-----------------
Ge, M., Delgado-Battenfeld, C., & Jannach, D. (2010, September).
Beyond accuracy: evaluating recommender systems by coverage and serendipity.
In Proceedings of the fourth ACM conference on Recommender systems (pp. 257-260). ACM.
"""
predictedFlattened = [p for sublist in predicted for p in sublist]
LPredictions = len(set(predictedFlattened))
catalogCoverage = round(LPredictions / (len(catalog) * 1.0) * 100, 2)
return catalogCoverage
Personalization¶
Personalization is one of the most common metrics used to evaluate recommender systems.
def personalization(predicted: List[list]):
"""
Personalization measures recommendation similarity across users.
A high score indicates good personalization (user's lists of recommendations are different).
A low score indicates poor personalization (user's lists of recommendations are very similar).
A model is "personalizing" well if the set of recommendations for each user is different.
Parameters
----------
predicted: a list of lists
Ordered predictions
example: [['X', 'Y', 'Z'], ['X', 'Y', 'Z']]
Returns
-------
The personalization score for all recommendations.
"""
def makeRecMatrix(predicted: List[list]):
df = pd.DataFrame(data=predicted).reset_index().melt(
id_vars='index', value_name='item',
)
df = df[['index', 'item']].pivot(
index='index', columns='item', values='item')
df = pd.notna(df)*1
recMatrix = sp.csr_matrix(df.values)
return recMatrix
# Create matrix for recommendations
predicted = np.array(predicted)
recMatrixSparse = makeRecMatrix(predicted)
# Calculate similarity for every user's recommendation list
similarity = cosine_similarity(X=recMatrixSparse, dense_output=False)
# Get indicies for upper right triangle w/o diagonal
upperRight = np.triu_indices(similarity.shape[0], k=1)
# Calculate average similarity
personalization = np.mean(similarity[upperRight])
return 1-personalization