======================
Evaluation Metrics
======================


Many evaluation metrics are available for recommendation systems and each has its own pros and cons.
``CAPRI`` supports the two types of evaluation metrics: **Accuracy** and **Beyond Accuracy**.
In this page, we will discuss these two evaluation metrics in detail.


Accuracy Metrics
--------------------

Precision@K
~~~~~~~~~~~~~~~~

Precision at k is the fraction of relevant recommended items in the top-k set. Assume that in a top-10 recommendation problem, my precision at 10 is 75%.
This means that 75% of the suggestions I offer are applicable to the user.

::

    def precisionk(actual: list, recommended: list):
        """
        Computes the number of relevant results among the top k recommended items

        Parameters
        ----------
        actual: list
            A list of ground truth items
            example: [X, Y, Z]
        recommended: list
            A list of ground truth items (all possible relevant items)
            example: [x, y, z]

        Returns
        ----------
            precision at k
        """
        relevantResults = set(actual) & set(recommended)
        assert 0 <= len(
            relevantResults), f"The number of relevant results is not true (currently: {len(relevantResults)})"
        return 1.0 * len(relevantResults) / len(recommended)


Recall@K
~~~~~~~~~~~~~~~~

The proportion of relevant things discovered in the top-k suggestions is known as recall at k.
Assume we computed recall at 10 and discovered it to be 55% in our top-10 recommendation system.
This suggests that the top-k results contain 55% of the entire number of relevant items.

::

    def recallk(actual: list, recommended: list):
        """
        The number of relevant results among the top k recommended items divided by the total number of relevant items

        Parameters
        ----------
        actual: list
            A list of ground truth items (all possible relevant items)
            example: [X, Y, Z]
        recommended: list
            A list of items recommended by the system
            example: [x, y, z]

        Returns
        ----------
            recall at k
        """
        relevantResults = set(actual) & set(recommended)
        assert 0 <= len(relevantResults), f"The number of relevant results is not true (currently: {len(relevantResults)})"
        return 1.0 * len(relevantResults) / len(actual)


Map@K
~~~~~~~~~~~~~~~~

MAP@k stands for Mean Average Precision at Cut Off k and is most commonly employed in recommendation systems, 
although it can also be utilized in other systems.
When you use MAP to evaluate a recommender algorithm, you're treating the recommendation as a ranking problem.

::

    def mapk(actual: list, predicted: list, k: int = 10):
        """
        Computes mean Average Precision at k (mAPk) for a set of recommended items

        Parameters
        ----------
        actual: list
            A list of ground truth items (order doesn't matter)
            example: [X, Y, Z]
        predicted: list
            A list of predicted items, recommended by the system (order matters)
            example: [x, y, z]
        k: integer, optional (default to 10)
            The number of elements of predicted to consider in the calculation

        Returns
        ----------
        score:
            The mean Average Precision at k (mAPk)
        """
        score = 0.0
        numberOfHits = 0.0
        for i, p in enumerate(predicted):
            if p in actual and p not in predicted[:i]:
                numberOfHits += 1.0
                score += numberOfHits / (i+1.0)
        if not actual:
            return 0.0
        score = score / min(len(actual), k)
        return score


NDCG@K
~~~~~~~~~~~~~~~~

The Discounted Cumulative Gain for k displayed recommendations sums up the importance of the items displayed for 
the current user (cumulative), while penalizing relevant items in later slots (discounted).
In the Normalized Cumulative Gain for k given suggestions, this score is divided by the maximum possible value of DCG@K for the current user.

::

    def dcg(scores: list):
        """
        Computes the Discounted Cumulative Gain (DCG) for a list of scores

        Parameters
        ----------
        scores: list
            A list of scores

        Returns
        ----------
        dcg: float
            The Discounted Cumulative Gain (DCG)
        """
        return np.sum(np.divide(np.power(2, scores) - 1, np.log(np.arange(scores.shape[0], dtype=np.float32) + 2)), dtype=np.float32)


    def ndcgk(actual: list, predicted: list, relevance=None, at=None):
        """
        Calculates the implicit version of Normalized Discounted Cumulative Gain (NDCG) for top k items in the ranked output

        Parameters
        ----------
        actual: list
            A list of ground truth items
            example: [X, Y, Z]
        predicted: list
            A list of predicted items, recommended by the system
            example: [x, y, z]
        relevance: list, optional (default to None)
            A list of relevance scores for the items in the ground truth
        at: any, optional (default to None)
            The number of items to consider in the calculation

        Returns
        ----------
        ndcg:
            Normalized DCG score

        Metric Defintion
        ----------
        Jarvelin, K., & Kekalainen, J. (2002). Cumulated gain-based evaluation of IR techniques.
        ACM Transactions on Information Systems (TOIS), 20(4), 422-446.
        """
        # Convert list to numpy array
        actual, predicted = np.asarray(list(actual)), np.asarray(list(predicted))
        # Check the relevance value
        if relevance is None:
            relevance = np.ones_like(actual)
        assert len(relevance) == actual.shape[0]
        # Creating a dictionary associating itemId to its relevance
        item2rel = {it: r for it, r in zip(actual, relevance)}
        # Creates array of length "at" with the relevance associated to the item in that position
        rankScores = np.asarray([item2rel.get(it, 0.0)
                                for it in predicted[:at]], dtype=np.float32)
        # IDCG has all relevances to 1, up to the number of items in the test set
        idcg = dcg(np.sort(relevance)[::-1])
        # Calculating rank-DCG, DCG uses the relevance of the recommended items
        rdcg = dcg(rankScores)
        if rdcg == 0.0:
            return 0.0
        # Return items
        return round(rdcg / idcg, 4)


Beyond-accuracy Metrics
--------------------------------------

Beyound accuracy metrics refer to evaluating recommender systems by coverage and serendipity.


List Diversity
~~~~~~~~~~~~~~~~

List Diversity is one of the most common metrics used to evaluate recommender systems.

::

    def listDiversity(predicted: list, itemsSimilarityMatrix):
        """
        Computes the diversity for a list of recommended items for a user

        Parameters
        ----------
        predicted: list
            A list of predicted numeric/character vectors of retrieved documents for the corresponding element of actual
            example: ['X', 'Y', 'Z']

        Returns
        ----------
            diversity
        """
        pairCount = 0
        similarity = 0
        pairs = itertools.combinations(predicted, 2)
        for pair in pairs:
            itemID1 = pair[0]
            itemID2 = pair[1]
            similarity += itemsSimilarityMatrix[itemID1, itemID2]
            pairCount += 1
        averageSimilarity = similarity / pairCount
        diversity = 1 - averageSimilarity
        return diversity

Novelty
~~~~~~~~~~~~~~~~

Novelty is one of the most common metrics used to evaluate recommender systems.

::

    def novelty(predicted: list, pop: dict, u: int, k: int):
        """
        Computes the novelty for a list of recommended items for a user

        Parameters
        ----------
        predicted: a list of recommedned items
            Ordered predictions
            example: ['X', 'Y', 'Z']
        pop: dictionary
            A dictionary of all items alongside of its occurrences counter in the training data
            example: {1198: 893, 1270: 876, 593: 876, 2762: 867}
        u: integer
            The number of users in the training data
        k: integer
            The length of recommended lists per user

        Returns
        ----------
        novelty:
            The novelty of the recommendations in system level

        Metric Definition
        ----------
        Zhou, T., Kuscsik, Z., Liu, J. G., Medo, M., Wakeling, J. R., & Zhang, Y. C. (2010).
        Solving the apparent diversity-accuracy dilemma of recommender systems.
        Proceedings of the National Academy of Sciences, 107(10), 4511-4515.
        """
        selfInformation = 0
        for item in predicted:
            if item in pop.keys():
                itemPopularity = pop[item]/u
                itemNoveltyValue = np.sum(-np.log2(itemPopularity))
            else:
                itemNoveltyValue = 0
            selfInformation += itemNoveltyValue
        noveltyScore = selfInformation/k
        return noveltyScore


Catalog Coverage
~~~~~~~~~~~~~~~~

Catalog Coverage is one of the most common metrics used to evaluate recommender systems.

::

    def catalogCoverage(predicted: List[list], catalog: set):
    """
    Computes the catalog coverage for k lists of recommendations
    Coverage is the percent of items in the training data the model is able to recommend on a test set

    Parameters
    ----------
    predicted: a list of lists
        Ordered predictions
        example: [['X', 'Y', 'Z'], ['X', 'Y', 'Z']]
    catalog: list
        A list of all unique items in the training data
        example: ['A', 'B', 'C', 'X', 'Y', 'Z']
    k: integer
        The number of observed recommendation lists
        which randomly choosed in our offline setup

    Returns
    ----------
    catalogCoverage:
        The catalog coverage of the recommendations as a percent
        rounded to 2 decimal places

    Metric Definition
    -----------------
    Ge, M., Delgado-Battenfeld, C., & Jannach, D. (2010, September).
    Beyond accuracy: evaluating recommender systems by coverage and serendipity.
    In Proceedings of the fourth ACM conference on Recommender systems (pp. 257-260). ACM.
    """
    predictedFlattened = [p for sublist in predicted for p in sublist]
    LPredictions = len(set(predictedFlattened))
    catalogCoverage = round(LPredictions / (len(catalog) * 1.0) * 100, 2)
    return catalogCoverage

Personalization
~~~~~~~~~~~~~~~~

Personalization is one of the most common metrics used to evaluate recommender systems.

::

    def personalization(predicted: List[list]):
        """
        Personalization measures recommendation similarity across users.
        A high score indicates good personalization (user's lists of recommendations are different).
        A low score indicates poor personalization (user's lists of recommendations are very similar).
        A model is "personalizing" well if the set of recommendations for each user is different.

        Parameters
        ----------
        predicted: a list of lists
            Ordered predictions
            example: [['X', 'Y', 'Z'], ['X', 'Y', 'Z']]

        Returns
        -------
            The personalization score for all recommendations.
        """

        def makeRecMatrix(predicted: List[list]):
            df = pd.DataFrame(data=predicted).reset_index().melt(
                id_vars='index', value_name='item',
            )
            df = df[['index', 'item']].pivot(
                index='index', columns='item', values='item')
            df = pd.notna(df)*1
            recMatrix = sp.csr_matrix(df.values)
            return recMatrix

        # Create matrix for recommendations
        predicted = np.array(predicted)
        recMatrixSparse = makeRecMatrix(predicted)
        # Calculate similarity for every user's recommendation list
        similarity = cosine_similarity(X=recMatrixSparse, dense_output=False)
        # Get indicies for upper right triangle w/o diagonal
        upperRight = np.triu_indices(similarity.shape[0], k=1)
        # Calculate average similarity
        personalization = np.mean(similarity[upperRight])
        return 1-personalization