About | Research | Teaching | Blog | Tags | Contact

Accurate keyword extraction with customised KeyBERT

KeyBERT is a keyword extaction tool that takes a very different approach to select the most important keywords in a document, compared to traditional scoring methods, such as tfidf. KeyBERT measures the importance of the keyword in the document as the cosine similarity between the representations of the document and the keyword, constructed using a pretrained language model. In this way KeyBERT aims to capture the relevance of the keyword to the meaning of the document, and at the same time removes the need for a large reference corpus.

The problem with CountVectorizer

However, in KeyBERT, sklearn's CountVectorizer is used behind the scenes to extract tokens from which keywords are later created. The problem with CountVectorizer is that it is that uses a regular expression to extract alphanumeric tokens. It first generates a solid list of tokens from the whole document, ignoring sentence boundaries, and then it deletes stopwords from this list, destroying any grammatical relations between tokens. Keywords are then constructed from this token list. As a result, many consist of tokens which in the original text were separated by stopwords or punctuation symbols and even appeared in different sentences. See example below:

    from sklearn.feature_extraction.text import CountVectorizer

    doc = 'This is the first document. This document is the second document.'

    vectorizer = CountVectorizer(stop_words="english", ngram_range=(2, 2))
    X = vectorizer.fit_transform([doc])
    vectorizer.get_feature_names_out()

:::python
    array(['document document', 'document second', 'second document'],
  dtype=object)

CountVectorizer first extracted all alphanumeric tokens using a regex from the whole document and then stripped stopwords from it. This produced a list of tokens: "document", "document", "second", "document". Three bigrams are then created from this list:

(1) "document document", extracted from the last token of the first sentence and the first non-stopword from the second sentence.

(2) "document second", extracted from "... document is the second", after deleting the stopwords "is" and "the".

(3) "second document", the only meaningful word combination.

Below is a customized CountVectorizer that respects sentence boundaries and grammatical relations between words, based on a proper NLP tokenization and PoS tagging function, such as one available in SpaCy or NLTK:

    import importlib
    import sklearn
    importlib.reload(sklearn.feature_extraction.text)
    import spacy

    from itertools import chain

    spacy_nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])


    # the original CountVectorizer's method to generate ngrams is stored in a global variable
    original_word_ngrams = CountVectorizer._word_ngrams

    def get_token_sets(self, tokens, stop_words=None):
        global original_word_ngrams
        token_set = []
        if stop_words is None:
            stop_words = []
        for t in tokens:
            if t.pos_ in ["NOUN", "ADJ", "PROPN"] and t.text not in stop_words:
                token_set.append(t.text)
            else:
                yield original_word_ngrams(self, token_set, stop_words)
                token_set = []
        yield original_word_ngrams(self, token_set, stop_words)

    # a custom method to generate ngrams is assigned to the  CountVectorizer class
    def _custom_word_ngrams(self, tokens, stop_words=None):
        token_sets = get_token_sets(self, tokens, stop_words=stop_words)
        tokens = list(chain.from_iterable(token_sets))
        return tokens

    CountVectorizer._word_ngrams = _custom_word_ngrams

:::python
    vectorizer = CountVectorizer(tokenizer=spacy_nlp, stop_words="english", ngram_range=(2, 2))
    X = vectorizer.fit_transform([doc])
    vectorizer.get_feature_names_out()

Here, only "second document" is extracted from the same input text.

Customizing KeyBERT

Here is the effect of the default CountVectorizer within KeyBERT:

    from keybert import KeyBERT


    doc = """
             Supervised learning is the machine learning task of learning a function that
             maps an input to an output based on example input-output pairs. It infers a
             function from labeled training data consisting of a set of training examples.
             In supervised learning, each example is a pair consisting of an input object
             (typically a vector) and a desired output value (also called the supervisory signal).
             A supervised learning algorithm analyzes the training data and produces an inferred function,
             which can be used for mapping new examples. An optimal scenario will allow for the
             algorithm to correctly determine the class labels for unseen instances. This requires
             the learning algorithm to generalize from the training data to unseen situations in a
             'reasonable' way (see inductive bias).
          """
    kw_model = KeyBERT()
    keywords = kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 3), stop_words="english", top_n=10)
    keywords

:::python
    [('supervised learning algorithm', 0.6992),
     ('supervised learning example', 0.6807),
     ('supervised learning', 0.6779),
     ('supervised learning machine', 0.6706),
     ('supervised', 0.6676),
     ('function labeled training', 0.663),
     ('training examples supervised', 0.625),
     ('signal supervised', 0.6152),
     ('labeled training data', 0.6125),
     ('examples supervised', 0.6112)]

Here, * "supervised learning machine" has been constructed from "Supervised learning is the machine learning task", * "function labeled training" from "... a function from labeled training data ...", * "training examples supervised" from "... set of training examples. In supervised learning ...", * "signal supervised" from "the supervisory signal). A supervised learning algorithm" * "examples supervised" from "training examples. In supervised learning".

That is, five out of top 10 keywords are extraction errors.

    Using a custom ngram generator on the same input text, we get:

    vectorizer = CountVectorizer(tokenizer=spacy_nlp, stop_words="english", ngram_range=(1, 3))
    keywords = kw_model.extract_keywords(doc, vectorizer=vectorizer, top_n=10)
    keywords

:::python
    [('supervised learning algorithm', 0.6992),
     ('supervised learning', 0.6779),
     ('supervised', 0.6676),
     ('learning algorithm', 0.5632),
     ('training data', 0.5271),
     ('learning', 0.4813),
     ('training examples', 0.4668),
     ('training', 0.4134),
     ('labels', 0.3947),
     ('class labels', 0.389)]

Which are much more meaningful phrases.