API

XBOCModel

xboc.XBOCModel()

A Bag-Of-Concepts model that is implemented following the original paper: https://www.sciencedirect.com/science/article/abs/pii/S0925231217308962

The model is also compatible with the BERTopic pipeline and can be used to embed documents.

xboc.XBOCModel.__init__(self: Self, corpus: list[list[str]] | ndarray, wv: ndarray, idx2word: dict[int, str], tokenizer: Tokenizer | None = None, clustering_method: ClusteringMethod = ClusteringMethod.Spherical_KMeans, n_concepts: int = 100, iterations: int = 100, random_state: int = 42, verbose: bool = False, label_impl: LabelingImplementation | None = None, llm_model: LLMModel = LLMModel.OPENAI_GPT3_5, custom_chain: RunnableSerializable[dict[Any, Any], BaseMessage] | None = None, n_top_words_label: int = 10, log_level: int = 20) None

Initializes the Bag-Of-Concept model.

Parameters

corpuslist[list[str]] | np.ndarrayy

The preprocessed corpus to train the model on.

wvnp.ndarray

Word vector representations

idx2worddict[int, str]

Index to word mapping

tokenizerTokenizer | None

The tokenizer used to tokenize the corpus. If not provided no new documents can be embedded. by default None

clustering_methodClusteringMethod

The clustering method to use, by default Spherical KMeans

n_conceptsint, optional

Number of concepts, by default 100

iterationsint, optional

Numbeer of iterations., by default 100

random_stateint, optional

Random state to avoid stochasticity., by default 42

verbosebool, optional

Verbose or not, by default False

label_impl: LabelingImplementation | None

Whether to use template chains or a custom chain and whether to label concepts at all, by default None

llm_model: LLMModel

Large language model to be used for the supported templates, by default OPENAI_GPT3_5

custom_chain: RunnableSerializable[dict[Any, Any], BaseMessage] | None

The custom chain to be used if the user has specified it, by default None

n_top_words_labelint

How many top words to use to label the concepts, by default 10

log_level: int

Log level if verbose is True, by default INFO.

xboc.XBOCModel.fit(self: Self) tuple[ndarray | spmatrix, list[tuple[str, int]], dict[int, str]]

Fit the model on the training data

Returns

tuple[np.ndarray | scipy.sparse.spmatrix, list[str, int], dict[int, str]]
  1. Bag of Concept matrix, where the rows represent the document’s embeddings.

  2. A list of word to concept mappings.

  3. A index to word converter as a dictionary.

xboc.XBOCModel.encode(self: Self, text: list[str] | str) spmatrix | ndarray

Encodes new documents

Parameters

textlist[str] | str

Text to be encoded

Returns

scipy.sparse.spmatrix | np.ndarray

The document embedding

Raises

ValueError

Model must be first trained before it can encode documents.

AttributeError

A tokenizer must be specified in order to encode new documents.

xboc.XBOCModel.save(self: Self, folder_path: str) None

Saves the model using pickle.

IMPORTANT

Custom chains cannot be saved. Please make sure that after you load your BoC model, you set manually the custom chain.

Parameters

folder_pathstr

The folder where the model should be saved.

xboc.XBOCModel.get_top_n_words(self, n: int, cluster: int | None = None) list[list[str]] | list[str]

Gets the top N words for all clusters or for a given cluster if provided.

Parameters

nint

Number of words to get

clusterint | None, optional

Cluster of interest, by default None

Returns

list[list[str]] | list[str]

Top N words for all clusters or for a given cluster

Raises

AttributeError

Something went wrong during training.

IndexError

The number N is higher than the minimum number of words assigned to clusters.

IndexError

The cluster of interest is out of range.

xboc.XBOCModel.get_concept_label(self: Self, index: int | None = None) list[str] | str

Return all concept labels or specific label.

Parameters

indexint | None, optional

Concept index, by default None

Returns

np.ndarray | str

The concept label or all concept labels

Raises

IndexError

If concept index is out of range

xboc.XBOCModel.load(file_path: str) XBOCModel

Loads the model using pickle.

Parameters

file_pathstr

The path to the file.

xboc.XBOCModel.calculate_scores_for_k_range(k_range: list[int], wv: ndarray, clustering_method: ClusteringMethod = ClusteringMethod.Spherical_KMeans, max_iter: int = 100, verbose: bool = False, random_state: int = 42) DataFrame

Calculates the BIC, AIC using GMMs on the provided word vectors using the K number of clusters in the provided list. Additionaly, this function calculates the silhouette, davies and calinski scores for the same list of k number of clusters, using the provided clustering method.

Parameters

k_rangelist[int]

The grid parameter search range

wvnp.ndarray

Word Vectors

clustering_methodClusteringMethod

The clustering method to use for calculating the silhouette, davies and calinski scores. by default Spherical_KMeans

max_iterint, optional

Maximum number of iteration for the clustering model, by default 100

verbosebool, optional

Verbosity, by default False

random_stateint, optional

Random state to be used, by default 42

Returns

pd.DataFrame

Dataframe consisting all the results

xboc.XBOCModel._cluster_wv(self: Self, wv: ndarray, num_concept: int, max_iter: int = 10) None

Cluster word vector representations using the provided clustering method.

Parameters

wvnp.ndarray

The word vector representations.

num_conceptint

Number of concepts to cluster.

max_iterint, optional

Maximum number of iterations for the clustering method., by default 10

Raises

ValueError

If the word vectors that have to be clustered contain any NaN values.

xboc.XBOCModel._create_bow(self: Self) None

Create the bag of words that is needed for compute the CF-IDF

xboc.XBOCModel._create_w2c(self: Self) None

Create the word to concept mapping.

Raises

IndexError

When the dimensions between words and labels do not match.

xboc.XBOCModel._apply_cfidf(self: Self, csr_matrix: ndarray | spmatrix) None

Applies the Concept-Frequency Inverse-Concept-Frequency.

Parameters

csr_matrixnp.ndarray | scipy.sparse.spmatrix

The embedded documents using the concepts.

xboc.XBOCModel._get_word2idx(self: Self, idx2word: dict[int, str]) dict[str, int]

Creates the reverse mapping of index to word.

Parameters

idx2worddict[int, str]

Index to word mapping

Returns

dict[str, int]

Word to index mapping

xboc.XBOCModel._log(self: Self, msg: str) None

Log messasges if verbose is enabled.

Parameters

msgstr

Message to be logged

Types

xboc.LLMModel()

All model-specific prompts implementations

xboc.LabelingImplementation()

Labeling LangChain Implementation to use. The user can specify whether to use our pre-defined templates or a custom langchain that he provides.

xboc.ClusteringMethod()

Define all supported clustering methods.

xboc.Tokenizer()

Abstract wrapper around tokenizers. The call method should be implemented to allow tokenization of new documents.

Ensures compatibility with pickle.

xboc.Tokenizer.__call__(self, text: str) list[str]

Tokenizes the input text and returns a list of tokens.

Parameters

textstr

The input text to tokenize.

Returns

list[str]

A list of tokens extracted from the input text.

Prompts

xboc.prompts.UniversalPrompt()

An abstract class that defines that each model-specific prompt implementation should have the following attribute:

Attributes

promptstr

The main prompt to be used. MUST CONTAIN “{keywords}”!

xboc.prompts.OpenAIPrompt()

OpenAI specific prompt.

Attributes

promptstr

The main prompt to be used