API

XBOCModel

xboc.XBOCModel()

A Bag-Of-Concepts model that is implemented following the original paper: https://www.sciencedirect.com/science/article/abs/pii/S0925231217308962

The model is also compatible with the BERTopic pipeline and can be used to embed documents.

xboc.XBOCModel.__init__(self: Self, corpus: list[list[str]] | ndarray, wv: ndarray, idx2word: dict[int, str], tokenizer: Tokenizer | None = None, clustering_method: ClusteringMethod = ClusteringMethod.Spherical_KMeans, n_concepts: int = 100, iterations: int = 100, random_state: int = 42, verbose: bool = False, label_impl: LabelingImplementation | None = None, llm_model: LLMModel = LLMModel.OPENAI_GPT3_5, custom_chain: RunnableSerializable[dict[Any, Any], BaseMessage] | None = None, n_top_words_label: int = 10, log_level: int = 20) → None

Initializes the Bag-Of-Concept model.

Parameters

corpuslist[list[str]] | np.ndarrayy: The preprocessed corpus to train the model on.
wvnp.ndarray: Word vector representations
idx2worddict[int, str]: Index to word mapping
tokenizerTokenizer | None: The tokenizer used to tokenize the corpus. If not provided no new documents can be embedded. by default None
clustering_methodClusteringMethod: The clustering method to use, by default Spherical KMeans
n_conceptsint, optional: Number of concepts, by default 100
iterationsint, optional: Numbeer of iterations., by default 100
random_stateint, optional: Random state to avoid stochasticity., by default 42
verbosebool, optional: Verbose or not, by default False
label_impl: LabelingImplementation | None: Whether to use template chains or a custom chain and whether to label concepts at all, by default None
llm_model: LLMModel: Large language model to be used for the supported templates, by default OPENAI_GPT3_5
custom_chain: RunnableSerializable[dict[Any, Any], BaseMessage] | None: The custom chain to be used if the user has specified it, by default None
n_top_words_labelint: How many top words to use to label the concepts, by default 10
log_level: int: Log level if verbose is True, by default INFO.

xboc.XBOCModel.fit(self: Self) → tuple[ndarray | spmatrix, list[tuple[str, int]], dict[int, str]]

Fit the model on the training data

Returns

tuple[np.ndarray | scipy.sparse.spmatrix, list[str, int], dict[int, str]]

Bag of Concept matrix, where the rows represent the document’s embeddings.
A list of word to concept mappings.
A index to word converter as a dictionary.

xboc.XBOCModel.encode(self: Self, text: list[str] | str) → spmatrix | ndarray

Encodes new documents

Parameters

textlist[str] | str: Text to be encoded

Returns

scipy.sparse.spmatrix | np.ndarray: The document embedding

Raises

ValueError: Model must be first trained before it can encode documents.
AttributeError: A tokenizer must be specified in order to encode new documents.

xboc.XBOCModel.save(self: Self, folder_path: str) → None

Saves the model using pickle.

IMPORTANT

Custom chains cannot be saved. Please make sure that after you load your BoC model, you set manually the custom chain.

Parameters

folder_pathstr: The folder where the model should be saved.

xboc.XBOCModel.get_top_n_words(self, n: int, cluster: int | None = None) → list[list[str]] | list[str]

Gets the top N words for all clusters or for a given cluster if provided.

Parameters

nint: Number of words to get
clusterint | None, optional: Cluster of interest, by default None

Returns

list[list[str]] | list[str]: Top N words for all clusters or for a given cluster

Raises

AttributeError: Something went wrong during training.
IndexError: The number N is higher than the minimum number of words assigned to clusters.
IndexError: The cluster of interest is out of range.

xboc.XBOCModel.get_concept_label(self: Self, index: int | None = None) → list[str] | str

Return all concept labels or specific label.

Parameters

indexint | None, optional: Concept index, by default None

Returns

np.ndarray | str: The concept label or all concept labels

Raises

IndexError: If concept index is out of range

xboc.XBOCModel.load(file_path: str) → XBOCModel

Loads the model using pickle.

Parameters

file_pathstr: The path to the file.

xboc.XBOCModel.calculate_scores_for_k_range(k_range: list[int], wv: ndarray, clustering_method: ClusteringMethod = ClusteringMethod.Spherical_KMeans, max_iter: int = 100, verbose: bool = False, random_state: int = 42) → DataFrame

Calculates the BIC, AIC using GMMs on the provided word vectors using the K number of clusters in the provided list. Additionaly, this function calculates the silhouette, davies and calinski scores for the same list of k number of clusters, using the provided clustering method.

Parameters

k_rangelist[int]: The grid parameter search range
wvnp.ndarray: Word Vectors
clustering_methodClusteringMethod: The clustering method to use for calculating the silhouette, davies and calinski scores. by default Spherical_KMeans
max_iterint, optional: Maximum number of iteration for the clustering model, by default 100
verbosebool, optional: Verbosity, by default False
random_stateint, optional: Random state to be used, by default 42

Returns

pd.DataFrame: Dataframe consisting all the results

xboc.XBOCModel._cluster_wv(self: Self, wv: ndarray, num_concept: int, max_iter: int = 10) → None

Cluster word vector representations using the provided clustering method.

Parameters

wvnp.ndarray: The word vector representations.
num_conceptint: Number of concepts to cluster.
max_iterint, optional: Maximum number of iterations for the clustering method., by default 10

Raises

ValueError: If the word vectors that have to be clustered contain any NaN values.

xboc.XBOCModel._create_bow(self: Self) → None: Create the bag of words that is needed for compute the CF-IDF

xboc.XBOCModel._create_w2c(self: Self) → None

Create the word to concept mapping.

Raises

IndexError: When the dimensions between words and labels do not match.

xboc.XBOCModel._apply_cfidf(self: Self, csr_matrix: ndarray | spmatrix) → None

Applies the Concept-Frequency Inverse-Concept-Frequency.

Parameters

csr_matrixnp.ndarray | scipy.sparse.spmatrix: The embedded documents using the concepts.

xboc.XBOCModel._get_word2idx(self: Self, idx2word: dict[int, str]) → dict[str, int]

Creates the reverse mapping of index to word.

Parameters

idx2worddict[int, str]: Index to word mapping

Returns

dict[str, int]: Word to index mapping

xboc.XBOCModel._log(self: Self, msg: str) → None

Log messasges if verbose is enabled.

Parameters

msgstr: Message to be logged

Types

xboc.LLMModel(): All model-specific prompts implementations

xboc.LabelingImplementation(): Labeling LangChain Implementation to use. The user can specify whether to use our pre-defined templates or a custom langchain that he provides.

xboc.ClusteringMethod(): Define all supported clustering methods.

xboc.Tokenizer()

Abstract wrapper around tokenizers. The call method should be implemented to allow tokenization of new documents.

Ensures compatibility with pickle.

xboc.Tokenizer.__call__(self, text: str) → list[str]

Tokenizes the input text and returns a list of tokens.

Parameters

textstr: The input text to tokenize.

Returns

list[str]: A list of tokens extracted from the input text.

Prompts

xboc.prompts.UniversalPrompt()

An abstract class that defines that each model-specific prompt implementation should have the following attribute:

Attributes

promptstr: The main prompt to be used. MUST CONTAIN “{keywords}”!

xboc.prompts.OpenAIPrompt()

OpenAI specific prompt.

Attributes

promptstr: The main prompt to be used