API
XBOCModel
- xboc.XBOCModel()
A Bag-Of-Concepts model that is implemented following the original paper: https://www.sciencedirect.com/science/article/abs/pii/S0925231217308962
The model is also compatible with the BERTopic pipeline and can be used to embed documents.
- xboc.XBOCModel.__init__(self: Self, corpus: list[list[str]] | ndarray, wv: ndarray, idx2word: dict[int, str], tokenizer: Tokenizer | None = None, clustering_method: ClusteringMethod = ClusteringMethod.Spherical_KMeans, n_concepts: int = 100, iterations: int = 100, random_state: int = 42, verbose: bool = False, label_impl: LabelingImplementation | None = None, llm_model: LLMModel = LLMModel.OPENAI_GPT3_5, custom_chain: RunnableSerializable[dict[Any, Any], BaseMessage] | None = None, n_top_words_label: int = 10, log_level: int = 20) None
Initializes the Bag-Of-Concept model.
Parameters
- corpuslist[list[str]] | np.ndarrayy
The preprocessed corpus to train the model on.
- wvnp.ndarray
Word vector representations
- idx2worddict[int, str]
Index to word mapping
- tokenizerTokenizer | None
The tokenizer used to tokenize the corpus. If not provided no new documents can be embedded. by default None
- clustering_methodClusteringMethod
The clustering method to use, by default Spherical KMeans
- n_conceptsint, optional
Number of concepts, by default 100
- iterationsint, optional
Numbeer of iterations., by default 100
- random_stateint, optional
Random state to avoid stochasticity., by default 42
- verbosebool, optional
Verbose or not, by default False
- label_impl: LabelingImplementation | None
Whether to use template chains or a custom chain and whether to label concepts at all, by default None
- llm_model: LLMModel
Large language model to be used for the supported templates, by default OPENAI_GPT3_5
- custom_chain: RunnableSerializable[dict[Any, Any], BaseMessage] | None
The custom chain to be used if the user has specified it, by default None
- n_top_words_labelint
How many top words to use to label the concepts, by default 10
- log_level: int
Log level if verbose is True, by default INFO.
- xboc.XBOCModel.fit(self: Self) tuple[ndarray | spmatrix, list[tuple[str, int]], dict[int, str]]
Fit the model on the training data
Returns
- tuple[np.ndarray | scipy.sparse.spmatrix, list[str, int], dict[int, str]]
Bag of Concept matrix, where the rows represent the document’s embeddings.
A list of word to concept mappings.
A index to word converter as a dictionary.
- xboc.XBOCModel.encode(self: Self, text: list[str] | str) spmatrix | ndarray
Encodes new documents
Parameters
- textlist[str] | str
Text to be encoded
Returns
- scipy.sparse.spmatrix | np.ndarray
The document embedding
Raises
- ValueError
Model must be first trained before it can encode documents.
- AttributeError
A tokenizer must be specified in order to encode new documents.
- xboc.XBOCModel.save(self: Self, folder_path: str) None
Saves the model using pickle.
IMPORTANT
Custom chains cannot be saved. Please make sure that after you load your BoC model, you set manually the custom chain.
Parameters
- folder_pathstr
The folder where the model should be saved.
- xboc.XBOCModel.get_top_n_words(self, n: int, cluster: int | None = None) list[list[str]] | list[str]
Gets the top N words for all clusters or for a given cluster if provided.
Parameters
- nint
Number of words to get
- clusterint | None, optional
Cluster of interest, by default None
Returns
- list[list[str]] | list[str]
Top N words for all clusters or for a given cluster
Raises
- AttributeError
Something went wrong during training.
- IndexError
The number N is higher than the minimum number of words assigned to clusters.
- IndexError
The cluster of interest is out of range.
- xboc.XBOCModel.get_concept_label(self: Self, index: int | None = None) list[str] | str
Return all concept labels or specific label.
Parameters
- indexint | None, optional
Concept index, by default None
Returns
- np.ndarray | str
The concept label or all concept labels
Raises
- IndexError
If concept index is out of range
- xboc.XBOCModel.load(file_path: str) XBOCModel
Loads the model using pickle.
Parameters
- file_pathstr
The path to the file.
- xboc.XBOCModel.calculate_scores_for_k_range(k_range: list[int], wv: ndarray, clustering_method: ClusteringMethod = ClusteringMethod.Spherical_KMeans, max_iter: int = 100, verbose: bool = False, random_state: int = 42) DataFrame
Calculates the BIC, AIC using GMMs on the provided word vectors using the K number of clusters in the provided list. Additionaly, this function calculates the silhouette, davies and calinski scores for the same list of k number of clusters, using the provided clustering method.
Parameters
- k_rangelist[int]
The grid parameter search range
- wvnp.ndarray
Word Vectors
- clustering_methodClusteringMethod
The clustering method to use for calculating the silhouette, davies and calinski scores. by default Spherical_KMeans
- max_iterint, optional
Maximum number of iteration for the clustering model, by default 100
- verbosebool, optional
Verbosity, by default False
- random_stateint, optional
Random state to be used, by default 42
Returns
- pd.DataFrame
Dataframe consisting all the results
- xboc.XBOCModel._cluster_wv(self: Self, wv: ndarray, num_concept: int, max_iter: int = 10) None
Cluster word vector representations using the provided clustering method.
Parameters
- wvnp.ndarray
The word vector representations.
- num_conceptint
Number of concepts to cluster.
- max_iterint, optional
Maximum number of iterations for the clustering method., by default 10
Raises
- ValueError
If the word vectors that have to be clustered contain any NaN values.
- xboc.XBOCModel._create_bow(self: Self) None
Create the bag of words that is needed for compute the CF-IDF
- xboc.XBOCModel._create_w2c(self: Self) None
Create the word to concept mapping.
Raises
- IndexError
When the dimensions between words and labels do not match.
- xboc.XBOCModel._apply_cfidf(self: Self, csr_matrix: ndarray | spmatrix) None
Applies the Concept-Frequency Inverse-Concept-Frequency.
Parameters
- csr_matrixnp.ndarray | scipy.sparse.spmatrix
The embedded documents using the concepts.
Types
- xboc.LLMModel()
All model-specific prompts implementations
- xboc.LabelingImplementation()
Labeling LangChain Implementation to use. The user can specify whether to use our pre-defined templates or a custom langchain that he provides.
- xboc.ClusteringMethod()
Define all supported clustering methods.
- xboc.Tokenizer()
Abstract wrapper around tokenizers. The call method should be implemented to allow tokenization of new documents.
Ensures compatibility with pickle.