Skip to content

operations

create.stopwords_list

Documentation

Create a list of stopwords from one or multiple sources.

This will download nltk stopwords if necessary, and merge all input lists into a single, sorted list without duplicates.

Inputs
field name type description required default
languages list A list of languages, will be used to retrieve language-specific stopword from nltk. no
stopwords list A list of additional, custom stopwords. no
Outputs
field name type description required default
stopwords_list list A sorted list of unique stopwords. yes

generate.LDA.for.tokens_array

Documentation

Perform Latent Dirichlet Allocation on a tokenized corpus.

This module computes models for a range of number of topics provided by the user.

Inputs
field name type description required default
tokens_array array The text corpus. yes
num_topics_min integer The minimal number of topics. no 7
num_topics_max integer The max number of topics. no 7
compute_coherence boolean Whether to compute the coherence score for each model. no False
words_per_topic integer How many words per topic to put in the result model. no 10
Outputs
field name type description required default
topic_models dict A dictionary with one coherence model table for each number of topics. yes
coherence_table table Coherence details. no
coherence_map dict A map with the coherence value for every number of topics. yes

preprocess.tokens_array

Documentation

Preprocess lists of tokens, incl. lowercasing, remove special characers, etc.

Lowercasing: Lowercase the words. This operation is a double-edged sword. It can be effective at yielding potentially better results in the case of relatively small datasets or datatsets with a high percentage of OCR mistakes. For instance, if lowercasing is not performed, the algorithm will treat USA, Usa, usa, UsA, uSA, etc. as distinct tokens, even though they may all refer to the same entity. On the other hand, if the dataset does not contain such OCR mistakes, then it may become difficult to distinguish between homonyms and make interpreting the topics much harder.

Removing stopwords and words with less than three characters: Remove low information words. These are typically words such as articles, pronouns, prepositions, conjunctions, etc. which are not semantically salient. There are numerous stopword lists available for many, though not all, languages which can be easily adapted to the individual researcher's needs. Removing words with less than three characters may additionally remove many OCR mistakes. Both these operations have the dual advantage of yielding more reliable results while reducing the size of the dataset, thus in turn reducing the required processing power. This step can therefore hardly be considered optional in TM.

Noise removal: Remove elements such as punctuation marks, special characters, numbers, html formatting, etc. This operation is again concerned with removing elements that may not be relevant to the text analysis and in fact interfere with it. Depending on the dataset and research question, this operation can become essential.

Inputs
field name type description required default
tokens_array array The tokens array to pre-process. yes
to_lowercase boolean Apply lowercasing to the text. no False
remove_alphanumeric boolean Remove all tokens that include numbers (e.g. ex1ample). no False
remove_non_alpha boolean Remove all tokens that include punctuation and numbers (e.g. ex1a.mple). no False
remove_all_numeric boolean Remove all tokens that contain numbers only (e.g. 876). no False
remove_short_tokens integer Remove tokens shorter or equal to this value. If value is <= 0, no filtering will be done. no 0
remove_stopwords list Remove stopwords. no
Outputs
field name type description required default
tokens_array array The pre-processed content, as an array of lists of strings. yes

tokenize.string

Documentation

Tokenize a string.

Inputs
field name type description required default
text string The text to tokenize. yes
Outputs
field name type description required default
token_list list The tokenized version of the input text. yes

tokenize.texts_array

Documentation

Split sentences into words or words into characters.

In other words, this operation establishes the word boundaries (i.e., tokens) a very helpful way of finding patterns. It is also the typical step prior to stemming and lemmatization

Inputs
field name type description required default
texts_array array An array of text items to be tokenized. yes
tokenize_by_word boolean Whether to tokenize by word (default), or character. no True
Outputs
field name type description required default
tokens_array array The tokenized content, as an array of lists of strings. yes