operations

`create.stopwords_list`¶

Documentation

Create a list of stopwords from one or multiple sources.

This will download nltk stopwords if necessary, and merge all input lists into a single, sorted list without duplicates.

Inputs

field name	type	description	required	default
languages	list	A list of languages, will be used to retrieve language-specific stopword from nltk.	no
stopwords	list	A list of additional, custom stopwords.	no

Outputs

field name	type	description	required	default
stopwords_list	list	A sorted list of unique stopwords.	yes

`generate.LDA.for.tokens_array`¶

Documentation

Perform Latent Dirichlet Allocation on a tokenized corpus.

This module computes models for a range of number of topics provided by the user.

Inputs

field name	type	description	required	default
tokens_array	array	The text corpus.	yes
num_topics_min	integer	The minimal number of topics.	no	7
num_topics_max	integer	The max number of topics.	no	7
compute_coherence	boolean	Whether to compute the coherence score for each model.	no	False
words_per_topic	integer	How many words per topic to put in the result model.	no	10

Outputs

field name	type	description	required
topic_models	dict	A dictionary with one coherence model table for each number of topics.	yes
coherence_table	table	Coherence details.	no
coherence_map	dict	A map with the coherence value for every number of topics.	yes

`preprocess.tokens_array`¶

Documentation

Preprocess lists of tokens, incl. lowercasing, remove special characers, etc.

Lowercasing: Lowercase the words. This operation is a double-edged sword. It can be effective at yielding potentially better results in the case of relatively small datasets or datatsets with a high percentage of OCR mistakes. For instance, if lowercasing is not performed, the algorithm will treat USA, Usa, usa, UsA, uSA, etc. as distinct tokens, even though they may all refer to the same entity. On the other hand, if the dataset does not contain such OCR mistakes, then it may become difficult to distinguish between homonyms and make interpreting the topics much harder.

Removing stopwords and words with less than three characters: Remove low information words. These are typically words such as articles, pronouns, prepositions, conjunctions, etc. which are not semantically salient. There are numerous stopword lists available for many, though not all, languages which can be easily adapted to the individual researcher's needs. Removing words with less than three characters may additionally remove many OCR mistakes. Both these operations have the dual advantage of yielding more reliable results while reducing the size of the dataset, thus in turn reducing the required processing power. This step can therefore hardly be considered optional in TM.

Noise removal: Remove elements such as punctuation marks, special characters, numbers, html formatting, etc. This operation is again concerned with removing elements that may not be relevant to the text analysis and in fact interfere with it. Depending on the dataset and research question, this operation can become essential.

Inputs

field name	type	description	required	default
tokens_array	array	The tokens array to pre-process.	yes
to_lowercase	boolean	Apply lowercasing to the text.	no	False
remove_alphanumeric	boolean	Remove all tokens that include numbers (e.g. ex1ample).	no	False
remove_non_alpha	boolean	Remove all tokens that include punctuation and numbers (e.g. ex1a.mple).	no	False
remove_all_numeric	boolean	Remove all tokens that contain numbers only (e.g. 876).	no	False
remove_short_tokens	integer	Remove tokens shorter or equal to this value. If value is <= 0, no filtering will be done.	no	0
remove_stopwords	list	Remove stopwords.	no

Outputs

field name	type	description	required	default
tokens_array	array	The pre-processed content, as an array of lists of strings.	yes

`tokenize.string`¶

Documentation

Tokenize a string.

Inputs

field name	type	description	required	default
text	string	The text to tokenize.	yes

Outputs

field name	type	description	required	default
token_list	list	The tokenized version of the input text.	yes

`tokenize.texts_array`¶

Documentation

Split sentences into words or words into characters.

In other words, this operation establishes the word boundaries (i.e., tokens) a very helpful way of finding patterns. It is also the typical step prior to stemming and lemmatization

Inputs

field name	type	description	required	default
texts_array	array	An array of text items to be tokenized.	yes
tokenize_by_word	boolean	Whether to tokenize by word (default), or character.	no	True

Outputs

field name	type	description	required	default
tokens_array	array	The tokenized content, as an array of lists of strings.	yes

operations

create.stopwords_list¶

generate.LDA.for.tokens_array¶

preprocess.tokens_array¶

tokenize.string¶

tokenize.texts_array¶

`create.stopwords_list`¶

`generate.LDA.for.tokens_array`¶

`preprocess.tokens_array`¶

`tokenize.string`¶

`tokenize.texts_array`¶