operations
create.stopwords_list
Documentation
Create a list of stopwords from one or multiple sources.
This will download nltk stopwords if necessary, and
merge all input lists into a single, sorted list without
duplicates.
Author(s)
Markus Binsteiner markus@frkl.io
Context
Tags language_processing
Labels package: kiara_plugin.language_processing
References source_repo:
https://github.com/DHARPA-Project/kiara_pl…
documentation:
https://DHARPA-Project.github.io/kiara_plu…
Operation details
Documentation Create a list of stopwords from one or
multiple sources.
This will download nltk stopwords if
necessary, and merge all input lists
into a single, sorted list without
duplicates.
Inputs
field
name type desc… Requ… Defa…
──────────────────────────────────────
langu list A no -- no
ages list defa…
of --
lang…
will
be
used
to
retr…
lang…
stop…
from
nltk.
stopw list A no -- no
ord_l list defa…
ists of --
lists
of
stop…
Outputs
field name type description
──────────────────────────────────────
stopwords_lis list A sorted list
t of unique
stopwords.
generate.LDA.for.tokens_array
Documentation
Perform Latent Dirichlet Allocation on a tokenized
corpus.
This module computes models for a range of number of
topics provided by the user.
Author(s)
Markus Binsteiner markus@frkl.io
Context
Tags language_processing, LDA, tokens
Labels package: kiara_plugin.language_processing
References source_repo:
https://github.com/DHARPA-Project/kiara_pl…
documentation:
https://DHARPA-Project.github.io/kiara_plu…
Operation details
Documentation Perform Latent Dirichlet Allocation on a
tokenized corpus.
This module computes models for a range
of number of topics provided by the
user.
Inputs
field
name type desc… Requ… Def…
──────────────────────────────────────
token array The yes --
s_arr text no
ay corp… def…
--
num_t inte… The no 7
opics mini…
_min numb…
of
topi…
num_t inte… The no --
opics max no
_max numb… def…
of --
topi…
compu bool… Whet… no Fal…
te_co to
heren comp…
ce the
cohe…
score
for
each
mode…
words inte… How no 10
_per_ many
topic words
per
topic
to
put
in
the
resu…
mode…
Outputs
field name type description
──────────────────────────────────────
topic_models dict A dictionary
with one
coherence
model table
for each
number of
topics.
coherence_tab table Coherence
le details.
coherence_map dict A map with
the
coherence
value for
every number
of topics.
preprocess.tokens_array
Documentation
Preprocess lists of tokens, incl. lowercasing, remove
special characers, etc.
Lowercasing: Lowercase the words. This operation is a
double-edged sword. It can be effective at yielding
potentially better results in the case of relatively
small datasets or datatsets with a high percentage of
OCR mistakes. For instance, if lowercasing is not
performed, the algorithm will treat USA, Usa, usa, UsA,
uSA, etc. as distinct tokens, even though they may all
refer to the same entity. On the other hand, if the
dataset does not contain such OCR mistakes, then it may
become difficult to distinguish between homonyms and
make interpreting the topics much harder.
Removing stopwords and words with less than three
characters: Remove low information words. These are
typically words such as articles, pronouns,
prepositions, conjunctions, etc. which are not
semantically salient. There are numerous stopword lists
available for many, though not all, languages which can
be easily adapted to the individual researcher's needs.
Removing words with less than three characters may
additionally remove many OCR mistakes. Both these
operations have the dual advantage of yielding more
reliable results while reducing the size of the dataset,
thus in turn reducing the required processing power.
This step can therefore hardly be considered optional in
TM.
Noise removal: Remove elements such as punctuation
marks, special characters, numbers, html formatting,
etc. This operation is again concerned with removing
elements that may not be relevant to the text analysis
and in fact interfere with it. Depending on the dataset
and research question, this operation can become
essential.
Author(s)
Markus Binsteiner markus@frkl.io
Context
Tags language_processing, tokens, preprocess
Labels package: kiara_plugin.language_processing
References source_repo:
https://github.com/DHARPA-Project/kiara_pl…
documentation:
https://DHARPA-Project.github.io/kiara_plu…
Operation details
Documentation Preprocess lists of tokens, incl.
lowercasing, remove special characers,
etc.
Lowercasing: Lowercase the words. This
operation is a double-edged sword. It
can be effective at yielding potentially
better results in the case of relatively
small datasets or datatsets with a high
percentage of OCR mistakes. For
instance, if lowercasing is not
performed, the algorithm will treat USA,
Usa, usa, UsA, uSA, etc. as distinct
tokens, even though they may all refer
to the same entity. On the other hand,
if the dataset does not contain such OCR
mistakes, then it may become difficult
to distinguish between homonyms and make
interpreting the topics much harder.
Removing stopwords and words with less
than three characters: Remove low
information words. These are typically
words such as articles, pronouns,
prepositions, conjunctions, etc. which
are not semantically salient. There are
numerous stopword lists available for
many, though not all, languages which
can be easily adapted to the individual
researcher's needs. Removing words with
less than three characters may
additionally remove many OCR mistakes.
Both these operations have the dual
advantage of yielding more reliable
results while reducing the size of the
dataset, thus in turn reducing the
required processing power. This step can
therefore hardly be considered optional
in TM.
Noise removal: Remove elements such as
punctuation marks, special characters,
numbers, html formatting, etc. This
operation is again concerned with
removing elements that may not be
relevant to the text analysis and in
fact interfere with it. Depending on the
dataset and research question, this
operation can become essential.
Inputs
field
name type desc… Requ… Def…
──────────────────────────────────────
token array The yes --
s_arr toke… no
ay array def…
to --
pre-…
to_lo bool… Apply no Fal…
werca lowe…
se to
the
text.
remov bool… Remo… no Fal…
e_alp all
hanum toke…
eric that
incl…
numb…
(e.g.
ex1a…
remov bool… Remo… no Fal…
e_non all
_alph toke…
a that
incl…
punc…
and
numb…
(e.g.
ex1a…
remov bool… Remo… no Fal…
e_all all
_nume toke…
ric that
cont…
numb…
only
(e.g.
876).
remov inte… Remo… no Fal…
e_sho toke…
rt_to shor…
kens than
a
cert…
leng…
If
value
is <=
0, no
filt…
will
be
done.
remov list Remo… no --
e_sto stop… no
pword def…
s --
Outputs
field name type description
──────────────────────────────────────
tokens_array array The
pre-processed
content, as
an array of
lists of
strings.
remove_stopwords.from.tokens_array
Documentation
Remove stopwords from an array of token-lists.
Author(s)
Markus Binsteiner markus@frkl.io
Context
Tags language_processing
Labels package: kiara_plugin.language_processing
References source_repo:
https://github.com/DHARPA-Project/kiara_pl…
documentation:
https://DHARPA-Project.github.io/kiara_plu…
Operation details
Documentation Remove stopwords from an array of
token-lists.
Inputs
field
name type desc… Requ… Def…
──────────────────────────────────────
token array An yes --
s_arr array no
ay of def…
stri… --
lists
(a
list
of
toke…
langu list A no --
ages list no
of def…
lang… --
names
to
use
defa…
stop…
lists
for.
addit list A no --
ional list no
_stop of def…
words addi… --
cust…
stop…
Outputs
field name type description
──────────────────────────────────────
tokens_array array An array of
string lists,
with the
stopwords
removed.
tokenize.string
Documentation
Tokenize a string.
Author(s)
Markus Binsteiner markus@frkl.io
Context
Tags language_processing
Labels package: kiara_plugin.language_processing
References source_repo:
https://github.com/DHARPA-Project/kiara_pl…
documentation:
https://DHARPA-Project.github.io/kiara_plu…
Operation details
Documentation Tokenize a string.
Inputs
field
name type desc… Req… Defa…
──────────────────────────────────────
text stri… The yes -- no
text defa…
to --
toke…
Outputs
field name type description
──────────────────────────────────────
token_list list The tokenized
version of the
input text.
tokenize.texts_array
Documentation
Split sentences into words or words into characters.
In other words, this operation establishes the word
boundaries (i.e., tokens) a very helpful way of finding
patterns. It is also the typical step prior to stemming
and lemmatization
Author(s)
Markus Binsteiner markus@frkl.io
Context
Tags language_processing, tokenize, tokens
Labels package: kiara_plugin.language_processing
References source_repo:
https://github.com/DHARPA-Project/kiara_pl…
documentation:
https://DHARPA-Project.github.io/kiara_plu…
Operation details
Documentation Split sentences into words or words into
characters.
In other words, this operation
establishes the word boundaries (i.e.,
tokens) a very helpful way of finding
patterns. It is also the typical step
prior to stemming and lemmatization
Inputs
field
name type desc… Requ… Def…
──────────────────────────────────────
texts array An yes --
_arra array no
y of def…
text --
items
to be
toke…
token bool… Whet… no True
ize_b to
y_wor toke…
d by
word
(def…
or
char…
Outputs
field name type description
──────────────────────────────────────
tokens_array array The tokenized
content, as
an array of
lists of
strings.