kiara: Natural Language Processing (NLP)¶
Welcome back! Now that we're comfortable with what kiara looks like and what it can do to help track your data and your research process, let's try out some of the digital analysis tools, starting with Natural Language Processing.
Why NLP?¶
First of all, why bothering with NLP? Natural language processing technology allows researchers to sort through unstructured data such as plain text. In other words, by adding numerical value to text, computers can understand language and perform advanced operations such as text categorisation, labelling, summarisation and so on. There are two main stages in NLP: pre-processing and analysis (aka, algorithm development and/or implementation). Here we cover both stages through the example of some of the most common pre-processing operations such as tokenisation, lowercasing, removing stopwords etc. in the first part. For the second part, we will use the example of another widely used text analysis method called topic modelling. For more information about the pre-processing operations and topic modelling and a more in-depth discussion particularly for humanities research, please refer to this repository here.
Starting the Process¶
Let's start by double checking that we have all the required plugins, and setting up an API for us to use kiara. We'll do this all in one go this time, but if you're unsure, feel free to head back to the installation notebook to look over this section again.
try:
from kiara_plugin.jupyter import ensure_kiara_plugins
except:
import sys
print("Installing 'kiara_plugin.jupyter'...")
!{sys.executable} -m pip install -q kiara_plugin.jupyter
from kiara_plugin.jupyter import ensure_kiara_plugins
ensure_kiara_plugins()
from kiara import KiaraAPI
kiara = KiaraAPI.instance()
Now we're all set up, we want to download some text to work with in our language processing analyis.
For our example here we will be using a relatively small number of texts. This is a sample taken from the larger corpus ChroniclItaly 3.0 (Viola and Fiscarelli 2021, Viola 2021), an open access digital heritage collection of Italian immigrant newspapers published in the United States from 1898 to 1936.
The corpus that we use here includes the digitized (OCRed) front pages of the Italian language newspaper La rassegna as collected from Chronicling America, an Internet-based, searchable database of U.S. newspapers published in the United States from 1789 to 1963 made available by the Library of Congress.
These files are also a good examples because their filenames already contain important metadata information such as the publication date. The file name structure is: LCCNnumber_date_pageNumber_ocr.txt. Therefore, the file name ‘sn84037025_1917-04-14_ed-1_seq-1_ocr.txt ’ refers to the OCR text file of the first page of the first edition of La Rassegna published on 14 April 1917. kiara allows us to retrieve both the files and the metadata in the filenames. This is very useful for historical research, but also to keep track of how we are intervening on our sources. Let's see how this works.
kiara.list_operation_ids('download')
['download.file', 'download.file_bundle']
Last time we only wanted one file, but with language processing we might want a bigger corpus.
Let's have a look at download.file_bundle
this time.
kiara.retrieve_operation_info('download.file_bundle')
Documentation -- n/a -- Author(s) Markus Binsteiner markus@frkl.io Context Tags onboarding Labels package: kiara_plugin.onboarding References source_repo: https://github.com/DHARPA-Project/kiara_plugin.onboarding documentation: https://DHARPA-Project.github.io/kiara_plugin.onboarding/ Operation details Documentation -- n/a -- Inputs field name type description Required Default ────────────────────────────────────────────────────────────────────────────────────────────────── url string The url of an archive/zip file to download. yes -- no default -- sub_path string A relative path to select only a sub-folder no -- no default -- from the archive. Outputs field name type description ────────────────────────────────────────────────────────────────────────────────────────────────── file_bundle file_bundle The downloaded file bundle. download_metadata dict Metadata about the download.
So we still want a url, but for a zip file that we can download. Here's some example data for us to use.
Again, we need to define the inputs, use kiara.run_job
with our chosen operation download.file_bundle
and store this as our outputs.
inputs = {
"url": "https://github.com/DHARPA-Project/kiara.examples/archive/refs/heads/main.zip",
"sub_path": "kiara.examples-main/examples/data/text_corpus/data"
}
outputs = kiara.run_job('download.file_bundle', inputs=inputs)
outputs
patool: Extracting /var/folders/5h/j266_5ss6qj7x37qd77pydh0p248rs/T/tmpjr9umxt6 ...
patool: ... /var/folders/5h/j266_5ss6qj7x37qd77pydh0p248rs/T/tmpjr9umxt6 extracted to `/var/folders/5h/j266_5ss6qj7x37qd77pydh0p248rs/T/tmpnha6uj9d'.
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ │ │ field value │ │ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ │ download_metadata │ │ dict data { │ │ "response_headers": [ │ │ { │ │ "access-control-allow-origin": "https://render.githubusercontent.com", │ │ "content-disposition": "attachment; filename=kiara.examples-main.zip", │ │ "content-security-policy": "default-src 'none'; style-src 'unsafe-inline'; sandbox", │ │ "content-type": "application/zip", │ │ "etag": "W/\"34f87e1d6dc5c913d21b59d7aecf516bca3a32605b6e1504ae97eec4611cc862\"", │ │ "strict-transport-security": "max-age=31536000", │ │ "vary": "Authorization,Accept-Encoding,Origin", │ │ "x-content-type-options": "nosniff", │ │ "x-frame-options": "deny", │ │ "x-xss-protection": "1; mode=block", │ │ "date": "Fri, 27 Jan 2023 07:48:53 GMT", │ │ "transfer-encoding": "chunked", │ │ "x-github-request-id": "542C:C4A4:31636B:3C82F9:63D381E5" │ │ }, │ │ { │ │ "server": "GitHub.com", │ │ "date": "Fri, 27 Jan 2023 07:48:53 GMT", │ │ "content-type": "text/html; charset=utf-8", │ │ "vary": "X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, Accept-Encoding, Accept, X… │ │ "location": "https://codeload.github.com/DHARPA-Project/kiara.examples/zip/refs/heads/m… │ │ "cache-control": "max-age=0, private", │ │ "strict-transport-security": "max-age=31536000; includeSubdomains; preload", │ │ "x-frame-options": "deny", │ │ "x-content-type-options": "nosniff", │ │ "x-xss-protection": "0", │ │ "referrer-policy": "no-referrer-when-downgrade", │ │ "content-security-policy": "default-src 'none'; base-uri 'self'; block-all-mixed-conten… │ │ "content-length": "0", │ │ "x-github-request-id": "77A9:872B:7D71061:81F6ABB:63D381E5" │ │ } │ │ ], │ │ "request_time": "2023-01-27T07:48:53.906919+00:00" │ │ } │ │ dict schema { │ │ "title": "dict", │ │ "type": "object" │ │ } │ │ │ │ file_bundle │ │ bundle name data │ │ number_of_files 16 │ │ size 298452 │ │ included files │ │ (relative) path size │ │ ────────────────────────────────────────────────────────────── │ │ La_Rassegna/sn84037025_1917-04-14_ed-1_seq-1_ocr.txt 20647 │ │ La_Rassegna/sn84037025_1917-04-07_ed-1_seq-1_ocr.txt 19397 │ │ La_Rassegna/sn84037025_1917-04-14_ed-2_seq-1_ocr.txt 20650 │ │ La_Rassegna/sn84037025_1917-04-21_ed-1_seq-1_ocr.txt 21017 │ │ La_Rassegna/sn84037025_1917-04-21_ed-2_seq-1_ocr.txt 20982 │ │ La_Ragione/sn84037024_1917-05-05_ed-2_seq-1_ocr.txt 18474 │ │ La_Ragione/sn84037024_1917-05-16_ed-1_seq-1_ocr.txt 18620 │ │ La_Ragione/sn84037024_1917-05-16_ed-2_seq-1_ocr.txt 18698 │ │ La_Ragione/sn84037024_1917-05-05_ed-1_seq-1_ocr.txt 18346 │ │ La_Ragione/sn84037024_1917-04-25_ed-4_seq-1_ocr.txt 16235 │ │ La_Ragione/sn84037024_1917-04-25_ed-3_seq-1_ocr.txt 16793 │ │ La_Ragione/sn84037024_1917-04-25_ed-2_seq-1_ocr.txt 16679 │ │ La_Ragione/sn84037024_1917-04-25_ed-1_seq-1_ocr.txt 16613 │ │ La_Ragione/sn84037024_1917-05-16_ed-3_seq-1_ocr.txt 18540 │ │ La_Ragione/sn84037024_1917-05-05_ed-4_seq-1_ocr.txt 18481 │ │ La_Ragione/sn84037024_1917-05-05_ed-3_seq-1_ocr.txt 18280 │ │ │ │ │ │ │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Great, we've successfully imported a bundle of files this time rather than just one. This has given us both the metadata for the files, and the files themselves. As you can see, kiara also gives us additional information on the composition of the text files, that is the number of tokens. This information will be useful later when we will intervene on these files to keep track of how we have changed them. For now, let's save the files in a separate variable for us to use later.
file_bundle = outputs['file_bundle']
Preparing the Texts¶
Now that we have imported the files, let's give them some structure. For this, we will need the create.table.from.file_bundle
function (similar to the installation notebook which you are welcome to revisit at any time). Let's have a look by exploring the list of avaibale operations.
kiara.retrieve_operation_info('create.table.from.file_bundle')
Documentation Create a table value from a text file_bundle. The resulting table will have (at a minimum) the following collumns: • id: an auto-assigned index • rel_path: the relative path of the file (from the provided base path) • content: the text file content Author(s) Markus Binsteiner markus@frkl.io Context Tags tabular Labels package: kiara_plugin.tabular References source_repo: https://github.com/DHARPA-Project/kiara_plugin.tabular documentation: https://DHARPA-Project.github.io/kiara_plugin.tabular/ Operation details Documentation Create a table value from a text file_bundle. The resulting table will have (at a minimum) the following collumns: - id: an auto-assigned index - rel_path: the relative path of the file (from the provided base path) - content: the text file content Inputs field name type description Required Default ────────────────────────────────────────────────────────────────────────────────────────────────── file_bundle file_bundle The source value (of type yes -- no default -- 'file_bundle'). Outputs field name type description ────────────────────────────────────────────────────────────────────────────────────────────────── table table The result value (of type 'table').
Let's use the file bundle we downloaded earlier and saved in our variable, and run this kiara table function.
inputs = {
'file_bundle' : file_bundle
}
outputs = kiara.run_job('create.table.from.file_bundle', inputs=inputs)
outputs
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ │ │ field value │ │ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ │ table │ │ id rel_path mime_type size content file_name │ │ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ │ 0 La_Ragione/sn84037024_1917-04 text/plain 16613 LA RAGIONE sn84037024_1917-04-25_ed-1_se │ │ 1 La_Ragione/sn84037024_1917-04 text/plain 16679 LA RAG ONE sn84037024_1917-04-25_ed-2_se │ │ 2 La_Ragione/sn84037024_1917-04 text/plain 16793 LA RAGIONE sn84037024_1917-04-25_ed-3_se │ │ 3 La_Ragione/sn84037024_1917-04 text/plain 16235 contro i vili, i camorristi, i sn84037024_1917-04-25_ed-4_se │ │ 4 La_Ragione/sn84037024_1917-05 text/plain 18346 contro i vili, i camorristi, i sn84037024_1917-05-05_ed-1_se │ │ 5 La_Ragione/sn84037024_1917-05 text/plain 18474 LA RAGIONA sn84037024_1917-05-05_ed-2_se │ │ 6 La_Ragione/sn84037024_1917-05 text/plain 18280 LA RAGIONE sn84037024_1917-05-05_ed-3_se │ │ 7 La_Ragione/sn84037024_1917-05 text/plain 18481 LA RAGIONE sn84037024_1917-05-05_ed-4_se │ │ 8 La_Ragione/sn84037024_1917-05 text/plain 18620 contro i vili, i camorristi, i sn84037024_1917-05-16_ed-1_se │ │ 9 La_Ragione/sn84037024_1917-05 text/plain 18698 LA RAG ONE sn84037024_1917-05-16_ed-2_se │ │ 10 La_Ragione/sn84037024_1917-05 text/plain 18540 contro 1 vili, i camorristi, i sn84037024_1917-05-16_ed-3_se │ │ 11 La_Rassegna/sn84037025_1917-0 text/plain 19397 ■■■ sn84037025_1917-04-07_ed-1_se │ │ 12 La_Rassegna/sn84037025_1917-0 text/plain 20647 La Rassegna sn84037025_1917-04-14_ed-1_se │ │ 13 La_Rassegna/sn84037025_1917-0 text/plain 20650 Both Phones sn84037025_1917-04-14_ed-2_se │ │ 14 La_Rassegna/sn84037025_1917-0 text/plain 21017 ■ jSrìt** W?? iIK 38®f- i^M sn84037025_1917-04-21_ed-1_se │ │ 15 La_Rassegna/sn84037025_1917-0 text/plain 20982 ■Both Phones sn84037025_1917-04-21_ed-2_se │ │ │ │ │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Great, this has taken all the information from the files we downloaded and made it a bit easier to navigate. In order to process and analyse our sources, we need to work with the files' content which is in the column 'content'. Let's run `kiara.list_operation_ids('table')' to see how we might be able to do that.
kiara.list_operation_ids('table')
['create.database.from.table',
'create.network_data.from.tables',
'create.table.from.file',
'create.table.from.file_bundle',
'export.table.as.csv_file',
'extract.date_array.from.table',
'filter.table',
'import.table.from.local_file_path',
'import.table.from.local_folder_path',
'query.table',
'table.pick.column',
'table_filter.drop_columns',
'table_filter.select_columns',
'table_filter.select_rows']
As we are interested in one column, the table.pick.column
operation seems like a good fit.
kiara.retrieve_operation_info('table.pick.column')
Documentation Pick one column from a table, returning an array. Author(s) Markus Binsteiner markus@frkl.io Context Tags tabular Labels package: kiara_plugin.tabular References source_repo: https://github.com/DHARPA-Project/kiara_plugin.tabular documentation: https://DHARPA-Project.github.io/kiara_plugin.tabular/ Operation details Documentation Pick one column from a table, returning an array. Inputs field name type description Required Default ────────────────────────────────────────────────────────────────────────────────────────────────── table table A table. yes -- no default -- column_name string The name of the column to extract. yes -- no default -- Outputs field name type description ────────────────────────────────────────────────────────────────────────────────────────────────── array array The column.
So here we need two inputs, the table we just made and the name of the column we want to pick.
Let's specify our outputs again and run the function. In this way, we retain the content of the files as the variable we need for NLP.
inputs = {
'table' : outputs['table'],
'column_name' : 'content'
}
outputs = kiara.run_job('table.pick.column', inputs=inputs)
outputs
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ │ │ field value │ │ ────────────────────────────────────────────────────────────────────────────────────────────────── │ │ array │ │ LA RAGIONE │ │ LA RAG ONE │ │ LA RAGIONE │ │ contro i vili, i camorristi, i sicari, i falsari e gli austriacanti, nemici dell ... │ │ contro i vili, i camorristi, i sicari, i falsari e gli austriacanti, nemici dell ... │ │ LA RAGIONA │ │ LA RAGIONE │ │ LA RAGIONE │ │ contro i vili, i camorristi, i sicari, i falsari e gli austriacanti, nemici dell ... │ │ LA RAG ONE │ │ contro 1 vili, i camorristi, i sicari, i falsari e gli austriacanti, nemici dell ... │ │ ■■■ │ │ La Rassegna │ │ Both Phones │ │ ■ jSrìt** W?? iIK 38®f- i^M │ │ ■Both Phones │ │ │ │ │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Natural Language Processing (Stage 1)¶
Now we are ready for preparing our text for analysis. Let's see what operations are included in kiara for NLP in the kiara_plugin.language_processing
package.
infos = metadata = kiara.retrieve_operations_info()
operations = {}
for op_id, info in infos.item_infos.items():
if info.context.labels.get("package", None) == "kiara_plugin.language_processing":
operations[op_id] = info
print(operations.keys())
dict_keys(['create.stopwords_list', 'generate.LDA.for.tokens_array', 'preprocess.tokens_array', 'remove_stopwords.from.tokens_array', 'tokenize.string', 'tokenize.texts_array'])
The contents of our text files have been stored as an array. Before performing any operation, we should start by tokenising our text. We can do this by using the tokenize.texts_array
function.
If you're unsure about which of these operations you should run, you can refer to the in-built explanation in each kiara module which clarifies what each operation does. For further information about pros and cons of each pre-processing operation, please refer to this repository here.
kiara.retrieve_operation_info('tokenize.texts_array')
Documentation Split sentences into words or words into characters. In other words, this operation establishes the word boundaries (i.e., tokens) a very helpful way of finding patterns. It is also the typical step prior to stemming and lemmatization Author(s) Markus Binsteiner markus@frkl.io Context Tags language_processing, tokenize, tokens Labels package: kiara_plugin.language_processing References source_repo: https://github.com/DHARPA-Project/kiara_plugin.language_processing documentation: https://DHARPA-Project.github.io/kiara_plugin.language_processing/ Operation details Documentation Split sentences into words or words into characters. In other words, this operation establishes the word boundaries (i.e., tokens) a very helpful way of finding patterns. It is also the typical step prior to stemming and lemmatization Inputs field name type description Required Default ────────────────────────────────────────────────────────────────────────────────────────────────── texts_array array An array of text items to be yes -- no default -- tokenized. tokenize_by_word boolean Whether to tokenize by word no True (default), or character. Outputs field name type description ────────────────────────────────────────────────────────────────────────────────────────────────── tokens_array array The tokenized content, as an array of lists of strings.
Great, let's give it a go!
inputs = {
'texts_array': outputs['array']
}
outputs = kiara.run_job('tokenize.texts_array', inputs=inputs)
outputs
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ │ │ field value │ │ ───────────────────────────────────────────────────────────────────────────────────────────────────────── │ │ tokens_array │ │ ['LA', 'RAGIONE', 'ORGANO', 'DI', 'DIFESA', 'DELLA', "ITALIANITÀ'", 'contro', '1 ... │ │ ['LA', 'RAG', 'ONE', 'contro', 'i', 'vili', ',', 'i', 'camorristi', ',', 'i', 's ... │ │ ['LA', 'RAGIONE', 'ORGANO', 'DI', 'DIFESA', 'DELLA', 'ITALIANITÀ', 'contro', 'i' ... │ │ ['contro', 'i', 'vili', ',', 'i', 'camorristi', ',', 'i', 'sicari', ',', 'i', 'f ... │ │ ['contro', 'i', 'vili', ',', 'i', 'camorristi', ',', 'i', 'sicari', ',', 'i', 'f ... │ │ ['LA', 'RAGIONA', 'ORGANO', 'DI', 'DIFESA', 'DELLA', 'ITALIANITÀ', 'contro', 'i' ... │ │ ['LA', 'RAGIONE', 'ORGANO', 'DI', 'DIFESA', 'DELLA', "ITALIANITÀ'", 'contro', 'i ... │ │ ['LA', 'RAGIONE', 'contro', 'i', 'vili', ',', '1', 'camorristi', ',', 'i', 'sica ... │ │ ['contro', 'i', 'vili', ',', 'i', 'camorristi', ',', 'i', 'sicari', ',', 'i', 'f ... │ │ ['LA', 'RAG', 'ONE', 'ORGANO', 'DI', 'DIFESA', 'DELLA', 'ITALIANITÀ', "''", 'con ... │ │ ['contro', '1', 'vili', ',', 'i', 'camorristi', ',', 'i', 'sicari', ',', 'i', 'f ... │ │ ['■■■', 'La', 'Rassegna', '_', 'I', 'Both', 'Phones', 'ANNO', 'L', 'No', '.', '1 ... │ │ ['La', 'Rassegna', 'Jjoth', 'Phones', 'ANNO', 'L', 'No', '.', '2', 'BASTA', '!', ... │ │ ['Both', 'Phones', 'ANNO', 'I', '.', 'No', '.', '2', 'BASTA', '!', '...', 'uà', ... │ │ ['■', 'jSrìt', '*', '*', 'W', '?', '?', 'iIK', '38®f-', 'i^M', 'F', '<', '5É', ' ... │ │ ['■Both', 'Phones', 'ANNO', '11', '.', 'No', '.', '5', 'LE', 'COSE', 'A', 'POSTO ... │ │ │ │ │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
We can see from the printed preview that this has tokenized the contents for each of the text files we imported.
Now we can work on pre-processing some of this text. Let's look at what options we have in the preprocess.tokens_array
operation.
kiara.retrieve_operation_info('preprocess.tokens_array')
Documentation Preprocess lists of tokens, incl. lowercasing, remove special characers, etc. Lowercasing: Lowercase the words. This operation is a double-edged sword. It can be effective at yielding potentially better results in the case of relatively small datasets or datatsets with a high percentage of OCR mistakes. For instance, if lowercasing is not performed, the algorithm will treat USA, Usa, usa, UsA, uSA, etc. as distinct tokens, even though they may all refer to the same entity. On the other hand, if the dataset does not contain such OCR mistakes, then it may become difficult to distinguish between homonyms and make interpreting the topics much harder. Removing stopwords and words with less than three characters: Remove low information words. These are typically words such as articles, pronouns, prepositions, conjunctions, etc. which are not semantically salient. There are numerous stopword lists available for many, though not all, languages which can be easily adapted to the individual researcher's needs. Removing words with less than three characters may additionally remove many OCR mistakes. Both these operations have the dual advantage of yielding more reliable results while reducing the size of the dataset, thus in turn reducing the required processing power. This step can therefore hardly be considered optional in TM. Noise removal: Remove elements such as punctuation marks, special characters, numbers, html formatting, etc. This operation is again concerned with removing elements that may not be relevant to the text analysis and in fact interfere with it. Depending on the dataset and research question, this operation can become essential. Author(s) Markus Binsteiner markus@frkl.io Context Tags language_processing, tokens, preprocess Labels package: kiara_plugin.language_processing References source_repo: https://github.com/DHARPA-Project/kiara_plugin.language_processing documentation: https://DHARPA-Project.github.io/kiara_plugin.language_processing/ Operation details Documentation Preprocess lists of tokens, incl. lowercasing, remove special characers, etc. Lowercasing: Lowercase the words. This operation is a double-edged sword. It can be effective at yielding potentially better results in the case of relatively small datasets or datatsets with a high percentage of OCR mistakes. For instance, if lowercasing is not performed, the algorithm will treat USA, Usa, usa, UsA, uSA, etc. as distinct tokens, even though they may all refer to the same entity. On the other hand, if the dataset does not contain such OCR mistakes, then it may become difficult to distinguish between homonyms and make interpreting the topics much harder. Removing stopwords and words with less than three characters: Remove low information words. These are typically words such as articles, pronouns, prepositions, conjunctions, etc. which are not semantically salient. There are numerous stopword lists available for many, though not all, languages which can be easily adapted to the individual researcher's needs. Removing words with less than three characters may additionally remove many OCR mistakes. Both these operations have the dual advantage of yielding more reliable results while reducing the size of the dataset, thus in turn reducing the required processing power. This step can therefore hardly be considered optional in TM. Noise removal: Remove elements such as punctuation marks, special characters, numbers, html formatting, etc. This operation is again concerned with removing elements that may not be relevant to the text analysis and in fact interfere with it. Depending on the dataset and research question, this operation can become essential. Inputs field name type description Required Default ────────────────────────────────────────────────────────────────────────────────────────────────── tokens_array array The tokens array to pre-process. yes -- no default -- to_lowercase boolean Apply lowercasing to the text. no False remove_alphanumeric boolean Remove all tokens that include no False numbers (e.g. ex1ample). remove_non_alpha boolean Remove all tokens that include no False punctuation and numbers (e.g. ex1a.mple). remove_all_numeric boolean Remove all tokens that contain no False numbers only (e.g. 876). remove_short_tokens integer Remove tokens shorter or equal to no 0 this value. If value is <= 0, no filtering will be done. remove_stopwords list Remove stopwords. no -- no default -- Outputs field name type description ────────────────────────────────────────────────────────────────────────────────────────────────── tokens_array array The pre-processed content, as an array of lists of strings.
kiara includes the most widely used text analysis pre-processing operations. Let's try some of them and take a few moments to notice how they change our text.
Let's start by removing the so-called stopwords. These are low information words such as articles, pronouns, prepositions, conjunctions, etc. which are not semantically salient. There are numerous stopword lists available for many, though not all, languages which can be easily adapted to the individual researcher's needs. Here we are defining our stopword list but do experiment yourself with adding and change some of the words.
stopword_list = ['la', 'i']
inputs = {
'tokens_array': outputs['tokens_array'],
'remove_stopwords' : stopword_list
}
outputs = kiara.run_job('preprocess.tokens_array', inputs=inputs)
outputs
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ │ │ field value │ │ ───────────────────────────────────────────────────────────────────────────────────────────────────────── │ │ tokens_array │ │ ['RAGIONE', 'ORGANO', 'DI', 'DIFESA', 'DELLA', "ITALIANITÀ'", 'contro', '1', 'vi ... │ │ ['RAG', 'ONE', 'contro', 'vili', ',', 'camorristi', ',', 'sicari', ',', 'falsari ... │ │ ['RAGIONE', 'ORGANO', 'DI', 'DIFESA', 'DELLA', 'ITALIANITÀ', 'contro', 'vili', ' ... │ │ ['contro', 'vili', ',', 'camorristi', ',', 'sicari', ',', 'falsari', 'e', 'gli', ... │ │ ['contro', 'vili', ',', 'camorristi', ',', 'sicari', ',', 'falsari', 'e', 'gli', ... │ │ ['RAGIONA', 'ORGANO', 'DI', 'DIFESA', 'DELLA', 'ITALIANITÀ', 'contro', 'vili', ' ... │ │ ['RAGIONE', 'ORGANO', 'DI', 'DIFESA', 'DELLA', "ITALIANITÀ'", 'contro', 'vili', ... │ │ ['RAGIONE', 'contro', 'vili', ',', '1', 'camorristi', ',', 'sicari', ',', 'falsa ... │ │ ['contro', 'vili', ',', 'camorristi', ',', 'sicari', ',', 'falsari', 'e', 'gli', ... │ │ ['RAG', 'ONE', 'ORGANO', 'DI', 'DIFESA', 'DELLA', 'ITALIANITÀ', "''", 'contro', ... │ │ ['contro', '1', 'vili', ',', 'camorristi', ',', 'sicari', ',', 'falsari', 'e', ' ... │ │ ['■■■', 'Rassegna', '_', 'Both', 'Phones', 'ANNO', 'L', 'No', '.', '1', 'Il', 'p ... │ │ ['Rassegna', 'Jjoth', 'Phones', 'ANNO', 'L', 'No', '.', '2', 'BASTA', '!', '...' ... │ │ ['Both', 'Phones', 'ANNO', '.', 'No', '.', '2', 'BASTA', '!', '...', 'uà', 'quai ... │ │ ['■', 'jSrìt', '*', '*', 'W', '?', '?', 'iIK', '38®f-', 'i^M', 'F', '<', '5É', ' ... │ │ ['■Both', 'Phones', 'ANNO', '11', '.', 'No', '.', '5', 'LE', 'COSE', 'A', 'POSTO ... │ │ │ │ │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Great. Let's take this a bit further and try and combine two of our options in one function. In reality, we can add all the inputs together in one job, but let's start with converting everything into lowercase and removing any words with non-alphanumeric symbols.
inputs = {
'tokens_array': outputs['tokens_array'],
'to_lowercase' : True,
'remove_non_alpha' : True
}
outputs = kiara.run_job('preprocess.tokens_array', inputs=inputs)
outputs
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ │ │ field value │ │ ───────────────────────────────────────────────────────────────────────────────────────────────────────── │ │ tokens_array │ │ ['ragione', 'organo', 'di', 'difesa', 'della', 'contro', 'vili', 'camorristi', ' ... │ │ ['rag', 'one', 'contro', 'vili', 'camorristi', 'sicari', 'falsari', 'e', 'gli', ... │ │ ['ragione', 'organo', 'di', 'difesa', 'della', 'italianità', 'contro', 'vili', ' ... │ │ ['contro', 'vili', 'camorristi', 'sicari', 'falsari', 'e', 'gli', 'austriacanti' ... │ │ ['contro', 'vili', 'camorristi', 'sicari', 'falsari', 'e', 'gli', 'austriacanti' ... │ │ ['ragiona', 'organo', 'di', 'difesa', 'della', 'italianità', 'contro', 'vili', ' ... │ │ ['ragione', 'organo', 'di', 'difesa', 'della', 'contro', 'vili', 'camorristi', ' ... │ │ ['ragione', 'contro', 'vili', 'camorristi', 'sicari', 'falsari', 'e', 'gli', 'au ... │ │ ['contro', 'vili', 'camorristi', 'sicari', 'falsari', 'e', 'gli', 'austriacanti' ... │ │ ['rag', 'one', 'organo', 'di', 'difesa', 'della', 'italianità', 'contro', 'vili' ... │ │ ['contro', 'vili', 'camorristi', 'sicari', 'falsari', 'e', 'gli', 'austriacanti' ... │ │ ['rassegna', 'both', 'phones', 'anno', 'l', 'no', 'il', 'perche', 'de', 'rassegn ... │ │ ['rassegna', 'jjoth', 'phones', 'anno', 'l', 'no', 'basta', 'da', 'qualche', 'te ... │ │ ['both', 'phones', 'anno', 'no', 'basta', 'uà', 'quaiene', 'tempo', 'a', 'questa ... │ │ ['jsrìt', 'w', 'iik', 'f', 'v', 'ht', 'p', 't', 'both', 'phones', 'anno', 'il', ... │ │ ['phones', 'anno', 'no', 'le', 'cose', 'a', 'posto', 'si', 'va', 'dicendo', 'si' ... │ │ │ │ │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Now that we're happy with our prepared pre-processed texts, we can use generate.LDA.for.tokens_array
to try out some topic modelling. The default for topics is set at seven, but just like the preprocess.tokens_array
operation, we can play around with the options. Let's have a look.
kiara.retrieve_operation_info('generate.LDA.for.tokens_array')
Documentation Perform Latent Dirichlet Allocation on a tokenized corpus. This module computes models for a range of number of topics provided by the user. Author(s) Markus Binsteiner markus@frkl.io Context Tags language_processing, LDA, tokens Labels package: kiara_plugin.language_processing References source_repo: https://github.com/DHARPA-Project/kiara_plugin.language_processing documentation: https://DHARPA-Project.github.io/kiara_plugin.language_processing/ Operation details Documentation Perform Latent Dirichlet Allocation on a tokenized corpus. This module computes models for a range of number of topics provided by the user. Inputs field name type description Required Default ────────────────────────────────────────────────────────────────────────────────────────────────── tokens_array array The text corpus. yes -- no default -- num_topics_min integer The minimal number of topics. no 7 num_topics_max integer The max number of topics. no 7 compute_coherence boolean Whether to compute the coherence no False score for each model. words_per_topic integer How many words per topic to put in no 10 the result model. Outputs field name type description ────────────────────────────────────────────────────────────────────────────────────────────────── topic_models dict A dictionary with one coherence model table for each number of topics. coherence_table table Coherence details. coherence_map dict A map with the coherence value for every number of topics.
We'll stick with the default for now, and generate some topics for our text.
inputs = {
'tokens_array' : outputs['tokens_array']
}
outputs = kiara.run_job('generate.LDA.for.tokens_array', inputs=inputs)
outputs
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ │ │ field value │ │ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ │ coherence_map │ │ dict data {} │ │ dict schema { │ │ "title": "dict", │ │ "type": "object" │ │ } │ │ │ │ coherence_table -- none/not set -- │ │ topic_models │ │ dict data { │ │ "7": [ │ │ [ │ │ 0, │ │ "0.031*\"di\" + 0.024*\"e\" + 0.017*\"che\" + 0.015*\"il\" + 0.013*\"non\" + 0.012*\"a\" … │ │ ], │ │ [ │ │ 1, │ │ "0.043*\"di\" + 0.027*\"e\" + 0.025*\"che\" + 0.017*\"il\" + 0.016*\"a\" + 0.016*\"non\" … │ │ ], │ │ [ │ │ 2, │ │ "0.023*\"di\" + 0.022*\"e\" + 0.021*\"che\" + 0.014*\"a\" + 0.011*\"per\" + 0.011*\"il\" … │ │ ], │ │ [ │ │ 3, │ │ "0.043*\"di\" + 0.028*\"e\" + 0.026*\"che\" + 0.019*\"il\" + 0.016*\"a\" + 0.013*\"non\" … │ │ ], │ │ [ │ │ 4, │ │ "0.025*\"di\" + 0.020*\"che\" + 0.018*\"e\" + 0.016*\"a\" + 0.013*\"un\" + 0.012*\"il\" +… │ │ ], │ │ [ │ │ 5, │ │ "0.030*\"di\" + 0.019*\"e\" + 0.016*\"che\" + 0.016*\"il\" + 0.011*\"un\" + 0.011*\"a\" +… │ │ ], │ │ [ │ │ 6, │ │ "0.029*\"di\" + 0.018*\"e\" + 0.013*\"che\" + 0.012*\"il\" + 0.010*\"si\" + 0.009*\"per\"… │ │ ] │ │ ] │ │ } │ │ dict schema { │ │ "title": "dict", │ │ "type": "object" │ │ } │ │ │ │ │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Recording and Tracing our Data¶
We've successfully downloaded, organised and pre-processed our text files, and now generated some topics for it.
Fantastic!
As we know, this means we've made lots of decisions about our research process and our data. But by using kiara, we can trace what's changed and the decisions we've made. Let's have a look!
As with the installation notebook, not much to see here yet but will be updated as changes come. Would potentially be useful with operations that require options (like the preproccessing) to know whether this has been selected or not?
topics = outputs['topic_models']
topics.lineage
generate.LDA.for.tokens_array ├── input: compute_coherence (boolean) = 5137a237-fe0f-45bd-abe3-cc84700a2bb6 ├── input: num_topics_max (integer) = 75399d70-bbef-4215-b9b5-5dacfa03b2ba ├── input: num_topics_min (integer) = 75399d70-bbef-4215-b9b5-5dacfa03b2ba ├── input: tokens_array (array) = 02d01eb7-70d6-4ef7-811e-66ed25f920bb │ └── preprocess.tokens_array │ ├── input: remove_all_numeric (boolean) = 5137a237-fe0f-45bd-abe3-cc84700a2bb6 │ ├── input: remove_alphanumeric (boolean) = 5137a237-fe0f-45bd-abe3-cc84700a2bb6 │ ├── input: remove_non_alpha (boolean) = 8b1b93ec-a51e-4bbd-84cf-c5a1efd78e9b │ ├── input: remove_short_tokens (integer) = f5df1b36-9884-413d-92d0-81209227f106 │ ├── input: remove_stopwords (list) = bb8a79b2-369c-46ae-a85a-2b0f85c9da22 │ ├── input: to_lowercase (boolean) = 8b1b93ec-a51e-4bbd-84cf-c5a1efd78e9b │ └── input: tokens_array (array) = d1db365d-2e59-4455-ae05-78447e5a4268 │ └── preprocess.tokens_array │ ├── input: remove_all_numeric (boolean) = 5137a237-fe0f-45bd-abe3-cc84700a2bb6 │ ├── input: remove_alphanumeric (boolean) = 5137a237-fe0f-45bd-abe3-cc84700a2bb6 │ ├── input: remove_non_alpha (boolean) = 5137a237-fe0f-45bd-abe3-cc84700a2bb6 │ ├── input: remove_short_tokens (integer) = f5df1b36-9884-413d-92d0-81209227f106 │ ├── input: remove_stopwords (list) = 524b5812-c4df-4ea0-a50a-d0ec5166c22f │ ├── input: to_lowercase (boolean) = 5137a237-fe0f-45bd-abe3-cc84700a2bb6 │ └── input: tokens_array (array) = a3c66f00-7f67-483d-8018-a64714094fa4 │ └── tokenize.texts_array │ ├── input: texts_array (array) = 3db76a98-88e6-45ee-8618-7c95fdf8232c │ │ └── table.pick.column │ │ ├── input: column_name (string) = 33ebce29-be63-4644-b66b-9e82a3c56236 │ │ └── input: table (table) = bd56aae9-6289-4f3e-b3f4-edbc55310689 │ │ └── create.table │ │ └── input: file_bundle (file_bundle) = 214ae90d-224b-447a-b0e8-112024a8e6d4 │ │ └── download.file_bundle │ │ ├── input: sub_path (string) = 89c3d000-a486-4089-9592-142253d8f3d3 │ │ └── input: url (string) = 10d94fa6-0c3d-4d6e-a457-9fa1e7b63e99 │ └── input: tokenize_by_word (boolean) = 8b1b93ec-a51e-4bbd-84cf-c5a1efd78e9b └── input: words_per_topic (integer) = cd1319e3-a6ec-4d8d-99b3-34ef873e1d13