Skip to content

Binder

kiara: Natural Language Processing (NLP)

Welcome back! Now that we're comfortable with what kiara looks like and what it can do to help track your data and your research process, let's try out some of the digital analysis tools, starting with Natural Language Processing.

Why NLP?

First of all, why bothering with NLP? Natural language processing technology allows researchers to sort through unstructured data such as plain text. In other words, by adding numerical value to text, computers can understand language and perform advanced operations such as text categorisation, labelling, summarisation and so on. There are two main stages in NLP: pre-processing and analysis (aka, algorithm development and/or implementation). Here we cover both stages through the example of some of the most common pre-processing operations such as tokenisation, lowercasing, removing stopwords etc. in the first part. For the second part, we will use the example of another widely used text analysis method called topic modelling. For more information about the pre-processing operations and topic modelling and a more in-depth discussion particularly for humanities research, please refer to this repository here.

Starting the Process

Let's start by double checking that we have all the required plugins, and setting up an API for us to use kiara. We'll do this all in one go this time, but if you're unsure, feel free to head back to the installation notebook to look over this section again.

try:
    from kiara_plugin.jupyter import ensure_kiara_plugins
except:
    import sys
    print("Installing 'kiara_plugin.jupyter'...")
    !{sys.executable} -m pip install -q kiara_plugin.jupyter
    from kiara_plugin.jupyter import ensure_kiara_plugins

ensure_kiara_plugins()

from kiara import KiaraAPI
kiara = KiaraAPI.instance()

Now we're all set up, we want to download some text to work with in our language processing analyis.
For our example here we will be using a relatively small number of texts. This is a sample taken from the larger corpus ChroniclItaly 3.0 (Viola and Fiscarelli 2021, Viola 2021), an open access digital heritage collection of Italian immigrant newspapers published in the United States from 1898 to 1936. The corpus that we use here includes the digitized (OCRed) front pages of the Italian language newspaper La rassegna as collected from Chronicling America, an Internet-based, searchable database of U.S. newspapers published in the United States from 1789 to 1963 made available by the Library of Congress. These files are also a good examples because their filenames already contain important metadata information such as the publication date. The file name structure is: LCCNnumber_date_pageNumber_ocr.txt. Therefore, the file name ‘sn84037025_1917-04-14_ed-1_seq-1_ocr.txt ’ refers to the OCR text file of the first page of the first edition of La Rassegna published on 14 April 1917. kiara allows us to retrieve both the files and the metadata in the filenames. This is very useful for historical research, but also to keep track of how we are intervening on our sources. Let's see how this works.

kiara.list_operation_ids('download')
['download.file', 'download.file_bundle']

Last time we only wanted one file, but with language processing we might want a bigger corpus.
Let's have a look at download.file_bundle this time.

kiara.retrieve_operation_info('download.file_bundle')
                                                                                                                                            
 Documentation                                                                                                                              
                     -- n/a --                                                                                                              
                                                                                                                                            
 Author(s)                                                                                                                                  
                     Markus Binsteiner   markus@frkl.io                                                                                     
                                                                                                                                            
 Context                                                                                                                                    
                     Tags         onboarding                                                                                                
                     Labels       package: kiara_plugin.onboarding                                                                          
                     References   source_repo: https://github.com/DHARPA-Project/kiara_plugin.onboarding                                    
                                  documentation: https://DHARPA-Project.github.io/kiara_plugin.onboarding/                                  
                                                                                                                                            
 Operation details                                                                                                                          
                     Documentation   -- n/a --                                                                                              
                                                                                                                                            
                     Inputs                                                                                                                 
                                       field name   type     description                                    Required   Default              
                                      ──────────────────────────────────────────────────────────────────────────────────────────────────    
                                       url          string   The url of an archive/zip file to download.    yes        -- no default --     
                                       sub_path     string   A relative path to select only a sub-folder    no         -- no default --     
                                                             from the archive.                                                              
                                                                                                                                            
                                                                                                                                            
                     Outputs                                                                                                                
                                       field name          type          description                                                        
                                      ──────────────────────────────────────────────────────────────────────────────────────────────────    
                                       file_bundle         file_bundle   The downloaded file bundle.                                        
                                       download_metadata   dict          Metadata about the download.                                       
                                                                                                                                            
                                                                                                                                            

So we still want a url, but for a zip file that we can download. Here's some example data for us to use.

Again, we need to define the inputs, use kiara.run_job with our chosen operation download.file_bundle and store this as our outputs.

inputs = {
    "url": "https://github.com/DHARPA-Project/kiara.examples/archive/refs/heads/main.zip",
    "sub_path": "kiara.examples-main/examples/data/text_corpus/data"
 }

outputs = kiara.run_job('download.file_bundle', inputs=inputs)
outputs
patool: Extracting /var/folders/5h/j266_5ss6qj7x37qd77pydh0p248rs/T/tmpjr9umxt6 ...
patool: ... /var/folders/5h/j266_5ss6qj7x37qd77pydh0p248rs/T/tmpjr9umxt6 extracted to `/var/folders/5h/j266_5ss6qj7x37qd77pydh0p248rs/T/tmpnha6uj9d'.

╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                                                                                          │
│   field               value                                                                                                              │
│  ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────  │
│   download_metadata                                                                                                                      │
│                         dict data     {                                                                                                  │
│                                         "response_headers": [                                                                            │
│                                           {                                                                                              │
│                                             "access-control-allow-origin": "https://render.githubusercontent.com",                       │
│                                             "content-disposition": "attachment; filename=kiara.examples-main.zip",                       │
│                                             "content-security-policy": "default-src 'none'; style-src 'unsafe-inline'; sandbox",         │
│                                             "content-type": "application/zip",                                                           │
│                                             "etag": "W/\"34f87e1d6dc5c913d21b59d7aecf516bca3a32605b6e1504ae97eec4611cc862\"",            │
│                                             "strict-transport-security": "max-age=31536000",                                             │
│                                             "vary": "Authorization,Accept-Encoding,Origin",                                              │
│                                             "x-content-type-options": "nosniff",                                                         │
│                                             "x-frame-options": "deny",                                                                   │
│                                             "x-xss-protection": "1; mode=block",                                                         │
│                                             "date": "Fri, 27 Jan 2023 07:48:53 GMT",                                                     │
│                                             "transfer-encoding": "chunked",                                                              │
│                                             "x-github-request-id": "542C:C4A4:31636B:3C82F9:63D381E5"                                    │
│                                           },                                                                                             │
│                                           {                                                                                              │
│                                             "server": "GitHub.com",                                                                      │
│                                             "date": "Fri, 27 Jan 2023 07:48:53 GMT",                                                     │
│                                             "content-type": "text/html; charset=utf-8",                                                  │
│                                             "vary": "X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, Accept-Encoding, Accept, X…     │
│                                             "location": "https://codeload.github.com/DHARPA-Project/kiara.examples/zip/refs/heads/m…     │
│                                             "cache-control": "max-age=0, private",                                                       │
│                                             "strict-transport-security": "max-age=31536000; includeSubdomains; preload",                 │
│                                             "x-frame-options": "deny",                                                                   │
│                                             "x-content-type-options": "nosniff",                                                         │
│                                             "x-xss-protection": "0",                                                                     │
│                                             "referrer-policy": "no-referrer-when-downgrade",                                             │
│                                             "content-security-policy": "default-src 'none'; base-uri 'self'; block-all-mixed-conten…     │
│                                             "content-length": "0",                                                                       │
│                                             "x-github-request-id": "77A9:872B:7D71061:81F6ABB:63D381E5"                                  │
│                                           }                                                                                              │
│                                         ],                                                                                               │
│                                         "request_time": "2023-01-27T07:48:53.906919+00:00"                                               │
│                                       }                                                                                                  │
│                         dict schema   {                                                                                                  │
│                                         "title": "dict",                                                                                 │
│                                         "type": "object"                                                                                 │
│                                       }                                                                                                  │
│                                                                                                                                          │
│   file_bundle                                                                                                                            │
│                         bundle name       data                                                                                           │
│                         number_of_files   16                                                                                             │
│                         size              298452                                                                                         │
│                         included files                                                                                                   │
│                                             (relative) path                                        size                                  │
│                                            ──────────────────────────────────────────────────────────────                                │
│                                             La_Rassegna/sn84037025_1917-04-14_ed-1_seq-1_ocr.txt   20647                                 │
│                                             La_Rassegna/sn84037025_1917-04-07_ed-1_seq-1_ocr.txt   19397                                 │
│                                             La_Rassegna/sn84037025_1917-04-14_ed-2_seq-1_ocr.txt   20650                                 │
│                                             La_Rassegna/sn84037025_1917-04-21_ed-1_seq-1_ocr.txt   21017                                 │
│                                             La_Rassegna/sn84037025_1917-04-21_ed-2_seq-1_ocr.txt   20982                                 │
│                                             La_Ragione/sn84037024_1917-05-05_ed-2_seq-1_ocr.txt    18474                                 │
│                                             La_Ragione/sn84037024_1917-05-16_ed-1_seq-1_ocr.txt    18620                                 │
│                                             La_Ragione/sn84037024_1917-05-16_ed-2_seq-1_ocr.txt    18698                                 │
│                                             La_Ragione/sn84037024_1917-05-05_ed-1_seq-1_ocr.txt    18346                                 │
│                                             La_Ragione/sn84037024_1917-04-25_ed-4_seq-1_ocr.txt    16235                                 │
│                                             La_Ragione/sn84037024_1917-04-25_ed-3_seq-1_ocr.txt    16793                                 │
│                                             La_Ragione/sn84037024_1917-04-25_ed-2_seq-1_ocr.txt    16679                                 │
│                                             La_Ragione/sn84037024_1917-04-25_ed-1_seq-1_ocr.txt    16613                                 │
│                                             La_Ragione/sn84037024_1917-05-16_ed-3_seq-1_ocr.txt    18540                                 │
│                                             La_Ragione/sn84037024_1917-05-05_ed-4_seq-1_ocr.txt    18481                                 │
│                                             La_Ragione/sn84037024_1917-05-05_ed-3_seq-1_ocr.txt    18280                                 │
│                                                                                                                                          │
│                                                                                                                                          │
│                                                                                                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Great, we've successfully imported a bundle of files this time rather than just one. This has given us both the metadata for the files, and the files themselves. As you can see, kiara also gives us additional information on the composition of the text files, that is the number of tokens. This information will be useful later when we will intervene on these files to keep track of how we have changed them. For now, let's save the files in a separate variable for us to use later.

file_bundle = outputs['file_bundle']

Preparing the Texts

Now that we have imported the files, let's give them some structure. For this, we will need the create.table.from.file_bundle function (similar to the installation notebook which you are welcome to revisit at any time). Let's have a look by exploring the list of avaibale operations.

kiara.retrieve_operation_info('create.table.from.file_bundle')
                                                                                                                                            
 Documentation                                                                                                                              
                     Create a table value from a text file_bundle.                                                                          
                                                                                                                                            
                     The resulting table will have (at a minimum) the following collumns:                                                   
                                                                                                                                            
                     id: an auto-assigned index                                                                                          
                     rel_path: the relative path of the file (from the provided base path)                                               
                     content: the text file content                                                                                      
                                                                                                                                            
 Author(s)                                                                                                                                  
                     Markus Binsteiner   markus@frkl.io                                                                                     
                                                                                                                                            
 Context                                                                                                                                    
                     Tags         tabular                                                                                                   
                     Labels       package: kiara_plugin.tabular                                                                             
                     References   source_repo: https://github.com/DHARPA-Project/kiara_plugin.tabular                                       
                                  documentation: https://DHARPA-Project.github.io/kiara_plugin.tabular/                                     
                                                                                                                                            
 Operation details                                                                                                                          
                     Documentation   Create a table value from a text file_bundle.                                                          
                                                                                                                                            
                                     The resulting table will have (at a minimum) the following collumns:                                   
                                     - id: an auto-assigned index                                                                           
                                     - rel_path: the relative path of the file (from the provided base path)                                
                                     - content: the text file content                                                                       
                                                                                                                                            
                     Inputs                                                                                                                 
                                       field name    type          description                              Required   Default              
                                      ──────────────────────────────────────────────────────────────────────────────────────────────────    
                                       file_bundle   file_bundle   The source value (of type                yes        -- no default --     
                                                                   'file_bundle').                                                          
                                                                                                                                            
                                                                                                                                            
                     Outputs                                                                                                                
                                       field name   type    description                                                                     
                                      ──────────────────────────────────────────────────────────────────────────────────────────────────    
                                       table        table   The result value (of type 'table').                                             
                                                                                                                                            
                                                                                                                                            

Let's use the file bundle we downloaded earlier and saved in our variable, and run this kiara table function.

inputs = {
    'file_bundle' : file_bundle
}

outputs = kiara.run_job('create.table.from.file_bundle', inputs=inputs)
outputs
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                                                                                          │
│   field   value                                                                                                                          │
│  ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────  │
│   table                                                                                                                                  │
│             id   rel_path                        mime_type    size    content                          file_name                         │
│            ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────    │
│             0    La_Ragione/sn84037024_1917-04   text/plain   16613   LA RAGIONE                       sn84037024_1917-04-25_ed-1_se     │
│             1    La_Ragione/sn84037024_1917-04   text/plain   16679   LA RAG ONE                       sn84037024_1917-04-25_ed-2_se     │
│             2    La_Ragione/sn84037024_1917-04   text/plain   16793   LA RAGIONE                       sn84037024_1917-04-25_ed-3_se     │
│             3    La_Ragione/sn84037024_1917-04   text/plain   16235   contro i vili, i camorristi, i   sn84037024_1917-04-25_ed-4_se     │
│             4    La_Ragione/sn84037024_1917-05   text/plain   18346   contro i vili, i camorristi, i   sn84037024_1917-05-05_ed-1_se     │
│             5    La_Ragione/sn84037024_1917-05   text/plain   18474   LA RAGIONA                       sn84037024_1917-05-05_ed-2_se     │
│             6    La_Ragione/sn84037024_1917-05   text/plain   18280   LA RAGIONE                       sn84037024_1917-05-05_ed-3_se     │
│             7    La_Ragione/sn84037024_1917-05   text/plain   18481   LA RAGIONE                       sn84037024_1917-05-05_ed-4_se     │
│             8    La_Ragione/sn84037024_1917-05   text/plain   18620   contro i vili, i camorristi, i   sn84037024_1917-05-16_ed-1_se     │
│             9    La_Ragione/sn84037024_1917-05   text/plain   18698   LA RAG ONE                       sn84037024_1917-05-16_ed-2_se     │
│             10   La_Ragione/sn84037024_1917-05   text/plain   18540   contro 1 vili, i camorristi, i   sn84037024_1917-05-16_ed-3_se     │
│             11   La_Rassegna/sn84037025_1917-0   text/plain   19397   ■■■                              sn84037025_1917-04-07_ed-1_se     │
│             12   La_Rassegna/sn84037025_1917-0   text/plain   20647   La Rassegna                      sn84037025_1917-04-14_ed-1_se     │
│             13   La_Rassegna/sn84037025_1917-0   text/plain   20650   Both Phones                      sn84037025_1917-04-14_ed-2_se     │
│             14   La_Rassegna/sn84037025_1917-0   text/plain   21017   ■ jSrìt** W?? iIK 38®f- i^M      sn84037025_1917-04-21_ed-1_se     │
│             15   La_Rassegna/sn84037025_1917-0   text/plain   20982   ■Both Phones                     sn84037025_1917-04-21_ed-2_se     │
│                                                                                                                                          │
│                                                                                                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Great, this has taken all the information from the files we downloaded and made it a bit easier to navigate. In order to process and analyse our sources, we need to work with the files' content which is in the column 'content'. Let's run `kiara.list_operation_ids('table')' to see how we might be able to do that.

kiara.list_operation_ids('table')
['create.database.from.table',
 'create.network_data.from.tables',
 'create.table.from.file',
 'create.table.from.file_bundle',
 'export.table.as.csv_file',
 'extract.date_array.from.table',
 'filter.table',
 'import.table.from.local_file_path',
 'import.table.from.local_folder_path',
 'query.table',
 'table.pick.column',
 'table_filter.drop_columns',
 'table_filter.select_columns',
 'table_filter.select_rows']

As we are interested in one column, the table.pick.column operation seems like a good fit.

kiara.retrieve_operation_info('table.pick.column')
                                                                                                                                            
 Documentation                                                                                                                              
                     Pick one column from a table, returning an array.                                                                      
                                                                                                                                            
 Author(s)                                                                                                                                  
                     Markus Binsteiner   markus@frkl.io                                                                                     
                                                                                                                                            
 Context                                                                                                                                    
                     Tags         tabular                                                                                                   
                     Labels       package: kiara_plugin.tabular                                                                             
                     References   source_repo: https://github.com/DHARPA-Project/kiara_plugin.tabular                                       
                                  documentation: https://DHARPA-Project.github.io/kiara_plugin.tabular/                                     
                                                                                                                                            
 Operation details                                                                                                                          
                     Documentation   Pick one column from a table, returning an array.                                                      
                                                                                                                                            
                     Inputs                                                                                                                 
                                       field name    type     description                                   Required   Default              
                                      ──────────────────────────────────────────────────────────────────────────────────────────────────    
                                       table         table    A table.                                      yes        -- no default --     
                                       column_name   string   The name of the column to extract.            yes        -- no default --     
                                                                                                                                            
                                                                                                                                            
                     Outputs                                                                                                                
                                       field name   type    description                                                                     
                                      ──────────────────────────────────────────────────────────────────────────────────────────────────    
                                       array        array   The column.                                                                     
                                                                                                                                            
                                                                                                                                            

So here we need two inputs, the table we just made and the name of the column we want to pick.

Let's specify our outputs again and run the function. In this way, we retain the content of the files as the variable we need for NLP.

inputs = {
    'table' : outputs['table'],
    'column_name' : 'content'
}

outputs = kiara.run_job('table.pick.column', inputs=inputs)
outputs
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                                                                                          │
│   field   value                                                                                                                          │
│  ──────────────────────────────────────────────────────────────────────────────────────────────────                                      │
│   array                                                                                                                                  │
│             LA RAGIONE                                                                                                                   │
│             LA RAG ONE                                                                                                                   │
│             LA RAGIONE                                                                                                                   │
│             contro i vili, i camorristi, i sicari, i falsari e gli austriacanti, nemici dell ...                                         │
│             contro i vili, i camorristi, i sicari, i falsari e gli austriacanti, nemici dell ...                                         │
│             LA RAGIONA                                                                                                                   │
│             LA RAGIONE                                                                                                                   │
│             LA RAGIONE                                                                                                                   │
│             contro i vili, i camorristi, i sicari, i falsari e gli austriacanti, nemici dell ...                                         │
│             LA RAG ONE                                                                                                                   │
│             contro 1 vili, i camorristi, i sicari, i falsari e gli austriacanti, nemici dell ...                                         │
│             ■■■                                                                                                                          │
│             La Rassegna                                                                                                                  │
│             Both Phones                                                                                                                  │
│             ■ jSrìt** W?? iIK 38®f- i^M                                                                                                  │
│             ■Both Phones                                                                                                                 │
│                                                                                                                                          │
│                                                                                                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Natural Language Processing (Stage 1)

Now we are ready for preparing our text for analysis. Let's see what operations are included in kiara for NLP in the kiara_plugin.language_processing package.

infos = metadata = kiara.retrieve_operations_info()
operations = {}
for op_id, info in infos.item_infos.items():
    if info.context.labels.get("package", None) == "kiara_plugin.language_processing":
        operations[op_id] = info

print(operations.keys())
dict_keys(['create.stopwords_list', 'generate.LDA.for.tokens_array', 'preprocess.tokens_array', 'remove_stopwords.from.tokens_array', 'tokenize.string', 'tokenize.texts_array'])

The contents of our text files have been stored as an array. Before performing any operation, we should start by tokenising our text. We can do this by using the tokenize.texts_array function.

If you're unsure about which of these operations you should run, you can refer to the in-built explanation in each kiara module which clarifies what each operation does. For further information about pros and cons of each pre-processing operation, please refer to this repository here.

kiara.retrieve_operation_info('tokenize.texts_array')
                                                                                                                                            
 Documentation                                                                                                                              
                     Split sentences into words or words into characters.                                                                   
                                                                                                                                            
                     In other words, this operation establishes the word boundaries (i.e., tokens) a very helpful way of finding            
                     patterns. It is also the typical step prior to stemming and lemmatization                                              
                                                                                                                                            
 Author(s)                                                                                                                                  
                     Markus Binsteiner   markus@frkl.io                                                                                     
                                                                                                                                            
 Context                                                                                                                                    
                     Tags         language_processing, tokenize, tokens                                                                     
                     Labels       package: kiara_plugin.language_processing                                                                 
                     References   source_repo: https://github.com/DHARPA-Project/kiara_plugin.language_processing                           
                                  documentation: https://DHARPA-Project.github.io/kiara_plugin.language_processing/                         
                                                                                                                                            
 Operation details                                                                                                                          
                     Documentation   Split sentences into words or words into characters.                                                   
                                                                                                                                            
                                     In other words, this operation establishes the word boundaries (i.e., tokens) a very helpful way of    
                                     finding patterns. It is also the typical step prior to stemming and lemmatization                      
                                                                                                                                            
                     Inputs                                                                                                                 
                                       field name         type      description                             Required   Default              
                                      ──────────────────────────────────────────────────────────────────────────────────────────────────    
                                       texts_array        array     An array of text items to be            yes        -- no default --     
                                                                    tokenized.                                                              
                                       tokenize_by_word   boolean   Whether to tokenize by word             no         True                 
                                                                    (default), or character.                                                
                                                                                                                                            
                                                                                                                                            
                     Outputs                                                                                                                
                                       field name     type    description                                                                   
                                      ──────────────────────────────────────────────────────────────────────────────────────────────────    
                                       tokens_array   array   The tokenized content, as an array of lists of strings.                       
                                                                                                                                            
                                                                                                                                            

Great, let's give it a go!

inputs = {
    'texts_array': outputs['array']
}

outputs = kiara.run_job('tokenize.texts_array', inputs=inputs)
outputs
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                                                                                          │
│   field          value                                                                                                                   │
│  ─────────────────────────────────────────────────────────────────────────────────────────────────────────                               │
│   tokens_array                                                                                                                           │
│                    ['LA', 'RAGIONE', 'ORGANO', 'DI', 'DIFESA', 'DELLA', "ITALIANITÀ'", 'contro', '1 ...                                  │
│                    ['LA', 'RAG', 'ONE', 'contro', 'i', 'vili', ',', 'i', 'camorristi', ',', 'i', 's ...                                  │
│                    ['LA', 'RAGIONE', 'ORGANO', 'DI', 'DIFESA', 'DELLA', 'ITALIANITÀ', 'contro', 'i' ...                                  │
│                    ['contro', 'i', 'vili', ',', 'i', 'camorristi', ',', 'i', 'sicari', ',', 'i', 'f ...                                  │
│                    ['contro', 'i', 'vili', ',', 'i', 'camorristi', ',', 'i', 'sicari', ',', 'i', 'f ...                                  │
│                    ['LA', 'RAGIONA', 'ORGANO', 'DI', 'DIFESA', 'DELLA', 'ITALIANITÀ', 'contro', 'i' ...                                  │
│                    ['LA', 'RAGIONE', 'ORGANO', 'DI', 'DIFESA', 'DELLA', "ITALIANITÀ'", 'contro', 'i ...                                  │
│                    ['LA', 'RAGIONE', 'contro', 'i', 'vili', ',', '1', 'camorristi', ',', 'i', 'sica ...                                  │
│                    ['contro', 'i', 'vili', ',', 'i', 'camorristi', ',', 'i', 'sicari', ',', 'i', 'f ...                                  │
│                    ['LA', 'RAG', 'ONE', 'ORGANO', 'DI', 'DIFESA', 'DELLA', 'ITALIANITÀ', "''", 'con ...                                  │
│                    ['contro', '1', 'vili', ',', 'i', 'camorristi', ',', 'i', 'sicari', ',', 'i', 'f ...                                  │
│                    ['■■■', 'La', 'Rassegna', '_', 'I', 'Both', 'Phones', 'ANNO', 'L', 'No', '.', '1 ...                                  │
│                    ['La', 'Rassegna', 'Jjoth', 'Phones', 'ANNO', 'L', 'No', '.', '2', 'BASTA', '!', ...                                  │
│                    ['Both', 'Phones', 'ANNO', 'I', '.', 'No', '.', '2', 'BASTA', '!', '...', 'uà',  ...                                  │
│                    ['■', 'jSrìt', '*', '*', 'W', '?', '?', 'iIK', '38®f-', 'i^M', 'F', '<', '5É', ' ...                                  │
│                    ['■Both', 'Phones', 'ANNO', '11', '.', 'No', '.', '5', 'LE', 'COSE', 'A', 'POSTO ...                                  │
│                                                                                                                                          │
│                                                                                                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

We can see from the printed preview that this has tokenized the contents for each of the text files we imported.

Now we can work on pre-processing some of this text. Let's look at what options we have in the preprocess.tokens_array operation.

kiara.retrieve_operation_info('preprocess.tokens_array')
                                                                                                                                            
 Documentation                                                                                                                              
                     Preprocess lists of tokens, incl. lowercasing, remove special characers, etc.                                          
                                                                                                                                            
                     Lowercasing: Lowercase the words. This operation is a double-edged sword. It can be effective at yielding              
                     potentially better results in the case of relatively small datasets or datatsets with a high percentage of OCR         
                     mistakes. For instance, if lowercasing is not performed, the algorithm will treat USA, Usa, usa, UsA, uSA, etc. as     
                     distinct tokens, even though they may all refer to the same entity. On the other hand, if the dataset does not         
                     contain such OCR mistakes, then it may become difficult to distinguish between homonyms and make interpreting the      
                     topics much harder.                                                                                                    
                                                                                                                                            
                     Removing stopwords and words with less than three characters: Remove low information words. These are typically        
                     words such as articles, pronouns, prepositions, conjunctions, etc. which are not semantically salient. There are       
                     numerous stopword lists available for many, though not all, languages which can be easily adapted to the individual    
                     researcher's needs. Removing words with less than three characters may additionally remove many OCR mistakes. Both     
                     these operations have the dual advantage of yielding more reliable results while reducing the size of the dataset,     
                     thus in turn reducing the required processing power. This step can therefore hardly be considered optional in TM.      
                                                                                                                                            
                     Noise removal: Remove elements such as punctuation marks, special characters, numbers, html formatting, etc. This      
                     operation is again concerned with removing elements that may not be relevant to the text analysis and in fact          
                     interfere with it. Depending on the dataset and research question, this operation can become essential.                
                                                                                                                                            
 Author(s)                                                                                                                                  
                     Markus Binsteiner   markus@frkl.io                                                                                     
                                                                                                                                            
 Context                                                                                                                                    
                     Tags         language_processing, tokens, preprocess                                                                   
                     Labels       package: kiara_plugin.language_processing                                                                 
                     References   source_repo: https://github.com/DHARPA-Project/kiara_plugin.language_processing                           
                                  documentation: https://DHARPA-Project.github.io/kiara_plugin.language_processing/                         
                                                                                                                                            
 Operation details                                                                                                                          
                     Documentation   Preprocess lists of tokens, incl. lowercasing, remove special characers, etc.                          
                                                                                                                                            
                                     Lowercasing: Lowercase the words. This operation is a double-edged sword. It can be effective at       
                                     yielding potentially better results in the case of relatively small datasets or datatsets with a       
                                     high percentage of OCR mistakes. For instance, if lowercasing is not performed, the algorithm will     
                                     treat USA, Usa, usa, UsA, uSA, etc. as distinct tokens, even though they may all refer to the same     
                                     entity. On the other hand, if the dataset does not contain such OCR mistakes, then it may become       
                                     difficult to distinguish between homonyms and make interpreting the topics much harder.                
                                                                                                                                            
                                     Removing stopwords and words with less than three characters: Remove low information words. These      
                                     are typically words such as articles, pronouns, prepositions, conjunctions, etc. which are not         
                                     semantically salient. There are numerous stopword lists available for many, though not all,            
                                     languages which can be easily adapted to the individual researcher's needs. Removing words with less   
                                     than three characters may additionally remove many OCR mistakes. Both these operations have the dual   
                                     advantage of yielding more reliable results while reducing the size of the dataset, thus in turn       
                                     reducing the required processing power. This step can therefore hardly be considered optional in TM.   
                                                                                                                                            
                                     Noise removal: Remove elements such as punctuation marks, special characters, numbers, html            
                                     formatting, etc. This operation is again concerned with removing elements that may not be relevant     
                                     to the text analysis and in fact interfere with it. Depending on the dataset and research question,    
                                     this operation can become essential.                                                                   
                                                                                                                                            
                     Inputs                                                                                                                 
                                       field name            type      description                          Required   Default              
                                      ──────────────────────────────────────────────────────────────────────────────────────────────────    
                                       tokens_array          array     The tokens array to pre-process.     yes        -- no default --     
                                       to_lowercase          boolean   Apply lowercasing to the text.       no         False                
                                       remove_alphanumeric   boolean   Remove all tokens that include       no         False                
                                                                       numbers (e.g. ex1ample).                                             
                                       remove_non_alpha      boolean   Remove all tokens that include       no         False                
                                                                       punctuation and numbers (e.g.                                        
                                                                       ex1a.mple).                                                          
                                       remove_all_numeric    boolean   Remove all tokens that contain       no         False                
                                                                       numbers only (e.g. 876).                                             
                                       remove_short_tokens   integer   Remove tokens shorter or equal to    no         0                    
                                                                       this value. If value is <= 0, no                                     
                                                                       filtering will be done.                                              
                                       remove_stopwords      list      Remove stopwords.                    no         -- no default --     
                                                                                                                                            
                                                                                                                                            
                     Outputs                                                                                                                
                                       field name     type    description                                                                   
                                      ──────────────────────────────────────────────────────────────────────────────────────────────────    
                                       tokens_array   array   The pre-processed content, as an array of lists of strings.                   
                                                                                                                                            
                                                                                                                                            

kiara includes the most widely used text analysis pre-processing operations. Let's try some of them and take a few moments to notice how they change our text.

Let's start by removing the so-called stopwords. These are low information words such as articles, pronouns, prepositions, conjunctions, etc. which are not semantically salient. There are numerous stopword lists available for many, though not all, languages which can be easily adapted to the individual researcher's needs. Here we are defining our stopword list but do experiment yourself with adding and change some of the words.

stopword_list = ['la', 'i']

inputs = {
    'tokens_array': outputs['tokens_array'],
    'remove_stopwords' : stopword_list
}

outputs = kiara.run_job('preprocess.tokens_array', inputs=inputs)
outputs
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                                                                                          │
│   field          value                                                                                                                   │
│  ─────────────────────────────────────────────────────────────────────────────────────────────────────────                               │
│   tokens_array                                                                                                                           │
│                    ['RAGIONE', 'ORGANO', 'DI', 'DIFESA', 'DELLA', "ITALIANITÀ'", 'contro', '1', 'vi ...                                  │
│                    ['RAG', 'ONE', 'contro', 'vili', ',', 'camorristi', ',', 'sicari', ',', 'falsari ...                                  │
│                    ['RAGIONE', 'ORGANO', 'DI', 'DIFESA', 'DELLA', 'ITALIANITÀ', 'contro', 'vili', ' ...                                  │
│                    ['contro', 'vili', ',', 'camorristi', ',', 'sicari', ',', 'falsari', 'e', 'gli', ...                                  │
│                    ['contro', 'vili', ',', 'camorristi', ',', 'sicari', ',', 'falsari', 'e', 'gli', ...                                  │
│                    ['RAGIONA', 'ORGANO', 'DI', 'DIFESA', 'DELLA', 'ITALIANITÀ', 'contro', 'vili', ' ...                                  │
│                    ['RAGIONE', 'ORGANO', 'DI', 'DIFESA', 'DELLA', "ITALIANITÀ'", 'contro', 'vili',  ...                                  │
│                    ['RAGIONE', 'contro', 'vili', ',', '1', 'camorristi', ',', 'sicari', ',', 'falsa ...                                  │
│                    ['contro', 'vili', ',', 'camorristi', ',', 'sicari', ',', 'falsari', 'e', 'gli', ...                                  │
│                    ['RAG', 'ONE', 'ORGANO', 'DI', 'DIFESA', 'DELLA', 'ITALIANITÀ', "''", 'contro',  ...                                  │
│                    ['contro', '1', 'vili', ',', 'camorristi', ',', 'sicari', ',', 'falsari', 'e', ' ...                                  │
│                    ['■■■', 'Rassegna', '_', 'Both', 'Phones', 'ANNO', 'L', 'No', '.', '1', 'Il', 'p ...                                  │
│                    ['Rassegna', 'Jjoth', 'Phones', 'ANNO', 'L', 'No', '.', '2', 'BASTA', '!', '...' ...                                  │
│                    ['Both', 'Phones', 'ANNO', '.', 'No', '.', '2', 'BASTA', '!', '...', 'uà', 'quai ...                                  │
│                    ['■', 'jSrìt', '*', '*', 'W', '?', '?', 'iIK', '38®f-', 'i^M', 'F', '<', '5É', ' ...                                  │
│                    ['■Both', 'Phones', 'ANNO', '11', '.', 'No', '.', '5', 'LE', 'COSE', 'A', 'POSTO ...                                  │
│                                                                                                                                          │
│                                                                                                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Great. Let's take this a bit further and try and combine two of our options in one function. In reality, we can add all the inputs together in one job, but let's start with converting everything into lowercase and removing any words with non-alphanumeric symbols.

inputs = {
    'tokens_array': outputs['tokens_array'],
    'to_lowercase' : True,
    'remove_non_alpha' : True
}

outputs = kiara.run_job('preprocess.tokens_array', inputs=inputs)
outputs
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                                                                                          │
│   field          value                                                                                                                   │
│  ─────────────────────────────────────────────────────────────────────────────────────────────────────────                               │
│   tokens_array                                                                                                                           │
│                    ['ragione', 'organo', 'di', 'difesa', 'della', 'contro', 'vili', 'camorristi', ' ...                                  │
│                    ['rag', 'one', 'contro', 'vili', 'camorristi', 'sicari', 'falsari', 'e', 'gli',  ...                                  │
│                    ['ragione', 'organo', 'di', 'difesa', 'della', 'italianità', 'contro', 'vili', ' ...                                  │
│                    ['contro', 'vili', 'camorristi', 'sicari', 'falsari', 'e', 'gli', 'austriacanti' ...                                  │
│                    ['contro', 'vili', 'camorristi', 'sicari', 'falsari', 'e', 'gli', 'austriacanti' ...                                  │
│                    ['ragiona', 'organo', 'di', 'difesa', 'della', 'italianità', 'contro', 'vili', ' ...                                  │
│                    ['ragione', 'organo', 'di', 'difesa', 'della', 'contro', 'vili', 'camorristi', ' ...                                  │
│                    ['ragione', 'contro', 'vili', 'camorristi', 'sicari', 'falsari', 'e', 'gli', 'au ...                                  │
│                    ['contro', 'vili', 'camorristi', 'sicari', 'falsari', 'e', 'gli', 'austriacanti' ...                                  │
│                    ['rag', 'one', 'organo', 'di', 'difesa', 'della', 'italianità', 'contro', 'vili' ...                                  │
│                    ['contro', 'vili', 'camorristi', 'sicari', 'falsari', 'e', 'gli', 'austriacanti' ...                                  │
│                    ['rassegna', 'both', 'phones', 'anno', 'l', 'no', 'il', 'perche', 'de', 'rassegn ...                                  │
│                    ['rassegna', 'jjoth', 'phones', 'anno', 'l', 'no', 'basta', 'da', 'qualche', 'te ...                                  │
│                    ['both', 'phones', 'anno', 'no', 'basta', 'uà', 'quaiene', 'tempo', 'a', 'questa ...                                  │
│                    ['jsrìt', 'w', 'iik', 'f', 'v', 'ht', 'p', 't', 'both', 'phones', 'anno', 'il',  ...                                  │
│                    ['phones', 'anno', 'no', 'le', 'cose', 'a', 'posto', 'si', 'va', 'dicendo', 'si' ...                                  │
│                                                                                                                                          │
│                                                                                                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Now that we're happy with our prepared pre-processed texts, we can use generate.LDA.for.tokens_array to try out some topic modelling. The default for topics is set at seven, but just like the preprocess.tokens_array operation, we can play around with the options. Let's have a look.

kiara.retrieve_operation_info('generate.LDA.for.tokens_array')
                                                                                                                                            
 Documentation                                                                                                                              
                     Perform Latent Dirichlet Allocation on a tokenized corpus.                                                             
                                                                                                                                            
                     This module computes models for a range of number of topics provided by the user.                                      
                                                                                                                                            
 Author(s)                                                                                                                                  
                     Markus Binsteiner   markus@frkl.io                                                                                     
                                                                                                                                            
 Context                                                                                                                                    
                     Tags         language_processing, LDA, tokens                                                                          
                     Labels       package: kiara_plugin.language_processing                                                                 
                     References   source_repo: https://github.com/DHARPA-Project/kiara_plugin.language_processing                           
                                  documentation: https://DHARPA-Project.github.io/kiara_plugin.language_processing/                         
                                                                                                                                            
 Operation details                                                                                                                          
                     Documentation   Perform Latent Dirichlet Allocation on a tokenized corpus.                                             
                                                                                                                                            
                                     This module computes models for a range of number of topics provided by the user.                      
                                                                                                                                            
                     Inputs                                                                                                                 
                                       field name          type      description                            Required   Default              
                                      ──────────────────────────────────────────────────────────────────────────────────────────────────    
                                       tokens_array        array     The text corpus.                       yes        -- no default --     
                                       num_topics_min      integer   The minimal number of topics.          no         7                    
                                       num_topics_max      integer   The max number of topics.              no         7                    
                                       compute_coherence   boolean   Whether to compute the coherence       no         False                
                                                                     score for each model.                                                  
                                       words_per_topic     integer   How many words per topic to put in     no         10                   
                                                                     the result model.                                                      
                                                                                                                                            
                                                                                                                                            
                     Outputs                                                                                                                
                                       field name        type    description                                                                
                                      ──────────────────────────────────────────────────────────────────────────────────────────────────    
                                       topic_models      dict    A dictionary with one coherence model table for each number of topics.     
                                       coherence_table   table   Coherence details.                                                         
                                       coherence_map     dict    A map with the coherence value for every number of topics.                 
                                                                                                                                            
                                                                                                                                            

We'll stick with the default for now, and generate some topics for our text.

inputs = {
    'tokens_array' : outputs['tokens_array']
}

outputs = kiara.run_job('generate.LDA.for.tokens_array', inputs=inputs)
outputs
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                                                                                          │
│   field             value                                                                                                                │
│  ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────  │
│   coherence_map                                                                                                                          │
│                       dict data     {}                                                                                                   │
│                       dict schema   {                                                                                                    │
│                                       "title": "dict",                                                                                   │
│                                       "type": "object"                                                                                   │
│                                     }                                                                                                    │
│                                                                                                                                          │
│   coherence_table   -- none/not set --                                                                                                   │
│   topic_models                                                                                                                           │
│                       dict data     {                                                                                                    │
│                                       "7": [                                                                                             │
│                                         [                                                                                                │
│                                           0,                                                                                             │
│                                           "0.031*\"di\" + 0.024*\"e\" + 0.017*\"che\" + 0.015*\"il\" + 0.013*\"non\" + 0.012*\"a\" …     │
│                                         ],                                                                                               │
│                                         [                                                                                                │
│                                           1,                                                                                             │
│                                           "0.043*\"di\" + 0.027*\"e\" + 0.025*\"che\" + 0.017*\"il\" + 0.016*\"a\" + 0.016*\"non\" …     │
│                                         ],                                                                                               │
│                                         [                                                                                                │
│                                           2,                                                                                             │
│                                           "0.023*\"di\" + 0.022*\"e\" + 0.021*\"che\" + 0.014*\"a\" + 0.011*\"per\" + 0.011*\"il\" …     │
│                                         ],                                                                                               │
│                                         [                                                                                                │
│                                           3,                                                                                             │
│                                           "0.043*\"di\" + 0.028*\"e\" + 0.026*\"che\" + 0.019*\"il\" + 0.016*\"a\" + 0.013*\"non\" …     │
│                                         ],                                                                                               │
│                                         [                                                                                                │
│                                           4,                                                                                             │
│                                           "0.025*\"di\" + 0.020*\"che\" + 0.018*\"e\" + 0.016*\"a\" + 0.013*\"un\" + 0.012*\"il\" +…     │
│                                         ],                                                                                               │
│                                         [                                                                                                │
│                                           5,                                                                                             │
│                                           "0.030*\"di\" + 0.019*\"e\" + 0.016*\"che\" + 0.016*\"il\" + 0.011*\"un\" + 0.011*\"a\" +…     │
│                                         ],                                                                                               │
│                                         [                                                                                                │
│                                           6,                                                                                             │
│                                           "0.029*\"di\" + 0.018*\"e\" + 0.013*\"che\" + 0.012*\"il\" + 0.010*\"si\" + 0.009*\"per\"…     │
│                                         ]                                                                                                │
│                                       ]                                                                                                  │
│                                     }                                                                                                    │
│                       dict schema   {                                                                                                    │
│                                       "title": "dict",                                                                                   │
│                                       "type": "object"                                                                                   │
│                                     }                                                                                                    │
│                                                                                                                                          │
│                                                                                                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Recording and Tracing our Data

We've successfully downloaded, organised and pre-processed our text files, and now generated some topics for it.
Fantastic!

As we know, this means we've made lots of decisions about our research process and our data. But by using kiara, we can trace what's changed and the decisions we've made. Let's have a look!

As with the installation notebook, not much to see here yet but will be updated as changes come. Would potentially be useful with operations that require options (like the preproccessing) to know whether this has been selected or not?

topics = outputs['topic_models']

topics.lineage
generate.LDA.for.tokens_array
├── input: compute_coherence (boolean) = 5137a237-fe0f-45bd-abe3-cc84700a2bb6
├── input: num_topics_max (integer) = 75399d70-bbef-4215-b9b5-5dacfa03b2ba
├── input: num_topics_min (integer) = 75399d70-bbef-4215-b9b5-5dacfa03b2ba
├── input: tokens_array (array) = 02d01eb7-70d6-4ef7-811e-66ed25f920bb
│   └── preprocess.tokens_array
│       ├── input: remove_all_numeric (boolean) = 5137a237-fe0f-45bd-abe3-cc84700a2bb6
│       ├── input: remove_alphanumeric (boolean) = 5137a237-fe0f-45bd-abe3-cc84700a2bb6
│       ├── input: remove_non_alpha (boolean) = 8b1b93ec-a51e-4bbd-84cf-c5a1efd78e9b
│       ├── input: remove_short_tokens (integer) = f5df1b36-9884-413d-92d0-81209227f106
│       ├── input: remove_stopwords (list) = bb8a79b2-369c-46ae-a85a-2b0f85c9da22
│       ├── input: to_lowercase (boolean) = 8b1b93ec-a51e-4bbd-84cf-c5a1efd78e9b
│       └── input: tokens_array (array) = d1db365d-2e59-4455-ae05-78447e5a4268
│           └── preprocess.tokens_array
│               ├── input: remove_all_numeric (boolean) = 5137a237-fe0f-45bd-abe3-cc84700a2bb6
│               ├── input: remove_alphanumeric (boolean) = 5137a237-fe0f-45bd-abe3-cc84700a2bb6
│               ├── input: remove_non_alpha (boolean) = 5137a237-fe0f-45bd-abe3-cc84700a2bb6
│               ├── input: remove_short_tokens (integer) = f5df1b36-9884-413d-92d0-81209227f106
│               ├── input: remove_stopwords (list) = 524b5812-c4df-4ea0-a50a-d0ec5166c22f
│               ├── input: to_lowercase (boolean) = 5137a237-fe0f-45bd-abe3-cc84700a2bb6
│               └── input: tokens_array (array) = a3c66f00-7f67-483d-8018-a64714094fa4
│                   └── tokenize.texts_array
│                       ├── input: texts_array (array) = 3db76a98-88e6-45ee-8618-7c95fdf8232c
│                       │   └── table.pick.column
│                       │       ├── input: column_name (string) = 33ebce29-be63-4644-b66b-9e82a3c56236
│                       │       └── input: table (table) = bd56aae9-6289-4f3e-b3f4-edbc55310689
│                       │           └── create.table
│                       │               └── input: file_bundle (file_bundle) = 214ae90d-224b-447a-b0e8-112024a8e6d4
│                       │                   └── download.file_bundle
│                       │                       ├── input: sub_path (string) = 89c3d000-a486-4089-9592-142253d8f3d3
│                       │                       └── input: url (string) = 10d94fa6-0c3d-4d6e-a457-9fa1e7b63e99
│                       └── input: tokenize_by_word (boolean) = 8b1b93ec-a51e-4bbd-84cf-c5a1efd78e9b
└── input: words_per_topic (integer) = cd1319e3-a6ec-4d8d-99b3-34ef873e1d13