Skip to content

kiara_plugin.language_processing

operations

kiara_plugin.language_processing

Home
Package contents
Package contents
- module_types
- operations operations
  Table of contents
Usage
Development
API reference
API reference
- kiara_plugin
  kiara_plugin
  - language_processing
    
    language_processing
    
    data_types
    
    models
    
    modules
    
    modules
    
    lda
    
    lemmatize
    
    tokens
    
    pipelines

operations

`create.stopwords_list`¶

                                                                                
 Documentation                                                                  
                     Create a list of stopwords from one or multiple sources.   
                                                                                
                     This will download nltk stopwords if necessary, and        
                     merge all input lists into a single, sorted list without   
                     duplicates.                                                
                                                                                
 Author(s)                                                                      
                     Markus Binsteiner   markus@frkl.io                         
                                                                                
 Context                                                                        
                     Tags         language_processing                           
                     Labels       package: kiara_plugin.language_processing     
                     References   source_repo:                                  
                                  https://github.com/DHARPA-Project/kiara_pl…   
                                  documentation:                                
                                  https://DHARPA-Project.github.io/kiara_plu…   
                                                                                
 Operation details                                                              
                     Documentation   Create a list of stopwords from one or     
                                     multiple sources.                          
                                                                                
                                     This will download nltk stopwords if       
                                     necessary, and merge all input lists       
                                     into a single, sorted list without         
                                     duplicates.                                
                                                                                
                     Inputs                                                     
                                       field                                    
                                       name    type   desc…   Requ…   Defa…     
                                      ──────────────────────────────────────    
                                       langu   list   A       no      -- no     
                                       ages           list            defa…     
                                                      of              --        
                                                      lang…                     
                                                      will                      
                                                      be                        
                                                      used                      
                                                      to                        
                                                      retr…                     
                                                      lang…                     
                                                      stop…                     
                                                      from                      
                                                      nltk.                     
                                       stopw   list   A       no      -- no     
                                       ord_l          list            defa…     
                                       ists           of              --        
                                                      lists                     
                                                      of                        
                                                      stop…                     
                                                                                
                                                                                
                     Outputs                                                    
                                       field name      type   description       
                                      ──────────────────────────────────────    
                                       stopwords_lis   list   A sorted list     
                                       t                      of unique         
                                                              stopwords.

`generate.LDA.for.tokens_array`¶

                                                                                
 Documentation                                                                  
                     Perform Latent Dirichlet Allocation on a tokenized         
                     corpus.                                                    
                                                                                
                     This module computes models for a range of number of       
                     topics provided by the user.                               
                                                                                
 Author(s)                                                                      
                     Markus Binsteiner   markus@frkl.io                         
                                                                                
 Context                                                                        
                     Tags         language_processing, LDA, tokens              
                     Labels       package: kiara_plugin.language_processing     
                     References   source_repo:                                  
                                  https://github.com/DHARPA-Project/kiara_pl…   
                                  documentation:                                
                                  https://DHARPA-Project.github.io/kiara_plu…   
                                                                                
 Operation details                                                              
                     Documentation   Perform Latent Dirichlet Allocation on a   
                                     tokenized corpus.                          
                                                                                
                                     This module computes models for a range    
                                     of number of topics provided by the        
                                     user.                                      
                                                                                
                     Inputs                                                     
                                       field                                    
                                       name    type    desc…   Requ…   Def…     
                                      ──────────────────────────────────────    
                                       token   array   The     yes     --       
                                       s_arr           text            no       
                                       ay              corp…           def…     
                                                                       --       
                                       num_t   inte…   The     no      7        
                                       opics           mini…                    
                                       _min            numb…                    
                                                       of                       
                                                       topi…                    
                                       num_t   inte…   The     no      --       
                                       opics           max             no       
                                       _max            numb…           def…     
                                                       of              --       
                                                       topi…                    
                                       compu   bool…   Whet…   no      Fal…     
                                       te_co           to                       
                                       heren           comp…                    
                                       ce              the                      
                                                       cohe…                    
                                                       score                    
                                                       for                      
                                                       each                     
                                                       mode…                    
                                       words   inte…   How     no      10       
                                       _per_           many                     
                                       topic           words                    
                                                       per                      
                                                       topic                    
                                                       to                       
                                                       put                      
                                                       in                       
                                                       the                      
                                                       resu…                    
                                                       mode…                    
                                                                                
                                                                                
                     Outputs                                                    
                                       field name      type    description      
                                      ──────────────────────────────────────    
                                       topic_models    dict    A dictionary     
                                                               with one         
                                                               coherence        
                                                               model table      
                                                               for each         
                                                               number of        
                                                               topics.          
                                       coherence_tab   table   Coherence        
                                       le                      details.         
                                       coherence_map   dict    A map with       
                                                               the              
                                                               coherence        
                                                               value for        
                                                               every number     
                                                               of topics.

`preprocess.tokens_array`¶

                                                                                
 Documentation                                                                  
                     Preprocess lists of tokens, incl. lowercasing, remove      
                     special characers, etc.                                    
                                                                                
                     Lowercasing: Lowercase the words. This operation is a      
                     double-edged sword. It can be effective at yielding        
                     potentially better results in the case of relatively       
                     small datasets or datatsets with a high percentage of      
                     OCR mistakes. For instance, if lowercasing is not          
                     performed, the algorithm will treat USA, Usa, usa, UsA,    
                     uSA, etc. as distinct tokens, even though they may all     
                     refer to the same entity. On the other hand, if the        
                     dataset does not contain such OCR mistakes, then it may    
                     become difficult to distinguish between homonyms and       
                     make interpreting the topics much harder.                  
                                                                                
                     Removing stopwords and words with less than three          
                     characters: Remove low information words. These are        
                     typically words such as articles, pronouns,                
                     prepositions, conjunctions, etc. which are not             
                     semantically salient. There are numerous stopword lists    
                     available for many, though not all, languages which can    
                     be easily adapted to the individual researcher's needs.    
                     Removing words with less than three characters may         
                     additionally remove many OCR mistakes. Both these          
                     operations have the dual advantage of yielding more        
                     reliable results while reducing the size of the dataset,   
                     thus in turn reducing the required processing power.       
                     This step can therefore hardly be considered optional in   
                     TM.                                                        
                                                                                
                     Noise removal: Remove elements such as punctuation         
                     marks, special characters, numbers, html formatting,       
                     etc. This operation is again concerned with removing       
                     elements that may not be relevant to the text analysis     
                     and in fact interfere with it. Depending on the dataset    
                     and research question, this operation can become           
                     essential.                                                 
                                                                                
 Author(s)                                                                      
                     Markus Binsteiner   markus@frkl.io                         
                                                                                
 Context                                                                        
                     Tags         language_processing, tokens, preprocess       
                     Labels       package: kiara_plugin.language_processing     
                     References   source_repo:                                  
                                  https://github.com/DHARPA-Project/kiara_pl…   
                                  documentation:                                
                                  https://DHARPA-Project.github.io/kiara_plu…   
                                                                                
 Operation details                                                              
                     Documentation   Preprocess lists of tokens, incl.          
                                     lowercasing, remove special characers,     
                                     etc.                                       
                                                                                
                                     Lowercasing: Lowercase the words. This     
                                     operation is a double-edged sword. It      
                                     can be effective at yielding potentially   
                                     better results in the case of relatively   
                                     small datasets or datatsets with a high    
                                     percentage of OCR mistakes. For            
                                     instance, if lowercasing is not            
                                     performed, the algorithm will treat USA,   
                                     Usa, usa, UsA, uSA, etc. as distinct       
                                     tokens, even though they may all refer     
                                     to the same entity. On the other hand,     
                                     if the dataset does not contain such OCR   
                                     mistakes, then it may become difficult     
                                     to distinguish between homonyms and make   
                                     interpreting the topics much harder.       
                                                                                
                                     Removing stopwords and words with less     
                                     than three characters: Remove low          
                                     information words. These are typically     
                                     words such as articles, pronouns,          
                                     prepositions, conjunctions, etc. which     
                                     are not semantically salient. There are    
                                     numerous stopword lists available for      
                                     many, though not all, languages which      
                                     can be easily adapted to the individual    
                                     researcher's needs. Removing words with    
                                     less than three characters may             
                                     additionally remove many OCR mistakes.     
                                     Both these operations have the dual        
                                     advantage of yielding more reliable        
                                     results while reducing the size of the     
                                     dataset, thus in turn reducing the         
                                     required processing power. This step can   
                                     therefore hardly be considered optional    
                                     in TM.                                     
                                                                                
                                     Noise removal: Remove elements such as     
                                     punctuation marks, special characters,     
                                     numbers, html formatting, etc. This        
                                     operation is again concerned with          
                                     removing elements that may not be          
                                     relevant to the text analysis and in       
                                     fact interfere with it. Depending on the   
                                     dataset and research question, this        
                                     operation can become essential.            
                                                                                
                     Inputs                                                     
                                       field                                    
                                       name    type    desc…   Requ…   Def…     
                                      ──────────────────────────────────────    
                                       token   array   The     yes     --       
                                       s_arr           toke…           no       
                                       ay              array           def…     
                                                       to              --       
                                                       pre-…                    
                                       to_lo   bool…   Apply   no      Fal…     
                                       werca           lowe…                    
                                       se              to                       
                                                       the                      
                                                       text.                    
                                       remov   bool…   Remo…   no      Fal…     
                                       e_alp           all                      
                                       hanum           toke…                    
                                       eric            that                     
                                                       incl…                    
                                                       numb…                    
                                                       (e.g.                    
                                                       ex1a…                    
                                       remov   bool…   Remo…   no      Fal…     
                                       e_non           all                      
                                       _alph           toke…                    
                                       a               that                     
                                                       incl…                    
                                                       punc…                    
                                                       and                      
                                                       numb…                    
                                                       (e.g.                    
                                                       ex1a…                    
                                       remov   bool…   Remo…   no      Fal…     
                                       e_all           all                      
                                       _nume           toke…                    
                                       ric             that                     
                                                       cont…                    
                                                       numb…                    
                                                       only                     
                                                       (e.g.                    
                                                       876).                    
                                       remov   inte…   Remo…   no      Fal…     
                                       e_sho           toke…                    
                                       rt_to           shor…                    
                                       kens            than                     
                                                       a                        
                                                       cert…                    
                                                       leng…                    
                                                       If                       
                                                       value                    
                                                       is <=                    
                                                       0, no                    
                                                       filt…                    
                                                       will                     
                                                       be                       
                                                       done.                    
                                       remov   list    Remo…   no      --       
                                       e_sto           stop…           no       
                                       pword                           def…     
                                       s                               --       
                                                                                
                                                                                
                     Outputs                                                    
                                       field name     type    description       
                                      ──────────────────────────────────────    
                                       tokens_array   array   The               
                                                              pre-processed     
                                                              content, as       
                                                              an array of       
                                                              lists of          
                                                              strings.

`remove_stopwords.from.tokens_array`¶

                                                                                
 Documentation                                                                  
                     Remove stopwords from an array of token-lists.             
                                                                                
 Author(s)                                                                      
                     Markus Binsteiner   markus@frkl.io                         
                                                                                
 Context                                                                        
                     Tags         language_processing                           
                     Labels       package: kiara_plugin.language_processing     
                     References   source_repo:                                  
                                  https://github.com/DHARPA-Project/kiara_pl…   
                                  documentation:                                
                                  https://DHARPA-Project.github.io/kiara_plu…   
                                                                                
 Operation details                                                              
                     Documentation   Remove stopwords from an array of          
                                     token-lists.                               
                                                                                
                     Inputs                                                     
                                       field                                    
                                       name    type    desc…   Requ…   Def…     
                                      ──────────────────────────────────────    
                                       token   array   An      yes     --       
                                       s_arr           array           no       
                                       ay              of              def…     
                                                       stri…           --       
                                                       lists                    
                                                       (a                       
                                                       list                     
                                                       of                       
                                                       toke…                    
                                       langu   list    A       no      --       
                                       ages            list            no       
                                                       of              def…     
                                                       lang…           --       
                                                       names                    
                                                       to                       
                                                       use                      
                                                       defa…                    
                                                       stop…                    
                                                       lists                    
                                                       for.                     
                                       addit   list    A       no      --       
                                       ional           list            no       
                                       _stop           of              def…     
                                       words           addi…           --       
                                                       cust…                    
                                                       stop…                    
                                                                                
                                                                                
                     Outputs                                                    
                                       field name     type    description       
                                      ──────────────────────────────────────    
                                       tokens_array   array   An array of       
                                                              string lists,     
                                                              with the          
                                                              stopwords         
                                                              removed.

`tokenize.string`¶

                                                                                
 Documentation                                                                  
                     Tokenize a string.                                         
                                                                                
 Author(s)                                                                      
                     Markus Binsteiner   markus@frkl.io                         
                                                                                
 Context                                                                        
                     Tags         language_processing                           
                     Labels       package: kiara_plugin.language_processing     
                     References   source_repo:                                  
                                  https://github.com/DHARPA-Project/kiara_pl…   
                                  documentation:                                
                                  https://DHARPA-Project.github.io/kiara_plu…   
                                                                                
 Operation details                                                              
                     Documentation   Tokenize a string.                         
                                                                                
                     Inputs                                                     
                                       field                                    
                                       name    type    desc…   Req…   Defa…     
                                      ──────────────────────────────────────    
                                       text    stri…   The     yes    -- no     
                                                       text           defa…     
                                                       to             --        
                                                       toke…                    
                                                                                
                                                                                
                     Outputs                                                    
                                       field name   type   description          
                                      ──────────────────────────────────────    
                                       token_list   list   The tokenized        
                                                           version of the       
                                                           input text.

`tokenize.texts_array`¶

                                                                                
 Documentation                                                                  
                     Split sentences into words or words into characters.       
                                                                                
                     In other words, this operation establishes the word        
                     boundaries (i.e., tokens) a very helpful way of finding    
                     patterns. It is also the typical step prior to stemming    
                     and lemmatization                                          
                                                                                
 Author(s)                                                                      
                     Markus Binsteiner   markus@frkl.io                         
                                                                                
 Context                                                                        
                     Tags         language_processing, tokenize, tokens         
                     Labels       package: kiara_plugin.language_processing     
                     References   source_repo:                                  
                                  https://github.com/DHARPA-Project/kiara_pl…   
                                  documentation:                                
                                  https://DHARPA-Project.github.io/kiara_plu…   
                                                                                
 Operation details                                                              
                     Documentation   Split sentences into words or words into   
                                     characters.                                
                                                                                
                                     In other words, this operation             
                                     establishes the word boundaries (i.e.,     
                                     tokens) a very helpful way of finding      
                                     patterns. It is also the typical step      
                                     prior to stemming and lemmatization        
                                                                                
                     Inputs                                                     
                                       field                                    
                                       name    type    desc…   Requ…   Def…     
                                      ──────────────────────────────────────    
                                       texts   array   An      yes     --       
                                       _arra           array           no       
                                       y               of              def…     
                                                       text            --       
                                                       items                    
                                                       to be                    
                                                       toke…                    
                                       token   bool…   Whet…   no      True     
                                       ize_b           to                       
                                       y_wor           toke…                    
                                       d               by                       
                                                       word                     
                                                       (def…                    
                                                       or                       
                                                       char…                    
                                                                                
                                                                                
                     Outputs                                                    
                                       field name     type    description       
                                      ──────────────────────────────────────    
                                       tokens_array   array   The tokenized     
                                                              content, as       
                                                              an array of       
                                                              lists of          
                                                              strings.