Text processing#
All functions need for processing the text of project titles. Steps include normalization (hyphen splitting, punctuation removal, white space corrections), language detection, lemmatization and keyword detection.
            detect_acronyms(text, lang, acronyms_df)
#
    Detect an acronym in a given text based on the specified language and a DataFrame of keywords.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
                text
             | 
            
                  str
             | 
            
               The input text to search for keywords.  | 
            required | 
                lang
             | 
            
                  str
             | 
            
               The language code (e.g., 'en', 'de', etc.).  | 
            required | 
                acronyms_df
             | 
            
                  DataFrame
             | 
            
               A DataFrame containing acronyms for different languages.  | 
            required | 
Returns:
| Name | Type | Description | 
|---|---|---|
list |             
                  list
             | 
            
               A list of matched acronyms, or None if no matches are found.  | 
          
Source code in src\text_processing.py
              
            detect_keywords(text, lang, keyword_df)
#
    Detect keywords in a given text based on the specified language and a DataFrame of keywords.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
                text
             | 
            
                  str
             | 
            
               The input text to search for keywords.  | 
            required | 
                lang
             | 
            
                  str
             | 
            
               The language code (e.g., 'en', 'de', etc.).  | 
            required | 
                keyword_df
             | 
            
                  DataFrame
             | 
            
               A DataFrame containing keywords for different languages.  | 
            required | 
Returns:
| Name | Type | Description | 
|---|---|---|
list |             
                  list
             | 
            
               A list of matched keywords, or None if no matches are found.  | 
          
Source code in src\text_processing.py
              
            lemmatize_batch(texts, lang, batch_size=100, remove_stopwords=False)
#
    Lemmatize a batch of texts using spaCy for the specified language.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
                texts
             | 
            
                  list
             | 
            
               A list of input texts to be lemmatized.  | 
            required | 
                lang
             | 
            
                  str
             | 
            
               The language code (e.g., 'en', 'fr', 'es', 'de') for spaCy's language model.  | 
            required | 
                remove_stopwords
             | 
            
                  bool
             | 
            
               If True, stopwords will be removed from the lemmatized output. Defaults to False.  | 
            
                  False
             | 
          
Returns:
| Name | Type | Description | 
|---|---|---|
list |             
                  list
             | 
            
               A list of lemmatized strings.  | 
          
Source code in src\text_processing.py
              
            lemmatize_str(text, lang, remove_stopwords=False)
#
    Lemmatize a given text using spaCy for the specified language and return a lemmatized string.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
                text
             | 
            
                  str
             | 
            
               The input text to be lemmatized.  | 
            required | 
                lang
             | 
            
                  str
             | 
            
               The language code (e.g., 'en', 'fr', 'es', 'de') for spaCy's language model.  | 
            required | 
                remove_stopwords
             | 
            
                  bool
             | 
            
               If True, stopwords will be removed from the lemmatized output. Defaults to False.  | 
            
                  False
             | 
          
Returns:
| Name | Type | Description | 
|---|---|---|
str |             
                  str
             | 
            
               A string of lemmatized tokens joined by spaces. If the language is not supported, the original text is returned.  | 
          
Source code in src\text_processing.py
              
            normalize_str(text)
#
    Normalize a text string by cleaning unwanted characters and formatting.
This function performs the following operations: - Replaces hyphens with spaces to split hyphenated words. - Removes punctuation and symbol characters, while preserving: - Letters (including accented and non-ASCII characters) - Digits - Apostrophes (') - Collapses multiple whitespace characters into a single space. - Trims leading and trailing whitespace.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
                text
             | 
            
                  str
             | 
            
               The input string to normalize.  | 
            required | 
Returns:
| Name | Type | Description | 
|---|---|---|
str |             
                  str
             | 
            
               A normalized version of the input string.  | 
          
Source code in src\text_processing.py
              
            process_keywords(keywords_df, langauges=['en', 'fr', 'es', 'de'], remove_stopwords=False)
#
    Processes a DataFrame of keywords by normalizing, lemmatizing, and adding de-accented versions.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
                keywords_df
             | 
            
                  DataFrame
             | 
            
               A DataFrame where each column represents a language (e.g., 'en', 'fr', 'es', 'de') and contains keywords as strings.  | 
            required | 
Returns:
| Type | Description | 
|---|---|
                  DataFrame
             | 
            
               pd.DataFrame: The processed DataFrame with normalized, lemmatized, and de-accented keywords.  | 
          
Source code in src\text_processing.py
              
            remove_accents(text)
#
    Remove accents from characters in a string.
This function uses the unicodedata library to normalize the text and remove accents.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
                text
             | 
            
                  str
             | 
            
               The input string to process.  | 
            required | 
Returns:
| Name | Type | Description | 
|---|---|---|
str |             
                  str
             | 
            
               The input string with accents removed.  |