Text processing#
All functions need for processing the text of project titles. Steps include normalization (hyphen splitting, punctuation removal, white space corrections), language detection, lemmatization and keyword detection.
detect_acronyms(text, lang, acronyms_df)
#
Detect an acronym in a given text based on the specified language and a DataFrame of keywords.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The input text to search for keywords. |
required |
lang
|
str
|
The language code (e.g., 'en', 'de', etc.). |
required |
acronyms_df
|
DataFrame
|
A DataFrame containing acronyms for different languages. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
list |
list
|
A list of matched acronyms, or None if no matches are found. |
Source code in src\text_processing.py
detect_keywords(text, lang, keyword_df)
#
Detect keywords in a given text based on the specified language and a DataFrame of keywords.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The input text to search for keywords. |
required |
lang
|
str
|
The language code (e.g., 'en', 'de', etc.). |
required |
keyword_df
|
DataFrame
|
A DataFrame containing keywords for different languages. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
list |
list
|
A list of matched keywords, or None if no matches are found. |
Source code in src\text_processing.py
lemmatize_batch(texts, lang, batch_size=100, remove_stopwords=False)
#
Lemmatize a batch of texts using spaCy for the specified language.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
list
|
A list of input texts to be lemmatized. |
required |
lang
|
str
|
The language code (e.g., 'en', 'fr', 'es', 'de') for spaCy's language model. |
required |
remove_stopwords
|
bool
|
If True, stopwords will be removed from the lemmatized output. Defaults to False. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
list |
list
|
A list of lemmatized strings. |
Source code in src\text_processing.py
lemmatize_str(text, lang, remove_stopwords=False)
#
Lemmatize a given text using spaCy for the specified language and return a lemmatized string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The input text to be lemmatized. |
required |
lang
|
str
|
The language code (e.g., 'en', 'fr', 'es', 'de') for spaCy's language model. |
required |
remove_stopwords
|
bool
|
If True, stopwords will be removed from the lemmatized output. Defaults to False. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
A string of lemmatized tokens joined by spaces. If the language is not supported, the original text is returned. |
Source code in src\text_processing.py
normalize_str(text)
#
Normalize a text string by cleaning unwanted characters and formatting.
This function performs the following operations: - Replaces hyphens with spaces to split hyphenated words. - Removes punctuation and symbol characters, while preserving: - Letters (including accented and non-ASCII characters) - Digits - Apostrophes (') - Collapses multiple whitespace characters into a single space. - Trims leading and trailing whitespace.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The input string to normalize. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
A normalized version of the input string. |
Source code in src\text_processing.py
process_keywords(keywords_df, langauges=['en', 'fr', 'es', 'de'], remove_stopwords=False)
#
Processes a DataFrame of keywords by normalizing, lemmatizing, and adding de-accented versions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
keywords_df
|
DataFrame
|
A DataFrame where each column represents a language (e.g., 'en', 'fr', 'es', 'de') and contains keywords as strings. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The processed DataFrame with normalized, lemmatized, and de-accented keywords. |
Source code in src\text_processing.py
remove_accents(text)
#
Remove accents from characters in a string.
This function uses the unicodedata library to normalize the text and remove accents.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The input string to process. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The input string with accents removed. |