Philadelphia Eaglesâ\x80\x99 victory
automatically gets converted to Philadelphia Eagles' victory
in partition_html
using the replace_unicode_quotes
cleaning function. You can see how that works in the code snippet below:
unstructured
include an apply
method that allow you to apply the text cleaning to the document element without instantiating a new element. The apply
method expects a callable that takes a string as input and produces another string as output. In the example below, we invoke the replace_unicode_quotes
cleaning function using the apply
method.
str -> str
function, users can also easily include their own cleaning functions for custom data preparation tasks. In the example below, we remove citations from a section of text.
unstructured
library.
bytes_string_to_string
Converts an output string that looks like a byte string to a string using the specified encoding. This happens sometimes in partition_html
when there is a character like an emoji that isn’t expected by the HTML parser. In that case, the encoded bytes get processed.
Examples:
bytes_string_to_string
function, you can check the source code here.
clean
Cleans a section of text with options including removing bullets, extra whitespace, dashes and trailing punctuation. Optionally, you can choose to lowercase the output.
Options:
-
Applies
clean_bullets
ifbullets=True
. -
Applies
clean_extra_whitespace
ifextra_whitespace=True
. -
Applies
clean_dashes
ifdashes=True
. -
Applies
clean_trailing_punctuation
iftrailing_punctuation=True
. -
Lowercases the output if
lowercase=True
.
clean
function, you can check the source code here.
clean_bullets
Removes bullets from the beginning of text. Bullets that do not appear at the beginning of the text are not removed.
Examples:
clean_bullets
function, you can check the source code here.
clean_dashes
Removes dashes from a section of text. Also handles special characters such as \u2013
.
Examples:
clean_dashes
function, you can check the source code here.
clean_non_ascii_chars
Removes non-ascii characters from a string.
Examples:
clean_non_ascii_chars
function, you can check the source code here.
clean_ordered_bullets
Remove alphanumeric bullets from the beginning of text up to three “sub-section” levels.
Examples:
clean_ordered_bullets
function, you can check the source code here.
clean_postfix
Removes the postfix from a string if they match a specified pattern.
Options:
-
Ignores case if
ignore_case
is set toTrue
. The default isFalse
. -
Strips trailing whitespace is
strip
is set toTrue
. The default isTrue
.
clean_postfix
function, you can check the source code here.
clean_prefix
Removes the prefix from a string if they match a specified pattern.
Options:
-
Ignores case if
ignore_case
is set toTrue
. The default isFalse
. -
Strips leading whitespace is
strip
is set toTrue
. The default isTrue
.
clean_prefix
function, you can check the source code here.
clean_trailing_punctuation
Removes trailing punctuation from a section of text.
Examples:
clean_trailing_punctuation
function, you can check the source code here.
group_broken_paragraphs
Groups together paragraphs that are broken up with line breaks for visual or formatting purposes. This is common in .txt
files. By default, group_broken_paragraphs
groups together lines split by \n
. You can change that behavior with the line_split
kwarg. The function considers \n\n
to be a paragraph break by default. You can change that behavior with the paragraph_split
kwarg.
Examples:
group_broken_paragraphs
function, you can check the source code here.
remove_punctuation
Removes ASCII and unicode punctuation from a string.
Examples:
remove_punctuation
function, you can check the source code here.
replace_unicode_quotes
Replaces unicode quote characters such as \x91
in strings.
Examples:
replace_unicode_quotes
function, you can check the source code here.
translate_text
The translate_text
cleaning functions translates text between languages. translate_text
uses the Helsinki NLP MT models from transformers
for machine translation. Works for Russian, Chinese, Arabic, and many other languages.
Parameters:
-
text
: the input string to translate. -
source_lang
: the two letter language code for the source language of the text. Ifsource_lang
is not specified, the language will be detected usinglangdetect
. -
target_lang
: the two letter language code for the target language for translation. Defaults to"en"
.
translate_text
function, you can check the source code here.