String Operations

String operations in data processing mainly involve string splitting and string extraction: splitting strings into multiple lines or capturing and splitting strings into multiple columns using regular expressions. After downcasting the expression with col("str").str(), string operations can be performed.

Splitting Strings into Multiple Lines

In the following dataframe, the member information of each team is stored as a string. Our first task is to split the string into multiple lines using the newline character as a delimiter.

#![allow(unused)]
fn main() {
let df_str = df!{"items" => ["Jada; location:2759 Fairway Drive; Email:Jada;@gmail;.com\nGraceland; location:6 Greenleaf Dr; Email:Graceland@gmail.com",
"zdlldine; location:2887 Andell Road; Email:zdlldine@gmail.com\nMakana; location:1521 Winifred Way; Email:Makana@gmail.com\nNatsuki; location:4416 Golf Course Drive; Email:Natsuki@gmail.com",
"Pope; location:345 Edgewood Avenue; Email:Pope@gmail.com",
"Oaklynn; location:3017 Cherry Camp Road; Email:zdl_361@hotmail.com",
"Tysheenia; location:1616 Smith Street; Email:Tysheenia@gmail.com\nZenda; location:4416 Golf Course Drive; Email:Zenda@gmail.com"],
"teamID" => ["team01","team02","team03","team04","team05"]}?;
let df_res = df_str.lazy().select([col("teamID"),col("items").str().split(lit("\n"))]).collect()?;
println!("{:?}",&df_res);
}

Output:

shape: (5, 2)
┌────────┬─────────────────────────────────┐
│ teamID ┆ items                           │
│ ---    ┆ ---                             │
│ str    ┆ list[str]                       │
╞════════╪═════════════════════════════════╡
│ team01 ┆ ["Jada; location:2759 Fairway … │
│ team02 ┆ ["zdlldine; location:2887 And … │
│ team03 ┆ ["Pope; location:345 Edgewood … │
│ team04 ┆ ["Oaklynn; location:3017 Cherr… │
│ team05 ┆ ["Tysheenia; location:1616 Smi… │
└────────┴─────────────────────────────────┘

The values after string splitting are wrapped in a List. By calling explode(["items"]), the items field is unpacked into multiple lines.

#![allow(unused)]
fn main() {
let df_lines=df_res.explode(["items"])?;
println!("{:?}",&df_lines);
}

Output

shape: (9, 2)
┌────────┬─────────────────────────────────┐
│ teamID ┆ items                           │
│ ---    ┆ ---                             │
│ str    ┆ str                             │
╞════════╪═════════════════════════════════╡
│ team01 ┆ Jada; location:2759 Fairway Dr… │
│ team01 ┆ Graceland; location:6 Greenlea… │
│ team02 ┆ zdlldine; location:2887 Andell… │
│ team02 ┆ Makana; location:1521 Winifred… │
│ team02 ┆ Natsuki; location:4416 Golf Co… │
│ team03 ┆ Pope; location:345 Edgewood Av… │
│ team04 ┆ Oaklynn; location:3017 Cherry … │
│ team05 ┆ Tysheenia; location:1616 Smith… │
│ team05 ┆ Zenda; location:4416 Golf Cour… │
└────────┴─────────────────────────────────┘

Splitting into Multiple Columns

Similar to split, split_exact() saves the split string into DataType::Struct, and after unnest(), it can be split into multiple fields. split_exact(lit(";"),3) takes the first parameter as the delimiter and the second parameter as the number of fields. It precisely returns a fixed number of fields. If the number of split strings is insufficient, it generates null values. If there are too many, it directly discards the extra fields.

#![allow(unused)]
fn main() {
let df_structed = df_lines.lazy().select([
    col("teamID"),
    col("items").str().split_exact(lit(";"),3)
]).collect()?;
println!("df_structed\n{:?}",&df_structed);
}

Note that the returned type of items is a struct type. Output


df_structed
shape: (9, 2)
┌────────┬─────────────────────────────────┐
│ teamID ┆ items                           │
│ ---    ┆ ---                             │
│ str    ┆ struct[4]                       │
╞════════╪═════════════════════════════════╡
│ team01 ┆ {"Jada"," location:2759 Fairwa… │
│ team01 ┆ {"Graceland"," location:6 Gree… │
│ team02 ┆ {"zdlldine"," location:2887 An… │
│ team02 ┆ {"Makana"," location:1521 Wini… │
│ team02 ┆ {"Natsuki"," location:4416 Gol… │
│ team03 ┆ {"Pope"," location:345 Edgewoo… │
│ team04 ┆ {"Oaklynn"," location:3017 Che… │
│ team05 ┆ {"Tysheenia"," location:1616 S… │
│ team05 ┆ {"Zenda"," location:4416 Golf … │
└────────┴─────────────────────────────────┘

Apply unnest to unpack the struct into multiple fields

#![allow(unused)]
fn main() {
let df_unnest = df_structed.unnest(["items"])?;
println!("df_unnest:\n{:?}",&df_unnest);
}

To demonstrate the effect of the number of fields, we added a few extra semicolons to the first element of the df_str items field. This results in the field_3 of the first element not being null.

Output

df_unnest:
shape: (9, 5)
┌────────┬───────────┬─────────────────────────────────┬────────────────────────────┬─────────┐
│ teamID ┆ field_0   ┆ field_1                         ┆ field_2                    ┆ field_3 │
│ ---    ┆ ---       ┆ ---                             ┆ ---                        ┆ ---     │
│ str    ┆ str       ┆ str                             ┆ str                        ┆ str     │
╞════════╪═══════════╪═════════════════════════════════╪════════════════════════════╪═════════╡
│ team01 ┆ Jada      ┆  location:2759 Fairway Drive    ┆  Email:Jada                ┆ @gmail  │
│ team01 ┆ Graceland ┆  location:6 Greenleaf Dr        ┆  Email:Graceland@gmail.com ┆ null    │
│ team02 ┆ zdlldine  ┆  location:2887 Andell Road      ┆  Email:zdlldine@hotmail.com┆ null    │
│ team02 ┆ Makana    ┆  location:1521 Winifred Way     ┆  Email:Makana@gmail.com    ┆ null    │
│ team02 ┆ Natsuki   ┆  location:4416 Golf Course Dri… ┆  Email:Natsuki@gmail.com   ┆ null    │
│ team03 ┆ Pope      ┆  location:345 Edgewood Avenue   ┆  Email:Pope@gmail.com      ┆ null    │
│ team04 ┆ Oaklynn   ┆  location:3017 Cherry Camp Roa… ┆  Email:zdl_361@hotmail.com ┆ null    │
│ team05 ┆ Tysheenia ┆  location:1616 Smith Street     ┆  Email:Tysheenia@gmail.com ┆ null    │
│ team05 ┆ Zenda     ┆  location:4416 Golf Course Dri… ┆  Email:Zenda@gmail.com     ┆ null    │
└────────┴───────────┴─────────────────────────────────┴────────────────────────────┴─────────┘

Regular Expression Capture

Sometimes simple split cannot meet business needs. Complex tasks require regular expression capture to complete. The main method involved is the extract method:

extract(self, pat: Expr, group_index: usize) captures the value at the group_index after matching the regular expression pat.
extract_groups(self, pat: &str) returns all captures after matching the regular expression pat. We continue working based on df_lines.

#![allow(unused)]
fn main() {
// Since the expression is too long, we define a custom expression
let ex = |index| -> Expr{
    // Here we use a regular expression to capture three fields
    col("items").str().extract(lit(r#"^([A-Z a-z]*); location:(.*); Email:(.*)$"#), index)
};
let df_extract=df_lines.lazy().select([
    col("teamID"),
    ex(0).alias("source"), // 0 captures the entire matching string
    ex(1).alias("Name"),
    ex(2).alias("location"),
    ex(3).alias("email"),    
    ]).collect()?;
println!("{:?}",&df_extract);
}

Output

shape: (9, 5)
┌────────┬─────────────────────────────────┬───────────┬────────────────────────┬─────────────────────┐
│ teamID ┆ source                          ┆ Name      ┆ location               ┆ email               │
│ ---    ┆ ---                             ┆ ---       ┆ ---                    ┆ ---                 │
│ str    ┆ str                             ┆ str       ┆ str                    ┆ str                 │
╞════════╪═════════════════════════════════╪═══════════╪════════════════════════╪═════════════════════╡
│ team01 ┆ Jada; location:2759 Fairway Dr… ┆ Jada      ┆ 2759 Fairway Drive     ┆ Jada;@gmail;.com    │
│ team01 ┆ Graceland; location:6 Greenlea… ┆ Graceland ┆ 6 Greenleaf Dr         ┆ Graceland@gmail.com │
│ team02 ┆ zdlldine; location:2887 Andell… ┆ Ives      ┆ 2887 Andell Road       ┆ zdlldine@gmail.com  │
│ team02 ┆ Makana; location:1521 Winifred… ┆ Makana    ┆ 1521 Winifred Way      ┆ Makana@gmail.com    │
│ team02 ┆ Natsuki; location:4416 Golf Co… ┆ Natsuki   ┆ 4416 Golf Course Drive ┆ Natsuki@gmail.com   │
│ team03 ┆ Pope; location:345 Edgewood Av… ┆ Pope      ┆ 345 Edgewood Avenue    ┆ Pope@gmail.com      │
│ team04 ┆ Oaklynn; location:3017 Cherry … ┆ Oaklynn   ┆ 3017 Cherry Camp Road  ┆ zdl_361@hotmail.com │
│ team05 ┆ Tysheenia; location:1616 Smith… ┆ Tysheenia ┆ 1616 Smith Street      ┆ Tysheenia@gmail.com │
│ team05 ┆ Zenda; location:4416 Golf Cour… ┆ Zenda     ┆ 4416 Golf Course Drive ┆ Zenda@gmail.com     │
└────────┴─────────────────────────────────┴───────────┴────────────────────────┴─────────────────────┘

String API

Self refers to the return value of Expr.str()

API	Description
contains_literal(self, pat: Expr)	Checks if it contains a string literal
contains(self, pat: Expr, strict: bool)	Checks if it matches a regular expression. If pat is an invalid regular expression, strict==true returns an error. If strict==false, the invalid regular expression will simply evaluate to false.
contains_any(self, patterns: Expr, ascii_case_insensitive: bool)	Matches multiple fixed strings using the Aho-Corasick algorithm¹. The pattern should be constructed like this: `let pat = lit(Series::new("pat".into(),["fo","ba","str3"]));`
replace_many(self,patterns: Expr, replace_with: Expr, ascii_case_insensitive: bool)	Replaces multiple strings using the Aho-Corasick algorithm
ends_with(self, sub: Expr)	Checks if it ends with the sub string
starts_with(self, sub: Expr)	Checks if it starts with the sub string
hex_encode(self)	Encodes the string into a hexadecimal string
hex_decode(self, strict: bool)	Decodes a hexadecimal string into a regular string
base64_encode(self)	Encodes the string using base64
base64_decode(self, strict: bool)	Decodes a base64 string into a regular string
extract(self, pat: Expr, group_index: usize)	Extracts a regular expression capture, see Regular Expression Capture
find_literal(self, pat: Expr)	Finds the index of the literal
find(self, pat: Expr, strict: bool)	Searches for the index of the regular expression
count_matches(self, pat: Expr, literal: bool)	Returns the count of successful regular expression matches
strptime(self,dtype: DataType, options: StrptimeOptions,ambiguous: Expr)	Parses a string into Date/Datetime/Time
to_datetime	Parses a string into datetime
to_time(self, options: StrptimeOptions)	Parses a string into time
join(self, delimiter: &str, ignore_nulls: bool)	Joins the strings in the field into a single string, using the delimiter as a separator
split(self, by: Expr)	Splits the string into a List<String>. You can use explode to split the result into multiple rows. String Split into Multiple Rows
split_inclusive(self, by: Expr)	Similar to split but retains the delimiter
split_exact(self, by: Expr, n: usize)	Splits into a Struct, which can be unpacked into multiple fields using unnest. Split into Multiple Columns
strip_prefix(self, prefix: Expr)	Removes the prefix
strip_suffix(self, suffix: Expr)	Removes the suffix
to_lowercase(self)	Converts all characters to lowercase
to_uppercase(self)	Converts all characters to uppercase
to_integer(self, base: Expr, strict: bool)	Parses the string into an integer according to the specified base
len_bytes(self)	Counts the number of bytes
len_chars(self)	Counts the number of characters
slice(self, offset: Expr, length: Expr)	Returns a substring referenced by the slice
head(self, n: Expr)	Returns the first n characters
tail(self, n: Expr)	Returns the last n characters

The Aho-Corasick algorithm's patterns are not regular expressions but a collection of multiple fixed strings. This algorithm is used for multi-pattern matching, i.e., finding the positions of multiple fixed patterns (strings) in a text. The core idea of the Aho-Corasick algorithm is to build an automaton to match multiple pattern strings simultaneously. As the input text flows through the automaton, it can efficiently identify all matching patterns.