String Operations
String operations in data processing mainly involve string splitting and string extraction: splitting strings into multiple lines or capturing and splitting strings into multiple columns using regular expressions. After downcasting the expression with col("str").str(), string operations can be performed.
Splitting Strings into Multiple Lines
In the following dataframe, the member information of each team is stored as a string. Our first task is to split the string into multiple lines using the newline character as a delimiter.
#![allow(unused)] fn main() { let df_str = df!{"items" => ["Jada; location:2759 Fairway Drive; Email:Jada;@gmail;.com\nGraceland; location:6 Greenleaf Dr; Email:Graceland@gmail.com", "zdlldine; location:2887 Andell Road; Email:zdlldine@gmail.com\nMakana; location:1521 Winifred Way; Email:Makana@gmail.com\nNatsuki; location:4416 Golf Course Drive; Email:Natsuki@gmail.com", "Pope; location:345 Edgewood Avenue; Email:Pope@gmail.com", "Oaklynn; location:3017 Cherry Camp Road; Email:zdl_361@hotmail.com", "Tysheenia; location:1616 Smith Street; Email:Tysheenia@gmail.com\nZenda; location:4416 Golf Course Drive; Email:Zenda@gmail.com"], "teamID" => ["team01","team02","team03","team04","team05"]}?; let df_res = df_str.lazy().select([col("teamID"),col("items").str().split(lit("\n"))]).collect()?; println!("{:?}",&df_res); }
Output:
shape: (5, 2)
┌────────┬─────────────────────────────────┐
│ teamID ┆ items │
│ --- ┆ --- │
│ str ┆ list[str] │
╞════════╪═════════════════════════════════╡
│ team01 ┆ ["Jada; location:2759 Fairway … │
│ team02 ┆ ["zdlldine; location:2887 And … │
│ team03 ┆ ["Pope; location:345 Edgewood … │
│ team04 ┆ ["Oaklynn; location:3017 Cherr… │
│ team05 ┆ ["Tysheenia; location:1616 Smi… │
└────────┴─────────────────────────────────┘
The values after string splitting are wrapped in a List. By calling explode(["items"]), the items field is unpacked into multiple lines.
#![allow(unused)] fn main() { let df_lines=df_res.explode(["items"])?; println!("{:?}",&df_lines); }
Output
shape: (9, 2)
┌────────┬─────────────────────────────────┐
│ teamID ┆ items │
│ --- ┆ --- │
│ str ┆ str │
╞════════╪═════════════════════════════════╡
│ team01 ┆ Jada; location:2759 Fairway Dr… │
│ team01 ┆ Graceland; location:6 Greenlea… │
│ team02 ┆ zdlldine; location:2887 Andell… │
│ team02 ┆ Makana; location:1521 Winifred… │
│ team02 ┆ Natsuki; location:4416 Golf Co… │
│ team03 ┆ Pope; location:345 Edgewood Av… │
│ team04 ┆ Oaklynn; location:3017 Cherry … │
│ team05 ┆ Tysheenia; location:1616 Smith… │
│ team05 ┆ Zenda; location:4416 Golf Cour… │
└────────┴─────────────────────────────────┘
Splitting into Multiple Columns
Similar to split, split_exact() saves the split string into DataType::Struct, and after unnest(), it can be split into multiple fields. split_exact(lit(";"),3) takes the first parameter as the delimiter and the second parameter as the number of fields. It precisely returns a fixed number of fields. If the number of split strings is insufficient, it generates null values. If there are too many, it directly discards the extra fields.
#![allow(unused)] fn main() { let df_structed = df_lines.lazy().select([ col("teamID"), col("items").str().split_exact(lit(";"),3) ]).collect()?; println!("df_structed\n{:?}",&df_structed); }
Note that the returned type of items is a struct type. Output
df_structed
shape: (9, 2)
┌────────┬─────────────────────────────────┐
│ teamID ┆ items │
│ --- ┆ --- │
│ str ┆ struct[4] │
╞════════╪═════════════════════════════════╡
│ team01 ┆ {"Jada"," location:2759 Fairwa… │
│ team01 ┆ {"Graceland"," location:6 Gree… │
│ team02 ┆ {"zdlldine"," location:2887 An… │
│ team02 ┆ {"Makana"," location:1521 Wini… │
│ team02 ┆ {"Natsuki"," location:4416 Gol… │
│ team03 ┆ {"Pope"," location:345 Edgewoo… │
│ team04 ┆ {"Oaklynn"," location:3017 Che… │
│ team05 ┆ {"Tysheenia"," location:1616 S… │
│ team05 ┆ {"Zenda"," location:4416 Golf … │
└────────┴─────────────────────────────────┘
Apply unnest to unpack the struct into multiple fields
#![allow(unused)] fn main() { let df_unnest = df_structed.unnest(["items"])?; println!("df_unnest:\n{:?}",&df_unnest); }
To demonstrate the effect of the number of fields, we added a few extra semicolons to the first element of the df_str items field. This results in the field_3 of the first element not being null.
Output
df_unnest:
shape: (9, 5)
┌────────┬───────────┬─────────────────────────────────┬────────────────────────────┬─────────┐
│ teamID ┆ field_0 ┆ field_1 ┆ field_2 ┆ field_3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str │
╞════════╪═══════════╪═════════════════════════════════╪════════════════════════════╪═════════╡
│ team01 ┆ Jada ┆ location:2759 Fairway Drive ┆ Email:Jada ┆ @gmail │
│ team01 ┆ Graceland ┆ location:6 Greenleaf Dr ┆ Email:Graceland@gmail.com ┆ null │
│ team02 ┆ zdlldine ┆ location:2887 Andell Road ┆ Email:zdlldine@hotmail.com┆ null │
│ team02 ┆ Makana ┆ location:1521 Winifred Way ┆ Email:Makana@gmail.com ┆ null │
│ team02 ┆ Natsuki ┆ location:4416 Golf Course Dri… ┆ Email:Natsuki@gmail.com ┆ null │
│ team03 ┆ Pope ┆ location:345 Edgewood Avenue ┆ Email:Pope@gmail.com ┆ null │
│ team04 ┆ Oaklynn ┆ location:3017 Cherry Camp Roa… ┆ Email:zdl_361@hotmail.com ┆ null │
│ team05 ┆ Tysheenia ┆ location:1616 Smith Street ┆ Email:Tysheenia@gmail.com ┆ null │
│ team05 ┆ Zenda ┆ location:4416 Golf Course Dri… ┆ Email:Zenda@gmail.com ┆ null │
└────────┴───────────┴─────────────────────────────────┴────────────────────────────┴─────────┘
Regular Expression Capture
Sometimes simple split cannot meet business needs. Complex tasks require regular expression capture to complete. The main method involved is the extract method:
- extract(self, pat: Expr, group_index: usize) captures the value at the group_index after matching the regular expression pat.
- extract_groups(self, pat: &str) returns all captures after matching the regular expression pat. We continue working based on df_lines.
#![allow(unused)] fn main() { // Since the expression is too long, we define a custom expression let ex = |index| -> Expr{ // Here we use a regular expression to capture three fields col("items").str().extract(lit(r#"^([A-Z a-z]*); location:(.*); Email:(.*)$"#), index) }; let df_extract=df_lines.lazy().select([ col("teamID"), ex(0).alias("source"), // 0 captures the entire matching string ex(1).alias("Name"), ex(2).alias("location"), ex(3).alias("email"), ]).collect()?; println!("{:?}",&df_extract); }
Output
shape: (9, 5)
┌────────┬─────────────────────────────────┬───────────┬────────────────────────┬─────────────────────┐
│ teamID ┆ source ┆ Name ┆ location ┆ email │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str │
╞════════╪═════════════════════════════════╪═══════════╪════════════════════════╪═════════════════════╡
│ team01 ┆ Jada; location:2759 Fairway Dr… ┆ Jada ┆ 2759 Fairway Drive ┆ Jada;@gmail;.com │
│ team01 ┆ Graceland; location:6 Greenlea… ┆ Graceland ┆ 6 Greenleaf Dr ┆ Graceland@gmail.com │
│ team02 ┆ zdlldine; location:2887 Andell… ┆ Ives ┆ 2887 Andell Road ┆ zdlldine@gmail.com │
│ team02 ┆ Makana; location:1521 Winifred… ┆ Makana ┆ 1521 Winifred Way ┆ Makana@gmail.com │
│ team02 ┆ Natsuki; location:4416 Golf Co… ┆ Natsuki ┆ 4416 Golf Course Drive ┆ Natsuki@gmail.com │
│ team03 ┆ Pope; location:345 Edgewood Av… ┆ Pope ┆ 345 Edgewood Avenue ┆ Pope@gmail.com │
│ team04 ┆ Oaklynn; location:3017 Cherry … ┆ Oaklynn ┆ 3017 Cherry Camp Road ┆ zdl_361@hotmail.com │
│ team05 ┆ Tysheenia; location:1616 Smith… ┆ Tysheenia ┆ 1616 Smith Street ┆ Tysheenia@gmail.com │
│ team05 ┆ Zenda; location:4416 Golf Cour… ┆ Zenda ┆ 4416 Golf Course Drive ┆ Zenda@gmail.com │
└────────┴─────────────────────────────────┴───────────┴────────────────────────┴─────────────────────┘
String API
Self refers to the return value of Expr.str()
| API | Description |
|---|---|
| contains_literal(self, pat: Expr) | Checks if it contains a string literal |
| contains(self, pat: Expr, strict: bool) | Checks if it matches a regular expression. If pat is an invalid regular expression, strict==true returns an error. If strict==false, the invalid regular expression will simply evaluate to false. |
| contains_any(self, patterns: Expr, ascii_case_insensitive: bool) | Matches multiple fixed strings using the Aho-Corasick algorithm1. The pattern should be constructed like this: let pat = lit(Series::new("pat".into(),["fo","ba","str3"])); |
| replace_many(self,patterns: Expr, replace_with: Expr, ascii_case_insensitive: bool) | Replaces multiple strings using the Aho-Corasick algorithm |
| ends_with(self, sub: Expr) | Checks if it ends with the sub string |
| starts_with(self, sub: Expr) | Checks if it starts with the sub string |
| hex_encode(self) | Encodes the string into a hexadecimal string |
| hex_decode(self, strict: bool) | Decodes a hexadecimal string into a regular string |
| base64_encode(self) | Encodes the string using base64 |
| base64_decode(self, strict: bool) | Decodes a base64 string into a regular string |
| extract(self, pat: Expr, group_index: usize) | Extracts a regular expression capture, see Regular Expression Capture |
| find_literal(self, pat: Expr) | Finds the index of the literal |
| find(self, pat: Expr, strict: bool) | Searches for the index of the regular expression |
| count_matches(self, pat: Expr, literal: bool) | Returns the count of successful regular expression matches |
| strptime(self,dtype: DataType, options: StrptimeOptions,ambiguous: Expr) | Parses a string into Date/Datetime/Time |
| to_datetime | Parses a string into datetime |
| to_time(self, options: StrptimeOptions) | Parses a string into time |
| join(self, delimiter: &str, ignore_nulls: bool) | Joins the strings in the field into a single string, using the delimiter as a separator |
| split(self, by: Expr) | Splits the string into a List<String>. You can use explode to split the result into multiple rows. String Split into Multiple Rows |
| split_inclusive(self, by: Expr) | Similar to split but retains the delimiter |
| split_exact(self, by: Expr, n: usize) | Splits into a Struct, which can be unpacked into multiple fields using unnest. Split into Multiple Columns |
| strip_prefix(self, prefix: Expr) | Removes the prefix |
| strip_suffix(self, suffix: Expr) | Removes the suffix |
| to_lowercase(self) | Converts all characters to lowercase |
| to_uppercase(self) | Converts all characters to uppercase |
| to_integer(self, base: Expr, strict: bool) | Parses the string into an integer according to the specified base |
| len_bytes(self) | Counts the number of bytes |
| len_chars(self) | Counts the number of characters |
| slice(self, offset: Expr, length: Expr) | Returns a substring referenced by the slice |
| head(self, n: Expr) | Returns the first n characters |
| tail(self, n: Expr) | Returns the last n characters |
The Aho-Corasick algorithm's patterns are not regular expressions but a collection of multiple fixed strings. This algorithm is used for multi-pattern matching, i.e., finding the positions of multiple fixed patterns (strings) in a text. The core idea of the Aho-Corasick algorithm is to build an automaton to match multiple pattern strings simultaneously. As the input text flows through the automaton, it can efficiently identify all matching patterns.