字符串操作

数据处理中的字符串操作主要涉及字符串切割和字符串提取:将字符串分割到多行,或将字符串的正则表达式捕获分割到多列。 在对表达式进行向下转型col("str").str()后就可以进行字符串运算。

字符串分割到多行

以下dataframe中每个team的成员信息被用字符串储存,我们的第一个任务是将字符串分割到多行,以换行符为分隔符。

#![allow(unused)]
fn main() {
let df_str = df!{"items" => ["Jada; location:2759 Fairway Drive; Email:Jada;@gmail;.com\nGraceland; location:6 Greenleaf Dr; Email:Graceland@gmail.com",
"zdlldine; location:2887 Andell Road; Email:zdlldine@gmail.com\nMakana; location:1521 Winifred Way; Email:Makana@gmail.com\nNatsuki; location:4416 Golf Course Drive; Email:Natsuki@gmail.com",
"Pope; location:345 Edgewood Avenue; Email:Pope@gmail.com",
"Oaklynn; location:3017 Cherry Camp Road; Email:zdl_361@hotmail.com",
"Tysheenia; location:1616 Smith Street; Email:Tysheenia@gmail.com\nZenda; location:4416 Golf Course Drive; Email:Zenda@gmail.com"],
"teamID" => ["team01","team02","team03","team04","team05"]}?;
let df_res = df_str.lazy().select([col("teamID"),col("items").str().split(lit("\n"))]).collect()?;
println!("{:?}",&df_res);
}

Output:

shape: (5, 2)
┌────────┬─────────────────────────────────┐
│ teamID ┆ items                           │
│ ---    ┆ ---                             │
│ str    ┆ list[str]                       │
╞════════╪═════════════════════════════════╡
│ team01 ┆ ["Jada; location:2759 Fairway … │
│ team02 ┆ ["zdlldine; location:2887 And … │
│ team03 ┆ ["Pope; location:345 Edgewood … │
│ team04 ┆ ["Oaklynn; location:3017 Cherr… │
│ team05 ┆ ["Tysheenia; location:1616 Smi… │
└────────┴─────────────────────────────────┘

字符串分割后的值被包裹进List,调用explode(["items"],将items字段解包到多行。

#![allow(unused)]
fn main() {
let df_lines=df_res.explode(["items"])?;
println!("{:?}",&df_lines);
}

Output

shape: (9, 2)
┌────────┬─────────────────────────────────┐
│ teamID ┆ items                           │
│ ---    ┆ ---                             │
│ str    ┆ str                             │
╞════════╪═════════════════════════════════╡
│ team01 ┆ Jada; location:2759 Fairway Dr… │
│ team01 ┆ Graceland; location:6 Greenlea… │
│ team02 ┆ zdlldine; location:2887 Andell… │
│ team02 ┆ Makana; location:1521 Winifred… │
│ team02 ┆ Natsuki; location:4416 Golf Co… │
│ team03 ┆ Pope; location:345 Edgewood Av… │
│ team04 ┆ Oaklynn; location:3017 Cherry … │
│ team05 ┆ Tysheenia; location:1616 Smith… │
│ team05 ┆ Zenda; location:4416 Golf Cour… │
└────────┴─────────────────────────────────┘

分割到多列

与split类似split_exact()将字符串分割后保存进 DataType::Struct中,再经过unnest()解包,可以分割到多个字段。 split_exact(lit(";"),3) ,第一个参数为分隔符,第二个参数为字段数量。精确返回固定数量的字段,如果分割字符串后数量不够,产生null值。过多则直接丢弃多余字段。

#![allow(unused)]
fn main() {
let df_structed = df_lines.lazy().select([
    col("teamID"),
    col("items").str().split_exact(lit(";"),3)
]).collect()?;
println!("df_structed\n{:?}",&df_structed);
}

注意返回的items的类型为struct类型。 Output


df_structed
shape: (9, 2)
┌────────┬─────────────────────────────────┐
│ teamID ┆ items                           │
│ ---    ┆ ---                             │
│ str    ┆ struct[4]                       │
╞════════╪═════════════════════════════════╡
│ team01 ┆ {"Jada"," location:2759 Fairwa… │
│ team01 ┆ {"Graceland"," location:6 Gree… │
│ team02 ┆ {"zdlldine"," location:2887 An… │
│ team02 ┆ {"Makana"," location:1521 Wini… │
│ team02 ┆ {"Natsuki"," location:4416 Gol… │
│ team03 ┆ {"Pope"," location:345 Edgewoo… │
│ team04 ┆ {"Oaklynn"," location:3017 Che… │
│ team05 ┆ {"Tysheenia"," location:1616 S… │
│ team05 ┆ {"Zenda"," location:4416 Golf … │
└────────┴─────────────────────────────────┘

应用unnest解包struct到多个字段

#![allow(unused)]
fn main() {
let df_unnest = df_structed.unnest(["items"])?;
println!("df_unnest:\n{:?}",&df_unnest);
}

为了演示字段数量的作用,我们将第一个df_str items字段的第一个元素多增加了几个分号。所以导致第一个元素的 field_3字段不为null。 Output

df_unnest:
shape: (9, 5)
┌────────┬───────────┬─────────────────────────────────┬────────────────────────────┬─────────┐
│ teamID ┆ field_0   ┆ field_1                         ┆ field_2                    ┆ field_3 │
│ ---    ┆ ---       ┆ ---                             ┆ ---                        ┆ ---     │
│ str    ┆ str       ┆ str                             ┆ str                        ┆ str     │
╞════════╪═══════════╪═════════════════════════════════╪════════════════════════════╪═════════╡
│ team01 ┆ Jada      ┆  location:2759 Fairway Drive    ┆  Email:Jada                ┆ @gmail  │
│ team01 ┆ Graceland ┆  location:6 Greenleaf Dr        ┆  Email:Graceland@gmail.com ┆ null    │
│ team02 ┆ zdlldine  ┆  location:2887 Andell Road      ┆  Email:zdlldine@hotmail.com┆ null    │
│ team02 ┆ Makana    ┆  location:1521 Winifred Way     ┆  Email:Makana@gmail.com    ┆ null    │
│ team02 ┆ Natsuki   ┆  location:4416 Golf Course Dri… ┆  Email:Natsuki@gmail.com   ┆ null    │
│ team03 ┆ Pope      ┆  location:345 Edgewood Avenue   ┆  Email:Pope@gmail.com      ┆ null    │
│ team04 ┆ Oaklynn   ┆  location:3017 Cherry Camp Roa… ┆  Email:zdl_361@hotmail.com ┆ null    │
│ team05 ┆ Tysheenia ┆  location:1616 Smith Street     ┆  Email:Tysheenia@gmail.com ┆ null    │
│ team05 ┆ Zenda     ┆  location:4416 Golf Course Dri… ┆  Email:Zenda@gmail.com     ┆ null    │
└────────┴───────────┴─────────────────────────────────┴────────────────────────────┴─────────┘

正则表达式捕获

有时候简单的split并不能满足业务需求。复杂的工作需要正则表达式捕获才能完成。 主要涉及表达式extract方法:

  • extract(self, pat: Expr, group_index: usize) 正则pat匹配-后,返回group_index索引处捕获的值。
  • extract_groups(self, pat: &str) 正则pat匹配后,返回所有捕获

我们在df_lines的基础上继续执行工作。

#![allow(unused)]

fn main() {
//因为表达式太长,我们自定义一个表达式
let ex = |index| -> Expr{
    //这里使用正则表达式捕获三个字段
    col("items").str().extract(lit(r#"^([A-Z a-z]*); location:(.*); Email:(.*)$"#), index)
};

let df_extract=df_lines.lazy().select([
    col("teamID"),
    ex(0).alias("source"), // 0 捕获为全匹配字符串
    ex(1).alias("Name"),
    ex(2).alias("location"),
    ex(3).alias("email"),    
    ]).collect()?;
println!("{:?}",&df_extract);

}

Output

shape: (9, 5)
┌────────┬─────────────────────────────────┬───────────┬────────────────────────┬─────────────────────┐
│ teamID ┆ source                          ┆ Name      ┆ location               ┆ email               │
│ ---    ┆ ---                             ┆ ---       ┆ ---                    ┆ ---                 │
│ str    ┆ str                             ┆ str       ┆ str                    ┆ str                 │
╞════════╪═════════════════════════════════╪═══════════╪════════════════════════╪═════════════════════╡
│ team01 ┆ Jada; location:2759 Fairway Dr… ┆ Jada      ┆ 2759 Fairway Drive     ┆ Jada;@gmail;.com    │
│ team01 ┆ Graceland; location:6 Greenlea… ┆ Graceland ┆ 6 Greenleaf Dr         ┆ Graceland@gmail.com │
│ team02 ┆ zdlldine; location:2887 Andell… ┆ Ives      ┆ 2887 Andell Road       ┆ zdlldine@gmail.com  │
│ team02 ┆ Makana; location:1521 Winifred… ┆ Makana    ┆ 1521 Winifred Way      ┆ Makana@gmail.com    │
│ team02 ┆ Natsuki; location:4416 Golf Co… ┆ Natsuki   ┆ 4416 Golf Course Drive ┆ Natsuki@gmail.com   │
│ team03 ┆ Pope; location:345 Edgewood Av… ┆ Pope      ┆ 345 Edgewood Avenue    ┆ Pope@gmail.com      │
│ team04 ┆ Oaklynn; location:3017 Cherry … ┆ Oaklynn   ┆ 3017 Cherry Camp Road  ┆ zdl_361@hotmail.com │
│ team05 ┆ Tysheenia; location:1616 Smith… ┆ Tysheenia ┆ 1616 Smith Street      ┆ Tysheenia@gmail.com │
│ team05 ┆ Zenda; location:4416 Golf Cour… ┆ Zenda     ┆ 4416 Golf Course Drive ┆ Zenda@gmail.com     │
└────────┴─────────────────────────────────┴───────────┴────────────────────────┴─────────────────────┘

字符串API

Self为Expr.str()返回值

API含义
contains_literal(self, pat: Expr)是否包含字符串字面量
contains(self, pat: Expr, strict: bool)是否匹配正则表达式。如果pat是无效正则表达式,strict==true时返回错误。strict==false 则无效的正则表达式将简单评估为false。
contains_any(self, patterns: Expr, ascii_case_insensitive: bool)使用aho-corasick算法1匹配多个固定字符串。
pattern应该这样构建let pat = lit(Series::new("pat".into(),["fo","ba","str3"]));
replace_many(self,patterns: Expr, replace_with: Expr, ascii_case_insensitive: bool)使用aho-corasick算法替换多个字符串
ends_with(self, sub: Expr)是否以sub字符串结束
starts_with(self, sub: Expr)是否以sub字符串起始
hex_encode(self)字符串编码为十六进制字符串
hex_decode(self, strict: bool)十六进制解码成字符串
base64_encode(self)字符用base64编码
base64_decode(self, strict: bool)base64解码为字符串
extract(self, pat: Expr, group_index: usize)提取正则表达式捕获,见正则表达式捕获
find_literal(self, pat: Expr)查询字面量所在的索引
find(self, pat: Expr, strict: bool)搜索正则表达式所在的索引
count_matches(self, pat: Expr, literal: bool)返回正则表达式成功匹配的计数
strptime(self,dtype: DataType, options: StrptimeOptions,ambiguous: Expr)解析字符串为Date/Datetime/Time
to_datetime解析字符串为datetime
to_time(self, options: StrptimeOptions)解析字符串为time
join(self, delimiter: &str, ignore_nulls: bool)将字段里的字符串连接成一个字符串,用delimiter作为分割
split(self, by: Expr)分割字符串到List<String>。可以使用explode将结果拆到不同行。字符串分割到多行
split_inclusive(self, by: Expr)同split但分隔符保留
split_exact(self, by: Expr, n: usize)分割到Struce,可以用unnest解包到多个字段。分割到多列
strip_prefix(self, prefix: Expr)删除前缀
strip_suffix(self, suffix: Expr)删除后缀
to_lowercase(self)全部小写
to_uppercase(self)全部大写
to_integer(self, base: Expr, strict: bool)按照指定的进制解析为整数
len_bytes(self)字节计数
len_chars(self)字符计数
slice(self, offset: Expr, length: Expr)返回切片所引用的子字符串
head(self, n: Expr)返回前n个字符
tail(self, n: Expr)返回最后n个字符
1

Aho-Corasick 算法的模式不是正则表达式,而是多个固定字符串的集合。这个算法用于多模式匹配,即在一段文本中查找多个固定的模式(字符串)出现的位置。Aho-Corasick 算法的核心思想是通过构建一个自动机来同时匹配多个模式字符串。当输入的文本流过自动机时,它能够高效地识别出所有匹配的模式。