Join操作
什么是join联接
数据处理中经常出现以下需求:一张表保存了大量用户id和用户姓名、年龄、手机号、购物地址等等用户信息,另一张表保存了用户id和待快递商品,商品价格等等交易信息。如何能根据id将两张表的数据整合到一起。
联接操作用于解决以上问题。根据两张dataframe指定字段,将两张表组合到一起。join需要两个dataframe作为输入,分别称为左表和右表;另外必须指明哪些字段被用做匹配依据,称为"键",除了键以外的所有字段被用作值,分别称为左表值字段、右表值字段。join运算的结果包含:键,左表值字段、右表值字段。 join会出现4种情况
- 某键在左表中出现,在右表未出现;此键对应的右表值字段视为null
- 某键在右表中出现,在左表未出现;此键对应的左表值字段视为null
- 某键在两张表中都出现; 键至少在其中一个表是unique的。
- 某键在两张表中都出现; 键在两张表都不是unique的。会导致多对多联接,结果意义不明,通常是数据错误导致,需仔细排查。
可以同时使用多个字段作为键。
主要链接操作分5种:
- 左联接left_join:结果保留所有左表的键,及其对应的左表值字段、右表值字段。对于右表中未出现的键对应的值视为null。
- 右联接right_join: 结果保留所有右表的键,及其对应的左表值字段、右表值字段。对于左表中未出现的键对应的值视为null。
- 内联接inner_join:结果保留左表和右表共有的键,及其对应的左表值字段、右表值字段。
- 全联接full_join:结果保留左表和右表所有的键,及其对应的左表值字段、右表值字段,不存在的值被视为null。
- 笛卡尔联接cross_join:"笛卡尔积"是指两个数据集的所有可能的组合。例如,对于某键的左表值字段包含A、B两行,右表值字段包含1、2两行,那么这个键的笛卡尔联接将产生四行:(A, 1),(A, 2),(B, 1),(B, 2)。
left_join和right_join本质相同仅仅是调换了左表,右表,因此polars仅仅实现了left_join。 dataframe和lazyframe有不同的语法。
joinAPI
#![allow(unused)] fn main() { left_df.left_join(right_df, left_on,right_on) left_df.inner_join(right_df, left_on,right_on) left_df.full_join(right_df, left_on,right_on) left_df.cross_join(right_df, left_on,right_on) //用法: left.full_join(right, ["join_column_left"], ["join_column_right"]) }
示例:
employee_score储存了雇员的考核数据。employee_info储存了雇员的身份. 我们将使用employee_ID字段合并两张表.
#![allow(unused)] fn main() { let mut employee_score: DataFrame = df!( "employee_ID" => ["Employee01", "Employee01", "Employee01", "Employee01", "Employee02", "Employee02", "Employee02", "Employee02", "Employee03", "Employee03", "Employee03", "Employee03"], "date" => ["August", "September", "October", "November", "August", "September", "October", "November", "August", "September", "October", "November"], "score" => [83, 24, 86, 74, 89, 59, 48, 79, 51, 71, 44, 90] )?; let mut employee_info = df!{ "Name" => ["Lao Li", "Lao Zhang", "Lao Wang"], "employee_ID" => ["Employee01", "Employee02", "Employee03"], "email" => ["zdl361@126.com","LaoZhang@gmail.com","LaoWang@hotmail.com"] }?; let res = employee_score.left_join(&employee_info, ["employee_ID"], ["employee_ID"])?; println!("employee_score:{}\nemployee_info:{}\nafter join:{}",employee_score,employee_info,res); }
Output
employee_score:shape: (12, 3)
┌─────────────┬───────────┬───────┐
│ employee_ID ┆ date ┆ score │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i32 │
╞═════════════╪═══════════╪═══════╡
│ Employee01 ┆ August ┆ 83 │
│ Employee01 ┆ September ┆ 24 │
│ Employee01 ┆ October ┆ 86 │
│ Employee01 ┆ November ┆ 74 │
│ Employee02 ┆ August ┆ 89 │
│ … ┆ … ┆ … │
│ Employee02 ┆ November ┆ 79 │
│ Employee03 ┆ August ┆ 51 │
│ Employee03 ┆ September ┆ 71 │
│ Employee03 ┆ October ┆ 44 │
│ Employee03 ┆ November ┆ 90 │
└─────────────┴───────────┴───────┘
employee_info:shape: (3, 3)
┌───────────┬─────────────┬─────────────────────┐
│ Name ┆ employee_ID ┆ email │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═══════════╪═════════════╪═════════════════════╡
│ Lao Li ┆ Employee01 ┆ zdl361@126.com │
│ Lao Zhang ┆ Employee02 ┆ LaoZhang@gmail.com │
│ Lao Wang ┆ Employee03 ┆ LaoWang@hotmail.com │
└───────────┴─────────────┴─────────────────────┘
after join:shape: (12, 5)
┌─────────────┬───────────┬───────┬───────────┬─────────────────────┐
│ employee_ID ┆ date ┆ score ┆ Name ┆ email │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i32 ┆ str ┆ str │
╞═════════════╪═══════════╪═══════╪═══════════╪═════════════════════╡
│ Employee01 ┆ August ┆ 83 ┆ Lao Li ┆ zdl361@126.com │
│ Employee01 ┆ September ┆ 24 ┆ Lao Li ┆ zdl361@126.com │
│ Employee01 ┆ October ┆ 86 ┆ Lao Li ┆ zdl361@126.com │
│ Employee01 ┆ November ┆ 74 ┆ Lao Li ┆ zdl361@126.com │
│ Employee02 ┆ August ┆ 89 ┆ Lao Zhang ┆ LaoZhang@gmail.com │
│ … ┆ … ┆ … ┆ … ┆ … │
│ Employee02 ┆ November ┆ 79 ┆ Lao Zhang ┆ LaoZhang@gmail.com │
│ Employee03 ┆ August ┆ 51 ┆ Lao Wang ┆ LaoWang@hotmail.com │
│ Employee03 ┆ September ┆ 71 ┆ Lao Wang ┆ LaoWang@hotmail.com │
│ Employee03 ┆ October ┆ 44 ┆ Lao Wang ┆ LaoWang@hotmail.com │
│ Employee03 ┆ November ┆ 90 ┆ Lao Wang ┆ LaoWang@hotmail.com │
└─────────────┴───────────┴───────┴───────────┴─────────────────────┘