Lazyframe Grouping and Aggregation

The expressions in Lazyframe are carefully designed, and it is generally recommended to prioritize using the LazyFrame's grouping and aggregation API.

TermMeaning
Simple AggregationInput several data points and return a single value based on a certain algorithm. For example, min, max, mean, head, tail, etc., are all aggregation functions.
GroupingAs the name suggests, it groups data based on specified fields, and subsequent aggregation operations will call the aggregation function once for each group.

Practical Example 1

Observe the sample data employee_df, which contains the performance of 3 employees over the past 4 months.

#![allow(unused)]
fn main() {
let mut employee_df: DataFrame = df!("Name"=> ["Lao Li", "Lao Li", "Lao Li", "Lao Li", "Lao Zhang", "Lao Zhang", "Lao Zhang", "Lao Zhang", "Lao Wang", "Lao Wang", "Lao Wang", "Lao Wang"],
"Employee ID"=> ["Employee01", "Employee01", "Employee01", "Employee01", "Employee02", "Employee02", "Employee02", "Employee02", "Employee03", "Employee03", "Employee03", "Employee03"],
"Date"=> ["August", "September", "October", "November", "August", "September", "October", "November", "August", "September", "October", "November"],
"Performance"=> [83, 24, 86, 74, 89, 59, 48, 79, 51, 71, 44, 90])?;
}

The current requirement is to calculate the average performance of each person for all months and count the number of times each person's performance was greater than 70. The data needs to be grouped by Employee ID and then aggregated.

#![allow(unused)]
fn main() {
   let res = employee_df.lazy().group_by(["Employee ID","Name"]) //group_by may disrupt row order, group_by_stable can preserve the original row order.
            .agg([
                col("Performance").mean().alias("Average Performance"),
                col("Performance").gt(70).cast(DataType::Int32).sum().alias("Count Greater Than 70")
            ]).collect()?;
       
    println!("{}",res);
}

Polars can perform simple aggregation through the aggregation functions provided by expressions.

Output

shape: (3, 4)
┌─────────────┬───────────┬─────────────────────┬───────────────────────┐
│ Employee ID ┆ Name      ┆ Average Performance ┆ Count Greater Than 70 │
│ ---         ┆ ---       ┆ ---                 ┆ ---                   │
│ str         ┆ str       ┆ f64                 ┆ i32                   │
╞═════════════╪═══════════╪═════════════════════╪═══════════════════════╡
│ Employee03  ┆ Lao Wang  ┆ 64.0                ┆ 2                     │
│ Employee01  ┆ Lao Li    ┆ 66.75               ┆ 3                     │
│ Employee02  ┆ Lao Zhang ┆ 68.75               ┆ 2                     │
└─────────────┴───────────┴─────────────────────┴───────────────────────┘

Practical Example 2

Calculate the top two performers and their corresponding performances for each month.

#![allow(unused)]
fn main() {
let step1 = employee_df.lazy().group_by(["Date"]) //group_by may disrupt row order, group_by_stable can preserve the original row order.
.agg([
    col("Employee ID"),
    col("Performance"),
    col("Performance").rank(RankOptions::default(),None).alias("rank"),
]);
let step2 =step1.clone()
    .explode([col("Employee ID"),col("Performance"),col("rank")]);
let step3 =step2.clone() 
    .filter(col("rank").gt_eq(2)) ;
       
println!("step1:{}\nstep2:{}\nstep3:{}",step1.collect()?,step2.collect()?,step3.collect()?);
}

Step1

step1:shape: (4, 4)
┌───────────┬─────────────────────────────────┬──────────────┬───────────┐
│ Date      ┆ Employee ID                     ┆ Performance  ┆ rank      │
│ ---       ┆ ---                             ┆ ---          ┆ ---       │
│ str       ┆ list[str]                       ┆ list[i32]    ┆ list[u32] │
╞═══════════╪═════════════════════════════════╪══════════════╪═══════════╡
│ August    ┆ ["Employee01", "Employee02", "… ┆ [83, 89, 51] ┆ [2, 3, 1] │
│ October   ┆ ["Employee01", "Employee02", "… ┆ [86, 48, 44] ┆ [3, 2, 1] │
│ November  ┆ ["Employee01", "Employee02", "… ┆ [74, 79, 90] ┆ [1, 2, 3] │
│ September ┆ ["Employee01", "Employee02", "… ┆ [24, 59, 71] ┆ [1, 2, 3] │
└───────────┴─────────────────────────────────┴──────────────┴───────────┘

step2

step2:shape: (12, 4)
┌───────────┬─────────────┬─────────────┬──────┐
│ Date      ┆ Employee ID ┆ Performance ┆ rank │
│ ---       ┆ ---         ┆ ---         ┆ ---  │
│ str       ┆ str         ┆ i32         ┆ u32  │
╞═══════════╪═════════════╪═════════════╪══════╡
│ November  ┆ Employee01  ┆ 74          ┆ 1    │
│ November  ┆ Employee02  ┆ 79          ┆ 2    │
│ November  ┆ Employee03  ┆ 90          ┆ 3    │
│ October   ┆ Employee01  ┆ 86          ┆ 3    │
│ October   ┆ Employee02  ┆ 48          ┆ 2    │
│ …         ┆ …           ┆ …           ┆ …    │
│ September ┆ Employee02  ┆ 59          ┆ 2    │
│ September ┆ Employee03  ┆ 71          ┆ 3    │
│ August    ┆ Employee01  ┆ 83          ┆ 2    │
│ August    ┆ Employee02  ┆ 89          ┆ 3    │
│ August    ┆ Employee03  ┆ 51          ┆ 1    │
└───────────┴─────────────┴─────────────┴──────┘

step3

step3:shape: (8, 4)
┌───────────┬─────────────┬─────────────┬──────┐
│ Date      ┆ Employee ID ┆ Performance ┆ rank │
│ ---       ┆ ---         ┆ ---         ┆ ---  │
│ str       ┆ str         ┆ i32         ┆ u32  │
╞═══════════╪═════════════╪═════════════╪══════╡
│ November  ┆ Employee02  ┆ 79          ┆ 2    │
│ November  ┆ Employee03  ┆ 90          ┆ 3    │
│ August    ┆ Employee01  ┆ 83          ┆ 2    │
│ August    ┆ Employee02  ┆ 89          ┆ 3    │
│ October   ┆ Employee01  ┆ 86          ┆ 3    │
│ October   ┆ Employee02  ┆ 48          ┆ 2    │
│ September ┆ Employee02  ┆ 59          ┆ 2    │
│ September ┆ Employee03  ┆ 71          ┆ 3    │
└───────────┴─────────────┴─────────────┴──────┘

Complex Aggregation and User defined Custom Functions

  • Simple Aggregation: Input a single Series, output a Series with only one element.
  • Complex Aggregation: Input multiple Series, output multiple rows and columns.

Complex aggregation requires custom functions to compute the desired results. In the context of lazy().group_by/agg expressions, use col("Performance").apply_many.

#![allow(unused)]
fn main() {
use polars::prelude::*;
let mut employee_df: DataFrame = df!("Name"=> ["Lao Li", "Lao Li", "Lao Li", "Lao Li", "Lao Zhang", "Lao Zhang", "Lao Zhang", "Lao Zhang", "Lao Wang", "Lao Wang", "Lao Wang", "Lao Wang"],
"Employee ID"=> ["Employee01", "Employee01", "Employee01", "Employee01", "Employee02", "Employee02", "Employee02", "Employee02", "Employee03", "Employee03", "Employee03", "Employee03"],
"Date"=> ["August", "September", "October", "November", "August", "September", "October", "November", "August", "September", "October", "November"],
"Performance"=> [83, 24, 86, 74, 89, 59, 48, 79, 51, 71, 44, 90])?;
let user_defined_function= |x: & mut[Series]| -> Result<Option<Series>, PolarsError>{
    let arg0 = &x[0];
    let arg1 = &x[1];
    let arg2 = &x[2];
    // After analysis, we obtained the following results
    let res_field1 = Series::new("rank".into(), vec!["field1,row[10]","row[11]","row[12]"]);
    let res_field2 = Series::new("rank2".into(), vec!["field2,row[20]","row[21]","row[22]"]);
    let res_field3 = Series::new("rank3".into(), vec![1,2,3]);
    // For each group, we can return a complex two-dimensional structure
    // In simple aggregation, the result can only be a single value.
    // Complex two-dimensional structures must be wrapped in StructChunked to be stored in a single Series
    // For each group, we must keep the "Schema" of StructChunked unchanged,
    // "Schema" refers to the order, name, and data type of StructChunked fields.
    let res=StructChunked::from_series("res".into(), &[res_field1,res_field2,res_field3])?.into_series();
    Ok(Some(res))
};
// let sc = DataType::Struct(vec![
// Field::new("f1".into(), DataType::String),
// Field::new("f2".into(), DataType::String),
// Field::new("f3".into(), DataType::Int32 )
// ]);
// In the API documentation, `GetOutput::from_type(DataType::Boolean)` should be `GetOutput::from_type(sc)`. But in fact, any `GetOutput` can work.
let output_type = GetOutput::from_type(DataType::Boolean);
let res = employee_df.lazy().group_by([col("Date")]).agg(
[
    //col("date"),
    col("Performance").apply_many(user_defined_function, &[col("Name"),col("Employee ID"),col("Performance")], output_type)
]
).collect()?;
//explode, unnest are used to unpack StructChunked
println!("{}",res.explode(["Performance"])?.unnest(["Performance"])?);
}

.explode, .unnest are used to unpack StructChunked. See Splitting Strings into Multiple Lines, Splitting into Multiple Columns

In this example, we define a custom function user_defined_function that processes multiple Series and returns a complex two-dimensional structure wrapped in a StructChunked. This allows us to perform complex aggregations that output multiple rows and columns for each group. The explode and unnest functions are used to unpack the StructChunked into a more readable format.

Output

shape: (12, 4)
┌───────────┬────────────────┬────────────────┬───────┐
│ Date      ┆ rank           ┆ rank2          ┆ rank3 │
│ ---       ┆ ---            ┆ ---            ┆ ---   │
│ str       ┆ str            ┆ str            ┆ i32   │
╞═══════════╪════════════════╪════════════════╪═══════╡
│ October   ┆ field1,row[10] ┆ field2,row[20] ┆ 1     │
│ October   ┆ row[11]        ┆ row[21]        ┆ 2     │
│ October   ┆ row[12]        ┆ row[22]        ┆ 3     │
│ August    ┆ field1,row[10] ┆ field2,row[20] ┆ 1     │
│ August    ┆ row[11]        ┆ row[21]        ┆ 2     │
│ …         ┆ …              ┆ …              ┆ …     │
│ September ┆ row[11]        ┆ row[21]        ┆ 2     │
│ September ┆ row[12]        ┆ row[22]        ┆ 3     │
│ November  ┆ field1,row[10] ┆ field2,row[20] ┆ 1     │
│ November  ┆ row[11]        ┆ row[21]        ┆ 2     │
│ November  ┆ row[12]        ┆ row[22]        ┆ 3     │
└───────────┴────────────────┴────────────────┴───────┘