class RelationalGroupedDataset extends AnyRef
A set of methods for aggregations on a DataFrame
, created by groupBy,
cube or rollup (and also pivot
).
The main method is the agg
function, which has multiple variants. This class also contains
some first-order statistics such as mean
, sum
for convenience.
- Annotations
- @Stable()
- Since
2.0.0
- Note
This class was named
GroupedData
in Spark 1.x.
- Alphabetic
- By Inheritance
- RelationalGroupedDataset
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
-
new
RelationalGroupedDataset(df: DataFrame, groupingExprs: Seq[Expression], groupType: GroupType)
- Attributes
- protected[org.apache.spark.sql]
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
agg(expr: Column, exprs: Column*): DataFrame
Compute aggregates by specifying a series of aggregate columns.
Compute aggregates by specifying a series of aggregate columns. Note that this function by default retains the grouping columns in its output. To not retain grouping columns, set
spark.sql.retainGroupColumns
to false.The available aggregate methods are defined in org.apache.spark.sql.functions.
// Selects the age of the oldest employee and the aggregate expense for each department // Scala: import org.apache.spark.sql.functions._ df.groupBy("department").agg(max("age"), sum("expense")) // Java: import static org.apache.spark.sql.functions.*; df.groupBy("department").agg(max("age"), sum("expense"));
Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change to that behavior, set config variable
spark.sql.retainGroupColumns
tofalse
.// Scala, 1.3.x: df.groupBy("department").agg($"department", max("age"), sum("expense")) // Java, 1.3.x: df.groupBy("department").agg(col("department"), max("age"), sum("expense"));
- Annotations
- @varargs()
- Since
1.3.0
-
def
agg(exprs: Map[String, String]): DataFrame
(Java-specific) Compute aggregates by specifying a map from column name to aggregate methods.
(Java-specific) Compute aggregates by specifying a map from column name to aggregate methods. The resulting
DataFrame
will also contain the grouping columns.The available aggregate methods are
avg
,max
,min
,sum
,count
.// Selects the age of the oldest employee and the aggregate expense for each department import com.google.common.collect.ImmutableMap; df.groupBy("department").agg(ImmutableMap.of("age", "max", "expense", "sum"));
- Since
1.3.0
-
def
agg(exprs: Map[String, String]): DataFrame
(Scala-specific) Compute aggregates by specifying a map from column name to aggregate methods.
(Scala-specific) Compute aggregates by specifying a map from column name to aggregate methods. The resulting
DataFrame
will also contain the grouping columns.The available aggregate methods are
avg
,max
,min
,sum
,count
.// Selects the age of the oldest employee and the aggregate expense for each department df.groupBy("department").agg(Map( "age" -> "max", "expense" -> "sum" ))
- Since
1.3.0
-
def
agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame
(Scala-specific) Compute aggregates by specifying the column names and aggregate methods.
(Scala-specific) Compute aggregates by specifying the column names and aggregate methods. The resulting
DataFrame
will also contain the grouping columns.The available aggregate methods are
avg
,max
,min
,sum
,count
.// Selects the age of the oldest employee and the aggregate expense for each department df.groupBy("department").agg( "age" -> "max", "expense" -> "sum" )
- Since
1.3.0
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
avg(colNames: String*): DataFrame
Compute the mean value for each numeric columns for each group.
Compute the mean value for each numeric columns for each group. The resulting
DataFrame
will also contain the grouping columns. When specified columns are given, only compute the mean values for them.- Annotations
- @varargs()
- Since
1.3.0
-
def
clone(): AnyRef
- Attributes
- protected[java.lang]
- Definition Classes
- AnyRef
- Annotations
- @native() @throws( ... )
-
def
count(): DataFrame
Count the number of rows for each group.
Count the number of rows for each group. The resulting
DataFrame
will also contain the grouping columns.- Since
1.3.0
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
finalize(): Unit
- Attributes
- protected[java.lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
def
max(colNames: String*): DataFrame
Compute the max value for each numeric columns for each group.
Compute the max value for each numeric columns for each group. The resulting
DataFrame
will also contain the grouping columns. When specified columns are given, only compute the max values for them.- Annotations
- @varargs()
- Since
1.3.0
-
def
mean(colNames: String*): DataFrame
Compute the average value for each numeric columns for each group.
Compute the average value for each numeric columns for each group. This is an alias for
avg
. The resultingDataFrame
will also contain the grouping columns. When specified columns are given, only compute the average values for them.- Annotations
- @varargs()
- Since
1.3.0
-
def
min(colNames: String*): DataFrame
Compute the min value for each numeric column for each group.
Compute the min value for each numeric column for each group. The resulting
DataFrame
will also contain the grouping columns. When specified columns are given, only compute the min values for them.- Annotations
- @varargs()
- Since
1.3.0
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
def
pivot(pivotColumn: Column, values: List[Any]): RelationalGroupedDataset
(Java-specific) Pivots a column of the current
DataFrame
and performs the specified aggregation.(Java-specific) Pivots a column of the current
DataFrame
and performs the specified aggregation. This is an overloaded version of thepivot
method withpivotColumn
of theString
type.- pivotColumn
the column to pivot.
- values
List of values that will be translated to columns in the output DataFrame.
- Since
2.4.0
-
def
pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset
Pivots a column of the current
DataFrame
and performs the specified aggregation.Pivots a column of the current
DataFrame
and performs the specified aggregation. This is an overloaded version of thepivot
method withpivotColumn
of theString
type.// Compute the sum of earnings for each year by course with each course as a separate column df.groupBy($"year").pivot($"course", Seq("dotNET", "Java")).sum($"earnings")
- pivotColumn
the column to pivot.
- values
List of values that will be translated to columns in the output DataFrame.
- Since
2.4.0
-
def
pivot(pivotColumn: Column): RelationalGroupedDataset
Pivots a column of the current
DataFrame
and performs the specified aggregation.Pivots a column of the current
DataFrame
and performs the specified aggregation. This is an overloaded version of thepivot
method withpivotColumn
of theString
type.// Or without specifying column values (less efficient) df.groupBy($"year").pivot($"course").sum($"earnings");
- pivotColumn
he column to pivot.
- Since
2.4.0
-
def
pivot(pivotColumn: String, values: List[Any]): RelationalGroupedDataset
(Java-specific) Pivots a column of the current
DataFrame
and performs the specified aggregation.(Java-specific) Pivots a column of the current
DataFrame
and performs the specified aggregation.There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally.
// Compute the sum of earnings for each year by course with each course as a separate column df.groupBy("year").pivot("course", Arrays.<Object>asList("dotNET", "Java")).sum("earnings"); // Or without specifying column values (less efficient) df.groupBy("year").pivot("course").sum("earnings");
- pivotColumn
Name of the column to pivot.
- values
List of values that will be translated to columns in the output DataFrame.
- Since
1.6.0
-
def
pivot(pivotColumn: String, values: Seq[Any]): RelationalGroupedDataset
Pivots a column of the current
DataFrame
and performs the specified aggregation.Pivots a column of the current
DataFrame
and performs the specified aggregation. There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally.// Compute the sum of earnings for each year by course with each course as a separate column df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings") // Or without specifying column values (less efficient) df.groupBy("year").pivot("course").sum("earnings")
From Spark 3.0.0, values can be literal columns, for instance, struct. For pivoting by multiple columns, use the
struct
function to combine the columns and values:df.groupBy("year") .pivot("trainingCourse", Seq(struct(lit("java"), lit("Experts")))) .agg(sum($"earnings"))
- pivotColumn
Name of the column to pivot.
- values
List of values that will be translated to columns in the output DataFrame.
- Since
1.6.0
-
def
pivot(pivotColumn: String): RelationalGroupedDataset
Pivots a column of the current
DataFrame
and performs the specified aggregation.Pivots a column of the current
DataFrame
and performs the specified aggregation.There are two versions of
pivot
function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally.// Compute the sum of earnings for each year by course with each course as a separate column df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings") // Or without specifying column values (less efficient) df.groupBy("year").pivot("course").sum("earnings")
- pivotColumn
Name of the column to pivot.
- Since
1.6.0
-
def
sum(colNames: String*): DataFrame
Compute the sum for each numeric columns for each group.
Compute the sum for each numeric columns for each group. The resulting
DataFrame
will also contain the grouping columns. When specified columns are given, only compute the sum for them.- Annotations
- @varargs()
- Since
1.3.0
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
toString(): String
- Definition Classes
- RelationalGroupedDataset → AnyRef → Any
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @native() @throws( ... )