Skip to content

Partitioning ensures that all observations in a group end up on the same worker. To try and keep the observations on each worker balanced, `partition()` uses a greedy algorithm that iteratively assigns each group to the worker that currently has the fewest rows.

Usage

partition(data, cluster)

Arguments

data

Dataset to partition, typically grouped. When grouped, all observations in a group will be assigned to the same cluster.

cluster

Cluster to use.

Value

A [party_df].

Examples

library(dplyr)
#> 
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union
cl <- default_cluster()
cluster_library(cl, "dplyr")

mtcars2 <- partition(mtcars, cl)
mtcars2 %>% mutate(cyl2 = 2 * cyl)
#> Source: party_df [32 x 12]
#> Shards: 2 [16--16 rows]
#> 
#> # A data frame: 32 × 12
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb  cyl2
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4    12
#> 2  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1     8
#> 3  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2    16
#> 4  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4    16
#> 5  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2     8
#> 6  17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4    12
#> # … with 26 more rows
mtcars2 %>% filter(vs == 1)
#> Source: party_df [14 x 11]
#> Shards: 2 [5--9 rows]
#> 
#> # A data frame: 14 × 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
#> 2  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
#> 3  17.8     6 168.    123  3.92  3.44  18.9     1     0     4     4
#> 4  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
#> 5  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
#> 6  21.4     6 258     110  3.08  3.22  19.4     1     0     3     1
#> # … with 8 more rows
mtcars2 %>% group_by(cyl) %>% summarise(n())
#> Source: party_df [6 x 2]
#> Shards: 2 [3--3 rows]
#> 
#> # A data frame: 6 × 2
#>     cyl `n()`
#>   <dbl> <int>
#> 1     4     5
#> 2     6     2
#> 3     8     9
#> 4     4     6
#> 5     6     5
#> 6     8     5
mtcars2 %>% select(-cyl)
#> Source: party_df [32 x 10]
#> Shards: 2 [16--16 rows]
#> 
#> # A data frame: 32 × 10
#>     mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21    160    110  3.9   2.62  16.5     0     1     4     4
#> 2  22.8  108     93  3.85  2.32  18.6     1     1     4     1
#> 3  18.7  360    175  3.15  3.44  17.0     0     0     3     2
#> 4  14.3  360    245  3.21  3.57  15.8     0     0     3     4
#> 5  22.8  141.    95  3.92  3.15  22.9     1     0     4     2
#> 6  17.8  168.   123  3.92  3.44  18.9     1     0     4     4
#> # … with 26 more rows