An Array-Based Algorithm for Simultaneous Multidimensional Aggregates

Yihong Zhao, Prasad M. Deshpande, and Jeffrey F. Naughton

Summary by: Steve Gribble and Armando Fox

One-line summary: Efficent relational on-line analytical processing (ROLAP) algorithms for computing the data cube have been developed; this paper presents a detailed analysis of an efficient multidimensional on-line analytical processing (MOLAP) algorithm for gleaming the cube.

Relevance

Data cubes are relevant, ergo efficient data cube computation is relevant, and this algorithm seems to work quite well.

Flaws

this paper is a bear to read - they could do much better with more diagrams and less text
I'd like to see an analysis which results in the number of I/Os as a function of memory size, data size, dimension size, and number of dimensions. This seems like it should be possible (although not easy), and it would give me a better idea of the tradeoffs in this algorithm.

Overview/Main Points

Check out the data cube summary to read all about data cubes.
ROLAP: use relational tables as data structure. Thus, a cell in a multidimensional space is a tuple, with some attributes representing location in the space, and one or more representing the value of that cell. Efficient ways to compute the cube given this structure:
- use grouping operation on the dimension attributes to cluster related tuples (e.g. sorting or hashing)
- use grouping performed on behalf of sub-aggregate as partial grouping to speed computation of another sub-aggregate
- compute an aggregate from another aggregate, rather than from the much larger base table. (e.g. compute 1-d aggregate from one of the 2-d aggregates rather than the N-d data.)
MOLAP: stores data as sparse arrays. The position of the value within the sparse array encodes the position of the value in the multidimensional space.
All about arrays:
- array is most likely far too big to fit in memory. Must therefore split array into chunks. Row-major or column-major is not efficient; each chunk should be a small N-d piece of the N-d space, and chunk size should correspond to block size of disk.
- even with chunking, many cells of chunks will be empty - use compression for these sparse arrays
- array may need to be built from non-array (e.g. table) data. Use a 2-pass algorithm: first pass assigns chunk numbers to tuples, second pass clusters together tuples by their chunk number.
Simple data cube computing algorithm
- Let's say we have a 3-D cube (dimensions labelled A, B, C) and we want to compute the aggregation across dimension C, which corresponds to a plane of size |A| * |B| of aggregate values.
- One possibility is to sweep the entire plane down the C dimension. Better to use chunking: sweep a chunk of size |Ac|*|Bc| (where Ac is the chunk size in the A dimension and Bc is the chunk size in the Bc dimension) down, then that chunks neighbor, etc., until have slowly built up the aggregation plane in patchwork fashion. Less memory is required for this.
- For a data cube, we need multiple aggregates (AB plane, BC plane, AC plane, A edge, B edge, C edge, and the total aggregation point). Far better to compute A from AB than A from entire data set ABC, etc. Think of aggregates in cube computation as a lattice, with ABC as the root, and ABC having children AB, BC, and AC; AC has children A and C, etc. Goal is to embed a tree in this lattice, and compute each aggregate from its parent in the tree. Trick is to find the most efficient tree.
- array algorithm: be sure to compute any lower-dimensional group-by from the higher-dimensional parent. For example, read in each chunk of ABC and use to produce an aggregate chunk of AB. Once this chunk of AB is finished, output to disk, and repeat for next chunk of AB.
- This algorithm is careful to use parent aggregates to save computation for child subaggregates, but it computes subaggregats independently (e.g. scan entire ABC to make AB, then rescan ABC entirely to produce BC.) Can do better.
multi-way algorithm
- n-d data cube has 2^n group bys, one for each subset in the power set of the cube.
- ideally, have large enough memory to store all group-bys, and finish cube in one scan. Unfortunately this is usually not possible; goal is then to minimize memory needed for each computation to achieve maxmimum overlap in aggregation computations.
- single-pass algorithm: define an order over which you will scan the entire cube (row major order, scanned across each dimension in some order, e.g for 2-d table AB, scan a0b0,a1b0,a2b0,a0b1,a1b1,a2b1,a0b2,a1b2,a2b2). Can compute how much memory is required to compute aggregates across each dimension. If memory order is ABC, only need 1 chunk for the BC aggregate (as we always scan down the A dimension), but need 4 for the B dimension, and 16 for the C dimension. Still need structure to coordinate overlapping computation - use a lattice with a minimum spanning tree. Nasty formulae are given for computing memory required for a given dimension order and resulting minimum spanning tree, from which one can deduce the optimal dimension order.
- If don't have enough memory for optimal dimension order, need to split up aggregation into multiple passes using the multi-way algorithm. End up storing incomplete trees, then filling them in on subsequent passes.
Performance studies show that array based algorithm does better than the ROLAP algorithms, even! Factors that affect performance are # of valid data entries, the dimension size (how many elements in each dimension), and number of dimensions.

Back to index