Back to index
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates
Yihong Zhao, Prasad M. Deshpande, and Jeffrey F. Naughton
Summary by: Steve Gribble and Armando Fox
One-line summary:
Efficent relational on-line analytical processing (ROLAP) algorithms for
computing the data cube have been developed;  this paper presents a
detailed analysis of an efficient multidimensional on-line analytical
processing (MOLAP) algorithm for gleaming the cube.
Relevance
Data cubes are relevant, ergo efficient data cube computation is relevant,
and this algorithm seems to work quite well.
Flaws
     -  this paper is a bear to read - they could do much better with
	  more diagrams and less text
     
-  I'd like to see an analysis which results in the number of I/Os
	  as a function of memory size, data size, dimension size, and
	  number of dimensions.  This seems like it should be possible
	  (although not easy), and it would give me a better idea of the
	  tradeoffs in this algorithm.
Overview/Main Points
     -  Check out the data cube summary to
	  read all about data cubes.
     
-  ROLAP:  use relational tables as data structure.  Thus, a cell in
	  a multidimensional space is a tuple, with some attributes
	  representing location in the space, and one or more representing
	  the value of that cell.  Efficient ways to compute the cube given
	  this structure:
	  
	       -  use grouping operation on the dimension attributes to
		    cluster related tuples (e.g. sorting or hashing)
	       
-  use grouping performed on behalf of sub-aggregate as
		    partial grouping to speed computation of another
		    sub-aggregate
	       
-  compute an aggregate from another aggregate, rather
		    than from the much larger base table.  (e.g. compute
		    1-d aggregate from one of the 2-d aggregates rather
		    than the N-d data.)
	  
 
-  MOLAP: stores data as sparse arrays.  The position of the value
	  within the sparse array encodes the position of the value in the
	  multidimensional space.
     
-  All about arrays:
	  
	       -  array is most likely far too big to fit in memory.
		    Must therefore split array into chunks.  Row-major or
		    column-major is not efficient;  each chunk should be a
		    small N-d piece of the N-d space, and chunk size should
		    correspond to block size of disk.
	       
-  even with chunking, many cells of chunks will be empty
		    - use compression for these sparse arrays
	       
-  array may need to be built from non-array (e.g. table)
		    data.  Use a 2-pass algorithm:  first pass assigns
		    chunk numbers to tuples, second pass clusters together
		    tuples by their chunk number.
	  
 
-  Simple data cube computing algorithm
	  
	       -  Let's say we have a 3-D cube (dimensions labelled A, B,
		    C) and we want to compute the aggregation across
		    dimension C, which corresponds to a plane of size |A| *
		    |B| of aggregate values.
	       
-  One possibility is to sweep the entire plane down the C
		    dimension.  Better to use chunking:  sweep a chunk of
		    size |Ac|*|Bc| (where Ac is the chunk size in the A
		    dimension and Bc is the chunk size in the Bc dimension)
		    down, then that chunks neighbor, etc., until have
		    slowly built up the aggregation plane in patchwork
		    fashion.  Less memory is required for this.
	       
-  For a data cube, we need multiple aggregates (AB plane,
		    BC plane, AC plane, A edge, B edge, C edge, and the total
		    aggregation point).  Far better to compute A from AB
		    than A from entire data set ABC, etc.  Think of
		    aggregates in cube computation as a lattice, with ABC
		    as the root, and ABC having children AB, BC, and AC;
		    AC has children A and C, etc.  Goal is to embed a
		    tree in this lattice, and compute each aggregate from
		    its parent in the tree.  Trick is to find the most
		    efficient tree.
	       
-  array algorithm: be sure to compute any
		    lower-dimensional group-by from the higher-dimensional
		    parent.  For example, read in each chunk of ABC and use
		    to produce an aggregate chunk of AB.  Once this chunk
		    of AB is finished, output to disk, and repeat for next
		    chunk of AB.
	       
-  This algorithm is careful to use parent aggregates to
		    save computation for child subaggregates, but it
		    computes subaggregats independently (e.g. scan entire
		    ABC to make AB, then rescan ABC entirely to produce
		    BC.)  Can do better.
	  
 
-  multi-way algorithm
	  
	       -  n-d data cube has 2^n group bys, one for each subset in
		    the power set of the cube.
	       
-  ideally, have large enough memory to store all
		    group-bys, and finish cube in one scan. Unfortunately
		    this is usually not possible; goal is then to minimize
		    memory needed
		    for each computation to achieve maxmimum overlap in
		    aggregation computations.
	       
-  single-pass algorithm: define an order over which you
		    will scan the entire cube (row major order, scanned
		    across each dimension in some order, e.g for 2-d table
		    AB, scan
		    a0b0,a1b0,a2b0,a0b1,a1b1,a2b1,a0b2,a1b2,a2b2).  Can
		    compute how much memory is required to compute
		    aggregates across each dimension.  If memory order is ABC,
		    only need 1 chunk for the BC aggregate (as we always
		    scan down the A dimension), but need 4 for the B
		    dimension, and 16 for the C dimension.  Still need
		    structure to coordinate overlapping computation - use a
		    lattice with a minimum spanning tree.  Nasty formulae
		    are given for computing memory required for a given
		    dimension order and resulting minimum spanning tree,
		    from which one can deduce the optimal dimension order.
	       
-  If don't have enough memory for optimal dimension
		    order, need to split up aggregation into multiple
		    passes using the multi-way algorithm.  End up storing
		    incomplete trees, then filling them in on subsequent
		    passes.
	  
 
-  Performance studies show that array based algorithm does better
	  than the ROLAP algorithms, even!  Factors that affect performance
	  are # of valid data entries, the dimension size (how many
	  elements in each dimension), and number of dimensions.
Back to index