Go back to the the Combinatorial BLAS home page.

**Q1: **I would like to use your Combinatorial BLAS code for
some of my experiments which involve sparse matrix multiplication. However, it
is not clear how to write an output of PSpGEMM(…)
function to a file. I've tried to use "put" function of SpParMat, but
it outputs part of the matrix that corresponds to particular process and not
the whole matrix. Is there a way to do it using your code?

**A1:** Yes, SpParMat::SaveGathered(…)
will create a single file output (albeit slow) sorted with increasing row id's
when called like A.SaveGathered("product.txt").
The caveat is that "gathered" I/O in human readable form is
quite slow due to serialization on large processor counts. It should only be
used for debugging, ideally. For vectors, we have a much much
faster version:
FullyDistVec::ParallelWrite (…), which
should be used instead. SpParMat will also get a ParallelWrite soon.

----

**Q2**: Does Combinatorial BLAS support in-node multithreading?

**A2:** Almost all expensive primitives
(SpGEMM, SpMV with sparse vectors, SpMV with dense vectors, EWiseMult, Apply,
Set) are hybrid multithreaded within a socket. Read this example.

---

**Q3:** Reading/writing text files is really slow for my
purposes, what can I do?

**A3:** Starting from version 1.6, we now have
extremely fast matrix market (text) file reading, check out
SpParMat::ParallelReadMM() and FullyDistVec::ParallelReadMM()

---

**Q4:** Is there a preferred way to prune elements from a
SpParMat according to a predicate?

**A4:** Yes, SpParMat::Prune(…) will do it according to a predicate. An
overloaded version of the same function, SpParMat::Prune(ri,ci)
will prune all entries whose row indices are in ri and column indices are in ci

---

**Q5:** I am trying to run CombBLAS on Windows but the MS MPI does
not seem to support MPI C++ bindings.

**A5: **Combinatorial BLAS recently (version 1.3.0) switched to
C-API for all its internal MPI calls. After that, we've also compiled CombLAS
on a windows machine without problems. However, we recommend using an open
source MPI for windows too, such as MPICH-2.

---

**Q6:** I would like to use Combinatorial BLAS for some parallel
sparse matrix-matrix multiplications. This works quite well, however when I try
to assign a m x 1 sparse matrix (actually a vector) to the first column of an
existing sparse matrix with SpAsgn I get an error saying: "Matrix is too
small to be splitted". Is this because it's not possible to use SpAsgn on
vector-like matrices?

**A6: **SpAsgn internally uses a memory efficient Mult_AnXBn_DoubleBuff as opposed to Mult_AnXBn_Synch).
You might probably go into SpParMat<IT,NT,DER>::SpAsgn(...)
and change occuranges of Mult_AnXBn_DoubleBuff
to Mult_AnXBn_Synch. However, this will likely
only solve your problem for the serial case. because ComBBLAS can not
effectively 2D decompose an m x 1 matrix: each dimension should ideally be at
least sqrt(p). It is much better to represent that vector as a FullyDistSpVec.

---

**Q7: **Starting from a general sparse matrix Z, I want to
construct the symmetric matrix M: [[X'X; X'Z];[Z'X; Z'Z]], where X is a vector
of 1's. Thus the element at position (1,1) is simply the number of columns of
Z, and the vector Z'X contains the sums per column of Z. For now, I have a
working code, but it is quite sloppy because I do not find a function for which
I can easily increase the dimension of a sparse matrix or even set an element
to a specific value. Is there any function in Combinatorial BLAS which can do
this?

**A7:** Not out of the box. You don't want an SpAsgn or any
variant of it because it can't grow the matrix. You want some sort of matrix
append. How about using Find(…) and Sparse(…)? The Matlab care of what
you want to do is:

X = ones(size(Z,2),1)

M = [X' * X, X' * Z; Z'* X, Z' * Z]

Supporting such a general concatenation efficiently might be hard to add at this point. Instead, there is a Concatenate(…) function for vectors. Armed with Concatenate(…), find(), and the sparse-like constructor, one can solve your problem. Check out the working example in ReleaseTests/FindSparse.cpp

---

**Q8:** Does CombBLAS include the API to perform a symmetric
permutation on a matrix, as explained in your SISC
paper?

**A8:** Yes it does. Check out the
ReleaseTests/IndexingTiming.cpp for an example.

---

**Q9:** How can I use small test case to see whether the
operation on matrix is correct? In other words, how do I print all the
information of a matrix with each value in matrix?

I can use PrintInfo to print basic information, but it only gives me number of rows and columns and nnz

**A9:** Our recommendation is to use SaveGathered(…) to dump the whole matrix into a file in triples
(matrix market) format. For vectors, we have a much much faster version: FullyDistVec::ParallelWrite (…)

---

**Q10:** Does CombBLAS code run on any
graph size or there is some limitation on the dimension of the matrix A. I mean
should it be a multiple of sqrt(p) where p is total number of processors.

**A10:** No, the matrix dimension does not have to be a multiple
of sqrt(p) but it should be bigger than sqrt(p). In other words you can have a
5x5 matrix on 4 processors but not on 36 processors. We don't really see the
point of using more than |V|^2 processors.

---

**Q11:** My comparison results on real graph inputs revealed
something weird. In input loc-gowalla, how can 16 processors time(called
time_16) and

64 processors time(called time_64) which time_64*4<time_16 which is more than linear scale?

**A11:** The complexity of the parallel algorithm drops as
sub-matrices owned by each processor gets sparser. In particular, it is
proportional to O(flops x log(ni)) where ni is the size of the intersection of
the set of nonzero columns of Aik and nonzero rows of Bkj for A*B. What might
happen as p increases is that there is a phase transition that makes ni drop
significantly for your input (for p=64, each sub-matrix will have only ~1.2
nonzeros per row or column). More details are in the SISC
paper and the references therein. Hope this makes sense. This is why
I don't suggest people use CombBLAS for small p (< 40) because it is not on
the top of its game for small number of processors.

---

**Q12:** Should the input file have nodes numbered from 1 or it
is fine if the nodes are numbered from 0?

**A12:** If you're using the human readable matrix market format
as your input, then it should be 1-indexed.

---

**Q13: **I'm wondering for breadth-first-search, under the hood
does the matrix-vector multiplication method change based on the sparsity of
the frontier vector, or does the underlying matrix-vector multiplication assume
the frontier is always sparse?

**A13:** Depending on your definition of sparseness, the
frontier is almost always sparse. We use the pragmatic definition of
"sparse" in the sense that a vector is sparse if it is worth taking
advantage of the sparsity in there. I'd guess, for a dense vector assumption to
be competitive, it would have to have at least 1/3 of its potential locations
nonzero. However, I might be wrong (and you're welcome to prove me wrong). To
answer your question more directly, CombBLAS supports both dense and sparse
right hand side vectors, but the specific BFS implementation does not
adapt.

---

**Q14: **Could you briefly explain the difference in your
implementations of matrix-sparse vector and matrix-dense vector multiply? For
example, is the sparse vector case a write-based approach: Every element
updates all of its neighbors (from a graph-theoretic standpoint) locations in
the output vector; and the dense vector case a read-based approach: Every
element reads some value from each of its neighbors and updates its own entry
in the resulting vector?

**A14:** Sparse matrix-sparse vector is "right hand side
vector structure" driven. In y = A*x, for each nonzero x_i, we scale the
column A(:,i) with that and merge the scaled sparse columns results into y. The
computation boils down into merging sparse columns into one. Combinatorial
BLAS is a matrix-vector based library, so thinking in
terms of updates on single entries is probably not the right abstraction.

Sparse matrix-dense vector is slightly different in the sense that it is driven by the matrix structure; you basically stream the matrix. The correctness of both operations are handled by a SPA-like or heap-like data structure that merges multiple intermediate values contributing to the same output location; no atomics are used.

---

**Q15: **I would like to get your opinion
on how sparse-matrix based implementations compare with more native
implementations

**A15: **Sparse matrix abstraction, like
any abstraction, will leave some performance on the table. In particular it is
prone to performing extra passes over data or creating extra temporaries (if
you've ever programmed in Matlab; this is similar). On the other hand, sparse
matrix abstraction gives you "primitives" to implement graph
"algorithms" as opposed to the algorithms themselves. For instance,
CombBLAS has sparse matrix x sparse vector over a semiring as opposed to BFS,
because now using the same primitive one can implement MIS (maximal independent
set) too, only by changing the semiring. Or one can perform run time
filtering on edges based on the attributes, similarly by changing the semiring
functions (therefore extending functionality to semantic graphs). Indeed this
is what we've done in our upcoming IPDPS'13 paper.

---

**Q16: **Is there an effort to incorporate the bottom-up BFS of
Scott Beamer into CombBLAS?

**A16: **Yes, it is already done. Just use the dobfs executable
(made from DirOptBFS.cpp).

---

**Q17: **My serial code is faster than CombBLAS on a single
core.

**A17: **I believe that. CombBLAS targets
"scalability", not optimizing the single core performance.

Examples:

- think about the 2D BFS. CombBLAS does not use a CSR like data structure because that is not memory scalable due to problems of hypersparsity in large concurrencies. Instead CombBLAS opts to use a slower (about 2x around 1000 cores) but memory scalable format called DCSC.

- think about betweenness centrality which uses sparse matrix-matrix multiply. CombBLAS doesn't use the fastest serial algorithm as its subroutine because it doesn't scale to thousands of cores. Instead it uses a outer-product algorithm that is significantly slower for p=1, but scales indefinitely.

---

**Q18:** Looking at the output of your Graph500 application, I
noticed a large number of self-edges removed. That’s very interesting.

**A18: **The duplicate edges problem is inherent to the R-MAT
generator on large scale, unless some special kind of noise is added. Check
here for a great analysis of this phenomenon: http://arxiv.org/abs/1102.5046

---

**Q19:** How are you counting the number of edges traversed in
Graph500? Is this still using the original verify.c file provided with the
reference version of the Graph500 benchmark and passing in the parent tree?

**A19: **It is calculated by summing the
degrees of the discovered vertices using EWiseMult(…)
followed by a Reduce(…). Degrees are pre-symmetrization
(original edges), so we're not over-counting. However, we count self-loops and
duplicates as mentioned in the benchmark specs.

---

**Q20:** My computations
finishes fine but I get an “Attempting to use an MPI routine after finalizing
MPICH” afterwards.

**A20:** To avoid the
finalization error, please imitate an example such as MultTest.cpp: http://gauss.cs.ucsb.edu/~aydin/CombBLAS/html/_mult_test_8cpp_source.html

The curly brackets around the code are intentional. Since distributed objects have MPI related pointers in them, those pointers are released once the destructors are called. In C++ (at least until C++11) there isn’t a good way to call the destructor manually, so the destructor is called immediately before the program exists, which is after the MPI_Finalize. Since the MPI related objects are destructed after MPI_Finalize, you see this error. Try the curly brackets approach.

Go back to the the Combinatorial BLAS home page.