Computers are very fast circuits able to do several basic logical, mathematical and memory operations and we use long sequences of these operations to perform almost every possible task.
They are so fast that we think the elaboration as instantaneous, but if the input is large it may require some time. I still remember the first time I saw a C program run in some seconds and I realized the computation is not always instantaneous. I was implementing some arbitrary precision mathematical operator for greek Pi and Euler number computation and increasing the input size I started noticing that it took some second to complete. The time was proportional to the length of numbers involved, linearly in sum and quadratically in multiplication: that was my first experience with computational complexity.
Computational Complexity
More or less computational complexity is “counting the number of operations required to solve a problem in function of its input size”. For example if we some two number of n digits, how many 1-digit sums we need to perform? Or multiplying two n-digits numbers, how many 1-digit sums and multiplications we need?
In large computation we must take into account also memory impact: if n is 1000 all the computation runs in CPU cache and it is very fast, but if n is 10^8 the input probably is stored in RAM and we need to move data from RAM to cache and vice versa, so each operation is slower. If n is like 10^12 we need to store the information in disk and operations became even slower. In this case the complexity should take into account not only CPU operations also the number of accesses in fast and slow memory.
All the previous considerations imply an underlying model that can do specify basic actions like mathematical operations or memory movements: it is our computational model, an abstraction of a real computer, similar in capability but easier to study analytically.
In this post we recall some major models proposed in theoretical computer science literature. Why do we propose this argument? Because it is preparatory to the very interesting topics like P vs NP problem.
Turing Machine
The first computational model is the Turing Machine, invented by the British mathematician Alan Turing in 1936 [TM]. It is a mathematical computational model who aims to understand what is computable rather than to analyse the computational complexity of algorithms. In fact also if it can compute the same problems of standard computer, it is “slow” and difficult to use, since it only has a very limited number of operations. This is its description:
- infinite tape divided in cells of memory, each containing a certain symbol.
- an alphabet of symbols that can appear in the tape, potentially they can be just 0 and 1. One symbol is the blank symbol, or the symbol appearing in not initialized memory cells.
- an head positioned on a cell of the tape that at each cycle can read/write the cell or move in the previous/next cell.
- a state register recording the actual state of the machine. It can take values from a machine states set, one of these is halt state, that ends the computation.
- a finite table of instruction, telling the machine what to do according the actual state register value and cell value.
Non deterministic Turing machine and non deterministic version of the following models are out of the scope of this post.
Just a personal thought, since I know that the Game of Life is Turing Complete, in my mind the Turing Machine is a sort of alien computer in the alien word of the game of life like in this video:
Programming a Turing Machine
In this model simple operations like sum and multiplication can give you an headache. Consider sum n+m, it would require an initial tape state like this:
Initially the head is over the first 1 of n, it makes it blank and then start go right until the first 0 is found. It change it in 1 and start go left until the first 0 is found. At that point moves right of one position and if the symbol read is 1 it does the same previous procedure, otherwise if the symbol is + it blanks it and halt.
So with a set of few states (the ones described above) Turing machine can sum any two numbers read them in unary on the tape separated by a symbol + and returning n+m in unary on the tape.
Multiplication is even harder: we need some more symbol on tape, with a starting tape like:
blank | A | <n in unary> | x | <m in unary> | A | blank
Conceptually we copy n after the second A like in the sum for each digit in m. At the end of each copy n is all blanked, so we blank a digit of m and restore all 1 in n. If there are other 1 in m we perform a new copy of n like in the first state, else we blank n and halt.
We need something like O(n(n+m)) operations for sum and O(n2m2) for multiplication. Imagine programming a Turing Machine to factor numbers…
RAM model
The Random Access Machine model was introduced by Cook and Reckhow in 1973 [RAM] and it is much more similar to a real life computer. It considers a computer with this structure:
- a sequential processor with fixed program
- a countable number of cells, that can be accessed by the processor in a time independent from their number.
This is an unrealistic simplification in large computations, since a bit needs a minimum volume to be stored and there is a maximum velocity at which a bit of information can travel (principles of maximum information density and maximum information speed [BP]). So the wider is the memory and the bigger is the latency to access memory cells far from the processors respect to the near one. In fact in actual computers we observe that reading from RAM memory, disk, or network is slower by a factor of 1000 at each step.
Nevertheless RAM model it is still very effective in determining the operational complexity of algorithms in modern computers, since its structure is much more similar to them than Turing model, it is easier to treat and it provides a lower bound for real computers.
“Programming” a RAM is similar to programming a real computer, you define the pseudo-code that it must execute so that you can evaluate the number of operation required to compute it and obtain an upper bound O(f(n)) to the complexity of the algorithm.
A more interesting topic is to find lower bounds to execution time of algorithms, but often it is difficult to prove lower bound more significant than the trivial Ω(<input size>) necessary to read the whole input while executing the program. A famous example is the Ω(n logn) lower bound for the comparison-based sorting of n elements, but it is no more valid for example if we use numbers as indexes as in the Counting Sort [CLR].
Memory models
Two important aspects of a generic computation are:
- spatial locality: is the tendency of a processor to access data locations that are physically close to those accessed recently. If a certain memory location is accessed, there is a high probability that nearby memory locations will be accessed soon.
- temporal locality: is the tendency of a processor to access the same data locations repeatedly within short periods of time. If a particular data location is accessed, it’s likely to be accessed again soon.
In modern computers take into account these facts implementing several levels of memory, increasing in size but decreasing in speed like we described in general introduction to code optimization. Moreover data travel between two adjacent memory levels in blocks.
Formally these concepts have been analyzed in Hierarchical Memory Model (HMM) and Block Transfer (BT) Model.
HMM is characterized by a non-decreasing function a(x) that describes the access time to location x, implying that locations near to the processor take a lower time to be accessed.
BT model extends the HMM, allowing the transfer of B adjacent locations starting from address x in a(x) + B steps.
Working with these models is more difficult than with the RAM model but they are more realistic. An algorithm that just need to read a n elements input, that trivially takes n operations on RAM has Θ(n) = n log∗ n complexity in the BT model with access function a(x) = log x (shown in [BT]).
Nevertheless in algorithms like matrix multiplication the correct use of temporal and spatial locality leads to the same complexity of RAM model. We already covered real implementation of matrix multiplication here and here. In this article we underline some facts but refer to the two previous articles for the complete explanation.
If two large matrices M1 and M2 are stored in the disk, we need to move them on RAM in blocks and we must do it cleverly to avoid useless data transfers. Transposing the second matrix before the multiplication is good for spatial locality, since in this way we load in a unique block matrix elements that would be spare in more block.
The multiplication of small sub-matrices fitting the RAM is on the other hand a good choice to temporal locality.
In the pseudo code we can estimate the number of block transfers and understand is an algorithm will be CPU bounded or memory bounded.
Parallel computing
So far we focused on single CPU computation, but since the eighties multi-core CPUs and multi-nodes clusters are the main reality in research field and since two decades also in our houses (since when it is easier to increase the number of cores/processors in place of make more complex or fast a single processor).
Indeed also in single CPU there are at least 2 types of parallelism:
- bit-level parallelism, which consists in augmenting the number of bit elaborated per instruction increasing the word-size of the processor (vectorization);
- instruction-level parallelism, which consists in conveniently designing the processor to pipeline the execute of instructions, obtaining an higher throughput (cpu pipeline and branch prediction).
We can write code friendly with these kinds of parallelism, but essentially they are transparent to developers and the main solution is sequential. With parallel computing we consider task-level parallelism, which consists in exploiting more processors during the execution of a program. This is no more automatically provided by processor and must be made explicit by developers.
A parallel machine is characterized by its number of processors and by the network connecting them. As for single processor computers there are models that capture more or less details of the memories, similarly for parallel computers there are models that capture more or less details of the complexity of communication in the network.
In particular in literature there are both models with a very specific network (like linear array, 2D/3D mesh, binary tree, etc) and generic models not representing any specific network (BSP, LogP, LPRAM etc). The first class gives more detailed lower/upper bounds but they are valid only in a specific network. The second class provides more general results, but usually they are more difficult to obtain an we still lack many of them.
Since we aim to generality, we skip the models based on specific networks. Let explore some of the generic ones.
PRAM model
This is the ideal parallel model, similar to RAM model (in fact PRAM is Parallel RAM) except for the presence of n processors that share the RAM.
There are several kinds of PRAM as described in [HARRIS]. They are categorized according to the program execution policy (SIMD, each processor with same program but different data, or MIMD each processor with its own program and data) and memory sharing policy:
- EREW: Exclusive Read, Exclusive Write, each cell can be read o written by a single CPU at time
- CREW: Concurrent Read, Exclusive Write, each cell can be written by a single CPU at time, but read by several.
- CRCW: Concurrent Read, Concurrent Write, each cell can be read or written by more CPU at once, with the following write collision policy:
- Common: the write succeed only if all participants try to write the same value on the cell
- Arbitrary: a random processor succeeds to write the cell
- Priority: the processor with higher priority writes the cell
It can be proved that these machines are not equivalent in terms of speed and in particular they can be ordered from the more to the least powerful in this way:
Priority ≥ Arbitrary ≥ Common ≥ CREW ≥ EREW
For example the OR of n elements requires Ω(log n) in CREW and EREW models, while it is O(1) in a CRCW Common PRAM.
Other parallel models
PRAM has limitations similar to RAM due to its lack of a network to connect processors and memory. In the following some models that try to overcome them:
LPRAM is a variation of PRAM which captures also communication costs, it is considered in [LPRAM] where time and communication lower bounds for several problems as matrix–matrix multiplication, sorting, FFT are shown. They consist in two parts, one for computation and one for communication.
BSP (Bulk Synchronous Parallel) describes a wide range of real networks thanks to its parameterized structure. Time lower bounds for this model contain in its formulas parameters for latency and bandwidth of the network which connects the processors, so that the time lower bound already expresses also the communication complexity.
It has been extended to take into account the locality in parallel computation in [DBSP], where the Decomposable BSP is introduced. In this model bandwidth and latency are variable, and computations take less time for algorithms exhibiting communication among near processors. It is possible to prove which D-BSP can model processor networks more effectively than BSP, allowing emulations with only a constant factor slowdown for several networks, e.g. multidimensional
arrays, see [DBSP2] for a complete dissertation.
Relations between parallel and sequential algorithms
A result that impressed me was the relation between parallel algorithms (that must exhibit locality to minimize communications) and hierarchical memories algorithms. In [PPS] it is shown that an optimal D-BSP algorithm can be transformed in an optimal algorithm in a hierarchical model.
Conclusions
This post is a summary of sequential and parallel computational models, from the more oriented to obtain mathematical results (like the Turing Machine), passing through ideal form of real computers (like RAM and PRAM models) to models that try to describe also the complexity of memories and network communication.
They convey good concepts, useful to keep in mind also while developing code in real machines. Moreover this post will be a starting point for other more theoretical, related to computational classes.
References
- [TM] Turing, A.M.: On Computable Numbers, with an Application to the Entscheidungs problem, Proceedings of the London Mathematical Society, 2 (42): 230–265, 1937.
- [RAM] Cook, S. A., Reckhow, R. A.: Time Bounded Random Access Machines. J. Comput. Syst. Sci., vol. 7, n. 4, pp. 354–375, 1973
- [CLR] Cormen, T. H., Leiserson, C. E., Rivest, R. L., Stein, C.: Introduction to Algorithms, second edition. MIT Press, 2001
- [BP] Bilardi, G., and Preparata, F.: Horizons of parallel computation. J. Parall. Distrib. Comput. 27, 2, 172–182, 1995.
- [HMM] Aggarwal, A., Alpern, B., Chandra, A., Snir, M.: A model for hierarchical memory, Proceedings of the nineteenth annual ACM symposium on Theory of computing, STOC ’87, ACM, New York, NY, USA, 1987, ISBN 0-89791-221-7.
- [BT] Aggarwal, A., Chandra, A. K., Snir, M.: Hierarchical memory with block transfer, Proceedings of the 28th Annual Symposium on Foundations of Computer Science, SFCS ’87, IEEE Computer Society, Washington, DC, USA, 1987, ISBN 0-8186-0807-2.
- [Harris] T.J. Harris, A Survey of PRAM Simulations Techinques, ACM Computing Surveys (CSUR), Volume 26 Issue 2, June 1994
- [LPRAM] Aggarwal, A., Chandra, A.K., and Snir, M.: Communication complexity of PRAMs. Theoretical Computer Science, 71:328, 1990.
- [BSP] Valiant, L.G.: A bridging model for parallel computation. Communications of the ACM, 33(8):103111, August 1990.
- [DBSP] P. De la Torre and C.P. Kruskal. Submachine locality in the bulk synchronous setting. In Proc. of EUROPAR 96, LNCS 1124, pp. 352–358, August 1996.
- [DBSP2] Bilardi, G., Pietracaprina, A. and Pucci, G.: Decomposable BSP: A Bandwidth-Latency Model for Parallel and Hierarchical Computation, in Handbook of Parallel Computing (J. Reif and S. Rajasekaran Eds.), CRC Press, Boca Raton Fl USA, 2007
- [PPS] Andrea Pietracaprina, Geppino Pucci, Francesco Silvestri:
Cache-oblivious simulation of parallel programs. IPDPS 2006