Code optimization

This page “connects the dots” of posts dealing with optimization. It summarize the topic citing the various posts, making it a good place to have an overview of the argument as debated in this blog.

Code optimization means to modify an existing program to make it faster and less resource-demanding. Depending on the context this assumes very different meanings:

database based application: write efficent queries (using indexes, minimizing selected fields…),
computational contexts: to write code that fully exploits the CPU and memory,
network distributed computations: to minimize communication,
and so on…

When coding we shouldn’t care of performance at first writing of a program, we should only concentrate on solution and leave optimization to a second phase.

Optimization has various levels:

optimization of execution in a single CPU
parallelization in a machine with more CPUs
parallelization over network

Over time I hope to cover somehow this topic, which is one of the more fascinating of computer science. This page is quite “programming-oriented” but in the blog I treat the argument also theoretically presenting the most famous computational models and in future showing how to analyze algorithm on them.

The problem(s)

Modern computers have the following structure:

CPU with registers, pipelines and manages several instruction sets.
Memory hierarchy: cache L1, L2, L3, RAM, SSD, HDD, network storage
In a computer CPU has with several core (possibly multithreaded) and the computer can be just a node in a network of computers.

Let’s see in details with which problems we have to fight in each of these points.

CPU

The CPU is the most important part of a computer. Its structure fundamentally is an ALU (Arithmetic-Logic Unit) that executes operations, a fist of registers to store data for operations and the context and a control unit to communicate with memory. Execution proceeds with steps (cycles) in which the CPU performs the Fetch-Decode-Execute process. Since this process has several sub-steps, usually execution of code is organized in a pipeline to exploit more components in each cycle (so in a cycle there are several operations executing, each in a different sub-step).

CPU pipeline is a form of instruction-level parallelism and we should keep it in mind while optimazing code. As long as CPU guess instructions that fill the pipeline it will work with maximum throughput (make some work for several instruction per cycle), but when code has poor predictability (e.g., lots of if statements) it can only carry one instruction per cycle (I’m semplifying the question, I know there are branch predictors and other cool stuffs, but the fundamental problem is to fill the pipeline).

Moreover CPU can execute also special instruction called SIMD (single instruction multiple data) that can perform the same operation in an array of bytes. Famous instruction sets SIMD are MMX, SSE2/3, AVX. When dealing with specific problems SIMD instructions can be very useful.

Memory usage

Another problem is linked to provide instruction and data to the CPU, fetching them from memory. Memory has a speed slower than CPU, so many cycles are necessary to obtain the data. In general levels of memory have a hierarchical organization: the nearer we are to CPU and the faster and smaller the memories are.

Memory	Size	Access time (CPU cycles)
registers	bytes	1
Cache L1	~128KB	~5
Cache L2/L3	~MBs	~10
RAM	~ GBs	~ 100
SSD	TB	~ 10.000-100.000
HDD	~TBs	~ 10⁶-10⁷
network storage	∞	~ 10⁹

Order of magnitude for size and speed of modern memories.

Here three major ideas:

while CPU elaborates some data, pre-fetch the next needed data from RAM (or disk) to cache, so that they are available when CPU requires them,
since data travel between memories in segment of a certain size, if data that require to be elaborate together are adjacent in memory we can avoid to use two segment of cache in place of one,
since moving data has a cost, if we should fully exploit data when they are available in the CPU.

The last two concepts are known as spatial and temporal locality of the computation. See also the post about hierarchical models in a theoretical approach to the argument.

Task parallelism

The two previous points aim to optimize the execution over one core, but modern CPUs have many core, so they can perform in parallel more task. When executing an algorithm there are several steps with different dependencies, some can be done together, some need other steps as precondition etc. These dependencies usually are represented by a flowchart (or by a DAG if you are a theory man).

The more tasks we can execute together and the faster will be the execution, also in something the nature of the algorithm is strictly sequential and there is very few that we can do.

Single core code optimization

In the blog we provide several example of single core optimization:

in Matrix Multiplication article we target how to make the implementation cache-friendly with data transposition and splitting elaboration in blocks and how to maximized CPU flops using vectorization.
in post where we show how to compute one million digits of Pi and Euler number e we also apply cache friendly techniques.