chathaway.codes -- logs from my daily fun | A Brief Discourse on Software Complexity Metrics

Software Complexity is the measure of how difficult software is to conceptualize, comprehend, and through extension, create and maintain. Despite having the word “complexity” in the title, this domain of computer science has nothing to do with time or space complexity that people often think of (Big-O notation, omega time, etc.). More interestingly, there doesn’t seem to be any connection between high time complexity and higher software complexity.

From the time of the first computer programs, written in assembly language or hard-coded into the machine, measuring the size of these programs has been a subject of debate. The first computer scientists measured their programs in punch cards, soon changed to lines of code as medium changed. This metric, LOC, is still widely used today, but at the same time, widely acknowledged as a bad metric; often times teased for how poorly it aligns to a software project (an oft told jokes goes along the lines of a manager telling a developer he needs to start logging his LOC per week, or he will get fired… So he enters in a crazy number like 50000).

If LOC is a poor metric, what is a good metric? This has been a subject of much debate up until the late 1990’s, after which people generally accepted the impossibility of the problem. Numerous metrics have been proposed, some with very vocal supporters, but none of have been rigorously tested, at least not enough to content all critics. The CK-metrics are probably the closest to an acceptable answer that we have; but those only measure complexity of object oriented systems, without looking at the functions within the classes themselves. Cyclomatic Complexity (also known as McCabe’s metric) was one of the first proposed, and continues to be one with a large number of followers, despite having little obvious foundation for why it should work. A number of similar metrics exist, including the next-most-populat Data-Flow complexity. Data Flow relies on a simplified Cyclomatic Complexity control-flow graph (CFG) in addition to the flow of data within this CFG.

Other metrics, such as Halsteads or NPath, also exist, but do not have the save reverence as these older metrics. Some newer metrics continue to be created (Program Slicing), but suffer from many of the same criticisms that older metrics have; no rigorous analysis, verification, or studies have been made with any of these metrics. There have been some attempts to correlate metrics with fault rate within programs, and although these have been semi-successful, the connections are weak at best.

Cyclomatic Complexity

Cyclomatic Complexity is calculated in two stages; first, we create a control-flow graph of the program, and then we count the number of edges and the number of nodes in this direct graph. Lastly, we apply the formula “C = E - N + 2P”. E is the number of edges, N the number of nodes, P the number of modules (almost always one).

Or, one could use the short-cut method; count the number of branch statements within in the code, and that is your cyclomatic complexity.

Many frameworks that implement CFG’s don’t do so in a way consistent with how McCabe decribes. Often times, they add an edge between every statement (i.e. x=1; y=2; if (x == y) print “yes” else print “no” would have 5 nodes instead of 4). This can be easily proven to be the same as McCabes methods for the purpose of applying his formula; every node added also adds exactly one edge, thereby countering it, unless that node is a branching node.

Data Flow Complexity

Data Flow complexity is created by a bloke named Oviedo; to calculate it, you need a formula little bit more complex than McCabes.

$DF_i = \sum_{j=1}^{||V_i||} DEF(V_j)$

This formula gives us the DF complexity for a single “block” within the CFG. Basically, for each variable used in the block, add 0 if the value is locally defined, else trace back through the program until you find a previous definition of the value and count if (1 in most cases, but if it is defined in a branch, you may have 2 values).

The last part of Data Flow Complexity is to combine it with the programs control flow complexity. The CF complexity is:

$CF = ||e||$

Simply the number of edges in the control-flow graph. Note that in this case, we can not used the simplified model used by most existing frameworks; we need to construct a CFG which accurately represents the program.

The last step is to combine them, with weights alpha and beta. Oviedo left this weights undefined, and should be determined based on some study. Most other references of this metric use weights of 1.

$DF = \sum_{i=1}^{||S||} \alpha DF_i + \beta CF_i$

Halstead’s Complexity

Halstead’s complexity is much different than the previous two; it is various functions of the number of unique operands, total operands, number of unique operators, and total operators.

Others

Other metrics exist, but become a bit tedious to include in a blog post. For example, the CK-metric suite is consists of 6 distinct metrics, the last of which has quite a few variations. If you are interested in these, I suggest reading about them here Lines of Code (statement count) is simple, and if you need it explained, you may want to first learn to program.

My personal webspace

Cyclomatic Complexity

Data Flow Complexity

Halstead’s Complexity

Others