R&D on PVS-Studio

Nov 06 2012

Author: Evgenii Ryzhkov

Introduction
What do we already have?
What experiments did we carry out?
Conclusion

We have a large list of tasks and wishes we try to stick to while developing PVS-Studio. But occasionally we find some time to spend on unusual experiments that may bring new development ways and capabilities. If research results are successful, they may be included into the main product. They can, on the contrary, prove to be meaningless and useless, in which case we appear to have carried out a few experiments to find out one more thing that doesn't work. It is this type of experiments we're going to speak about today.

Introduction

PVS-Studio is at present a plugin for Visual Studio and cannot work without this environment. This is actually not quite so, but common users are not aware of it. This is what makes them think that the tool depends on Visual Studio:

PVS-Studio employs an external preprocessor for its work. Earlier this used to be the preprocessor from Visual C++. Now it's Clang in most cases, though we sometimes have to use Visual C++ as well.
PVS-Studio uses the project file .vcproj/.vcxproj to get information about the project settings: for instance, #define/#include being used, compilation switches that may affect the code analysis process, and so on.
Besides, PVS-Studio also needs the project file .vcproj/.vcxproj to know which files should be checked.

However, sometimes both users and ourselves feel like using PVS-Studio without being bound to Visual Studio. In this article, we will tell you about some of our experiments related to this.

What do we already have?

First of all, I will tell you about the features that we made long ago and which are successfully used in PVS-Studio.

First, PVS-Studio has for some time been able to analyze projects, integrating into build systems (both classic makefile and tougher systems). It is described in the documentation, and some our customers use this feature actively. In this case, the Visual Studio environment as such almost remains overboard.

Second, we could have started using Clang, which is integrated into the PVS-Studio distribution pack, instead of the cl.exe preprocessor long ago. And it is Clang which is switched on by default. This is done because Clang runs faster as a preprocessor than cl.exe. Besides, Clang is free from certain errors that can be found in cl.exe. However, it does have its own ones, but it all looks quite transparent to users.

What experiments did we carry out?

Here is the list of the questions we wanted to find answers to in our experiments:

Do we need the project structure defined in makefile or .vcxproj files for correct static code analysis? Are the individual compilation parameters of each particular file important to this task? Cannot we just do with commands like "Check all the files with the same build switches in this folder?"
Do we need to take into account the compilation parameters settings (we mean the compiler switches) for static analysis?
Does static analysis require file preprocessing at all or can errors be well found without it?

To answer these questions, we have written a utility that recursively traverses the specified folder and runs PVS-Studio.exe on all the files with the source code (*.c, *.cpp, *.cxx, etc.). We wanted to compare the analysis results thus obtained with the results produced by the traditional project analysis within the Visual Studio environment.

Experiment one. Do we need the project structure?

We started our experiments on the WinMerge project that we already checked long ago. If you check it under Visual Studio using the project file, PVS-Studio will analyze about 270 files included into .vcproj. What's interesting, there are about 500 files with the source code (without .h-files) in the WinMerge folder. Though it's obvious now, we still didn't expect it at that time. It appears that if you tell the analyzer: "I want you to check the files in this folder", it will check unnecessary files too! And in case of WinMerge, their number is about twice larger.

So, the first problem we encountered in this experiment was this: if you check just "all the files in the folder", the analyzer will check more files than necessary, including those which are known to be incorrect and uncompilable.

But this wasn't the main problem. When we launched the analysis for all the files, we started to get preprocessor errors at once: "This or that #include-file cannot be found" with a reference to the names of the project .h-files located in other folders. We understood that we needed to point out to the preprocessor the folders where include-files lay. How to do this automatically, without having to specify these folders manually? We added all the subfolders into the #include-directories list for each file.

Running a bit ahead, I want to tell you that it's no easy task for some projects. If your project contains thousands of subfolders, automatically adding them into the #include-files search list will make the preprocessor command line swell. While you can use the response file for cl.exe, there's no solution to this problem in case of Clang yet.

So, you face another problem after automatically specifying all the subfolders to be searched through for #include-files.

This problem is this: projects sometimes have files with identical names. In this case, you cannot automatically specify files in which folder and for which project should be used in each particular search of #include-files. You may say: "Well, yeah, there are some projects that have files with the same names, but they are rare and can be ignored". No, they cannot. For instance, almost all the Visual Studio projects contain files with identical names. Don't you believe me? Do you think your projects don't contain such files? Then run a search for stdafx.h in your projects... Since stdafx.h must be included into all the files, choosing a wrong version of stdafx.h leads to a preprocessing error for ALL the project files.

Although we found many other files with identical names besides stdafx.h, the very presence of this problem makes it impossible to preprocess files for further handling in automated mode.

We have drawn the following conclusions from the first experiment's results. Checking "all the files in the folder" without any file project (whether it is makefile or vcproj) is difficult due to the following two reasons:

The folder in most cases contains additional "unnecessary" files which are usually quite numerous. At the same time, they may be uncompilable, incorrect or simply add "trash" to the analyzer's output.
The task of individually specifying the #include switches for each particular file can be solved only manually. It cannot be solved in automated mode, by simply passing an identical switch set to each file, because projects usually contain files with identical names.

Experiment two. The usefulness of compilation switches

Generally speaking, there are a lot of compilation switches that at first sight seem to affect the way a file is interpreted from the viewpoint of static analysis. But there are many other parameters influencing the inclusion of certain code branches besides the obvious paths to #include-files and #define-directives we have just discussed. How strong is their influence?

For instance, there is the "/J" switch in the cl.exe compiler:

/J (Default char Type Is unsigned)

The parameter seems to be important, but how much does it influence the static analysis? Or, for example, take some other parameters referring to language extensions:

/Za, /Ze (Disable Language Extensions)

To estimate the influence of such parameters, we compared the results of project analysis in a common (traditional) mode and the same check without accounting for such compilation parameters.

The experiment has shown that the latter results have an extremely insufficient difference from the former results. Moreover, even absence of #define-parameters being passed has almost no influence on the analysis quality in general. Of course, the analyzer made wrong choices of code branches in #ifdef-constructs, but it had been expected and is quite logical. The only parameter which is 100% necessary is still the paths to #include-files.

The conclusions drawn from the second experiment's results are as follows: it is desirable to account for compilation parameters to get more accurate static analysis results. But if you cannot do that for some reason (for example, you have a complex build system), you may try do without them. The only parameter which is an exception is the one defining paths to #include-files - that is necessary.

Experiment three. Do we need preprocessing?

Finally, we wanted to check how much necessary preprocessing was to get high-quality static analysis results. After all, you can detect a lot of errors through "local" analysis, i.e. inside one function.

To understand it, we carried out the following experiment. We disabled preprocessing completely in the analyzer and started feeding PVS-Studio.exe with source .cpp-files "as is" without any preprocessing. Then we compared the results to our reference results.

It appeared that abandoning the preprocessing is highly destructive to the analysis quality. The reason is that when you stop using preprocessing (either as a separate step or "on the fly"), you miss information about data types, classes and functions defined in the .h-files. Because of this, quite many of our diagnostics simply fell off. Yes, they still were able to find something. But, first, it was much less than before. And second, too many trash messages were generated because the analyzer "had failed to find out" the data type and assumed that it might cause troubles.

Did we get real errors among the results acquired through the analysis without preprocessing? Yes, we did. But they were very hard to filter out among the huge amount of false reports.

The conclusions from the third experiment's results are the following: absence of preprocessing is very destructive to the static analysis quality. Though you can still detect some errors even without the preprocessor, absence of information about data types, function titles and other similar information, distorts the analysis results and causes a lot of trash messages. It means that static analysis is not reasonable without preprocessing.

Conclusion

So, these are the conclusions drawn from the results of our experiment (or, rather, an evidence of what we already supposed before):

The project structure is very important. You can't just check "all the files in the folder". First, unnecessary files will be checked. Second, confusion occurs when including header files with identical names because of the impossibility to automatically generate individual build parameters for each file without having any project file.
Compilation switches should be accounted for, but their influence is not that crucial, except for the paths to #include-files and, perhaps, #define-parameters.
Preprocessing is a necessary step for static analysis. Without it, a significant part of information about the code structure gets lost, which leads to a poor quality of analysis results.

That's why we hardly will abandon the established scheme of project checking in nearest future.