Articles A Static Analysis Tool for C++ by Greg Utas

emailx45

Местный
Регистрация
5 Май 2008
Сообщения
3,571
Реакции
2,439
Credits
574
A Static Analysis Tool for C++
Greg Utas - 28/Apr/2020
[SHOWTOGROUPS=4,20]
Automating Scott Meyers' recommendations and cleaning up #include directives.

This article is a user guide to a static analysis tool for C++ code. Among other things, the tool can clean up #include lists and highlight violations of C++ best practices. It can also implement some of its suggestions by editing the code. The article also provides a high-level overview of the tool's implementation.
  • ...github.com/GregUtas/robust-services-core/archive/master.zip
  • .../KB/cpp/5165710/master.zip
  • .../KB/cpp/5246833/txt_files.zip

Introduction
C++ is a large language—too large, some would argue. Because it's a superset of C, it's easy for developers with a C background to build a hybrid OO/non-OO system. C++ also kept the preprocessor, which is sometimes used in what can only be described as despicable ways. And rather than risk offending legacy systems, the C++ standards committee seems very reluctant to deprecate anything—but not at all reluctant to keep adding what seems like one pedantic feature after another, at least to those of us struggling to keep up.

As a result of all this, there are often many ways to do something in C++, and figuring out which way is best can be difficult. Without guidance, it can only be learned through torturous experience. It is therefore unsurprising that there are many books about C++ best practices, such as Scott Meyers' Effective C++. But it's easy to forget their recommendations when you're immersed in coding, especially when new to the language. Of course, some developers don't even bother to read such books, being of the "If it works, it's correct—so don't touch it!" school. Having a tool that could serve as an automated Scott Meyers code inspector would go a long way to addressing these issues.

Background
When I started to develop the Robust Services Core (RSC), I had a reasonable knowledge of C++ but was far from proficient. The code grew very organically and was continually refactored. As I became more familiar with C++ and needed to revisit areas of the code that had lain dormant for a while, I kept finding things that I would now do differently. But there was always more code to develop and never enough time to do a tedious code inspection to find and "fix" all the things that could be improved.

Eventually I decided that, at the very least, it would be nice to clean up all the #include directives. Surely there was a publicly available tool for this. This was circa 2013, and the only thing I found was a Google initiative called "Include What You Use", which appeared to have been mothballed.Для просмотра ссылки Войди или Зарегистрируйся I therefore decided to write such a tool as a diversion from the main focus of RSC.

Some diversion! It soon became apparent that fixing #include lists, to add the directives that should be there and remove those that shouldn't, meant writing a parser. And not just a parser, but something closer to a compiler, because it would also have to do name resolution and other things. Another option was to take an open-source C++ compiler and either modify it or extract the necessary information from files that it might produce.

Rather than give up, I decided to try writing the tool from scratch. It would be a learning experience, even if the attempt ultimately had to be abandoned. This article describes the current state of the code that emerged.

Using the Code
Not only does the code clean up #include directives, it serves as an automated Scott Meyers code inspector that can implement some of its recommendations by suitably editing the source code. Its main drawback is that it only supports the subset of C++11 that RSC uses. Although this is a reasonable subset of the language, what's Для просмотра ссылки Войди или Зарегистрируйся will hamper its usefulness to projects that use unsupported language features. Adding one of these missing language features can be anywhere from moderately easy to quite challenging. Nonetheless, feel free to request that a specific language feature be supported—or even volunteer to implement it! This will make the tool useful to a wider range of projects.

Unlike previous articles that I've written, this one focuses more on how to use the code, and not much on how it works. However, it will provide a high-level overview of the design as a roadmap for those who want to dig into the code.

Walkthroughs

Defining the Library
Before the tool can be used, the files that make up the code base must be defined. This can be done right after RSC starts by entering the command >read buildlib from the CLI. That ">" is RSC's CLI prompt and is not entered, but this article uses it to denote a CLI command. A dump of all CLI commands is available in Для просмотра ссылки Войди или ЗарегистрируйсяДля просмотра ссылки Войди или Зарегистрируйся; scroll down to somewhere around line 1246, to "ct>help full", to see those in the ct directory, which is where the tool is implemented.

What >read buildlib does is execute the script Для просмотра ссылки Войди или Зарегистрируйся, which contains a sequence of CLI commands. This results in the execution of the following commands, which are copied from the console transcript file that RSC generates, with commands not relevant to this article removed:
Код:
nb>read buildlib
nb>ct
ct>read lib.create
ct>import subs "subs"
ct>import nbase "nb"
ct>import ntool "nt"
ct>import ctool "ct"
ct>import nwork "nw"
ct>import sbase "sb"
ct>import stool "st"
ct>import mbase "mb"
ct>import cbase "cb"
ct>import pbase "pb"
ct>import onode "on"
ct>import cnode "cn"
ct>import rnode "rn"
ct>import snode "sn"
ct>import anode "an"
ct>import diplo "dip"
ct>import rsc "rsc"

The tool is in the ct directory, so the command >ct is used to access the CLI commands in that directory. The script Для просмотра ссылки Войди или Зарегистрируйся is then read. It contains a series of >import commands that add, to the code library, all of the directories that are needed to compile the project (RSC, in this case). For example, the command
Код:
ct>import ctool "ct"

imports the code in the ct directory, which can subsequently be referred to as ctool in other CLI commands. The path to this directory is relative to the SourcePath configuration parameter. When RSC starts up, it obtains its configuration parameters from the file Для просмотра ссылки Войди или Зарегистрируйся. So to use the tools on your own code, you need to
  • Modify element.config by setting its SourcePath entry to a directory that subtends all of your project's code files.
  • Create a file similar to lib.create in the same directory as RSC's lib.create. Each of the >import commands in that file must specify a directory that is relative to your new setting for SourcePath.
  • Copy the Для просмотра ссылки Войди или Зарегистрируйся directory from RSC into your own project, just below your SourcePath directory, and include the command >import subs "subs", as found in RSC's lib.create, in your version of lib.create.
  • Modify the buildlib script to >read your version of lib.create.
Each >import command ends up creating a Для просмотра ссылки Войди или Зарегистрируйся instance for its directory and a Для просмотра ссылки Войди или Зарегистрируйся instance for each code fileДля просмотра ссылки Войди или Зарегистрируйся in that directory. There are currently two restrictions:
  • Each file name must be unique (i.e., the same name cannot be used in more than one directory).
  • All of the code files in a directory get imported (i.e., there is no way to exclude a code file).
Parsing the Code
Once all of the source code directories have been imported, the entire code library can be parsed, which is a prerequisite to checking it with the static analysis tool. This is done with the command
Код:
>parse - win32 $files

in which
  • - specifies that no parser options are being used (the only options are ones that enable debug tools)
  • win32 specifies that the target is 32-bit Windows (currently, the only other target is win64)
  • $files is a built-in library variable that contains the set of all code files
If $files is replaced with f ctool, meaning all the code files in the ct directory, the result (again taken from the console transcript file) looks like this:
Код:
ct>parse - win32 f ctool
cstdint
cctype
cmath
csignal
cstdio
cstdlib
direct.h
exception
functional
iosfwd
utility
typeinfo
winerror.h
atomic
cstddef
ctime
ios
io.h
iterator
cstring
windows.h
ostream
iomanip
memory
new
queue
stack
unordered_map
algorithm
dbghelp.h
istream
intsafe.h
list
map
set
timeb.h
vector
winsock2.h
string
iostream
ws2tcpip.h
bitset
fstream
sstream
FunctionGuard.h
SysDecls.h
Clock.h
Q1Link.h
Q2Link.h
SysTypes.h
Algorithms.h
Debug.h
std::bitset<unsigned int>
RegCell.h
Formatters.h
Exception.h
std::unique_ptr<std::basic_ostringstream>
Memory.h
//
// [many lines deleted]
//
Cxx.cpp
CxxRoot.cpp
std::unique_ptr<CodeTools::CxxStrLiteral<char,std::basic_string<char,std::char_traits<char>,
std::allocator<char>>,CodeTools::Cxx::Encoding::ASCII>>
std::vector<std::unique_ptr<CodeTools::CxxStrLiteral<char,std::basic_string<char,
std::char_traits<char>,std::allocator<char>>,CodeTools::Cxx::Encoding::ASCII>>>
std::move<std::unique_ptr<CodeTools::CxxStrLiteral<char,std::basic_string<char,
std::char_traits<char>,std::allocator<char>>,CodeTools::Cxx::Encoding::ASCII>>>
std::vector<std::unique_ptr<CodeTools::Macro>>
std::move<std::unique_ptr<CodeTools::Macro>>
std::unique_ptr<CodeTools::Define>
std::move<std::unique_ptr<CodeTools::Define>>
CodeTools::DisplayObjects<std::unique_ptr<CodeTools::Macro>>
std::iterator_t<const std::unique_ptr<CodeTools::Macro>>
NodeBase::Singleton<CodeTools::ParserTraceTool>
Parser.cpp
CodeTools::CxxCharLiteral<char,CodeTools::Cxx::Encoding::ASCII>
CodeTools::CxxCharLiteral<char16_t,CodeTools::Cxx::Encoding::U16>
CodeTools::CxxCharLiteral<char32_t,CodeTools::Cxx::Encoding::U32>
CodeTools::CxxCharLiteral<wchar_t,CodeTools::Cxx::Encoding::WIDE>
std::iterator_t<CodeTools::Cxx::Keyword>
std::iterator_t<const CodeTools::Cxx::Keyword>
std::unique_ptr<CodeTools::StringLiteral>
std::move<std::unique_ptr<CodeTools::StringLiteral>>
std::unique_ptr<CodeTools::Elif>
std::move<std::unique_ptr<CodeTools::Elif>>
std::unique_ptr<CodeTools::Else>
std::move<std::unique_ptr<CodeTools::Else>>
std::unique_ptr<CodeTools::Endif>
std::move<std::unique_ptr<CodeTools::Endif>>
std::unique_ptr<CodeTools::Error>
std::unique_ptr<CodeTools::Iff>
std::move<std::unique_ptr<CodeTools::Iff>>
std::unique_ptr<CodeTools::Ifdef>
std::move<std::unique_ptr<CodeTools::Ifdef>>
std::unique_ptr<CodeTools::Ifndef>
std::move<std::unique_ptr<CodeTools::Ifndef>>
std::unique_ptr<CodeTools::Line>
std::unique_ptr<CodeTools::Pragma>
std::unique_ptr<CodeTools::Undef>
std::move<std::unique_ptr<CodeTools::Undef>>
Total=225, failed=0




[/SHOWTOGROUPS]
 
Последнее редактирование:

emailx45

Местный
Регистрация
5 Май 2008
Сообщения
3,571
Реакции
2,439
Credits
574
[SHOWTOGROUPS=4,20]
As each file is parsed, its name is displayed. Template instantiations are indented (and indented further, when one template causes the instantiation of another).
The first RSC file to be parsed is FunctionGuard.h. The files that precede it are either from the standard library or Windows. However, they are not the actual instances of those files. Rather, they are taken from the Для просмотра ссылки Войди или Зарегистрируйся directory, which contains simplified versions of them. These versions avoid the need to
  • >import files that are external to the project from a wide range of directories
  • #define all the names that would be needed to correctly navigate all the #ifdefs in external files
  • support C++ language features used by external files but not by the project
  • parse lots of things that the project doesn't use
Consequently, before you can >parse your own project, you must ensure that the subs directory contains a stand-in for each external header that your project #includes, and that each stand-in declares the items that you use from it. Note that in the case of templates, subs headers do not need to provide function definitions.

Performing a Code Inspection
Now that all of the code has been parsed, it can be checked for violations of design guidelines:
Код:
>check rsc $files

This produces the file Для просмотра ссылки Войди или Зарегистрируйся, which contains all of the warnings that were found. Basic documentation for each of the ~120 warnings that >check can produce can be seen in the file Для просмотра ссылки Войди или Зарегистрируйся.

If >check is run on a subset of the code, it will first >parse any unparsed code that would be needed in a successful build. This avoids false positives, such as warnings that a function is not defined or is unused.

Before merging into the master branch, I usually run >check on all of the code and use the diff tool in VS2017's GitHub plug-in to see if any new warnings have arisen since the last merge.

At present, the only way to suppress a warning is to modify the function Для просмотра ссылки Войди или Зарегистрируйся.

Because headers in the subs directory do not provide function implementations for templates, >check can erroneously recommend things such as
  • removing an #include that is needed to make a destructor visible to a unique_ptr template instance
  • declaring a data member const even though it is inserted in a set and must therefore allow std::move
  • removing most of the things in Allocators.h (which is only invoked from the STL, not from within RSC)
Applying the Recommendations
The >fix command is currently able to resolve about half of the warnings:
Код:
fix : Interactively fixes warnings detected by >check.
(0:123) : warning number from Wnnn (0 = all warnings)
(t|f) : prompt before fixing?
<str> : a set of code files

For example, the following modifies all code files by deleting unnecessary #include directives, which is warning W018:
Код:
>fix 18 f $files

To select which occurrences of a warning to fix, ask to be prompted. For example,
Код:
>fix 53 t $files

will prompt before fixing each occurrence of warning W053, "Data could be const".

Warning: Before using >fix, be sure that you can recover the original version of the file if something goes wrong. It works on RSC's code, but that doesn't mean it's been thoroughly tested!

Exporting the Library
After the code has been parsed, the >export command can generate any combination of the following files:
  • A Для просмотра ссылки Войди или Зарегистрируйсяfile displays parsed code in a standard format and includes
    • the underlying type for each auto variable;
    • the number of times each item was
      • referenced,
      • initialized, read, or written (for data),
      • called (for functions); and
    • the file in which each item was defined (for data and functions).
  • A Для просмотра ссылки Войди или Зарегистрируйся file lists the external symbols used within each file, as well as the recommendations for which #include directives, using statements, and forward declarations the file should add or remove. Those recommendations also appear as warnings in the .check file.
  • An Для просмотра ссылки Войди или Зарегистрируйся file contains a global cross-reference (each symbol, followed a list of the files that use it, along with the line numbers where the symbol appears).
Digging Deeper

Library Variables and Operators

Many of the CLI commands in the ct directory take an expression as their last parameter. This article only used $files, but an expression can contain both variables and operators. The user defines a variable with the >assign command, and the library also provides the following variables, which cannot be modified directly:

VariableContents
$dirsdirectories that have been added to the library by >import
$filesall code files (headers and implementations) found in $dirs
$hdrsheaders in $files
$cppsimplementations (.c*) in $files
$subsheaders that declare items which are external to the code base
$extsheaders that appear in an #include directive but whose directories were not added to the library by >import (which will cause >parse to fail)
$varsall variables (those above, and any that the user has defined)

An expression is evaluated left to right, but parentheses can be used to override this. A variable is a set of either directories or files. The following notation is used in the expressions that appear below:

SetContents
<ds>the name of a directory (as defined by >import) or a set of directories
<fs>the name of a specific file or a set of files
<s>a <ds> or an <fs>

Here is a table of basic operators. The Result column is what the operator returns, which becomes the input to commands such as >assign and >list. The Expression column specifies the type of parameter(s) that the operator expects.

OperatorResultExpressionSemantics
|<s><s1> | <s2>set union of <s1> and <s2> (the '|' is optional)
&<s><s1> & <s2>set intersection of <s1> and <s2>
-<s><s1> - <s2>set difference between <s1> and <s2>
f<fs>f <ds>the files in <ds>
d<ds>d <fs>the directories in <fs>
fn<fs><fs> fn <str>files in <fs> with the file name <str>*
ft<fs><fs> ft <str>files in <fs> with the file type *.<str>
ms<fs><fs> ms <str>files in <fs> that contain <str>
in<fs><fs> in <ds>files in <fs> whose directory is in <ds>

The following operators can also be used on a set of code files:

ExpressionOperator NameSemantics
us <fs>usersfiles that #include any in <fs>
ub <fs>used byfiles that any in <fs> #include
as <fs>affectersub <fs>, transitively
ab <fs>affected byus <fs>, transitively
ca <fs>common affecters(as f1) & (as f2) & … (as fn), where f1…fn are the files in <fs>

After the code has been parsed, the following operators can also be used on a set of code files:

ExpressionOperator NameSemantics
im <fs>implementsfor each item declared (defined) in <fs>, add the file that defines (declares) it
ns <fs>needersfiles that also need <fs> in a build (im ab <fs>, transitively)
nb <fs>needed byfiles that <fs> also needs in a build (im as <fs>, transitively)




[/SHOWTOGROUPS]
 

emailx45

Местный
Регистрация
5 Май 2008
Сообщения
3,571
Реакции
2,439
Credits
574
[SHOWTOGROUPS=4,20]
These operators can help to analyze dependencies among code files. For example:
Код:
>import sbase "sb" // add SessionBase files to the library
>type us Thread.h // show all files that #include Thread.h
>assign h1 f sbase ft cpp // h1 = all SessionBase implementations
>assign c1 ab Thread.h // c1 = files that could be affected by changing Thread.h
>assign s1 h1 & c1 // s1 = SessionBase .cpps that could be affected by changing Thread.h

What to #include
Interactions exist among the warnings for adding and removing #include directives, using statements, and forward declarations. Для просмотра ссылки Войди или Зарегистрируйся generates these warnings. Its basic rules are
  • Always #include something if nothing guarantees that it will be visible transitively.
  • Don't #include something that will definitely be visible transitively. It is necessary to #include a base class, as well as a class that is used directly. However, it is not necessary to #include their base classes, even when using something declared in one of those transitive base classes. Similarly, it is not necessary for a .cpp to #include anything that its header will #include.
  • If a class is only used indirectly (i.e., as a pointer or reference type), don't #include it. Use a forward declaration instead. If there is no guarantee that one will be visible transitively, add one to this file.
  • A header should not contain a using directive or declaration. It is therefore told to remove it, and any .cpp that relies on it is told to add it.
  • If an #include, forward declaration, or using statement is not needed to resolve a symbol, remove it.
All of these warnings can be resolved by >fix, which will, for example, insert a forward declaration in the correct namespace and fully qualify symbols from another namespace when removing a using statement.

High-Level Design
The Для просмотра ссылки Войди или Зарегистрируйся is implemented using recursive descent, which makes its code easy to read and modify. The advent of unique_ptr was a godsend to these types of parsers, which were previously cursed by the need to delete objects when backing up. Placing each of these objects in a unique_ptr allows the parser to back up without having to write any code to delete them.

The parser does not check everything in the same way that a full parser must. It assumes that the code correctly compiles and links, so it only contains enough checks to produce a correct parse. Its grammar, which is informally documented in the relevant functions, is also far simpler than a complete C++ grammar.

As each code file is read in during >import, #include relationships are noted. This allows a global compile order to be calculated. The only other preprocessing that occurs before parsing is to erase, within C++ code, any macro name that is defined as an empty string. Currently, the only such name is NO_OP, which RSC uses before a bare semicolon when a for statement is missing a parenthesized statement.

Once this simple preprocessing is complete, all of the code is parsed together, in a single pass. After an item is parsed, it is added to the scope (namespace, class, function, or code block) in which it appears, and its virtual EnterScope function is invoked. After each function is parsed, it is "executed" by invoking its virtual EnterBlock function. An item's EnterScope or EnterBlock function also invokes the same function on each of its constituent parts.

Some of the warnings generated by >check are detected during >import, some are detected during >parse, and some are detected during >check itself, through the virtual function Check. CodeFile::Trim, mentioned in the previous section, uses the virtual function GetUsages to obtain, from all of its file's C++ entities, the symbols that are used (a) as base classes, (b) directly, and (c) indirectly, as well as those that were resolved by (d) forward declarations, (e) friend declarations, and (f) using statements.

Performance
The time required to >parse and >check all of RSC's code is similar to the time required for a complete build using Microsoft's C++ compiler. This isn't a true apples-to-apples comparison because >parse doesn't lay out memory or generate object code, but its time is also that for a debug, not a release, build. And >parse doesn't use more than one core, whereas Microsoft's compiler uses two when possible (at least on my quad core).

RSC contains about 127K lines of source if you exclude blanks, comments, and left braces. When RSC starts up with its default configuration file under Win32, it grows to about 48MB of memory, which could be significantly reduced by changing various configuration parameters. After executing >parse, >check, and >export, it has grown by about another 315MB.

The tools don't generate any intermediate or scratch files; everything is kept in memory. Using files would be a significant change, so the amount of available memory ultimately limits the size of the code library that the tools can accommodate. But my guess is that anyone with that much code could also provide a machine with enough memory—or simply purchase a commercial equivalent of the tool.

List of Code Files
The Для просмотра ссылки Войди или Зарегистрируйся directory contains all of the code. If you want to dive into it, here's a summary of the files in that directory:

FileDescription
CodeCoveragecode coverage tool (not discussed in this article)
CodeDira directory that contains source code
CodeDirSeta set of code directories
CodeFilea file that contains source code
CodeFileSeta set of code files
CodeSetbase class for CodeDirSet and CodeFileSet
CodeTypestypes for parsing and static analysis
CtIncrementCLI commands applicable to the ct directory
CtModuleinitialization of ct directory
Cxxtypes for C++
CxxAreanamespaces, classes, and class template instances
CxxCharLiteralcharacter literals
CxxDirectivepreprocessor directives
CxxExecutefor tracking code parsing and "execution"
CxxFwdforward declarations
CxxNamedlow-level named C++ items
CxxRootglobal namespace and built-in terminals
CxxScopecode blocks, data items, and functions
CxxScopedarguments, base classes, enums, enumerators, forwards, friends, terminals, typedefs, usings
CxxStatementstatements used in functions
CxxStrLiteralstring literals
CxxStringstring utilities
CxxSymbolsparser symbol tables
CxxTokenlow-level unnamed C++ items
Editorsource code editor for >fix command
Interpreterinterprets expressions (in CLI commands) that manipulate instances of LibrarySet subclasses
Lexerlexical analysis utilities
Librarycode files, code directories, and CLI symbols
LibraryErrSetinvoked when a CLI command does not apply to a set
LibraryItembase class for CodeDir, CodeFile, and LibrarySet
LibrarySetbase class for CodeSet, LibraryErrSet, and LibraryVarSet (sets of items to which CLI commands can be applied)
LibraryTypestypes for code library
LibraryVarSetbuilt-in or user-defined library variables
Parserparser for C++ source code
SetOperationsdifference, intersection, and union operators for instances of LibrarySet

Notes
1 While preparing this article, I checked to see if anything had changed. Google's project eventually gained traction and is now on Для просмотра ссылки Войди или Зарегистрируйся. They took the approach of building on Для просмотра ссылки Войди или Зарегистрируйся, and they say that they're currently "alpha" quality and that changes to Clang sometimes break them.

2 The file help.cli has a .txt extension, which is omitted from file names in this article. This article's text files are attached in Для просмотра ссылки Войди или Зарегистрируйся, but its links access their most recent versions on GitHub.

3 A code file is assumed to be any file with no extension (e.g., <string>) or a .h, .c, .hpp, .cpp, .hxx, .xxx, .hh, .cc, .h++, or .c++ extension. This is hard-coded in Для просмотра ссылки Войди или Зарегистрируйся.

History
  • 4th October, 2019: Initial version

License
This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)

[/SHOWTOGROUPS]
 
Последнее редактирование: