Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for incremental scan results #198

Closed
katrinaoneil opened this issue Jul 12, 2018 · 13 comments
Closed

Support for incremental scan results #198

katrinaoneil opened this issue Jul 12, 2018 · 13 comments
Labels
propose-to-close No activity from the TC; no "energy" around it; probably not important; or topic for future version.

Comments

@katrinaoneil
Copy link

The spec does not currently support incremental scan results (delta results)

@michaelcfanning
Copy link
Contributor

As per TC#20, this feedback potentially entails technical changes and requires further discussion. SARIF currently provides 'baselienState' and other properties related to diffing two complete SARIF log files. We should explore what it means both to perform an incremental scan and how to merge incremental scan results into a base SARIF file.

We have not yet identified a potential CSD.2 change on this topic.

@katrinaoneil
Copy link
Author

By incremental scan, I meant an analysis run that is not a complete full re-run of a previous scan, but rather a run that involves re-analyzing a piece of a project that was scanned before. For example, let's assume we have a project Foo. It gets analyzed for the first time, and a full scan of the project is performed generating a full set of analysis results. This is what I would consider a 'baselineState'. Now, let's say I fixed a bunch of bugs and decided to do a complete re-scan of the project generating another full set of results 'rescanState'. The next day though, I only fixed one bug and decided to only re-analyze that part of the project that is affected by my bug fix generating a subset of results 'incrementalState'. The difference between 'rescanState' and 'incrementalState' is that in the first case the entire project was re-analyzed, while in the second -- only part of the project was analyzed. This means that the way we would merge results with 'baselineState' would be different: in the first case, all of the results would be merged, while in the second only the results in those areas of the project that were re-analyzed would be merged.

@katrinaoneil
Copy link
Author

Fortify has been implementing incremental analysis incrementally :) Support for two analyzers has been available for a couple of years (since 16.2 release). Fortify's incremental scan will generate a results file that contains all of the issues, as if they were merged between two full scans. Each issue should contain a field for recording which scans it was found in, so that one could merge a chain of results files and know exactly when each issue appeared and disappeared.

@ghost
Copy link

ghost commented Aug 18, 2018

In TC #21, we discussed adding additional properties that would make it easier to determine whether two issues detected in successive runs are in fact logically the same. I don't see how these properties support incremental scanning per se, but I record the ideas here anyway:

SARIF already defines the following relevant properties:

  • run.baselineInstanceGuid, which identifies the run that is to be considered the baseline for the current run.

  • result.correlationGuid, which allows "logically identical" results to be grouped together.

  • result.baselineState, which specifies whether this result appeared in the baseline run, whether it is new in the current run, or whether it disappeared between the baseline run and the current run. (Obviously an analysis tool wouldn't detect a result that disappeared, but a post-processor might synthesize such a result so that a user could see all the differences from the baseline run by examining a single log file.)

  • file.roles, an enumeration that includes values such as "modifiedFile", "addedFile", and -- most relevant here -- "renamedFile".

SARIF could define the following additional properties, to ease the task of deciding whether two results are logically identical:

  • logicalLocation.previousName, which specifies the name by which this programmatic construct was identified in the baseline run. Since the run.logicalLocations dictionary is required to contain entries for all the ancestors of a nested construct such as Namespace1.Namespace2.Class1.Method1, this single property would make it possible to identify renames at any level (for example, the renaming of a method, a class, or an entire namespace).

  • file.previousUri, which would be used in the run.files dictionary to identify the previous location of a file that has been moved since the baseline run. Presumably a file object that specified previousUri would also include "renamedFile" in its roles property.

These properties would allow both physical and logical renames to be tracked from run to run. As we discussed in the TC, an analysis tool would be unlikely to populate them, but a post-processor that understood both the version control system (for physical renames) and the programming language (for logical renames) might be able to deduce them.

@katrinaoneil
Copy link
Author

I don't think SARIF should keep track of the changes to the source code, but what I think would be useful (I mentioned it in the last two minutes of TC 21) is having a property on every result that lists the scans this result was found in. That way a post-processor has enough information to figure out in which scan each result has been detected for the first time, and then in which it went away.

@sthagen
Copy link
Contributor

sthagen commented Aug 18, 2018

@katrinaoneil solely scan instance tagging makes sense to me 👍.

  1. I still have a hard time though, to imagine, when exactly a merge would be useful i.e. trust building for me as a developer. On one hand, I think when nicely decoupled sources clearly exhibit no path from changed code to code in the other compartments that is the use case. On the other hand, scenarios come to my mind, where (unchanged) call sites become death traps through the "fixed" bugs on the caller sites - but this is an implementation and usage issue, not impacting SARIF as a container format IMO.
    But, if an n-tuple is all we need per finding (I assume represented as a JSON array) where the consumer of the file expects reference ids of scan instances in to enable implementers to use SARIF for incremental scan storage it might be useful and doable IMO.

  2. Question: Do we care for when and how often a "field" in SARIF is written by producers? Say, what if my tool scans foo.bar (a baz language source file) one thousand and one times? will this be the multiplicity / tuple length a client then has to expect for every finding 😄 ?

@ghost
Copy link

ghost commented Aug 19, 2018

@katrinaoneil Could you clarify how an array of "runs in which this result occurred" would help in an "incremental scan" scenario?

@katrinaoneil
Copy link
Author

katrinaoneil commented Aug 21, 2018

@sdrees Sorry, I don't think I fully understand your question -- perhaps, we can discuss during the next call. What do you mean by a "field"? If foo.bar is scanned 1001 times and generates 1000 of the same results on the first 1000 scans, and the same 1000 results plus an additional one on the last scan, then scan id arrays for 1000 results in the last scan would contain 1000 elements, while the scan id array for the new result in the last scan would contain one element.

@katrinaoneil
Copy link
Author

@lgolding I guess, it all depends on what we mean by "support for incremental scan" in the context of SARIF. To be honest, I was not sure what it should mean when I filed the issue, but the more we talk about it, the more I get convinced that SARIF should not be doing any tracking of source code changes (the job of a source control system), figuring out what to scan provided the source changes made (the job of the analysis engine) or attempting to map new/removed/existing issues to each other (the job of the post-processor). However, the format should allow the results to contain enough information for the post-processor to be able to differentiate between full scans and incremental scans. And if each result in SARIF contains an array of scan ids it was detected in, this would be possible.

@sthagen
Copy link
Contributor

sthagen commented Aug 21, 2018

@katrinaoneil sure, but the 2. point boils down to historicize or not - if per finding only the latest scan id is noted 👍 no problem. Appending scan ids per finding would not look sound to me.

@michaelcfanning michaelcfanning added the propose-to-close No activity from the TC; no "energy" around it; probably not important; or topic for future version. label Nov 27, 2018
@ghost
Copy link

ghost commented Nov 29, 2018

@katrinaoneil I agree with @michaelcfanning that a full set of scan ids might be overdoing it. I think we hit the sweet spot with result.resultProvenance, which provides the range of times and runs during which the result was present.

@ghost ghost closed this as completed Nov 29, 2018
@ghost ghost removed the discussion-ongoing label Nov 30, 2018
@michaelcfanning
Copy link
Contributor

as per offline discussion with @katrinaoneil, the important immediate goal with incremental scanning is to properly record time of last detection (which is tracked in #287). as more specific proposals for incremental scanning scenario emerge, we will file new issues to address.

@katrinaoneil
Copy link
Author

@lgolding, @michaelcfanning, agreed

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
propose-to-close No activity from the TC; no "energy" around it; probably not important; or topic for future version.
Projects
None yet
Development

No branches or pull requests

3 participants