Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide log deconstruction mechanism to enable more efficient read/write #210

Closed
michaelcfanning opened this issue Aug 2, 2018 · 7 comments
Labels
2.1.0-CSD.1 Will be fixed in SARIF v2.1.0 CSD.1. CSD.1-public-comment Comment submitted by public during CSD.1 public comment period. design-improvement impact-breaks-consumers resolved-fixed

Comments

@michaelcfanning
Copy link
Contributor

michaelcfanning commented Aug 2, 2018

This issue opened in response to CSD.1 feedback re: efficient parsing of SARIF at scale. In TC#20, we discussed one possible solution, allowing SARIF producers/consumers to work with deconstructed SARIF files. One possible solution:

  1. SARIF producers produce a master SARIF file that 'stitches' together all required files that comprise the aggregate set of SARIF results + additional data.
  2. This master SARIF file, for each run in the runs array, would first emit all property singletons hanging off the run (i.e, all properties that consist of single, primitive values, rather than objects/arrays).
  3. Next, the file would include references to all optional objects and arrays, except for the results table. The producer would be permitted to emit a uri that indirectly points to a separate location holding the data that constitutes the property. eg., { uriReference : "file:///e:/MySarif.resources.sarif"}.
  4. finally, the log file could optionally contain an array of uriReferences comprising all results, or the file might express the results themselves.

One obvious problem with the above is how a SARIF property might either consist of, say, an array of results vs. an array or uris. This pattern doesn not work well with SDKs produced for typical strongly typed languages such as C# or Java. For a performance-sensitive reader/writer, we foresee a likely need for a separate SDK/API for the scenario in addition to a more convenient OM-hydrating SDK (that may attempt to process a file in its entirety).

These properties would be guaranteed to be at the head of the file:
instanceGuid
logicalId
description
baselineInstanceGuid
automationLocalId
architecture
originalUriBaseIds
defaultFileEncoding
columnKind
richMessageMimeType
redactionToken

These properties would allow for an indirection to a separate URI holding actual contents:
tool
invocations
conversion
versionControlDetails
files
logicalLocations
graphs
resources
properties

Finally, at the end of the file, the results object.

@michaelcfanning
Copy link
Contributor Author

Consider an efficient write using this mechanism: the SARIF producer establishes that it can reserve required file names at required file locations, and generates the 'header' of the master file, then begins to append results to the file. The producer either stores data (such as files tables) and writes those files on completion, or is appending this data as it appears to the relevant side-car file.

On the consumption side, the consumer reads all data leading up to the results and optionally loads any sidecar files (such as the files table) if necessary.

Note that this approach can provide for efficient speed/control of memory consumption but does not address the problem of sorting data to produce deterministically ordered output (e.g., results that are sorted according to some convention).

@michaelcfanning
Copy link
Contributor Author

See #211 for an alternate proposal.

@michaelcfanning
Copy link
Contributor Author

See #211 for an idea that prevents the need to have SARIF properties potentially be populated with two types. We could include a new property that explicitly allows URLs to be used to indirectly reference property contents. There are only 9 of these potentially indirectly referenceable things, which isn't a large number.

@katrinaoneil
Copy link

Something to keep in mind regarding constructing SARIF log file from a number of files is the OS limit of open file descriptors.

@katrinaoneil
Copy link

More comments from mu colleague:

My personal preference for binary serialization formats is protocol buffers. Google uses protobuf everywhere so there's no risk of it going away anytime soon. Protobufs do require some centralized ownership of the schema to ensure that field names and IDs are not changed or reused. I don't have any measurement of speed or size comparisons, though protobufs should be faster and smaller due to the use of integer IDs rather than string field names in the serialized content.

I would advise considering some questions about SARIF's goals:

  • Is low-latency interactive viewing a goal, or is it OK for SARIF's role to be a batch interchange between utilities that will convert to their own internal proprietary data storage formats optimized for the utility? Personally, I might argue that you should remove interactive viewers as a goal and focus on batch interchange. Different companies' tools will have different sorts of internal formats, hence requiring different sorts of optimizations in a file format if the goal is low-latency interaction. You're not going to produce a single file format that will work for everyone. (For example, low-latency AWB would require the guitools/model indices to be in the file. SCA can emit that to an output file in a Fortify proprietary format, but it's unreasonable to impose this on the general file format). It might be more reasonable to position SARIF as a format for interchange between different companies, with the expectation that companies will convert to/from their own proprietary format.

  • Is human readability of the files a goal? (Human readability simplifies debugging and unfamiliar users' understanding of the file schema).

  • Does SARIF want to be one-format-fits-all-scans, or is it OK if it's more suitable for some scan results than others? How scalable does it need to be? What's the upper bound where you care less about the usability? Do you want scalable coverage of 100% of scans, or is 99% or 95% OK? What percentage of scans do you want to cover? Representatives of the various companies on the committee could then bring some numbers of the scan result sizes that their customers experience, and you could then figure out where whatever percentage you identified puts you. Make sure to consider the effects of storing multiple scans in the same file.

And then a general strategies for handling very large results: Consider providing a mechanism for partial file parsing. Wrapping the entire file in one JSON element "sarifLog" means that when you start parsing from offset zero, you're not done until the final closing character "}". It's really not necessary. The log could instead be a sequential series of elements enabling parsing of smaller bits at a time.

I mean laying out the file like:

sarifLog {
fileFormatVersion = "1.0"
}
scanner {
name = "Fortify SCA"
version = "2018.1"
}
resultList {
numberOfResults = 2
}
result {
type = "dataflow"
}
result {
type = "structural"
}

rather than:

sarifLog {
fileFormatVersion = "1.0"
scanner {
name = "Fortify SCA"
version = "2018.1"
}
resultList [
result {
type = "dataflow"
}
result {
type = "dataflow"
}
}
}

Get rid of nesting so that the JSON parser just processes a small amount and then returns back to the tool. The tool will need to keep invoking the parser to get through the entire log file, but importantly, this offers the tool the opportunity to process each element in between each parser call.

If nothing else, this piecemeal parsing:
(a) potentially reduces memory need because the JSON parser is running over just a bit at a time rather than entire-file-at-once.
(b) improves low-latency interaction opportunities, because a separate thread parsing individual elements can add them to the UI as each element is parsed rather than blocking until the complete file is parsed. You can envision a UI that's updating as each element is parsed and returned to the tool.

A more complicated design provides indices within the file to the offsets of elements so that the file can be consumed in random order rather than start-to-finish. A log file viewer could then lseek to the offsets of elements required to render the current view while skipping the parse of everything else. I'm not convinced that this is a good idea for a general file format, because it's dependent on indexing, and that's probably tied to how the log file viewer represents indices internally.

But anyway, switching to BSON or protobuf would probably be faster serialization/deserialization than JSON, and probably smaller files than JSON. However, you sacrifice human readability. The committee has to evaluate such tradeoffs and decide what direction is more appropriate.

@fishoak
Copy link

fishoak commented Aug 29, 2018

I mentioned this proposal to a colleague (Dave Vitek), and he had the following to say. I do like this idea, although I concede the point that it makes life a bit more difficult for the consumer.

Breaking up large SARIF documents only at specifically identified points in the file smells pretty ad-hoc to me. I suspect people may eventually want more points at which they can break things up.

Another option would be to use https://tools.ietf.org/html/draft-pbryan-zyp-json-ref-03, which is essentially #include for json. So for example these two sets of documents have the same meaning:

Set 1:

File1.sarif

{ 'foo': ['bar', 1, 2, 3] }

Set 2:

File2.sarif

{ 'foo': {'$ref': 'foo.json' } }

foo.json

['bar', 1, 2, 3]

This provides a lot of flexibility to sarif producers in terms of how they chop up the document. It does require consumers to deal with external URIs potentially anywhere in the document. This might be asking way too much.

This also provides a mechanism for a DAG representation as opposed to a tree representation, which could have payoff for repeating fragments. It eliminates problems 1 and 3 mentioned by Larry, although makes 2 (validation) hard unless everything is validated.

ghost pushed a commit that referenced this issue Sep 4, 2018
@ghost ghost self-assigned this Sep 5, 2018
@ghost ghost closed this as completed Sep 5, 2018
@ScottLouvau
Copy link

See microsoft/sarif-sdk#1100 for some work in this area. This design uses drastically less memory for a single huge SARIF file, and can be modified to work with 'externalFiles' SARIF files, only doing the work to read collections if they are accessed by the viewer or processor.

This will make things better for use of the JSON format without having to break JSON rules or go to a binary format, though we may do those things for specific scenarios if more performance is necessary.

Reading two ~400MB SARIF files (one with 1.3M files and one with 400k results) takes around 5 seconds and less than 4MB of RAM with this implementation.

@ghost ghost added CSD.1-public-comment Comment submitted by public during CSD.1 public comment period. 2.1.0-CSD.1 Will be fixed in SARIF v2.1.0 CSD.1. labels May 13, 2019
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.1.0-CSD.1 Will be fixed in SARIF v2.1.0 CSD.1. CSD.1-public-comment Comment submitted by public during CSD.1 public comment period. design-improvement impact-breaks-consumers resolved-fixed
Projects
None yet
Development

No branches or pull requests

4 participants