New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide log deconstruction mechanism to enable more efficient read/write #210
Comments
Consider an efficient write using this mechanism: the SARIF producer establishes that it can reserve required file names at required file locations, and generates the 'header' of the master file, then begins to append results to the file. The producer either stores data (such as files tables) and writes those files on completion, or is appending this data as it appears to the relevant side-car file. On the consumption side, the consumer reads all data leading up to the results and optionally loads any sidecar files (such as the files table) if necessary. Note that this approach can provide for efficient speed/control of memory consumption but does not address the problem of sorting data to produce deterministically ordered output (e.g., results that are sorted according to some convention). |
See #211 for an alternate proposal. |
See #211 for an idea that prevents the need to have SARIF properties potentially be populated with two types. We could include a new property that explicitly allows URLs to be used to indirectly reference property contents. There are only 9 of these potentially indirectly referenceable things, which isn't a large number. |
Something to keep in mind regarding constructing SARIF log file from a number of files is the OS limit of open file descriptors. |
More comments from mu colleague: My personal preference for binary serialization formats is protocol buffers. Google uses protobuf everywhere so there's no risk of it going away anytime soon. Protobufs do require some centralized ownership of the schema to ensure that field names and IDs are not changed or reused. I don't have any measurement of speed or size comparisons, though protobufs should be faster and smaller due to the use of integer IDs rather than string field names in the serialized content. I would advise considering some questions about SARIF's goals:
And then a general strategies for handling very large results: Consider providing a mechanism for partial file parsing. Wrapping the entire file in one JSON element "sarifLog" means that when you start parsing from offset zero, you're not done until the final closing character "}". It's really not necessary. The log could instead be a sequential series of elements enabling parsing of smaller bits at a time. I mean laying out the file like: sarifLog { rather than: sarifLog { Get rid of nesting so that the JSON parser just processes a small amount and then returns back to the tool. The tool will need to keep invoking the parser to get through the entire log file, but importantly, this offers the tool the opportunity to process each element in between each parser call. If nothing else, this piecemeal parsing: A more complicated design provides indices within the file to the offsets of elements so that the file can be consumed in random order rather than start-to-finish. A log file viewer could then lseek to the offsets of elements required to render the current view while skipping the parse of everything else. I'm not convinced that this is a good idea for a general file format, because it's dependent on indexing, and that's probably tied to how the log file viewer represents indices internally. But anyway, switching to BSON or protobuf would probably be faster serialization/deserialization than JSON, and probably smaller files than JSON. However, you sacrifice human readability. The committee has to evaluate such tradeoffs and decide what direction is more appropriate. |
I mentioned this proposal to a colleague (Dave Vitek), and he had the following to say. I do like this idea, although I concede the point that it makes life a bit more difficult for the consumer. Breaking up large SARIF documents only at specifically identified points in the file smells pretty ad-hoc to me. I suspect people may eventually want more points at which they can break things up. Another option would be to use https://tools.ietf.org/html/draft-pbryan-zyp-json-ref-03, which is essentially #include for json. So for example these two sets of documents have the same meaning: Set 1: File1.sarif{ 'foo': ['bar', 1, 2, 3] } Set 2: File2.sarif{ 'foo': {'$ref': 'foo.json' } } foo.json['bar', 1, 2, 3] This provides a lot of flexibility to sarif producers in terms of how they chop up the document. It does require consumers to deal with external URIs potentially anywhere in the document. This might be asking way too much. This also provides a mechanism for a DAG representation as opposed to a tree representation, which could have payoff for repeating fragments. It eliminates problems 1 and 3 mentioned by Larry, although makes 2 (validation) hard unless everything is validated. |
See microsoft/sarif-sdk#1100 for some work in this area. This design uses drastically less memory for a single huge SARIF file, and can be modified to work with 'externalFiles' SARIF files, only doing the work to read collections if they are accessed by the viewer or processor. This will make things better for use of the JSON format without having to break JSON rules or go to a binary format, though we may do those things for specific scenarios if more performance is necessary. Reading two ~400MB SARIF files (one with 1.3M files and one with 400k results) takes around 5 seconds and less than 4MB of RAM with this implementation. |
This issue opened in response to CSD.1 feedback re: efficient parsing of SARIF at scale. In TC#20, we discussed one possible solution, allowing SARIF producers/consumers to work with deconstructed SARIF files. One possible solution:
One obvious problem with the above is how a SARIF property might either consist of, say, an array of results vs. an array or uris. This pattern doesn not work well with SDKs produced for typical strongly typed languages such as C# or Java. For a performance-sensitive reader/writer, we foresee a likely need for a separate SDK/API for the scenario in addition to a more convenient OM-hydrating SDK (that may attempt to process a file in its entirety).
These properties would be guaranteed to be at the head of the file:
instanceGuid
logicalId
description
baselineInstanceGuid
automationLocalId
architecture
originalUriBaseIds
defaultFileEncoding
columnKind
richMessageMimeType
redactionToken
These properties would allow for an indirection to a separate URI holding actual contents:
tool
invocations
conversion
versionControlDetails
files
logicalLocations
graphs
resources
properties
Finally, at the end of the file, the results object.
The text was updated successfully, but these errors were encountered: