Filter stack set instances by status reason with OpenRefine
The status reason is the only field that gives enough detail to tell how an instance got a general status such as “CURRENT”, “OUTDATED”, and “INOPERABLE”. The field is unstructured and undocumented. It may embed an error message from any service that CloudFormation calls, so analyzing it is difficult.
I learned how to use OpenRefine to analyze the output of list-stack-instances. I created a custom facet to filter the instances by status reason. The other tools I used before weren’t cutting it for me. Because OpenRefine uses facets to slice and dice the data, it’s really quick to explore.
SQLite is great for summarizing with counts and for filtering the results using other data sets such as the organizational hierarchy. Using the built-in functions I was unable to extract the variable parts of the messages. [3] If you don’t need the variable parts, you can implement this with a table of message templates joined using the LIKE operator. But writing SQL queries slows down exploratory analysis.
LibreOffice Calc’s UI gives a miserable experience becuase of the wide values in some of the columns (StackSetId, StackInstanceId, and StatusReason) [4]. The view snaps to column boundaries and long text does not wrap. You can reduce the column widths, but it hides much of the text. You can zoom out until you can barely read anything and then it might fit on one screen. I didn’t even try to implement it here.
Setup
Save the result of list-stack-instances and convert it to CSV using in2csv.
Import the result as a new project in OpenRefine.
Add text facets for the columns Region, Account, Status and DetailedStatus.
Add a custom facet for the StatusReason column.
StatusReason facet
Add a custom facet for the StatusReason column using this GREL expression.
filter(
    [
        ["Null", /null/],
        ["No updates", /No updates are to be performed\./],
        ["Cancelled by failures", /Cancelled since failure tolerance has exceeded/],
        ["Trust missing", /Account (.*?) should have '(.*?)' role with trust relationship to Role '(.*?)'\./],
        ["Account suspended", /ACCOUNT_SUSPENDED/],
        ["STS not activated", /STS is not activated in this region for account:(.*?). Your account administrator can activate STS in this region using the IAM Console\..*/],
        ["SCP denial to resource", /ResourceLogicalId:(.*?), ResourceType:(.*?), ResourceStatusReason:User: (.*?) is not authorized to perform: (.*?) with an explicit deny in a service control policy \(Service: (.*?); Status Code: (.*?); Error Code: (.*?); Request ID: (.*?); Proxy: (.*?)\)\..*/],
        ["Delivery channel limit", /ResourceLogicalId:(.*?), ResourceType:(.*?), ResourceStatusReason:Failed to put delivery channel '(.*?)' because the maximum number of delivery channels: (.*?) is reached\. \(Service: (.*?); Status Code: (.*?); Error Code: (.*?); Request ID: (.*?); Proxy: (.*?)\)\..*/],
        ["Unupdatable", /Stack:(.*?) is in (.*?) state and can not be updated\./],
        ["SCP denial to user", /User: (.*?) is not authorized to perform: (.*?) on resource: (.*?) with an explicit deny in a service control policy/],
        ["Resource already exists", /ResourceLogicalId:(.*), ResourceType:(.*), ResourceStatusReason:(.*) already exists\./],
        ["__Unmatched", /(.*?)/]
    ],
    template,
    match(coalesce(value, "null"), template[1]) != null
)[0][0]
The expression embeds a set of message templates. The label of the first matching template is used as the facet value.
Each message template has a label and a regular expression. The label summarizes the stack instance status with a short name such as “Null” or “Current”. The regular expression matches the whole status reason and may use groups to capture variable parts. Each template is stored as an array. The first element (template[0]) is its label and the second element (template[1]) is its regular expression.
The filter function applies each template to the status reason (value). If the status reason is null, it is treated as the literal string "null" for matching [1] [2]. The function returns a result array of all the matching templates. The match function returns an array of all the variable parts matched by the template, or null if it does not match. The captured variable parts of matching templates are ignored.
The templates are matched in definition order, so if there are two matching templates then the first one in the result array is the one that was first defined. Finally the label of the first matching template is returned.
The “__Unclassified” filter is a default template to catch messages that haven’t yet been classified. It is labelled with two underscores so that when the list of facets is sorted by name it appears at the top. It matches any message. It is the last template so that it doesn’t block matches by the more specific templates.
Use the __Unclassified filter to find messages that need a new message template. The message format is undocumented and application-specific, so the set of templates will ocassionally change as new status reasons are discovered.
Adding a new message template
Open the facet expression editor and paste this expression. Use it as a base for developing the regular expression of a new message template.
match(
    coalesce(value, "null"),
    /.*/
)
Position the cursor after the opening slash of the regular expression. Type out the message that you want to capture. Replace variable parts with (.*?) so that they are captured. If you have typed correctly, you should see an array in the match preview column next to the matched messages. When you have typed the complete message, remove the .* at the end. It exists as a placeholder so that the regular expression matches while it is incomplete.
[1] The null status reason occurs when the no status reason is given for the stack instance by list-stack-instances. I think this might happen for instance that haven’t been operated on for a long time. I believe the service forgets the status reason after a certain amount of time but I’m unsure how to prove it.
[2] The null status may also appear if list-stack-instances gives an empty string as the status reason and the result is converted from JSON to CSV using a tool like in2csv. To avoid this possibility you can import directly from the JSON response.
[3] Although it may be possible with a SQLean’s “extended standard library for SQLite”. https://github.com/nalgeon/sqlean
[4] https://ask.libreoffice.org/t/calc-horizontal-smooth-scrolling/12258