magichour.api.local.modelgen package

Submodules

magichour.api.local.modelgen.events module

magichour.api.local.modelgen.preprocess module

This module contains functions for the initial preprocessing of log files. The functions here are responsible for reading in data and converting them into LogLine named tuples. (see named tuple definition in magichour.api.local.util.namedtuples). Transforms named tuples represent a preprocessing step that .

TODO: Add the ability to write custom preprocessing functions other than just transforms.

magichour.api.local.modelgen.preprocess.get_transforms(transforms_file)

Reads transforms from a file and returns a list of Transform named tuples. The output is meant to be fed into get_transformed_lines(). A Transform is a named tuple that represents a pattern to replace in a log line. The pattern is replaced by a standard tuple specified in the Transform file.

The named tuple definition for Transform is: Transform = namedtuple(‘Transform’, [‘id’, ‘type’, ‘name’, ‘transform’, ‘compiled’])

Parameters:file_path – a path to a transforms file. See documentation for proper format for the transforms file.
Returns:list of Transform named tuples
Return type:transforms
magichour.api.local.modelgen.preprocess.read_auditd_file(file_path)
magichour.api.local.modelgen.preprocess.read_log_file(file_path, ts_start_index, ts_end_index, ts_format=None, skip_num_chars=0)

Function to create LogLine named tuples from an input log file. Output from this function and get_transforms() is meant to be fed into get_transformed_lines() in order to apply the Transforms to the created LogLines. This function can be used by itself if you don’t want to apply any Transforms, but keep in mind that writing and applying custom Transforms will assist the templating process.

We make no underlying assumptions about the log file format other than that there is a timestamp associated with each line. The rest of each line is considered associated text.

This function is a generator yielding LogLines. If you require a full list of LogLines then you will need to iterate through the generator.

Unless there is an exception, the file is closed internally to the function.

Parameters:
  • file_path – path to log file. Open using gzip if file_path ends with .gz
  • ts_start_index – starting index for parsing timestamp
  • ts_end_index – end index for parsing timestamp
  • ts_format – optional datetime format to pass to datetime.datetime.strptime to parse timestamp. If not specified, then the entire timestamp is parsed as a float.
  • skip_num_chars – optional number of characters to skip parsing at the beginning of each line (Default = 0)
Returns:

a generator yielding LogLine objects created from each line in file_path

magichour.api.local.modelgen.preprocess.transform_lines(lines, transforms_file)

Function to return transformed LogLine named tuples by applying the specified Transforms on original LogLines (as generated by get_lines()). Note that writing and applying custom Transforms will assist the templating process and produce higher quality templates.

This function is a generator yielding LogLines. If you require a full list of LogLines then you will need to iterate through the generator.

See the comment in the function as to where to add additional transform types.

Parameters:
  • lines – iterable of LogLines named tuples.
  • transforms – iterable of Transform named tuples.
Returns:

a generator yielding LogLine objects

magichour.api.local.modelgen.template module

This module contains the different algorithms that were evaluated for discovering log file templates (format strings).

Add additional template processors to this file.

Functions in this module should accept an iterable of LogLines. Functions should return an iterable of Templates. (see named tuple definition in magichour.api.local.named_tuples)

magichour.api.local.modelgen.template.baler(lines)

This function uses the Baler tool, created by Sandia National Labs. The tool is expected to be released in Q1 2016, so this code will be updated when that happens.

Parameters:lines (iterable LogLine) – an iterable of LogLine named tuples
Returns:templates – a list of Template named tuples
Return type:list Template
magichour.api.local.modelgen.template.logcluster(lines, *args, **kwargs)

This function uses the logcluster algorithm (available at http://ristov.github.io/logcluster/) to cluster log files and mine line patterns. See http://ristov.github.io/publications/cnsm15-logcluster-web.pdf for additional details on the algorithm. The current implementation writes loglines to a temporary file then feeds it to the logcluster command line tool (perl). Eventually, the goal is to fully translate logcluster.pl into python to eliminate this step.

Behavior of this function differs depending on combinations of lines and file_path: lines AND file_path set: write lines to file at file_path lines BUT NOT file_path set: write lines to temporary file file_path BUT NOT lines: pass file_path directly into logcluster NEITHER lines NOR file_path: throw exception

Parameters:lines (iterable LogLine) – an iterable of LogLine named tuples
Kwargs:
file_path (string): target path to pass to logcluster.pl (only used if lines is None, otherwise ignored). All other kwargs are passed on the command line to logcluster.pl. See above for details.
Returns:templates – a list of Template named tuples
Return type:list Template
magichour.api.local.modelgen.template.stringmatch(lines, *args, **kwargs)

This function uses the StringMatch algorithm to perform clustering and line pattern mining. See the paper “One Graph Is Worth a Thousand Logs: Uncovering Hidden Structures in Massive System Event Logs” by Aharon, Barash, Cohen, and Mordechai for further details on the algorithm.

The name “StringMatch” was taken from another paper: (Aharon et al do not name their algorithm).

Parameters:lines – (iterable LogLine): an iterable of LogLine named tuples
Kwargs:
batch_size (int): batch_size to pass to StringMatch (default: 5000) skip_count (int): skip_count to pass to StringMatch (default: 0) threshold (float): threshold to pass to StringMatch, must be between 0 and 1 (default: 0.75) min_samples (int): min_samples to pass to StringMatch (default: 25)
Returns:templates – a list of Template named tuples
Return type:list Template

magichour.api.local.modelgen.window module

magichour.api.local.modelgen.window.modelgen_window(timed_templates, window_size=60, remove_junk_drawer=False)

This function was written to take in the output of the apply_template function. It groups template occurrences into “windows” (aka transactions) that will be passed on to a market basket analysis algorithm in events/events.py.

By default the window size is 60 seconds.

Parameters:timed_templates – iterable of timed_templates
Kwargs:
window_size: # of seconds to allow for each window size (default: 60)
Returns:list of sets containing TimedTemplate named tuples
Return type:windows

Module contents