magichour.api.dist.preprocess package

Submodules

magichour.api.dist.preprocess.preProcess_RDD module

class magichour.api.dist.preprocess.preProcess_RDD.LogLine(ts, msg, processed, dictionary, template, templateId, templateDict)

Bases: tuple

__getnewargs__()

Return self as a plain tuple. Used by copy and pickle.

__getstate__()

Exclude the OrderedDict from pickling

__repr__()

Return a nicely formatted representation string

dictionary

Alias for field number 3

msg

Alias for field number 1

processed

Alias for field number 2

template

Alias for field number 4

templateDict

Alias for field number 6

templateId

Alias for field number 5

ts

Alias for field number 0

class magichour.api.dist.preprocess.preProcess_RDD.TemplateLine(id, template, skipWords)

Bases: tuple

__getnewargs__()

Return self as a plain tuple. Used by copy and pickle.

__getstate__()

Exclude the OrderedDict from pickling

__repr__()

Return a nicely formatted representation string

id

Alias for field number 0

skipWords

Alias for field number 2

template

Alias for field number 1

class magichour.api.dist.preprocess.preProcess_RDD.TransformLine(id, type, NAME, transform, compiled)

Bases: tuple

NAME

Alias for field number 2

__getnewargs__()

Return self as a plain tuple. Used by copy and pickle.

__getstate__()

Exclude the OrderedDict from pickling

__repr__()

Return a nicely formatted representation string

compiled

Alias for field number 4

id

Alias for field number 0

transform

Alias for field number 3

type

Alias for field number 1

magichour.api.dist.preprocess.preProcess_RDD.lineRegexReplacement(line, logTrans)

apply a list of regex replacements to a line, make note of all the remplacements peformed in a dictionary(list)

Parameters:line (LogLine) – logline to work on
Globals:
transforms(RDD(TransformLine)): replacemnts to make with
Returns:retval – logline with the processed, and dictionary portions
filled in
Return type:LogLine
magichour.api.dist.preprocess.preProcess_RDD.logPreProcess(sc, logTrans, rrdLogLine)

take a series of loglines and pre-process the lines replace ipaddresses, directories, urls, etc with constants keep a dictionary of the replacements done to the line

Parameters:
  • sc (sparkContext) – spark context
  • logTrans (string) – location fo the transFile in HDFS
  • logFile (string) – location of the log data in HDFS
Returns:

retval – preprocessed log lines ready for next

stage of processing

Return type:

RDD(LogLines

magichour.api.dist.preprocess.preProcess_RDD.rdd_TransformLine(line)

process transformations into RDD format

Parameters:line (string) – line from the transform defintion file. lines beginning with # are considered comments and will need to be removed
Returns:retval – namedTuple representation of the tasking
Return type:TransformLine
magichour.api.dist.preprocess.preProcess_RDD.rdd_preProcess(sc, logTrans, rrdLogLine)

make a rdd of preprocessed loglines

Args:
sc(sparkContext): sparkContext logTrans(string): location fo the transFile in HDFS logFile(string): location of the log data in HDFS
Returns:retval – preprocessed log lines ready for next
stage of processing
Return type:RDD(LogLines
magichour.api.dist.preprocess.preProcess_RDD.readTransforms(sc, transFile)

returns a list of transforms for replacement processing

Parameters:
  • sc (sparkContext) – spark context
  • transFile (string) – uri to the transform file in HDFS
Returns:

retval(list(TransformLine))

magichour.api.dist.preprocess.readLog_RDD module

class magichour.api.dist.preprocess.readLog_RDD.LogLine(ts, msg, processed, dictionary, template, templateId, templateDict)

Bases: tuple

__getnewargs__()

Return self as a plain tuple. Used by copy and pickle.

__getstate__()

Exclude the OrderedDict from pickling

__repr__()

Return a nicely formatted representation string

dictionary

Alias for field number 3

msg

Alias for field number 1

processed

Alias for field number 2

template

Alias for field number 4

templateDict

Alias for field number 6

templateId

Alias for field number 5

ts

Alias for field number 0

class magichour.api.dist.preprocess.readLog_RDD.TemplateLine(id, template, skipWords)

Bases: tuple

__getnewargs__()

Return self as a plain tuple. Used by copy and pickle.

__getstate__()

Exclude the OrderedDict from pickling

__repr__()

Return a nicely formatted representation string

id

Alias for field number 0

skipWords

Alias for field number 2

template

Alias for field number 1

class magichour.api.dist.preprocess.readLog_RDD.TransformLine(id, type, NAME, transform, compiled)

Bases: tuple

NAME

Alias for field number 2

__getnewargs__()

Return self as a plain tuple. Used by copy and pickle.

__getstate__()

Exclude the OrderedDict from pickling

__repr__()

Return a nicely formatted representation string

compiled

Alias for field number 4

id

Alias for field number 0

transform

Alias for field number 3

type

Alias for field number 1

magichour.api.dist.preprocess.readLog_RDD.procLogLine(line, logFile)

handles the logfile specific parsing input lines into 2 parts ts: timestamp float msg: the rest of the message

Parameters:
  • line (string) – text to process
  • logFile (string) – hint of URI used for input should use for switching parsing based off different directories
Returns:

retval – [ts, msg]

Return type:

list[string,string]

magichour.api.dist.preprocess.readLog_RDD.rdd_LogLine(line, logFile)

process a log line into a RDD

Parameters:
  • line (string) – string from the logline
  • logFile (string) – what URI the log lines came from, eventually want to do different parsing based on the base of the URI
Returns:

retval – fills in the first two portions of the LogLine

namedtuple

Return type:

LogLine

magichour.api.dist.preprocess.readLog_RDD.rdd_ReadLog(sc, logFile)

read a log/directory into LogLine RDD format NOTE: only ts, and msg are populated :param sc: :type sc: sparkContext :param logFile: URI to file toprocess

Returns:retval – RDD of logs read from the LogFile URI
Return type:RDD(LogLines

Module contents