magichour.api.dist.preprocess package¶
Submodules¶
magichour.api.dist.preprocess.preProcess_RDD module¶
-
class
magichour.api.dist.preprocess.preProcess_RDD.
LogLine
(ts, msg, processed, dictionary, template, templateId, templateDict)¶ Bases:
tuple
-
__getnewargs__
()¶ Return self as a plain tuple. Used by copy and pickle.
-
__getstate__
()¶ Exclude the OrderedDict from pickling
-
__repr__
()¶ Return a nicely formatted representation string
-
dictionary
¶ Alias for field number 3
-
msg
¶ Alias for field number 1
-
processed
¶ Alias for field number 2
-
template
¶ Alias for field number 4
-
templateDict
¶ Alias for field number 6
-
templateId
¶ Alias for field number 5
-
ts
¶ Alias for field number 0
-
-
class
magichour.api.dist.preprocess.preProcess_RDD.
TemplateLine
(id, template, skipWords)¶ Bases:
tuple
-
__getnewargs__
()¶ Return self as a plain tuple. Used by copy and pickle.
-
__getstate__
()¶ Exclude the OrderedDict from pickling
-
__repr__
()¶ Return a nicely formatted representation string
-
id
¶ Alias for field number 0
-
skipWords
¶ Alias for field number 2
-
template
¶ Alias for field number 1
-
-
class
magichour.api.dist.preprocess.preProcess_RDD.
TransformLine
(id, type, NAME, transform, compiled)¶ Bases:
tuple
-
NAME
¶ Alias for field number 2
-
__getnewargs__
()¶ Return self as a plain tuple. Used by copy and pickle.
-
__getstate__
()¶ Exclude the OrderedDict from pickling
-
__repr__
()¶ Return a nicely formatted representation string
-
compiled
¶ Alias for field number 4
-
id
¶ Alias for field number 0
-
transform
¶ Alias for field number 3
-
type
¶ Alias for field number 1
-
-
magichour.api.dist.preprocess.preProcess_RDD.
lineRegexReplacement
(line, logTrans)¶ apply a list of regex replacements to a line, make note of all the remplacements peformed in a dictionary(list)
Parameters: line (LogLine) – logline to work on - Globals:
- transforms(RDD(TransformLine)): replacemnts to make with
Returns: retval – logline with the processed, and dictionary portions filled inReturn type: LogLine
-
magichour.api.dist.preprocess.preProcess_RDD.
logPreProcess
(sc, logTrans, rrdLogLine)¶ take a series of loglines and pre-process the lines replace ipaddresses, directories, urls, etc with constants keep a dictionary of the replacements done to the line
Parameters: - sc (sparkContext) – spark context
- logTrans (string) – location fo the transFile in HDFS
- logFile (string) – location of the log data in HDFS
Returns: retval – preprocessed log lines ready for next
stage of processing
Return type: RDD(LogLines
-
magichour.api.dist.preprocess.preProcess_RDD.
rdd_TransformLine
(line)¶ process transformations into RDD format
Parameters: line (string) – line from the transform defintion file. lines beginning with # are considered comments and will need to be removed Returns: retval – namedTuple representation of the tasking Return type: TransformLine
-
magichour.api.dist.preprocess.preProcess_RDD.
rdd_preProcess
(sc, logTrans, rrdLogLine)¶ make a rdd of preprocessed loglines
- Args:
- sc(sparkContext): sparkContext logTrans(string): location fo the transFile in HDFS logFile(string): location of the log data in HDFS
Returns: retval – preprocessed log lines ready for next stage of processingReturn type: RDD(LogLines
-
magichour.api.dist.preprocess.preProcess_RDD.
readTransforms
(sc, transFile)¶ returns a list of transforms for replacement processing
Parameters: - sc (sparkContext) – spark context
- transFile (string) – uri to the transform file in HDFS
Returns: retval(list(TransformLine))
magichour.api.dist.preprocess.readLog_RDD module¶
-
class
magichour.api.dist.preprocess.readLog_RDD.
LogLine
(ts, msg, processed, dictionary, template, templateId, templateDict)¶ Bases:
tuple
-
__getnewargs__
()¶ Return self as a plain tuple. Used by copy and pickle.
-
__getstate__
()¶ Exclude the OrderedDict from pickling
-
__repr__
()¶ Return a nicely formatted representation string
-
dictionary
¶ Alias for field number 3
-
msg
¶ Alias for field number 1
-
processed
¶ Alias for field number 2
-
template
¶ Alias for field number 4
-
templateDict
¶ Alias for field number 6
-
templateId
¶ Alias for field number 5
-
ts
¶ Alias for field number 0
-
-
class
magichour.api.dist.preprocess.readLog_RDD.
TemplateLine
(id, template, skipWords)¶ Bases:
tuple
-
__getnewargs__
()¶ Return self as a plain tuple. Used by copy and pickle.
-
__getstate__
()¶ Exclude the OrderedDict from pickling
-
__repr__
()¶ Return a nicely formatted representation string
-
id
¶ Alias for field number 0
-
skipWords
¶ Alias for field number 2
-
template
¶ Alias for field number 1
-
-
class
magichour.api.dist.preprocess.readLog_RDD.
TransformLine
(id, type, NAME, transform, compiled)¶ Bases:
tuple
-
NAME
¶ Alias for field number 2
-
__getnewargs__
()¶ Return self as a plain tuple. Used by copy and pickle.
-
__getstate__
()¶ Exclude the OrderedDict from pickling
-
__repr__
()¶ Return a nicely formatted representation string
-
compiled
¶ Alias for field number 4
-
id
¶ Alias for field number 0
-
transform
¶ Alias for field number 3
-
type
¶ Alias for field number 1
-
-
magichour.api.dist.preprocess.readLog_RDD.
procLogLine
(line, logFile)¶ handles the logfile specific parsing input lines into 2 parts ts: timestamp float msg: the rest of the message
Parameters: - line (string) – text to process
- logFile (string) – hint of URI used for input should use for switching parsing based off different directories
Returns: retval – [ts, msg]
Return type: list[string,string]
-
magichour.api.dist.preprocess.readLog_RDD.
rdd_LogLine
(line, logFile)¶ process a log line into a RDD
Parameters: - line (string) – string from the logline
- logFile (string) – what URI the log lines came from, eventually want to do different parsing based on the base of the URI
Returns: retval – fills in the first two portions of the LogLine
namedtuple
Return type:
-
magichour.api.dist.preprocess.readLog_RDD.
rdd_ReadLog
(sc, logFile)¶ read a log/directory into LogLine RDD format NOTE: only ts, and msg are populated :param sc: :type sc: sparkContext :param logFile: URI to file toprocess
Returns: retval – RDD of logs read from the LogFile URI Return type: RDD(LogLines