【Dr.Elephant中文文檔-6】度量指標和啟發式演算法

2019 年 12 月 26 日
筆記

1.度量指標

1.1.資源用量

資源使用情況是你作業在 GB 小時內使用的資源量。

1.1.1.計量統計

我們將作業的資源使用量定義為任務容器大小和任務運行時間的乘積。因此，作業的資源使用量可以定義為mapper和reducer任務的資源使用量總和。

1.1.2.範例

Consider a job with:  4 mappers with runtime {12, 15, 20, 30} mins.  4 reducers with runtime {10 , 12, 15, 18} mins.  Container size of 4 GB  Then,  Resource used by all mappers: 4 * (( 12 + 15 + 20 + 30 ) / 60 ) GB Hours = 5.133 GB Hours  Resource used by all reducers: 4 * (( 10 + 12 + 15 + 18 ) / 60 ) GB Hours = 3.666 GB Hours  Total resource used by the job = 5.133 + 3.6666 = 8.799 GB Hours

1.2.浪費的資源量

這顯示了作業以 GB 小時浪費的資源量或以浪費的資源百分比。

1.2.1.計量統計

To calculate the resources wasted, we calculate the following:  The minimum memory wasted by the tasks (Map and Reduce)  The runtime of the tasks (Map and Reduce)  The minimum memory wasted by a task is equal to the difference between the container size and maximum task memory(peak memory) among all tasks. The resources wasted by the task is then the minimum memory wasted by the task multiplied by the duration of the task. The total resource wasted by the job then will be equal to the sum of wasted resources of all the tasks.    Let us define the following for each task:    peak_memory_used := The upper bound on the memory used by the task.  runtime := The run time of the task.    The peak_memory_used for any task is calculated by finding out the maximum of physical memory(max_physical_memory) used by all the tasks and the virtual memory(virtual_memory) used by the task.  Since peak_memory_used for each task is upper bounded by max_physical_memory, we can say for each task:    peak_memory_used = Max(max_physical_memory, virtual_memory/2.1)  Where 2.1 is the cluster memory factor.    The minimum memory wasted by each task can then be calculated as:    wasted_memory = Container_size - peak_memory_used    The minimum resource wasted by each task can then be calculated as:    wasted_resource = wasted_memory * runtime

1.3.運行時間

運行時間指標顯示了作業運行的總時間。

1.3.1.計量統計

作業運行時間是作業提交到資源管理器和作業完成時的時間差。

1.3.2.範例

作業的提交時間為1461837302868 ms，結束時間為1461840952182 ms，作業的runtime時間是1461840952182 - 1461837302868 = 3649314 ms，即1.01小時。

1.4.等待時間

等待時間是作業處於等待狀態消耗的時間

1.4.1.計量統計

For each task, let us define the following:    ideal_start_time := The ideal time when all the tasks should have started  finish_time := The time when the task finished  task_runtime := The runtime of the task    - Map tasks  For map tasks, we have    ideal_start_time := The job submission time    We will find the mapper task with the longest runtime ( task_runtime_max) and the task which finished last ( finish_time_last )  The total wait time of the job due to mapper tasks would be:    mapper_wait_time = finish_time_last - ( ideal_start_time + task_runtime_max)    - Reduce tasks  For reducer tasks, we have    ideal_start_time := This is computed by looking at the reducer slow start percentage (mapreduce.job.reduce.slowstart.completedmaps) and finding the finish time of the map task after which first reducer should have started  We will find the reducer task with the longest runtime ( task_runtime_max) and the task which finished last ( finish_time_last )    The total wait time of the job due to reducer tasks would be:  reducer_wait_time = finish_time_last - ( ideal_start_time + task_runtime_max)

2.啟發式演算法

2.1.Map-Reduce

2.1.1.Mapper 數據傾斜

Mapper 數據傾斜啟發式演算法能夠顯示作業是否發生數據傾斜。啟發式演算法會將所有 Mapper 分成兩組，第一組的平均值會小於第二組。

例如，第一組有 900 個 Mapper 作業，每個 Mapper 作業平均數據量為 7MB，而另一份包含 1200 個 Mapper 作業，且每個 Mapper 作業的平均數據量是 500MB。

2.1.1.1.計算

首先通過遞歸演算法計算兩組平均記憶體消耗，來評估作業的等級。其誤差為兩組平均記憶體消耗的差除以這倆組最小的平均記憶體消耗的差的值。

Let us define the following variables,        deviation: the deviation in input bytes between two groups      num_of_tasks: the number of map tasks      file_size: the average input size of the larger group        num_tasks_severity: List of severity thresholds for the number of tasks. e.g., num_tasks_severity = {10, 20, 50, 100}      deviation_severity: List of severity threshold values for the deviation of input bytes between two groups. e.g., deviation_severity: {2, 4, 8, 16}      files_severity: The severity threshold values for the fraction of HDFS block size. e.g. files_severity = { ⅛, ¼, ½, 1}    Let us define the following functions,        func avg(x): returns the average of a list x      func len(x): returns the length of a list x      func min(x,y): returns minimum of x and y      func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.    We』ll compute two groups recursively based on average memory consumed by them.    Let us call the two groups: group_1 and group_2    Without loss of generality, let us assume that,      avg(group_1) > avg(group_2) and len(group_1)< len(group_2) then,        deviation = avg(group_1) - avg(group_2) / min(avg(group_1)) - avg(group_2))      file_size = avg(group_1)      num_of_tasks = len(group_0)    The overall severity of the heuristic can be computed as,      severity = min(          getSeverity(deviation, deviation_severity)          , getSeverity(file_size,files_severity)          , getSeverity(num_of_tasks,num_tasks_severity)      )      ---      誤差（deviation）：分成兩部分後輸入數據量的誤差  作業數量（num_of_tasks）：map作業的數量  文件大小（file_size）：較大的那部分的平均輸入數據量的大小  作業數量的嚴重度（num_tasks_severity）：一個List包含了作業數量的嚴重度閾值，例如num_tasks_severity = {10, 20, 50, 100}  誤差嚴重度（deviation severity）：一個List包含了兩部分Mapper作業輸入數據差值的嚴重度閾值，例如deviation_severity: {2, 4, 8, 16}  文件嚴重度（files_severity）：一個List包含了文件大小佔HDFS塊大小比例的嚴重度閾值，例如files_severity = { ⅛, ¼, ½, 1}    然後定義如下的方法，  方法 avg(x)：返回List x的平均值  方法 len(x)：返回List x的長度大小  方法 min(x,y)：返回x和y中較小的一個  方法 getSeverity(x,y)：比較x和y中的嚴重度閾值，返回嚴重度的值    接下來，根據兩個部分的平均記憶體消耗，進行遞歸計算。  假設分成的兩部分分別為group_1和group_2  為了不失一般性，假設  avg(group_1) > ave(group_2) and len(group_1) < len(group_2)  以及  deviation = avg(group_1) - avg(group_2) / min(avg(group_1) - avg(group_2))  file_size = avg(group_1)  num_of_tasks = len(group_0)    啟發式演算法的嚴重度可以通過下面的方法來計算：  severity = min(getSeverity(deviation, deviation_severity),getSeverity(file_size,files_severity),getSeverity(num_of_tasks,num_tasks_severity))

2.1.1.2.參數配置

閾值參數deviation_severity、num_tasks_severity和files_severity能夠簡單的進行配置。如果想進一步了解如何配置這些參數，可以點擊開發者指南進行查看。

2.1.2.Mapper GC

Mapper GC 會分析任務的 GC 效率。它會計算出 GC 時間占所有 CPU 時間的百分比。

2.1.2.1.計算

啟發式演算法對Mapper GC嚴重度的計算按照如下過程進行。首先，計算出所有作業的平均的 CPU 使用時間、平均運行時間以及平均垃圾回收消耗的時間。我們要計算Mapper GC嚴重度的最小值，這個值可以通過平均運行時間和平均垃圾回收時間佔平均 CPU 總消耗時間的比例來計算。

Let us define the following variables:        avg_gc_time: average time spent garbage collecting      avg_cpu_time: average cpu time of all the tasks      avg_runtime: average runtime of all the tasks      gc_cpu_ratio: avg_gc_time/ avg_cpu_time        gc_ratio_severity: List of severity threshold values for the ratio of  avg_gc_time to avg_cpu_time.      runtime_severity: List of severity threshold values for the avg_runtime.    Let us define the following functions,        func min(x,y): returns minimum of x and y      func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.    The overall severity of the heuristic can then be computed as,        severity = min(getSeverity(avg_runtime, runtime_severity), getSeverity(gc_cpu_ratio, gc_ratio_severity)

2.1.2.2.參數配置

閾值參數gc_ratio_severity和runtime_severity也是可以簡單配置的。如果想進一步了解如何配置這些參數，可以參考開發者指南。

2.1.3.Mapper 記憶體消耗

此部分指標用來檢查mapper的記憶體消耗。他會檢查任務的消耗記憶體與容器請求到的記憶體比例。消耗的記憶體指任務最大消耗物理記憶體快照的平均值。容器請求的記憶體是作業mapreduce.map/reduce.memory.mb的配置值，是作業能請求到的最大物理記憶體。

2.1.3.1.計算

Let us define the following variables,        avg_physical_memory: Average of the physical memories of all tasks.      container_memory: Container memory        container_memory_severity: List of threshold values for the average container memory of the tasks.      memory_ratio_severity: List of threshold values for the ratio of avg_plysical_memory to container_memory    Let us define the following functions,        func min(x,y): returns minimum of x and y      func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.    The overall severity can then be computed as,        severity = min(getSeverity(avg_physical_memory/container_memory, memory_ratio_severity)                 , getSeverity(container_memory,container_memory_severity)                )

2.1.3.2.參數配置

閾值參數container_memory_severity和memory_ratio_severity也是可以簡單配置的。如果想進一步了解如何配置這些參數，可以參考開發者指南。

2.1.4.Mapper 的運行速度

這部分分析Mapper程式碼的運行效率。通過這些分析可以知道mapper是否受限於 CPU，或者處理的數據量過大。這個分析能夠分析mapper運行速度快慢和處理的數據量大小之間的關係。

2.1.4.1.計算

這個啟發式演算法的嚴重度值，是mapper作業的運行速度的嚴重度和mapper作業的運行時間嚴重度中較小的一個。

Let us define the following variables,        median_speed: median of speeds of all the mappers. The speeds of mappers are found by taking the ratio of input bytes to runtime.      median_size: median of size of all the mappers      median_runtime: median of runtime of all the mappers.        disk_speed_severity: List of threshold values for the median_speed.      runtime_severity: List of severity threshold values for median_runtime.    Let us define the following functions,        func min(x,y): returns minimum of x and y      func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.    The overall severity of the heuristic can then be computed as,        severity = min(getSeverity(median_speed, disk_speed_severity), getSeverity(median_runtime, median_runtime_severity)

2.1.4.2.參數配置

閾值參數disk_speed_severity和runtime_severity可以很簡單的配置。如果想進一步的了解這些參數配置，可以點擊開發者指南查看。

2.1.5.Mapper 溢出

這個啟發式演算法通過分析磁碟I/O來評判mapper的性能。mapper溢出比例（溢出的記錄數/總輸出的記錄數）是衡量mapper性能的一個重要指標：如果這個值接近 2，表示幾乎每個記錄都溢出了，並臨時寫到磁碟兩次（其中一次發生在記憶體排序快取溢出時，另一次發生在合併所有溢出的塊時）。當這些發生時表明mapper輸入輸出的數據量過大了。

2.1.5.1.計算

Let us define the following parameters,        total_spills: The sum of spills from all the map tasks.      total_output_records: The sum of output records from all the map tasks.      num_tasks: Total number of tasks.      ratio_spills: total_spills/ total_output_records        spill_severity: List of the threshold values for ratio_spills      num_tasks_severity: List of threshold values for total number of tasks.    Let us define the following functions,        func min(x,y): returns minimum of x and y      func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.    The overall severity of the heuristic can then be computed as,    	severity = min(getSeverity(ratio_spills, spill_severity), getSeverity(num_tasks, num_tasks_severity)

2.1.5.2.參數配置

閾值spill_severity和num_tasks_severity可以簡單的進行配置。如果想進一步了解配置參數的詳細資訊，可以點擊這裡查看。開發者指南.

2.1.6.Mapper 運行時間

這部分分析mapper的數量是否合適。通過分析結果，我們可以更好的優化任務中mapper的數量這個參數設置。有以下兩種情況發生時，這個參數就需要優化了：

Mapper的運行時間很短。通常作業在以下情況下出現：
- mapper數量過多
- mapper的平均運行時間很短
- 文件太小
大文件或不可分割文件塊，通常作業在以下情況下出現：
- mapper數量太少
- mapper的平均運行時間太長
- 文件過大 (個別達到 GB 級別)

2.1.6.1.計算

Let us define the following variables,      avg_size: average size of input data for all the mappers      avg_time: average of runtime of all the tasks.      num_tasks: total number of tasks.        short_runtime_severity: The list of threshold values for tasks with short runtime      long_runtime_severity: The list of threshold values for tasks with long runtime.      num_tasks_severity: The list of threshold values for number of tasks.    Let us define the following functions,      func min(x,y): returns minimum of x and y      func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.    The overall severity of the heuristic can then be computed as,      short_task_severity = min(getSeverity(avg_time,short_runtime_severity), getSeverity(num_tasks, num_tasks_severity))      severity = max(getSeverity(avg_size, long_runtime_severity), short_task_severity)

2.1.6.2.參數配置

閾值short_runtime_severity 、long_runtime_severity以及num_tasks_severity可以很簡單的配置。如果想進一步了解參數配置的詳細資訊，可以點擊開發者指南查看。

2.1.7.Reducer 數據傾斜

這部分分析每個Reduce中的數據是否存在傾斜情況。這部分分析能夠發現Reducer中是否存在這種情況，將Reduce分為兩部分，其中一部分的輸入數據量是否明顯大於另一部分的輸入數據量。

2.1.7.1.計算

首先通過遞歸演算法計算均值並基於每個組消耗的平均記憶體消耗將任務劃分為兩組來評估該演算法的等級。誤差表示為兩個部分Reducer的平均記憶體消耗之差除以兩個部分最小記憶體消耗之差得到的比例。

Let us define the following variables:    deviation: deviation in input bytes between two groups    num_of_tasks: number of reduce tasks    file_size: average of larger group    num_tasks_severity: List of severity threshold values for the number of tasks.    e.g. num_tasks_severity = {10,20,50,100}    deviation_severity: List of severity threshold values for the deviation of input bytes between two groups.    e.g. deviation_severity = {2,4,8,16}    files_severity: The severity threshold values for the fraction of HDFS block size    e.g. files_severity = { ⅛, ¼, ½, 1}    Let us define the following functions:    func avg(x): returns the average of a list x    func len(x): returns the length of a list x    func min(x,y): returns minimum of x and y    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.    We』ll compute two groups recursively based on average memory consumed by them.  Let us call the two groups: group_1 and group_2  Without loss of generality, let us assume that:    avg(group_1) > avg(group_2) and len(group_1)< len(group_2) then,    deviation = avg(group_1) - avg(group_2) / min(avg(group_1)) - avg(group_2))    file_size = avg(group_1)    num_of_tasks = len(group_0)    The overall severity of the heuristic can be computed as,    severity = min(getSeverity(deviation,deviation_severity),getSeverity(file_size,files_severity),getSeverity(num_of_tasks,num_tasks_severity))

2.1.7.2.參數配置

閾值deviation_severity、num_tasks_severity和files_severity，可以很簡單的進行配置。如果想進一步了解這些參數的配置，可以點擊開發者指南查看。

2.1.8.Reducer GC

這部分分析任務的 GC 效率，能夠計算並告訴我們 GC 時間占所用 CPU 時間的比例。

2.1.8.1.計算

首先，會計算出所有任務的平均 CPU 消耗時間、平均運行時間以及平均垃圾回收所消耗的時間。然後，演算法會根據平均運行時間以及垃圾回收時間佔平均 CPU 時間的比值來計算出最低的嚴重等級。

Let us define the following variables:        avg_gc_time: average time spent garbage collecting      avg_cpu_time: average cpu time of all the tasks      avg_runtime: average runtime of all the tasks      gc_cpu_ratio: avg_gc_time/ avg_cpu_time        gc_ratio_severity: List of severity threshold values for the ratio of  avg_gc_time to avg_cpu_time.      runtime_severity: List of severity threshold values for the avg_runtime.    Let us define the following functions,        func min(x,y): returns minimum of x and y      func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.    The overall severity of the heuristic can then be computed as,      severity = min(getSeverity(avg_runtime, runtime_severity), getSeverity(gc_cpu_ratio, gc_ratio_severity)

2.1.8.2.參數配置

閾值gc_ratio_severity、runtime_severity可以很簡單的配置，如果想進一步了解參數配置的詳細過程，可以點擊開發者指南查看。

2.1.9.Reducer 記憶體消耗

這部分分析顯示了任務的記憶體利用率。演算法會比較作業消耗的記憶體以及容器要求的記憶體分配。消耗的記憶體是指每個作業消耗的最大記憶體的平均值。容器需求的記憶體是指任務配置的mapreduce.map/reduce.memory.mb，也就是任務能夠使用最大物理記憶體。

2.1.9.1.計算

Let us define the following variables,        avg_physical_memory: Average of the physical memories of all tasks.      container_memory: Container memory        container_memory_severity: List of threshold values for the average container memory of the tasks.      memory_ratio_severity: List of threshold values for the ratio of avg_physical_memory to container_memory    Let us define the following functions,        func min(x,y): returns minimum of x and y      func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.    The overall severity can then be computed as,        severity = min(getSeverity(avg_physical_memory/container_memory, memory_ratio_severity)                 , getSeverity(container_memory,container_memory_severity)                )

2.1.9.2.參數配置

閾值container_memory_severity和memory_ratio_severity可以簡單的進行配置。如果想進一步了解配置參數的詳細資訊，可以點擊開發者指南查看。

2.1.10.Reducer 運行時間

這部分分析Reducer的運行效率，可以幫助我們更好的配置任務中reducer的數量。當出現以下兩種情況時，說明Reducer的數量需要進行調優：

Reducer過多，hadoop 任務可能的表現是：
- Reducer數量過多
- Reducer的運行時間很短
Reducer過少，hadoop 任務可能的表現是：
- Reducer數量過少
- Reducer運行時間很長

2.1.10.1.計算

Let us define the following variables,        avg_size: average size of input data for all the mappers      avg_time: average of runtime of all the tasks.      num_tasks: total number of tasks.        short_runtime_severity: The list of threshold values for tasks with short runtime      long_runtime_severity: The list of threshold values for tasks with long runtime.      num_tasks_severity: The number of tasks.    Let us define the following functions,        func min(x,y): returns minimum of x and y      func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.    The overall severity of the heuristic can then be computed as,        short_task_severity = min(getSeverity(avg_time,short_runtime_severity), getSeverity(num_tasks, num_tasks_severity))      severity = max(getSeverity(avg_size, long_runtime_severity), short_task_severity)

2.1.10.2.參數配置

閾值參數short_runtime_severity、long_runtime_severity以及num_tasks_severity可以很簡單的配置，如果想進一步了解參數配置的詳細過程，可以點擊開發者指南查看。

2.1.11.清洗&排序

這部分分析reducer消耗的總時間以及reducer在進行清洗和排序時消耗的時間，通過這些分析，可以評估reducer的執行效率。

2.1.11.1.計算

Let』s define following variables,        avg_exec_time: average time spent in execution by all the tasks.      avg_shuffle_time: average time spent in shuffling.      avg_sort_time: average time spent in sorting.        runtime_ratio_severity: List of threshold values for the ratio of twice of average shuffle or sort time to average execution time.      runtime_severity: List of threshold values for the runtime for shuffle or sort stages.    The overall severity can then be found as,    	severity = max(shuffle_severity, sort_severity)    	where shuffle_severity and sort_severity can be found as:    	shuffle_severity = min(getSeverity(avg_shuffle_time, runtime_severity), getSeverity(avg_shuffle_time*2/avg_exec_time, runtime_ratio_severity))    	sort_severity = min(getSeverity(avg_sort_time, runtime_severity), getSeverity(avg_sort_time*2/avg_exec_time, runtime_ratio_severity))

2.1.11.2.參數配置

閾值參數avg_exec_time、avg_shuffle_time和avg_sort_time可以很簡單的進行配置。更多關於參數配置的相信資訊可以點擊開發者指南查看。

2.2.Spark

2.2.1.Spark 的事件日誌限制

Spark事件日誌處理器當前無法處理很大的日誌文件。Dr-Elephant需要花很長的時間去處理一個很大的Spark時間日誌文件，期間很可能會影響Dr-Elephant本身的穩定運行。因此，目前我們設置了一個日誌大小限制（100MB），如果超過這個大小，會新起一個進程去處理這個日誌。

2.2.1.1.計算

如果數據被限流了，那麼啟發式演算法將評估為最嚴重等級CRITICAL，否則，就沒有評估等級。

2.2.2.Spark 負載均衡處理器

和Map/Reduce任務的執行機制不同，Spark應用在啟動後會一次性分配它所需要的所有資源，直到整個任務結束才會釋放這些資源。根據這個機制，對Spark的處理器的負載均衡就顯得非常重要，可以避免集群中個別節點壓力過大。

2.2.2.1.計算

Let us define the following variables:        peak_memory: List of peak memories for all executors      durations: List of durations of all executors      inputBytes: List of input bytes of all executors      outputBytes: List of output bytes of all executors.        looser_metric_deviation_severity: List of threshold values for deviation severity, loose bounds.      metric_deviation_severity: List of threshold values for deviation severity, tight bounds.    Let us define the following functions:        func getDeviation(x): returns max(|maximum-avg|, |minimum-avg|)/avg, where          x = list of values          maximum = maximum of values in x          minimum = minimum of values in x          avg = average of values in x        func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.      func max(x,y): returns the maximum value of x and y.      func Min(l): returns the minimum of a list l.    The overall severity can be found as,        severity = Min( getSeverity(getDeviation(peak_memory), looser_metric_deviation_severity),                 getSeverity(getDeviation(durations),  metric_deviation_severity),                 getSeverity(getDeviation(inputBytes), metric_deviation_severity),                 getSeverity(getDeviation(outputBytes), looser_metric_deviation_severity).                 )

2.2.2.2.參數配置

閾值參數looser_metric_deviation_severity和metric_deviation_severity可以簡單的進行配置。如果想進一步了解參數配置的詳細過程，可以點擊開發者指南查看。

2.2.3.Spark 任務運行時間

這部分啟發式演算法對Spark任務的運行時間進行調優分析。每個Spark應用程式可以拆分成多個任務，每個任務又可以拆分成多個運行階段。

2.2.3.1.計算

Let us define the following variables,        avg_job_failure_rate: Average job failure rate      avg_job_failure_rate_severity: List of threshold values for average job failure rate    Let us define the following variables for each job,        single_job_failure_rate: Failure rate of a single job      single_job_failure_rate_severity: List of threshold values for single job failure rate.    The severity of the job can be found as maximum of single_job_failure_rate_severity for all jobs and avg_job_failure_rate_severity.    i.e. severity = max(getSeverity(single_job_failure_rate, single_job_failure_rate_severity),                      getSeverity(avg_job_failure_rate, avg_job_failure_rate_severity)                  )    where single_job_failure_rate is computed for all the jobs.

2.2.3.2.參數配置

閾值參數single_job_failure_rate_severity和avg_job_failure_rate_severity可以很簡單的進行配置。更多詳細資訊，可以點擊開發者指南查看。

2.2.4.Spark 記憶體限制

目前，Spark應用程式缺少動態資源分配的功能。與Map/Reduce任務不同，能夠為每個map/reduce進程分配所需要的資源，並且在執行過程中逐步釋放佔用的資源。而Spark在應用程式執行時，會一次性的申請所需要的所有資源，直到任務結束才釋放這些資源。過多的記憶體使用會對集群節點的穩定性產生影響。所以，我們需要限制Spark應用程式能使用的最大記憶體比例。

2.2.4.1.計算

Let us define the following variables,        total_executor_memory: total memory of all the executors      total_storage_memory: total memory allocated for storage by all the executors      total_driver_memory: total driver memory allocated      peak_memory: total memory used at peak        mem_utilization_severity: The list of threshold values for the memory utilization.      total_memory_severity_in_tb: The list of threshold values for total memory.    Let us define the following functions,        func max(x,y): Returns maximum of x and y.      func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.    The overall severity can then be computed as,        severity = max(getSeverity(total_executor_memory,total_memory_severity_in_tb),                     getSeverity(peak_memory/total_storage_memory, mem_utilization_severity)                 )

2.2.4.2.參數配置

閾值參數total_memory_severity_in_tb和mem_utilization_severity可以很簡單的配置。進一步了解，可以點擊開發者指南查看。

2.2.5.Spark 階段運行時間

與Spark任務運行時間一樣，Spark應用程式可以分為多個任務，每個任務又可以分為多個運行階段。

2.2.5.1.計算

Let us define the following variable for each spark job,        stage_failure_rate: The stage failure rate of the job      stagge_failure_rate_severity: The list of threshold values for stage failure rate.    Let us define the following variables for each stage of a spark job,        task_failure_rate: The task failure rate of the stage      runtime: The runtime of a single stage        single_stage_tasks_failure_rate_severity: The list of threshold values for task failure of a stage      stage_runtime_severity_in_min: The list of threshold values for stage runtime.    Let us define the following functions,        func max(x,y): returns the maximum value of x and y.      func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.    The overall severity can be found as:        severity_stage = max(getSeverity(task_failure_rate, single_stage_tasks_faioure_rate_severity),                     getSeverity(runtime, stage_runtime_severity_in_min)                 )      severity_job = getSeverity(stage_failure_rate,stage_failure_rate_severity)        severity = max(severity_stage, severity_job)    where task_failure_rate is computed for all the tasks.

2.2.5.2.參數配置

閾值參數single_stage_tasks_failure_rate_severity、stage_runtime_severity_in_min和stage_failure_rate_severity可以很簡單的配置。進一步了解，請點擊開發者指南。

本章篇幅較長，一些專有名詞及參數功能，可以在Dr-Elephant的Dashboard中查看。

參考資料

[1]

開發者指南