We monitor backups a little bit more advanced - I thought I should share this knowledge as well...
Single Job Monitoring:
Monitor a single job and it's results - allows you configure the Job-Name with a PRTG filter value in the SQL sensor.
The results will include various values - most notable are:
- FinalJobStatus (as text)
- TotalDataSizeBytes
- TotalNumberOfDirectories
- TotalNumberOfFiles
- TotalRateMBMin
In order to implement this sensor, add a SQLv2 Sensor and configure like this:
- Database: BEDB
- Instance Name: BKUPEXEC (in most cases)
- Use input parameter: specify the exact job name
- DBNull = Error
Channels:
Most channels are rather simple to configure, they are counters, SpeedDisk or BytesDisk - as PRTG has those channel types integrated already.
The special channel is FINALJOBSTATUS - in order to have this working you will need the "backupexec.jobstatus.ovl" file in your %programfilesx86%\PRTG...\Lookups directory - see below for the file.
SQL Script for single job monitoring:
SELECT TOP 1
J.JobName,
H.TotalDataSizeBytes, H.TotalRateMBMin, H.TotalNumberOfFiles, H.TotalNumberOfDirectories,
H.FinalJobStatus,
H.ActualStartTime, H.EndTime, H.ElapsedTimeSeconds
FROM Jobs J
LEFT JOIN TaskDefinition T ON T.JobDefinitionID = J.JobDefinitionID
LEFT JOIN Schedule S ON S.ScheduleId = T.ScheduleID
LEFT JOIN JobHistorySummary H ON H.JobID = J.JobID
WHERE S.SystemId IS NOT NULL
AND H.ActualStartTime > GETDATE()-8 --adjust this value if needed, 8 helps for the last 8 days.. weekly backups (full backups)
AND J.JobName LIKE @prtg
ORDER BY H.ActualStartTime DESC
The backupexec.jobstatus.ovl file:
<?xml version="1.0" encoding="UTF-8"?>
<ValueLookup id="backupexec.jobstatus" desiredValue="1" undefinedState="error" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="PaeValueLookup.xsd">
<Lookups>
<SingleInt state="warning" value="1">
Job Cancelled
</SingleInt>
<SingleInt state="Ok" value="2">
Job Completed
</SingleInt>
<SingleInt state="warning" value="3">
Job Exception
</SingleInt>
<SingleInt state="warning" value="4">
Job Dispatched
</SingleInt>
<SingleInt state="warning" value="5">
Job On Hold
</SingleInt>
<SingleInt state="error" value="6">
Job Error
</SingleInt>
<SingleInt state="error" value="7">
Job Invalid Schedule
</SingleInt>
<SingleInt state="error" value="8">
Job Invalid Time Window
</SingleInt>
<SingleInt state="error" value="9">
unknown status code (9)
</SingleInt>
<SingleInt state="error" value="10">
Job Missed
</SingleInt>
<SingleInt state="warning" value="11">
Job Paused
</SingleInt>
<SingleInt state="warning" value="12">
Job Pending
</SingleInt>
<SingleInt state="warning" value="13">
Job Recovered
</SingleInt>
<SingleInt state="error" value="14">
Job Resource Disabled
</SingleInt>
<SingleInt state="warning" value="15">
Job Resumed
</SingleInt>
<SingleInt state="Ok" value="16">
Job Running
</SingleInt>
<SingleInt state="Ok" value="17">
Job Ready
</SingleInt>
<SingleInt state="Ok" value="18">
Job Scheduled
</SingleInt>
<SingleInt state="Ok" value="19">
Job Success
</SingleInt>
<SingleInt state="warning" value="20">
Job Superceded
</SingleInt>
<SingleInt state="error" value="21">
Job Abort (Time)
</SingleInt>
<SingleInt state="warning" value="22">
Job To Be Scheduled
</SingleInt>
<SingleInt state="warning" value="23">
Job Linked to Another Job
</SingleInt>
<SingleInt state="error" value="24">
unknown status value (24)
</SingleInt>
<SingleInt state="error" value="25">
Job Blocked by Template Rule
</SingleInt>
</Lookups>
</ValueLookup>
Another SQL script we use is the one below - this actually approaches the whole monitoring more in an overview - it still depends on the JobHistory table, meaning, the job must have been running.. in theory you could work around this and actually get information on the scheduler etc.. the script below is a pure example.
DECLARE @JobFilter AS VARCHAR(500)
DECLARE @TimeWindow AS Integer
SET @JobFilter = '%-FULL' -- we only want to see jobs that with "-FULL" at the end of the job name / use @PRGT if you want to control it in PRTG
SET @TimeWindow = 7 --7 days in each direction for history and scheduled jobs
SELECT
COUNT(J.JobName) AS TotalJobs,
SUM(CASE WHEN H.IsJobActive = 1 THEN 1 ELSE 0 END) AS ActiveJobs,
-- SUM(CASE WHEN H.FinalJobStatus = 1 THEN 1 ELSE 0 END) AS FinalStatus_JobCancelled,
-- SUM(CASE WHEN H.FinalJobStatus = 2 THEN 1 ELSE 0 END) AS FinalStatus_JobCompleted,
-- SUM(CASE WHEN H.FinalJobStatus = 3 THEN 1 ELSE 0 END) AS FinalStatus_JobException,
-- SUM(CASE WHEN H.FinalJobStatus = 4 THEN 1 ELSE 0 END) AS FinalStatus_JobDispatched,
-- SUM(CASE WHEN H.FinalJobStatus = 5 THEN 1 ELSE 0 END) AS FinalStatus_JobOnHold,
SUM(CASE WHEN H.FinalJobStatus = 6 THEN 1 ELSE 0 END) AS FinalStatus_JobError,
-- SUM(CASE WHEN H.FinalJobStatus = 7 THEN 1 ELSE 0 END) AS FinalStatus_JobInvalidSchedule,
-- SUM(CASE WHEN H.FinalJobStatus = 8 THEN 1 ELSE 0 END) AS FinalStatus_JobInvalidTimeWindow,
-- SUM(CASE WHEN H.FinalJobStatus = 9 THEN 1 ELSE 0 END) AS FinalStatus_unknownStatusCode9,
-- SUM(CASE WHEN H.FinalJobStatus = 10 THEN 1 ELSE 0 END) AS FinalStatus_JobMissed,
-- SUM(CASE WHEN H.FinalJobStatus = 11 THEN 1 ELSE 0 END) AS FinalStatus_JobPaused,
-- SUM(CASE WHEN H.FinalJobStatus = 12 THEN 1 ELSE 0 END) AS FinalStatus_JobPending,
--SUM(CASE WHEN H.FinalJobStatus = 13 THEN 1 ELSE 0 END) AS FinalStatus_JobRecovered,
-- SUM(CASE WHEN H.FinalJobStatus = 14 THEN 1 ELSE 0 END) AS FinalStatus_JobResourceDisabled,
-- SUM(CASE WHEN H.FinalJobStatus = 15 THEN 1 ELSE 0 END) AS FinalStatus_JobResumed,
SUM(CASE WHEN H.FinalJobStatus = 16 THEN 1 ELSE 0 END) AS FinalStatus_JobRunning,
-- SUM(CASE WHEN H.FinalJobStatus = 17 THEN 1 ELSE 0 END) AS FinalStatus_JobReady,
-- SUM(CASE WHEN H.FinalJobStatus = 18 THEN 1 ELSE 0 END) AS FinalStatus_JobScheduled,
SUM(CASE WHEN H.FinalJobStatus = 19 THEN 1 ELSE 0 END) AS FinalStatus_JobSuccess,
SUM(CASE WHEN H.FinalJobStatus = 20 THEN 1 ELSE 0 END) AS FinalStatus_JobSuperceded,
-- SUM(CASE WHEN H.FinalJobStatus = 21 THEN 1 ELSE 0 END) AS FinalStatus_JobAbort,
-- SUM(CASE WHEN H.FinalJobStatus = 22 THEN 1 ELSE 0 END) AS FinalStatus_JobToBeScheduled,
-- SUM(CASE WHEN H.FinalJobStatus = 23 THEN 1 ELSE 0 END) AS FinalStatus_JobLinkedToAnotherJob,
-- SUM(CASE WHEN H.FinalJobStatus = 24 THEN 1 ELSE 0 END) AS FinalStatus_unknownStatusCode24,
-- SUM(CASE WHEN H.FinalJobStatus = 25 THEN 1 ELSE 0 END) AS FinalStatus_JobBlockedByTemplateTule,
SUM(CASE WHEN H.FinalJobStatus = 3 THEN 1 ELSE 0 END) AS FinalStatus_Exceptions --3 = With Exceptions
-- SUM(CASE WHEN H.FinalJobStatus = 19 THEN 1 ELSE 0 END) AS FinalStatus_SuccessFull --19 = Successful
FROM Jobs J
LEFT JOIN TaskDefinition T ON T.JobDefinitionID = J.JobDefinitionID
LEFT JOIN Schedule S ON S.ScheduleId = T.ScheduleID
LEFT JOIN JobHistorySummary H ON H.JobID = J.JobID
WHERE S.SystemId IS NOT NULL
AND H.ActualStartTime > GETDATE() - @TimeWindow
AND J.NextDueDate < GETDATE() + @TimeWindow
AND J.JobName LIKE @JobFilter
Finally I wanted to mention what are our real challenges are, and we don't yet have a really good solution:
Our backup runs FULL starting Friday evening... during the week we run incremental backups. Now the incremental backups are not as critical... so let's focus on the weekends.
What happened every now and then was that e.g. only a few tapes where write able and other might still have been locked or one of our libraries jammed etc..
In the end, it means - we e.g. came in Monday morning and discovered that 50+ % of the backups did not run.
Now, the question is, how do you monitor this. There are about a 150 jobs - they are stacked on each other. In theory I expect let say 5 running jobs, 0 completed and a 145 pending - starting Friday night - over the weekend this number will change constantly.
What I did not yet find is that a good solution that when Backup Exec waits for user interaction like insert tapes, offline library, etc. does wait for user interaction.
As well as the fact that I can't tell PRTG on Friday I expect e.g. 150 jobs pending, on Saturday 1 PM the number should be more like 75 jobs pending and on Sunday 6 AM is should be down to 50 pending and Sunday 8 PM it should be 0 pending and 150 successful.
This is very granular, making it hard to find a solution. The jobs in our case will not finish - they are within their weekend time-window and will not be auto cancelled and therefor only manually looking in to Backup Exec will tell you if we are making progress or not.
It could be a solution to constantly see if the Total Bytes backed up goes up - but this again is challenging, we would need to compare values over time.. PRTG is as far as I know not directly able to do so and this would mean we would need to have a temp file with values form the last check in some kind of script or database that we would compare too...
So far I did not come up with the ultimate solution - every now and then I think about it a little more.. but well, I am not there yet.
Kind regards
Florian Rossmark
Add comment