“Why did the algorithm assign that cause?”
After using the InterVA-5 algorithm to assign causes of death to verbal autopsy (VA) data, you may be interested in trying to understand why the algorithm assigned a particular cause. Upon further inspection, you may be wondering why a particular cause was not assigned. We have included two new tools in the InterVA-5 package that help users explore these questions:
getTopSymptoms()
– list the “most important symptoms” for the causes assigned by InterVA-5 (and tell me if that symptom is present for a death)whyNotCOD()
– for any cause, list the “most important symptom” and tell me if that symptom is present for a deathIf we do not have an infinite amount of data, then it is NOT EASY to define the “most important symptom”. Here, we take a very crude approach and use the conditional probabilities that can be found in the matrix probbaseV5
(a data object that is included with the InterVA-5 package). These conditional probabilities, or \(P(S_i | C_j)\), refer to the probability of observing symptom \(S_i\) if the death is due to cause \(C_j\). While this metric is far from perfect, we hope that it will provide some insight into why a particular cause is (or is not) assigned by InterVA-5.
For these examples we will use a fake VA data set that is part of the InterVA-5 package
library(InterVA5)
data(RandomVA5)
results <- InterVA5(Input = RandomVA5, HIV = "l", Malaria = "l", write = FALSE)
#> Using Probbase version: probbase v17 180907
#> ..........10% completed
#> ..........20% completed
#> ..........30% completed
#> ..........40% completed
#> ..........50% completed
#> ..........60% completed
#> ..........70% completed
#> ..........80% completed
#> ..........90% completed
#> ..........100% completed
#> 100% completed
summary(results)
#> InterVA5 fitted on 200 deaths
#> CSMF calculated using reported causes by InterVA5 only
#> The remaining probabilities are assigned to 'Undetermined'
#>
#> Top 5 CSMFs:
#> cause likelihood
#> HIV/AIDS related death 0.1744
#> Undetermined 0.1350
#> Other and unspecified infect dis 0.0764
#> Renal failure 0.0713
#> Digestive neoplasms 0.0611
#>
#> Top 5 Circumstance of Mortality Category:
#> cause likelihood
#> Knowledge 0.365
#> Culture 0.240
#> Multiple 0.215
#> Inevitable 0.110
#> Resources 0.035
resultsTopCauses <- getTopCOD(results)
resultsTopCauses[1,]
#> ID cause
#> 1 d1 Stroke
InterVA-5
assigned the cause of the first death in our data set to Stroke. We can use the new functions to explore this cause assignment in greater detail…
getTopSymptoms()
Let us take a look at the conditional probabilities associated with the first death in the data set. To get started, we can use the getTopSymptoms()
function (in the InterVA5 package) which can take up to 5 arguments:
object
: an InterVA5 object (i.e., what the InterVA5()
function returns)data
: the input data used when producing the InterVA5 objectIDs
: a vector of IDs that identify the deaths for which you want resultspretty
: a logical indicating if you want the results in an easy-to-read format
TRUE
)includedAll
: a logical indicating if you want all of the symptoms included in the output (even those which are absent or have a value of missing/no)
FALSE
which only includes symptoms that are present)The following chunk of code provides an example, where we only ask for results pertaining to the first death in our data set and we only want to information about the symptoms that are present for this death.
topSymptoms <- getTopSymptoms(object = results, ## an InterVA5 object
data = RandomVA5, ## input data for InterVA5
IDs = results$ID[1], ## IDs of the deaths you want to analyze
pretty = TRUE, ## make the output pretty
includeAll = FALSE) ## include symptoms that are not present?
names(topSymptoms)
#> [1] "ID1"
names(topSymptoms$ID1)
#> [1] "ID" "Causelabels" "Cause1_Symptoms" "Cause2_Symptoms"
#> [5] "Cause3_Symptoms"
topSymptoms$ID1$ID
#> [1] "d1"
topSymptoms$ID1$Causelabels
#> [1] "Stroke"
topSymptoms$ID1$Cause1_Symptoms
#> [1] "(i019a) Was he male? [ cond. prob: 0.5]"
#> [2] "(i022a) Was s(he) aged 65 years or more at death? [ cond. prob: 0.5]"
#> [3] "(i263b) Did (s)he have difficulty with swallowing liquids? [ cond. prob: 0.5]"
#> [4] "(i261o) During the illness that led to death, did (s)he have difficulty swallowing? [ cond. prob: 0.05]"
#> [5] "(i451o) Did (s)he use motorised transport to get to the hospital or health facility? [ cond. prob: 0.05]"
#> [6] "(i458o) In the final days before death, did anyone use a telephone or cell phone to call for help? [ cond. prob: 0.05]"
#> [7] "(i246o) During the illness that led to death, did (s)he have stiffness of the whole body or was unable to open the mouth? [ cond. prob: 0.01]"
#> [8] "(i223o) During the illness that led to death, did (s)he have any urine problems? [ cond. prob: 0.005]"
The output from getTopSymptoms()
can be quite extensive, particularly if you ask for results for more than one death (or you use the option includeAll = TRUE
). The function returns a list with elements ID1
, ID2
, …, and so on. These refer to the arguments you passed to the IDs
parameter, so if you provide one ID the list will include ID1
, if you provide two IDs the list will include ID1
and ID2
, etc. Note that the variable used to identify the deaths is included in both the results (results$ID
) and in the input data (RandomVA5$ID
). Consider the above results, ID1
, which include:
ID
: the actual value used to identify this deathCause1_Symptoms
: a character vector describing the symptom variable name, symptom description, and the value of the conditional probability for the most likely cause of death assigned by InterVA-5 (these elements are ranked and ordered by this conditional probability)
includeALL = TRUE
then there would be additional information telling us if the symptom was present (yes/no)pretty = FALSE
then the output would be returned as a matrix, which is difficult to read in the R console (but may be easier to work with if you want to do further manipulation of these results)Cause2_Symptoms
and Cause3_Symptoms
. This was not the case so these returned objects have a value of NULL
.topSymptoms$ID1Cause2_Symptoms
#> NULL
results$VA5[[1]]$CAUSE2
#> [1] " "
data("causetextV5")
causetextV5[8, 2]
#> CAUSETXT.C40
#> "Malaria"
results$VA5[[1]]$CAUSE2 <- "Malaria"
topSymptoms <- getTopSymptoms(object = results, ## an InterVA5 object
data = RandomVA5, ## input data for InterVA5
IDs = results$ID[1], ## IDs of the deaths you want to analyze
pretty = TRUE, ## make the output pretty
includeAll = FALSE) ## include symptoms that are not present?
topSymptoms$ID1$Cause2_Symptoms
#> [1] "(i019a) Was he male? [ cond. prob: 0.5]"
#> [2] "(i022a) Was s(he) aged 65 years or more at death? [ cond. prob: 0.1]"
#> [3] "(i451o) Did (s)he use motorised transport to get to the hospital or health facility? [ cond. prob: 0.05]"
#> [4] "(i458o) In the final days before death, did anyone use a telephone or cell phone to call for help? [ cond. prob: 0.05]"
#> [5] "(i223o) During the illness that led to death, did (s)he have any urine problems? [ cond. prob: 0.005]"
#> [6] "(i263b) Did (s)he have difficulty with swallowing liquids? [ cond. prob: 0.005]"
#> [7] "(i261o) During the illness that led to death, did (s)he have difficulty swallowing? [ cond. prob: 0.002]"
#> [8] "(i246o) During the illness that led to death, did (s)he have stiffness of the whole body or was unable to open the mouth? [ cond. prob: 5e-04]"
While this last exercise may be useful for some, it is troublesome that we needed to change the results to get the desired information; and, thus, the motivation for the second tool…
whyNotCOD()
We can replicate the previous results using the function whyNotCOD()
without changing our stored results from InterVA5()
. Furthermore, we can ask for the information for any other cause on the list that InterVA-5 uses.
whyNotCOD(data = RandomVA5, IDs = RandomVA5$ID[1], cause = "Malaria",
pretty = TRUE, includeAll = FALSE)
#> $ID1
#> $ID1$ID
#> [1] "d1"
#>
#> $ID1$topSymptoms
#> [1] "(i019a) Was he male? [ cond. prob: 0.5]"
#> [2] "(i022a) Was s(he) aged 65 years or more at death? [ cond. prob: 0.1]"
#> [3] "(i451o) Did (s)he use motorised transport to get to the hospital or health facility? [ cond. prob: 0.05]"
#> [4] "(i458o) In the final days before death, did anyone use a telephone or cell phone to call for help? [ cond. prob: 0.05]"
#> [5] "(i223o) During the illness that led to death, did (s)he have any urine problems? [ cond. prob: 0.005]"
#> [6] "(i263b) Did (s)he have difficulty with swallowing liquids? [ cond. prob: 0.005]"
#> [7] "(i261o) During the illness that led to death, did (s)he have difficulty swallowing? [ cond. prob: 0.002]"
#> [8] "(i246o) During the illness that led to death, did (s)he have stiffness of the whole body or was unable to open the mouth? [ cond. prob: 5e-04]"
causetextV5[22, 2]
#> CAUSETXT.C40
#> "Other and unspecified neoplasms"
whyNotCOD(data = RandomVA5, IDs = RandomVA5$ID[1], cause = causetextV5[22, 2],
pretty = TRUE, includeAll = FALSE)
#> $ID1
#> $ID1$ID
#> [1] "d1"
#>
#> $ID1$topSymptoms
#> [1] "(i019a) Was he male? [ cond. prob: 0.5]"
#> [2] "(i022a) Was s(he) aged 65 years or more at death? [ cond. prob: 0.5]"
#> [3] "(i223o) During the illness that led to death, did (s)he have any urine problems? [ cond. prob: 0.05]"
#> [4] "(i263b) Did (s)he have difficulty with swallowing liquids? [ cond. prob: 0.05]"
#> [5] "(i451o) Did (s)he use motorised transport to get to the hospital or health facility? [ cond. prob: 0.05]"
#> [6] "(i458o) In the final days before death, did anyone use a telephone or cell phone to call for help? [ cond. prob: 0.05]"
#> [7] "(i261o) During the illness that led to death, did (s)he have difficulty swallowing? [ cond. prob: 0.02]"
#> [8] "(i246o) During the illness that led to death, did (s)he have stiffness of the whole body or was unable to open the mouth? [ cond. prob: 5e-04]"
This function also takes five arguments:
data
: the input data used when producing the InterVA5 objectIDs
: a vector of IDs that identify the deaths for which you want resultscause
: the cause for which the conditional probabilities will be returned
cause = NULL
(which is the default value) the function will print out a numbered list of possible causes and the user can enter in the number to identify the cause of interest.pretty
: a logical indicating if you want the results in an easy-to-read format
TRUE
)includedAll
: a logical indicating if you want all of the symptoms included in the output (even those which are absent or have a value of missing/no)
FALSE
which only includes symptoms that are present)Take it away Richard :)