Volume 16 How To Detect And Handle Outliers Pdf 22
The HD is generally sensitive to outliers. Because noise and outliers are common in medical segmentations, it is not recommended to use the HD directly [8, 40]. However, the quantile method proposed by Huttenlocher et al. [41] is one way to handle outliers. According to the Hausdorff quantile method, the HD is defined to be the qth quantile of distances instead of the maximum, so that possible outliers are excluded, where q is selected depending on the application and the nature of the measured point sets.
volume 16 how to detect and handle outliers pdf 22
Now, we provide guidelines for choosing a suitable metric based on the results so far. These guidelines are additionally summarized in Table 5 in form of matching between data properties, requirements, and metric properties: (i) When the objective is to evaluate the general alignment of the segments, especially when the segments are small (the overlap is likely small or zero), it is recommended to use distance based metrics rather than overlap based metrics. The volumetric similarity (VS) is not suitable in this case. (ii) Distance based metrics are recommended when the contour of the segmentation, i.e. the accuracy at the boundary, is of importance [6]. This follows from being the only category of metrics that takes into consideration the spatial position of false negatives and false positives. (iii) The Hausdorff distance is sensitive to outliers and thus not recommended to be used when outliers are likely. However, methods for handling the outliers, such as the quantile method [41], could solve the problem, otherwise the average distance (AVG) and the overlap based metrics as well as probabilistic based metrics are known to be stable against outliers. (iv) Probabilistic distance (PBD) and overlap based metrics are recommended when the alignment of the segments is of interest rather than the overall segmentation accuracy [2]. (v) Metrics considering the true negatives in their definitions have sensitivity to segment size. They reward segmentations with small segments and penalize those with large segments [10]. Therefore, they tend to generally penalize algorithms that aim to maximize recall and reward algorithms that aim to maximize precision. Such metrics should be avoided in general, especially when the objective is to reward recall (vi) When the segmentations have a high class imbalance, e.g. segmentations with small segments, it is recommended to use metrics with chance adjustment, e.g. the Kappa measure (KAP) and the adjusted rand index (ARI) [29, 55]. (vii) When the segments are not solid, but rather have low densities, then all metrics that are based on volume or on the four cardinalities (TP, TN, FP, FN), are not recommended. In such cases distance-based metrics, especially MHD and HD, are recommended. (viii) Volumetric similarity is not recommended when the quality of the segmentations being evaluated is low in general, because the segments are likely to have low overlap with their corresponding segments in the ground truth. In this case, overlap-based and distance-based metrics are recommended. (ix) When the segmented volume is of importance, volumetric similarity and overlap based metrics are recommended rather than distance based-metrics. (x) When more than one objective is to be considered, which are in conflict, then it is recommended to to combine more than one metric, so that each of the objective is considered by one of the metrics. Thereby, it is recommended to possibly avoid selecting metrics that are strongly correlated (Fig. 3).
In practice, in any analysis of dates some are usually rejected as obvious outliers. However, there are Bayesian statistical methods which can be used to perform this rejection in a more objective way (Christen 1994b), but these are not often used. This paper discusses the underlying statistics and application of these methods, and extensions of them, as they are implemented in OxCal v 4.1. New methods are presented for the treatment of outliers, where the problems lie principally with the context rather than the 14C measurement. There is also a full treatment of outlier analysis for samples that are all of the same age, which takes account of the uncertainty in the calibration curve. All of these Bayesian approaches can be used either for outlier detection and rejection or in a model averaging approach where dates most likely to be outliers are downweighted.
In our analysis, we had concentrated on two of the timeslots to find the outliers transaction pattern. Both the patterns have an unusually high spike of bitcoin volume within 1 month recorded from January 2016 to February 2016 shown in Fig. 12 and even though the there was not much variation of price there was a very big volume bitcoin circulated during the last week of January 2016.
We investigated on the unique transaction volume patterns and based on that we developed a methodology to extract interesting findings from a reconstructed database that has been extracted from blockchain system. We have found out that there are weekly patterns in a bitcoin volume to the price per day graph and there is a clear sign of economic financial trading of bitcoin flow among the transactions. The pattern of weekly trading shown in our analysis helped to investigate more on the specific impulses of transactions in a more focused timeslot. We have analyzed each transaction and bitcoin volume involved in that timeslot. The volume rank distribution helped us to identify outliers transactions with the largest volume of bitcoin involved in it. The SegWit (Segregated Witness) and its effect in terms of the soft fork and hard fork debate were heating up during the beginning of January 2016 that might be one of the causes of this large amount of bitcoin flow in some outlier transactions.
However, many genes are expressed at very low levels in both blood and fibroblasts to be captured at high depth by RNA sequencing. CRISPR/Cas9 technology can be used to improve coverage of low-expressed genes in a scalable manner [143]. Huang et al. [144] applied CRISPRclean method, using Cas9 nuclease and 360,000 guide RNAs to specifically remove RNA-Seq library fragments from over 4000 targeted genes and observed about a sixfold increase in coverage of untargeted genes compared to untreated RNA-Seq libraries. iPSCs are a good substitute when the candidate gene is known to be expressed at low levels in blood and fibroblast. Recently, Bonder et al. [145] unified data from five major iPSC genetic studies [146,147,148,149,150] to create the integrated iPSC QTL (i2QTL) consortium. They observed a fivefold enrichment of outliers in known rare disease genes as compared to non-disease genes and demonstrated detection of gene outliers in patients with Bardet-Biedl syndrome and hereditary cerebellar ataxia. Therefore, alternate tissues like fibroblasts, iPSCs, and blood should be considered carefully when the affected tissue is not available for transcriptome analysis. In the following section, we will discuss how sequencing the transcriptome can uncover pathogenic mutations, missed by studying genomic variants alone.
Allele-specific expression (ASE) is a phenomenon in diploid or polyploid genomes, where one allele has significantly higher expression than the other allele [160, 161]. When prioritizing variants from ES/GS data using recessive mode of inheritance, single heterozygous rare variants are filtered out. However, some of these heterozygous rare variants may exhibit ASE. Gonorazky et al. [141] reported that the allele imbalance approach provided diagnostic leads in three monogenic neuromuscular disorder patients, who previously had non-diagnostic ES and/or gene panel results. Kremer et al. [11] discuss how their ASE pipeline helped to establish the genetic diagnosis in a patient with mucolipidosis, who had tested negative for the enzymatic tests available for mucolipidosis type 1, 2, and 3 in blood leukocytes. They detected borderline non-significant low expression in an intronic variant in MCOLN1 gene that was filtered by their ES pipeline as it was intronic. Therefore, along with identifying expression outliers and splicing variants, ASE analysis should be performed as part of regular RNA-seq analysis, especially when genomic data identifies only one heterozygous variant for a recessive disorder.