Next Issue
Volume 2, September
Previous Issue
Volume 2, March
 
 

Analytics, Volume 2, Issue 2 (June 2023) – 14 articles

  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Reader to open them.
Order results
Result details
Select all
Export citation of selected articles as:
16 pages, 346 KiB  
Article
Bayesian Mixture Copula Estimation and Selection with Applications
by Yujian Liu, Dejun Xie and Siyi Yu
Analytics 2023, 2(2), 530-545; https://doi.org/10.3390/analytics2020029 - 15 Jun 2023
Cited by 3 | Viewed by 1227
Abstract
Mixture copulas are popular and essential tools for studying complex dependencies among variables. However, selecting the correct mixture models often involves repeated testing and estimations using criteria such as AIC, which could require effort and time. In this paper, we propose a method [...] Read more.
Mixture copulas are popular and essential tools for studying complex dependencies among variables. However, selecting the correct mixture models often involves repeated testing and estimations using criteria such as AIC, which could require effort and time. In this paper, we propose a method that would enable us to select and estimate the correct mixture copulas simultaneously. This is accomplished by first overfitting the model and then conducting the Bayesian estimations. We verify the correctness of our approach by numerical simulations. Finally, the real data analysis is performed by studying the dependencies among three major financial markets. Full article
Show Figures

Figure 1

21 pages, 308 KiB  
Article
Preliminary Perspectives on Information Passing in the Intelligence Community
by Jeremy E. Block, Ilana Bookner, Sharon Lynn Chu, R. Jordan Crouser, Donald R. Honeycutt, Rebecca M. Jonas, Abhishek Kulkarni, Yancy Vance Paredes and Eric D. Ragan
Analytics 2023, 2(2), 509-529; https://doi.org/10.3390/analytics2020028 - 15 Jun 2023
Viewed by 1112
Abstract
Analyst sensemaking research typically focuses on individual or small groups conducting intelligence tasks. This has helped understand information retrieval tasks and how people communicate information. As a part of the grand challenge of the Summer Conference on Applied Data Science (SCADS) to build [...] Read more.
Analyst sensemaking research typically focuses on individual or small groups conducting intelligence tasks. This has helped understand information retrieval tasks and how people communicate information. As a part of the grand challenge of the Summer Conference on Applied Data Science (SCADS) to build a system that can generate tailored daily reports (TLDR) for intelligence analysts, we conducted a qualitative interview study with analysts to increase understanding of information passing in the intelligence community. While our results are preliminary, we expect that this work will contribute to a better understanding of the information ecosystem of the intelligence community, how institutional dynamics affect information passing, and what implications this has for a TLDR system. This work describes our involvement in and work completed during SCADS. Although preliminary, we identify that information passing is both a formal and informal process and often follows professional networks due especially to the small population and specialization of work. We call attention to the need for future analysis of information ecosystems to better support tailored information retrieval features. Full article
Show Figures

Figure 1

24 pages, 764 KiB  
Review
Spatiotemporal Data Mining Problems and Methods
by Eleftheria Koutsaki, George Vardakis and Nikolaos Papadakis
Analytics 2023, 2(2), 485-508; https://doi.org/10.3390/analytics2020027 - 14 Jun 2023
Cited by 1 | Viewed by 1525
Abstract
Many scientific fields show great interest in the extraction and processing of spatiotemporal data, such as medicine with an emphasis on epidemiology and neurology, geology, social sciences, meteorology, and a great interest is also observed in the study of transport. Spatiotemporal data differ [...] Read more.
Many scientific fields show great interest in the extraction and processing of spatiotemporal data, such as medicine with an emphasis on epidemiology and neurology, geology, social sciences, meteorology, and a great interest is also observed in the study of transport. Spatiotemporal data differ significantly from spatial data, since spatiotemporal data refer to measurements, which take into account both the place and the time in which they are received, with their respective characteristics, while spatial data refer to and describe information related only to place. The innovation brought about by spatiotemporal data mining has caused a revolution in many scientific fields, and this is because through it we can now provide solutions and answers to complex problems, as well as provide useful and valuable predictions, through predictive learning. However, combining time and place in data mining presents significant challenges and difficulties that must be overcome. Spatiotemporal data mining and analysis is a relatively new approach to data mining which has been studied more systematically in the last decade. The purpose of this article is to provide a good introduction to spatiotemporal data, and through this detailed description, we attempt to introduce descriptive logic and gain a complete knowledge of these data. We aim to introduce a new way of describing them, aiming for future studies, by combining the expressions that arise by type of data, using descriptive logic, with new expressions, that can be derived, to describe future states of objects and environments with great precision, providing accurate predictions. In order to highlight the value of spatiotemporal data, we proceed to give a brief description of ST data in the introduction. We describe the relevant work carried out to date, the types of spatiotemporal (ST) data, their properties and the transformations that can be made between them, attempting, to a small extent, to introduce constraints and rules using descriptive logic, introducing descriptive logic into spatiotemporal data by type, when initially presenting the ST data. The data snapshots by species and similarities between the cases are then described. We describe methods, introducing clustering, dynamic ST clusters, predictive learning, pattern mining frequency, and pattern emergence, and problems such as anomaly detection, identifying time points of changes in the behavior of the observed object, and development of relationships between them. We describe the application of ST data in various fields today, as well as the future work. We finally conclude with our conclusions, with the representation and study of spatiotemporal data can, in combination with other properties which accompany all natural phenomena, through their appropriate processing, lead to safe conclusions regarding the study of problems, and also with great precision in the extraction of predictions by accurately determining future states of an environment or an object. Thus, the importance of ST data makes them particularly valuable today in various scientific fields, and their extraction is a particularly demanding challenge for the future. Full article
Show Figures

Figure 1

22 pages, 474 KiB  
Article
A Novel Zero-Truncated Katz Distribution by the Lagrange Expansion of the Second Kind with Associated Inferences
by Damodaran Santhamani Shibu, Christophe Chesneau, Mohanan Monisha, Radhakumari Maya and Muhammed Rasheed Irshad
Analytics 2023, 2(2), 463-484; https://doi.org/10.3390/analytics2020026 - 01 Jun 2023
Viewed by 886
Abstract
In this article, the Lagrange expansion of the second kind is used to generate a novel zero-truncated Katz distribution; we refer to it as the Lagrangian zero-truncated Katz distribution (LZTKD). Notably, the zero-truncated Katz distribution is a special case of this distribution. Along [...] Read more.
In this article, the Lagrange expansion of the second kind is used to generate a novel zero-truncated Katz distribution; we refer to it as the Lagrangian zero-truncated Katz distribution (LZTKD). Notably, the zero-truncated Katz distribution is a special case of this distribution. Along with the closed form expression of all its statistical characteristics, the LZTKD is proven to provide an adequate model for both underdispersed and overdispersed zero-truncated count datasets. Specifically, we show that the associated hazard rate function has increasing, decreasing, bathtub, or upside-down bathtub shapes. Moreover, we demonstrate that the LZTKD belongs to the Lagrangian distribution of the first kind. Then, applications of the LZTKD in statistical scenarios are explored. The unknown parameters are estimated using the well-reputed method of the maximum likelihood. In addition, the generalized likelihood ratio test procedure is applied to test the significance of the additional parameter. In order to evaluate the performance of the maximum likelihood estimates, simulation studies are also conducted. The use of real-life datasets further highlights the relevance and applicability of the proposed model. Full article
Show Figures

Figure 1

25 pages, 2182 KiB  
Article
Generalized Unit Half-Logistic Geometric Distribution: Properties and Regression with Applications to Insurance
by Suleman Nasiru, Christophe Chesneau, Abdul Ghaniyyu Abubakari and Irene Dekomwine Angbing
Analytics 2023, 2(2), 438-462; https://doi.org/10.3390/analytics2020025 - 16 May 2023
Cited by 5 | Viewed by 1126
Abstract
The use of distributions to model and quantify risk is essential in risk assessment and management. In this study, the generalized unit half-logistic geometric (GUHLG) distribution is developed to model bounded insurance data on the unit interval. The corresponding probability density function plots [...] Read more.
The use of distributions to model and quantify risk is essential in risk assessment and management. In this study, the generalized unit half-logistic geometric (GUHLG) distribution is developed to model bounded insurance data on the unit interval. The corresponding probability density function plots indicate that the related distribution can handle data that exhibit left-skewed, right-skewed, symmetric, reversed-J, and bathtub shapes. The hazard rate function also suggests that the distribution can be applied to analyze data with bathtubs, N-shapes, and increasing failure rates. Subsequently, the inferential aspects of the proposed model are investigated. In particular, Monte Carlo simulation exercises are carried out to examine the performance of the estimation method by using an algorithm to generate random observations from the quantile function. The results of the simulation suggest that the considered estimation method is efficient. The univariate application of the distribution and the multivariate application of the associated regression using risk survey data reveal that the model provides a better fit than the other existing distributions and regression models. Under the multivariate application, we estimate the parameters of the regression model using both maximum likelihood and Bayesian estimations. The estimates of the parameters for the two methods are very close. Diagnostic plots of the Bayesian method using the trace, ergodic, and autocorrelation plots reveal that the chains converge to a stationary distribution. Full article
Show Figures

Figure 1

12 pages, 421 KiB  
Article
Clustering Matrix Variate Longitudinal Count Data
by Sanjeena Subedi
Analytics 2023, 2(2), 426-437; https://doi.org/10.3390/analytics2020024 - 05 May 2023
Viewed by 1289
Abstract
Matrix variate longitudinal discrete data can arise in transcriptomics studies when the data are collected for N genes at r conditions over t time points, and thus, each observation Yn for n=1,,N can be written as [...] Read more.
Matrix variate longitudinal discrete data can arise in transcriptomics studies when the data are collected for N genes at r conditions over t time points, and thus, each observation Yn for n=1,,N can be written as an r×t matrix. When dealing with such data, the number of parameters in the model can be greatly reduced by considering the matrix variate structure. The components of the covariance matrix then also provide a meaningful interpretation. In this work, a mixture of matrix variate Poisson-log normal distributions is introduced for clustering longitudinal read counts from RNA-seq studies. To account for the longitudinal nature of the data, a modified Cholesky-decomposition is utilized for a component of the covariance structure. Furthermore, a parsimonious family of models is developed by imposing constraints on elements of these decompositions. The models are applied to both real and simulated data, and it is demonstrated that the proposed approach can recover the underlying cluster structure. Full article
(This article belongs to the Special Issue Feature Papers in Analytics)
Show Figures

Figure 1

16 pages, 1174 KiB  
Article
Wavelet Support Vector Censored Regression
by Mateus Maia, Jonatha Sousa Pimentel, Raydonal Ospina and Anderson Ara
Analytics 2023, 2(2), 410-425; https://doi.org/10.3390/analytics2020023 - 04 May 2023
Viewed by 1370
Abstract
Learning methods in survival analysis have the ability to handle censored observations. The Cox model is a predictive prevalent statistical technique for survival analysis, but its use rests on the strong assumption of hazard proportionality, which can be challenging to verify, particularly when [...] Read more.
Learning methods in survival analysis have the ability to handle censored observations. The Cox model is a predictive prevalent statistical technique for survival analysis, but its use rests on the strong assumption of hazard proportionality, which can be challenging to verify, particularly when working with non-linearity and high-dimensional data. Therefore, it may be necessary to consider a more flexible and generalizable approach, such as support vector machines. This paper aims to propose a new method, namely wavelet support vector censored regression, and compare the Cox model with traditional support vector regression and traditional support vector regression for censored data models, survival models based on support vector machines. In addition, to evaluate the effectiveness of different kernel functions in the support vector censored regression approach to survival data, we conducted a series of simulations with varying number of observations and ratios of censored data. Based on the simulation results, we found that the wavelet support vector censored regression outperformed the other methods in terms of the C-index. The evaluation was performed on simulations, survival benchmarking datasets and in a biomedical real application. Full article
Show Figures

Figure 1

17 pages, 635 KiB  
Article
Building Neural Machine Translation Systems for Multilingual Participatory Spaces
by Pintu Lohar, Guodong Xie, Daniel Gallagher and Andy Way
Analytics 2023, 2(2), 393-409; https://doi.org/10.3390/analytics2020022 - 01 May 2023
Cited by 2 | Viewed by 1883
Abstract
This work presents the development of the translation component in a multistage, multilevel, multimode, multilingual and dynamic deliberative (M4D2) system, built to facilitate automated moderation and translation in the languages of five European countries: Italy, Ireland, Germany, France and Poland. Two main topics [...] Read more.
This work presents the development of the translation component in a multistage, multilevel, multimode, multilingual and dynamic deliberative (M4D2) system, built to facilitate automated moderation and translation in the languages of five European countries: Italy, Ireland, Germany, France and Poland. Two main topics were to be addressed in the deliberation process: (i) the environment and climate change; and (ii) the economy and inequality. In this work, we describe the development of neural machine translation (NMT) models for these domains for six European languages: Italian, English (included as the second official language of Ireland), Irish, German, French and Polish. As a result, we generate 30 NMT models, initially baseline systems built using freely available online data, which are then adapted to the domains of interest in the project by (i) filtering the corpora, (ii) tuning the systems with automatically extracted in-domain development datasets and (iii) using corpus concatenation techniques to expand the amount of data available. We compare our results produced by the domain-adapted systems with those produced by Google Translate, and demonstrate that fast, high-quality systems can be produced that facilitate multilingual deliberation in a secure environment. Full article
Show Figures

Figure 1

34 pages, 5998 KiB  
Article
Investigating Online Art Search through Quantitative Behavioral Data and Machine Learning Techniques
by Minas Pergantis, Alexandros Kouretsis and Andreas Giannakoulopoulos
Analytics 2023, 2(2), 359-392; https://doi.org/10.3390/analytics2020021 - 26 Apr 2023
Viewed by 1362
Abstract
Studying searcher behavior has been a cornerstone of search engine research for decades, since it can lead to a better understanding of user needs and allow for an improved user experience. Going beyond descriptive data analysis and statistics, studies have been utilizing the [...] Read more.
Studying searcher behavior has been a cornerstone of search engine research for decades, since it can lead to a better understanding of user needs and allow for an improved user experience. Going beyond descriptive data analysis and statistics, studies have been utilizing the capabilities of Machine Learning to further investigate how users behave during general purpose searching. But the thematic content of a search greatly affects many aspects of user behavior, which often deviates from general purpose search behavior. Thus, in this study, emphasis is placed specifically on the fields of Art and Cultural Heritage. Insights derived from behavioral data can help Culture and Art institutions streamline their online presence and allow them to better understand their user base. Existing research in this field often focuses on lab studies and explicit user feedback, but this study takes advantage of real usage quantitative data and its analysis through machine learning. Using data collected by real world usage of the Art Boulevard proprietary search engine for content related to Art and Culture and through the means of Machine Learning-powered tools and methodologies, this article investigates the peculiarities of Art-related online searches. Through clustering, various archetypes of Art search sessions were identified, thus providing insight on the variety of ways in which users interacted with the search engine. Additionally, using extreme Gradient boosting, the metrics that were more likely to predict the success of a search session were documented, underlining the importance of various aspects of user activity for search success. Finally, through applying topic modeling on the textual information of user-clicked results, the thematic elements that dominated user interest were investigated, providing an overview of prevalent themes in the fields of Art and Culture. It was established that preferred results revolved mostly around traditional visual Art themes, while academic and historical topics also had a strong presence. Full article
Show Figures

Figure 1

9 pages, 581 KiB  
Article
The AI Learns to Lie to Please You: Preventing Biased Feedback Loops in Machine-Assisted Intelligence Analysis
by Jonathan Stray
Analytics 2023, 2(2), 350-358; https://doi.org/10.3390/analytics2020020 - 18 Apr 2023
Cited by 2 | Viewed by 2543
Abstract
Researchers are starting to design AI-powered systems to automatically select and summarize the reports most relevant to each analyst, which raises the issue of bias in the information presented. This article focuses on the selection of relevant reports without an explicit query, a [...] Read more.
Researchers are starting to design AI-powered systems to automatically select and summarize the reports most relevant to each analyst, which raises the issue of bias in the information presented. This article focuses on the selection of relevant reports without an explicit query, a task known as recommendation. Drawing on previous work documenting the existence of human-machine feedback loops in recommender systems, this article reviews potential biases and mitigations in the context of intelligence analysis. Such loops can arise when behavioral “engagement” signals such as clicks or user ratings are used to infer the value of displayed information. Even worse, there can be feedback loops in the collection of intelligence information because users may also be responsible for tasking collection. Avoiding misalignment feedback loops requires an alternate, ongoing, non-engagement signal of information quality. Existing evaluation scales for intelligence product quality and rigor, such as the IC Rating Scale, could provide ground-truth feedback. This sparse data can be used in two ways: for human supervision of average performance and to build models that predict human survey ratings for use at recommendation time. Both techniques are widely used today by social media platforms. Open problems include the design of an ideal human evaluation method, the cost of skilled human labor, and the sparsity of the resulting data. Full article
Show Figures

Figure 1

4 pages, 194 KiB  
Editorial
Data Stream Analytics
by Jesus S. Aguilar-Ruiz, Albert Bifet and Joao Gama
Analytics 2023, 2(2), 346-349; https://doi.org/10.3390/analytics2020019 - 14 Apr 2023
Viewed by 1296
Abstract
The human brain works in such a complex way that we have not yet managed to decipher its functional mysteries [...] Full article
(This article belongs to the Special Issue Feature Papers in Analytics)
18 pages, 1491 KiB  
Article
Development of a Dynamically Adaptable Routing System for Data Analytics Insights in Logistic Services
by Vasileios Tsoukas, Eleni Boumpa, Vasileios Chioktour, Maria Kalafati, Georgios Spathoulas and Athanasios Kakarountas
Analytics 2023, 2(2), 328-345; https://doi.org/10.3390/analytics2020018 - 13 Apr 2023
Viewed by 1904
Abstract
This work proposes an effective solution to the Vehicle Routing Problem, taking into account all phases of the delivery process. When compared to real-world data, the findings are encouraging and demonstrate the value of Machine Learning algorithms incorporated into the process. Several algorithms [...] Read more.
This work proposes an effective solution to the Vehicle Routing Problem, taking into account all phases of the delivery process. When compared to real-world data, the findings are encouraging and demonstrate the value of Machine Learning algorithms incorporated into the process. Several algorithms were combined along with a modified Hopfield network to deliver the optimal solution to a multiobjective issue on a platform capable of monitoring the various phases of the process. Additionally, a system providing viable insights and analytics in regard to the orders was developed. The results reveal a maximum distance saving of 25% and a maximum overall delivery time saving of 14%. Full article
Show Figures

Figure 1

13 pages, 348 KiB  
Article
Metric Ensembles Aid in Explainability: A Case Study with Wikipedia Data
by Grant Forbes and R. Jordan Crouser
Analytics 2023, 2(2), 315-327; https://doi.org/10.3390/analytics2020017 - 07 Apr 2023
Viewed by 1390
Abstract
In recent years, as machine learning models have become larger and more complex, it has become both more difficult and more important to be able to explain and interpret the results of those models, both to prevent model errors and to inspire confidence [...] Read more.
In recent years, as machine learning models have become larger and more complex, it has become both more difficult and more important to be able to explain and interpret the results of those models, both to prevent model errors and to inspire confidence for end users of the model. As such, there has been a significant and growing interest in explainability in recent years as a highly desirable trait for a model to have. Similarly, there has been much recent attention on ensemble methods, which aim to aggregate results from multiple (often simple) models or metrics in order to outperform models that optimize for only a single metric. We argue that this latter issue can actually assist with the former: a model that optimizes for several metrics has some base level of explainability baked into the model, and this explainability can be leveraged not only for user confidence but to fine-tune the weights between the metrics themselves in an intuitive way. We demonstrate a case study of such a benefit, in which we obtain clear, explainable results based on an aggregate of five simple metrics of relevance, using Wikipedia data as a proxy for some large text-based recommendation problem. We demonstrate that not only can these metrics’ simplicity and multiplicity be leveraged for explainability, but in fact, that very explainability can lead to an intuitive fine-tuning process that improves the model itself. Full article
Show Figures

Figure 1

19 pages, 2073 KiB  
Article
Readability Indices Do Not Say It All on a Text Readability
by Emilio Matricciani
Analytics 2023, 2(2), 296-314; https://doi.org/10.3390/analytics2020016 - 30 Mar 2023
Cited by 6 | Viewed by 1822
Abstract
We propose a universal readability index, GU, applicable to any alphabetical language and related to cognitive psychology, the theory of communication, phonics and linguistics. This index also considers readers’ short-term-memory processing capacity, here modeled by the word interval IP, [...] Read more.
We propose a universal readability index, GU, applicable to any alphabetical language and related to cognitive psychology, the theory of communication, phonics and linguistics. This index also considers readers’ short-term-memory processing capacity, here modeled by the word interval IP, namely, the number of words between two interpunctions. Any current readability formula does not consider Ip, but scatterplots of Ip versus a readability index show that texts with the same readability index can have very different Ip, ranging from 4 to 9, practically Miller’s range, which refers to 95% of readers. It is unlikely that IP has no impact on reading difficulty. The examples shown are taken from Italian and English Literatures, and from the translations of The New Testament in Latin and in contemporary languages. We also propose an extremely compact formula, relating the capacity of human short-term memory to the difficulty of reading a text. It should synthetically model human reading difficulty, a kind of “footprint” of humans. However, further experimental and multidisciplinary work is necessary to confirm our conjecture about the dependence of a readability index on a reader’s short-term-memory capacity. Full article
Show Figures

Figure 1

Previous Issue
Next Issue
Back to TopTop