Recent rapid advancements of machine learning have greatly enhanced the accuracy of prediction models, but most models remain "black boxes", making prediction error diagnosis challenging, especially with outliers. This lack of transparency hinders trust and reliability in industrial applications. Heuristic attribution methods, while helpful, often fail to capture true causal relationships, leading to inaccurate error attributions. Various root-cause analysis methods have been developed using Shapley values, yet they typically require predefined causal graphs, limiting their applicability for prediction errors in machine learning models. To address these limitations, we introduce the Causal-Discovery-based Root-Cause Analysis (CD-RCA) method that estimates causal relationships between the prediction error and the explanatory variables, without needing a pre-defined causal graph. By simulating synthetic error data, CD-RCA can identify variable contributions to outliers in prediction errors by Shapley values. Extensive simulations show CD-RCA outperforms current heuristic attribution methods, and a sensitivity analysis reveals new patterns where Shapley values may misattribute errors, paving the way for more accurate error attribution methods.
Causal Models and Prediction in Cell Line Perturbation Experiments
In cell line perturbation experiments, a collection of cells is perturbed with external agents and responses such as protein expression measured. Due to cost constraints, only a small fraction of all possible perturbations can be tested in vitro. This has led to the development of computational models that can predict cellular responses to perturbations, without having to actually perform the experiments in a wet lab. A central challenge for these models is to predict the effect of new, previously untested perturbations that were not used in the training data. Here we propose causal structural equations for modeling how perturbations effect cells. From this model, we derive two estimators for predicting responses: a Linear Regression (LR) estimator and a causal structure learning estimator that we term Causal Structure Regression (CSR). The CSR estimator requires more assumptions than LR, but can predict the effects of drugs that were not applied in the training data. Next we present Cellbox, a recently proposed system of ordinary differential equations (ODEs) based model that obtained the best prediction performance on a Melanoma cell line perturbation data set (Yuan et al., 2021). We derive analytic results that show a close connection between CSR and Cellbox, providing a new causal inter- pretation for the Cellbox model. We compare LR and CSR/Cellbox in sim- ulations, highlighting the strengths and weaknesses of the two approaches. Finally we compare the performance of LR and Cellbox on the benchmark Melanoma data set. We find that the LR model has comparable or slightly better performance than Cellbox.
Integrating Large Language Models in Causal Discovery: A Statistical Causal Approach
In practical statistical causal discovery (SCD), embedding domain expert
knowledge as constraints into the algorithm is widely accepted as significant
for creating consistent meaningful causal models, despite the recognized
challenges in systematic acquisition of the background knowledge. To overcome
these challenges, this paper proposes a novel methodology for causal inference,
in which SCD methods and knowledge based causal inference (KBCI) with a large
language model (LLM) are synthesized through "statistical causal prompting
(SCP)" for LLMs and prior knowledge augmentation for SCD. Experiments have
revealed that GPT-4 can cause the output of the LLM-KBCI and the SCD result
with prior knowledge from LLM-KBCI to approach the ground truth, and that the
SCD result can be further improved, if GPT-4 undergoes SCP. Furthermore, it has
been clarified that an LLM can improve SCD with its background knowledge, even
if the LLM does not contain information on the dataset. The proposed approach
can thus address challenges such as dataset biases and limitations,
illustrating the potential of LLMs to improve data-driven causal inference
across diverse scientific domains.
Scalable Counterfactual Distribution Estimation in Multivariate Causal Models
We consider the problem of estimating the counterfactual joint distribution of multiple quantities of interests (eg, outcomes) in a multivariate causal model extended from the classical difference-in-difference design. Existing methods for this task either ignore the correlation structures among dimensions of the multivariate outcome by considering univariate causal models on each dimension separately and hence produce incorrect counterfactual distributions, or poorly scale even for moderate-size datasets when directly dealing with such a multivariate causal model. We propose a method that alleviates both issues simultaneously by leveraging a robust latent one-dimensional subspace of the original high-dimension space and exploiting the efficient estimation from the univariate causal model on such space. Since the construction of the one-dimensional subspace uses information from all the dimensions, our method can capture the correlation structures and produce good estimates of the counterfactual distribution. We demonstrate the advantages of our approach over existing methods on both synthetic and real-world data.
2022
A Hypergraph Approach for Estimating Growth Mechanisms of Complex Networks
Temporal datasets that describe complex interactions between individuals over time are increasingly common in various domains. Conventional graph representations of such datasets may lead to information loss since higher-order relationships between more than two individuals must be broken into multiple pairwise relationships in graph representations. In those cases, a hypergraph representation is preferable since it can preserve higher-order relationships by using hyperedges. However, existing hypergraph models of temporal complex networks often employ some data-independent growth mechanism, which is the linear preferential attachment in most cases. In principle, this pre-specification is undesirable since it completely ignores the data at hand. Our work proposes a new hypergraph growth model with a data-driven preferential attachment mechanism estimated from observed data. A key component of our method is a recursive formula that allows us to overcome a bottleneck in computing the normalizing factors in our model. We also treat an often-neglected selection bias in modeling the emergence of new edges with new nodes. Fitting the proposed hypergraph model to 13 real-world datasets from diverse domains, we found that all estimated preferential attachment functions deviates substantially from the linear form. This demonstrates the need of doing away with the linear preferential attachment assumption and adopting a data-driven approach. We also showed that our model outperformed conventional models in replicating the observed first-order and second-order structures in these real-world datasets.
2021
Non-parametric Estimation of the Preferential Attachment Function from One Network Snapshot
Preferential attachment is commonly invoked to explain the emergence of those heavy-tailed degree distributions characteristic of growing network representations of diverse real-world phenomena. Experimentally confirming this hypothesis in real-world growing networks is an important frontier in network science research. Conventional preferential attachment estimation methods require that a growing network be observed across at least two snapshots in time. Numerous publicly available growing network datasets are, however, only available as single snapshots, leaving the applied network scientist with no means of measuring preferential attachment in these cases. We propose a nonparametric method, called PAFit-oneshot, for estimating preferential attachment in a growing network from one snapshot. PAFit-oneshot corrects for a previously unnoticed bias that arises when estimating preferential attachment values only for degrees observed in the single snapshot. Our work provides a means of measuring preferential attachment in a large number of publicly available one-snapshot networks. As a demonstration, we estimated preferential attachment in three such networks, and found sublinear preferential attachment in all cases. PAFit-oneshot is implemented in the R package PAFit.
2020
Joint Estimation of Non-parametric Transitivity and Preferential Attachment Functions in Scientific Co-authorship Networks
We propose a statistical method for estimating the non-parametric transitivity and preferential attachment functions simultaneously in a growing network, in contrast to conventional methods that either estimate each function in isolation or assume a certain functional form for these. Our model is demonstrated to exhibit a good fit to two real-world co-authorship networks and can illuminate several intriguing details of the preferential attachment and transitivity phenomena that would be unavailable under traditional methods. Moreover, we introduce a method for quantifying the amount of contributions of these phenomena in the growth process of a network based on the probabilistic dynamic process induced by the model formula. By applying this method, we found that transitivity dominated preferential attachment in both co-authorship networks. This suggests the importance of indirect relations in scientific creative processes. The proposed method is implemented in the R package FoFaF.
PAFit: An R Package for the Non-Parametric Estimation of Preferential Attachment and Node Fitness in Temporal Complex Networks
Many real-world systems are profitably described as complex networks that grow over time. Preferential attachment and node fitness are two simple growth mechanisms that not only explain certain structural properties commonly observed in real-world systems, but are also tied to a number of applications in modeling and inference. While there are statistical packages for estimating various parametric forms of the preferential attachment function, there is no such package implementing non-parametric estimation procedures. The non-parametric approach to the estimation of the preferential attachment function allows for comparatively finer-grained investigations of the "rich-get-richer" phenomenon that could lead to novel insights in the search to explain certain nonstandard structural properties observed in real-world networks. This paper introduces the R package PAFit, which implements non-parametric procedures for estimating the preferential attachment function and node fitnesses in a growing network, as well as a number of functions for generating complex networks from these two mechanisms. The main computational part of the package is implemented in C++ with OpenMP to ensure scalability to large-scale networks. In this paper, we first introduce the main functionalities of PAFit through simulated examples, and then use the package to analyze a collaboration network between scientists in the field of complex networks. The results indicate the joint presence of "richget-richer" and "fit-get-richer" phenomena in the collaboration network. The estimated attachment function is observed to be near-linear, which we interpret as meaning that the chance an author gets a new collaborator is proportional to their current number of collaborators. Furthermore, the estimated author fitnesses reveal a host of familiar faces from the complex networks community among the field’s topmost fittest network scientists.
2018
The Evolutions of the Rich Get Richer and the Fit Get Richer Phenomena in Scholarly Networks: The Case of the Strategic Management Journal
Understanding how a scientist develops new scientific collaborations or how their papers receive new citations is a major challenge in scientometrics. The approach being proposed simultaneously examines the growth processes of the co-authorship and citation networks by analyzing the evolutions of the rich get richer and the fit get richer phenomena. In particular, the preferential attachment function and author fitnesses, which govern the two phenomena, are estimated non-parametrically in each network. The approach is applied to the co-authorship and citation networks of the flagship journal of the strategic management scientific community, namely the Strategic Management Journal. The results suggest that the abovementioned phenomena have been consistently governing both temporal networks. The average of the attachment exponents in the co-authorship network is 0.30 while it is 0.29 in the citation network. This suggests that the rich get richer phenomenon has been weak in both networks. The right tails of the distributions of author fitness in both networks are heavy, which imply that the intrinsic scientific quality of each author has been playing a crucial role in getting new citations and new co-authorships. Since the total competitiveness in each temporal network is founded to be rising with time, it is getting harder to receive a new citation or to develop a new collaboration. Analyzing the average competency, it was found that on average, while the veterans tend to be more competent at developing new collaborations, the newcomers are likely better at acquiring new citations. Furthermore, the author fitness in both networks has been consistent with the history of the strategic management scientific community. This suggests that coupling node fitnesses throughout different networks might be a promising new direction in analyzing simultaneously multiple networks.
Transitivity vs Preferential Attachment: Determining the Driving Force Behind the Evolution of Scientific Co-Authorship Networks
We propose a method for the non-parametric joint estimation of preferential attachment and transitivity in complex networks, as opposite to conventional methods that either estimate one mechanism in isolation or jointly estimate both assuming some functional forms. We apply our method to three scientific co-authorship networks between scholars in the complex network field, physicists in high-energy physics, and authors in the Strategic Management Journal. The non-parametric method revealed complex trends of preferential attachment and transitivity that would be unavailable under conventional parametric approaches. In all networks, having one common collaborator with another scientist increases at least five times the chance that one will collaborate with that scientist. Finally, by quantifying the contribution of each mechanism, we found that while transitivity dominates preferential attachment in the high-energy physics network, preferential attachment is the main driving force behind the evolutions of the remaining two networks.
2016
Joint Estimation of Preferential Attachment and Node Fitness in Growing Complex Networks
Complex network growth across diverse fields of science is hypothesized to be driven in the main by a combination of preferential attachment and node fitness processes. For measuring the respective influences of these processes, previous approaches make strong and untested assumptions on the functional forms of either the preferential attachment function or fitness function or both. We introduce a Bayesian statistical method called PAFit to estimate preferential attachment and node fitness without imposing such functional constraints that works by maximizing a log-likelihood function with suitably added regularization terms. We use PAFit to investigate the interplay between preferential attachment and node fitness processes in a Facebook wall-post network. While we uncover evidence for both preferential attachment and node fitness, thus validating the hypothesis that these processes together drive complex network evolution, we also find that node fitness plays the bigger role in determining the degree of a node. This is the first validation of its kind on real-world network data. But surprisingly the rate of preferential attachment is found to deviate from the conventional log-linear form when node fitness is taken into account. The proposed method is implemented in the R package PAFit.
Nonparametric Estimation of the Preferential Attachment Function in Complex Networks: Evidence of Deviations from Log Linearity
We introduce a statistically sound method called PAFit for the joint estimation of preferential attachment and node fitness in temporal complex networks. Together these mechanisms play a crucial role in shaping network topology by governing the way in which nodes acquire new edges over time. PAFit is an advance over previous methods in so far as it does not make any assumptions on the functional form of the preferential attachment function. We found that the application of PAFit to a publicly available Flickr social network dataset turned up clear evidence for a deviation of the preferential attachment function from the popularly assumed log-linear form. What is more, we were surprised to find that hubs are not always the nodes with the highest node fitnesses. PAFit is implemented in an R package of the same name.
2015
PAFit: A Statistical Method for Measuring Preferential
Attachment in Temporal Complex Networks
Preferential attachment is a stochastic process that has been proposed to explain certain topological features characteristic of complex networks from diverse domains. The systematic investigation of preferential attachment is an important area of research in network science, not only for the theoretical matter of verifying whether this hypothesized process is operative in real-world networks, but also for the practical insights that follow from knowledge of its functional form. Here we describe a maximum likelihood based estimation method for the measurement of preferential attachment in temporal complex networks. We call the method PAFit, and implement it in an R package of the same name. PAFit constitutes an advance over previous methods primarily because we based it on a nonparametric statistical framework that enables attachment kernel estimation free of any assumptions about its functional form. We show this results in PAFit outperforming the popular methods of Jeong and Newman in Monte Carlo simulations. What is more, we found that the application of PAFit to a publically available Flickr social network dataset yielded clear evidence for a deviation of the attachment kernel from the popularly assumed log-linear form. Independent of our main work, we provide a correction to a consequential error in Newman’s original method which had evidently gone unnoticed since its publication over a decade ago.