Applied Probability and Statistics
Organizer: Liliana Blanco Castañeda, Universidad Nacional de Colombia
Privacy-preserving parametric inference: a case for robust statistics
Marco Avella, Columbia University (New York), USA
Differential privacy is a cryptographically-motivated definition of privacy that has become a very active field of research over the last decade in theoretical computer science and machine learning. In this paradigm we assume there is a trusted curator who holds the data of individuals in a database and the goal of privacy is to simultaneously protect individual data while allowing statistical analysis of the database as a whole. In this setting we introduce a general framework for parametric inference with differential privacy guarantees. We first obtain differentially private estimators based on bounded influence M-estimators by leveraging their gross error sensitivity in the calibration of a noise term added to them in order to ensure privacy. We then we show how a similar construction can also be applied to construct differentially private test statistics analogous to the Wald, score and likelihood ratio tests. We provide statistical guarantees for all our proposals via an asymptotic analysis. An interesting consequence of our results is to further clarify the connection between differential privacy and robust statistics. In particular, we demonstrate that differential privacy is a weaker requirement than infinitesimal robustness and show that robust M-estimators can be easily randomized in order to guarantee both differential privacy and robustness towards the presence of contaminated data. We illustrate our results both on simulated and real data.
Comparison of results between fuzzy sets and classical classification, such as clustering methods of family agricultural production units in Colombia
Martha Tatiana Pamela Jiménez Valderrama, Universidad de La Salle Bogotá, Colombia
As part of the project development of the PhD thesis in Agrociencias of the University of La Salle (Colombia), one of the procedures selected to evaluate the sustainability of the Agricultural Productive Units (UPAs) family is the fuzzy clustering. This is due to the high heterogeneity presented by the observational units and the fact that the definition of sustainability does not allow us to have a scale with exclusionary categories. One of the questions that arise at the time of implementing this methodology is if you really get different results than would be obtained if you use the classic clustering procedure. In this advance I want to show that although the differences found seem slight at the statistical level, the fuzzy result allows a deeper qualitative analysis on the level of sustainability of the UPAs and would allow a better approach to reality for those who generate public policies at the level of agricultural development in Colombia. In this case, the results obtained will be presented for environmental indicators (agricultural practices friendly or not with the environment), for which there is a sample of 43,000 rural agricultural production units in Colombia, extracted from the database of the III Censo Nacional Agropecuario 2014
Group Elastic Net: Towards Sparse Geometric Data Analysis for Categorical Data
Natalia Hernández Vargas, Universidad del Norte Barranquilla, Colombia
In Social Sciences, there are variables related to the subjects of study, such as “identities, perceptions and beliefs” that cannot be expressed through numbers. In addition, the number of categorical variables in these datasets tends to be high. Therefore, there are limited statistical methods for social scientist in data analysis. Sparse Principal Component Analysis (SPCA) has been developed in order to reduce the number of continuous variables in exploratory multivariate analysis. It formulates PCA as an optimization problem: It integrates the elastic net (lasso) constraint into the regression criterion that leads to modified principal components with sparse loadings. The “group Lasso” is an extension of the Lasso for factor selection. Group Sparse Principal Component Analysis (GSPCA) is a compromise between SPCA and group Lasso. It selects sparse number of groups of variables. However, GSPCA (group lasso) ignores the grouping effect of highly correlated groups of factors. Group Elastic Net is a straightforward extension of SPCA (elastic net) and GSPCA (group lasso). Real examples in social sciences (education, urban studies, finance) will be used to illustrate how this method outperforms in terms of factor selection.
Bayesian epidemiological data analysis in small areas
Karen Cecilia Flórez Lozano, Universidad del Norte, Barranquilla, Colombia
One of the main objectives of the disease mapping is to describe the spatial variation of the risk of a disease, to evaluate and quantify the amount of true heterogeneity spatial and associated risk patterns (Lawson, 2009). Most of the proposed models in the literature they provide estimated relative risks in small areas taking into account the neighborhood structure, so neighboring areas have similar risks, one of the models the most popular is the convolution model (Besag et al., 1991). However, there are situations in that this assumption may not be appropriate. Recent studies in this area are able to accurately explore the geographical variation of the disease in terms of different spatially underlying risk factors. In this session we will study the geographical variation of the risk of a certain disease in a specific study area within this theme, works have been proposed that although they manage to estimate the relative risk in small areas, sometimes they tend to have an excess of smoothing, making the task of identifying high-risk areas complex. The model here is presents attempts to address the two disease mapping objectives simultaneously, by a side estimate the relative risk in small áreas and at the same time detect discontinuities between them. The model allows to obtain risk estimates in each of the areas that make up the area of study, as well as study the number of classes or conglomerates that may exist in a geographical área. The proposed model does not require defining from the beginning the dependence or distance between neighbors, but it exposes a formulation where risk allocation variables allow capturing different risk structures. Thus, it is an alternative approach where the relative risks of small areas are assigned to underlying risks. This proposal applies model ideas of mixtures, cluster detection and latent structure models (Knorr-Held and Rasser, 2000; Lee and Lawson, 2014).