Shunpu Zhang

Professor & Chair

Department of Statistics
University of Central Florida
Computer Center II, 205
Orlando, FL 32816-2370

PHONE:
OFFICE: TC2 212C
E-MAIL: Shunpu.Zhang@ucf.edu


Biography

Dr. Zhang received his Ph.D. in statistics from the University of Alberta, Canada. Since 1997, he has been an assistant and associate professor of statistics at the University of Alaska-Fairbanks and an associate professor and full professor of statistics at the University of Nebraska Lincoln. While at Nebraska he served as a Mathematical Statistician at National Cancer Institute, USA. His research interests include general statistical methodology, empirical Bayes data analysis, bioinformatics, multiple hypothesis testing and its application to genomics, and statistical methods related to influenza virus genotyping. He has also worked in big data analytics.

Research

Statisticians provide crucial guidance in determining what information is reliable and which predictions can be trusted. They often help search for clues to the solution of a scientific mystery, and sometimes keep investigators from being misled by false impressions. Statisticians work in a variety of fields, including medicine, government, education, agriculture, business, and law. My research philosophy is to produce research that advances the practice of statistical methodologies in scientific fields.

Main Research Fields
1) Bayes and Empirical Bayes Method for Contaminated Data
My major contribution in this field is that I developed empirical Bayes estimation and testing methods for multiple linear regression models with measurement errors. My research revealed how and at what degree the presence of errors in observations would affect the performance of the Bayes and empirical Bayes estimator and testing procedure.

2) Nonparametric Functional Estimation
My interest in nonparametric functional estimation is mostly focused on solving the boundary problem in functional estimation. Boundary problems are well known to occur in nonparametric density and regression function estimation when the support of the estimated function has finite endpoints. The immediate effect of the boundary problem is down- or over- estimation of the density and regression function. The most severe boundary problem occurs at the endpoints of the support of the density or regression function. The boundary problem in functional estimation has important practical meanings since the real data (especially medical data) are often left or right truncated. My main contributions on this topic are:
a) I derived the optimal order (0, 2) endpoint kernel for kernel density estimation. Here, “optimal” means that the resulted estimator has the smallest mean squared error (MSE) among all the possible order (0, 2) kernels for estimating the density function at the end of its support.
b) I obtained the optimal endpoint kernels of any order and provided a simple method for constructing such kernels. Such kernels are used for estimating the derivatives of the density or regression function.
c) It has been long known that the use of the boundary kernels (or endpoint kernels) might result in negative estimates of the density function in the boundary region and the estimator often has much larger variance in the boundary region compared with that for the interior points. I proposed “the generalized reflection method” – a novel method for correcting the boundary problem. It has been shown that the new method eliminated the negativity problem of the kernel estimator, and at the same time, efficiently reduced the variance of the kernel estimator at the boundary. This paper is now one of the major papers in dealing with the boundary problem of kernel density estimator.

3) Nonparametric Deconvolution Problem
Nonparametric deconvolution is about retrieving the true probability density or regression function when the observations are contaminated with measurement errors. Although extensive results on nonparametric deconvolution have been reported in the literature, I was the first one who initiated the study of the boundary problem for the deconvolution estimators.

4) Statistics Ecology
My research in this field focuses on nonparametric estimation of animal abundance in a region for data collected by the transect sampling method.

5) Bioinformatics
Starting from 2004, I shifted my focus to interdisciplinary research and started research in the field of bioinformatics and health informatics. I have published in BMC Bioinformatics, Bioinformatics, and Journal of Computational Biology. My paper “A comprehensive evaluation of SAM, the SAM R-package and a simple modification to improve its performance” has been labeled as by the journal website. This paper is also one of the major reference papers for the significance analysis of microarrays (SAM) method in Wikipedia.
Since 2006, I have also been collaborating with the biologists from University of Nebraska Omaha and Center of Disease Control (CDC) conducting research on avian flu viruses. Recent outbreaks of highly pathogenic avian influenza A virus infections in poultry and humans have caused tremendous economic losses and raised concerns for future pandemic influenza. The purpose of our joint research is to correctly genotype the influenza A viruses reported to CDC from outbreaks around the world based on genomic sequences.

6) Health Informatics, Genetic Epidemiology and large-scale hypothesis testing
In recent years, I spend most of my time on developing more efficient methods for analyzing large scale data and simultaneous comparison of tens of thousands or even millions of tests. One of the projects I have been working with my collaborators at National Cancer Institute (NCI) is ranking health indices. Health indices such as age-adjusted mortality rates are summary measures of health and of the factors that affect health across for a given population. Such information can be used to provide information to the general public on the health condition of the community, to help the government’s policy making, to evaluate the effect of a current policy or health care program, or to provide information to health care, etc. Another project my collaborators at NCI and I have been working on is how to combine the information from individually non-significant genes to predict disease caused by a group (or a network) of genes, termed as combined significance.

Representative Publications

  • Zhang, S. and Karunamuni, R.J. (1997). Bayes and empirical Bayes estimation with errors in variables. Statistics and Probability Letters, 33, 23-34.
  • Zhang, S. and Karunamuni, R.J. (1997). Empirical Bayes estimation for the continuous one-parameter exponential family with errors in variables. Statistics and Decisions, 15, 261-279.
  • Zhang, S. and Wei, L. (1998). A note on the empirical Bayes convergence rates for the multiple parameter exponential family. Communications in Statistics, Theory and Methods, No. 4, Vol 28, 1273-1292.
  • Zhang, S. and Karunamuni, R.J. (1998). On kernel density estimation near endpoints. Journal of Statistical Planning and inference, 70, 301-316.
  • Zhang, S., Karunamuni, R.J. and Jones, M.C. (1999). An improved estimator of the density function at the boundary. Journal of the American Statistical Association, 94, 1231-1241.
  • Zhang, S. and Karunamuni, R.J. (2000). On nonparametric density estimation at the boundary. Journal of Nonparametric Statistics, 12, 197-221.
  • Zhang, S. and Karunamuni, R.J. (2000). Boundary Bias Correction for Nonparametric Deconvolution. Annals of the Institute of Statistical Mathematics, 52, 4, 612-629.
  • Zhang, S. and Karunamuni, R.J. (2009). Deconvolution Boundary Kernel Method in Nonparametric Density Estimation. Journal of Statistical Planning and Inference,139, 2269-2283.
  • Mack, Y. P., Pham X. Quang and Zhang S. (1999). Kernel Estimation of Wildlife Abundance from Transect Data without Shoulder Condition. Communications in Statistics, Theory and Methods. A, 28, 2277-2296.
  • Zhang S. (2001.) Improvements on the kernel estimation in line transect sampling without the shoulder condition. Statistics and Probability Letters, 53, 249-258.
  • Zhang, S. (2001). Generalized Likelihood Ratio Test of the Should Condition in Line Transect Sampling. Communications in Statistics, Theory and Methods, 30, 2343-2354.
  • Zhang, S. (2006). An Improved Nonparametric Approach for Detecting Differentially Expressed Genes with Replicated Microarray Data. Statistical Applications in Genetics and Molecular Biology 5.1.
  • Zhang, S. (2007). A comprehensive evaluation of SAM, the SAM R-package and a simple modification to improve its performance. BMC Bioinformatics, 8:230. Available at http://www.biomedcentral.com/1471-2105/8/230.
  • Zhang, S., Lu, G., Fang, X., and Donis, R. (2007). Multidimensional scaling and model-based clustering analyses for the clade assignments of the HPAI H5N1 viruses. Options for the Control of Influenza VI, International Medical Press. London: Blackwell.
  • Jiao, S. and Zhang, S. (2008). The t-mixture model approach for detecting differentially expressed genes in microarrays. Functional & Integrative Genomics, 8: 181-186. Available at http://www.springerlink.com/content/whp223044833424u/.
  • Lu, G., Zhang, S., and Fang, X. (2008). An improved string composition method for sequence comparison. BMC Bioinformatics, 9 (Suppl 6): S15.
  • Jiao, S. and Zhang, S. (2008). On correcting the overestimation of the permutation-based false discovery rate estimator. Bioinformatics, 24(15):1655-1661.
  • Jiao, S. and Zhang, S. (2010). Estimating the proportion of equivalently expressed
    genes in microarray data based on transformed test statistics. Journal of
    Computational Biology, 17 (2), 177-187.
  • Jiao, S., Bailey, C. P., Zhang, S. and Ladunga, I. (2010). Probabilistic peak calling
    and controlling False Discovery Rate in transcription factor binding site mapping
    from ChIP-seq. The Computational Biology of Transcription Factor Binding.
    In the series: Methods in Molecular Biology. Berlin: Springer, 2010.
  • Mechanic, L. E., Chen, H-S., Amos, C. I. Chatterjee, N., Cox, N. J., Divi, R. L., Fan, R., Harris, E. L., Jacobs, K., Kraft, P., Leal, S. M., McAllister, K, Moore, J. H., Paltoo, D. N., Province, M. A., Ramos, E. M., Ritchie, M. D., Roeder, K., Schaid, D. J., Stephens, M., Thomas, D. C., Weinberg, C. R., Witte, J., Zhang, S., Zöllner, S., Feuer, E, J., Gillanders, E. M. (2011). Next Generation Analytic Tools for Large Scale Genetic Epidemiology Studies of Complex Diseases. Genetic Epidemiology. 36: 22–35.
  • Zhang, S., Pfeiffer, R., and Chen, H. (2013). A Combined p-value Test for Multiple
    Hypothesis Testing. Journals of Statistical Planning and Inferences, 143, 764-770.
  • Chen, H. Pfeiffer, R. and Zhang, S. (2013): A powerful method for combining p-
    values in genomic studies. Genetic Epidemiology, 37(8):814-9.
  • Zhang, S., Luo, J., Zhu, L., Stinchcomb, G. D., Campbell, D.,
    Carter, G., Gilkeson, S., Feuer, E. J. (2014): Confidence intervals for ranks of age-adjusted rates across states or counties. Statistics in Medicine, 33(11):1853-66.