Duke University, Statistical Science

Program ID: 94-ISI2020 [#897]
Program Title: Summer@Duke Statistical Science - Duke-IIT-ISI summer projects
Program Type: Undergraduate program
Program Location: Durham, North Carolina 27708-0251, United States
Subject Area: Statistical Science
Application Deadline: 2019/12/31help popup (posted 2019/12/03, listed until 2020/06/03)
Program Description:    

Summer@Duke Statistical Science---Duke-IIT-ISI summer projects

Organizing committee: Alexander Volfovsky and Surya Tokdar (Duke University Statistical Science)

The eight-ten week Summer@Duke Statistical Science program is designed for a select group of students who are considering applying for PhD programs in statistics, biostatistics, data science, and related fields in the following fall. The program will give these students the opportunity to immerse themselves in research, interact with students at Duke and participate in seminars and working groups.

The faculty advisers for this program include standing members of the Duke Statistical Science department who have outlined several research projects, spanning applied, methodological and theoretical subjects. Students will work on assigned projects in close collaboration with the faculty advisers and current Duke graduate students. The final product of the projects will be a write up of the research conducted during the program.

Duke provides an excellent research environment. Students will have space in the Information Initiative at Duke, which co-houses several summer research programs. Throughout the summer there are professional development and research presentations in this building. The program participants will have access to graduate student mentors and Duke computing resources. Duke Statistical Science will provide logistical support for students and will help build community through fun activities and events.

Funding details: We will provide financial support for each participant up to $6,000 to help cover travel, housing, meals and incidentals. Duke Visa Services will help facilitate entry to the United States.

Application materials required:
  1. Personal statement of interest: please select up to three projects from the proposed list below to work on and describe why they are of interest and your expertise as it pertains to them. Also, describe any previous research you may have conducted.
  2. One reference letter (to be submitted by the reference writers on mathprograms.org)
  3. Unofficial transcript that lists relevant STEM and related courses taken.
Applications are due on mathprograms.org by December 31, 2019 to guarantee considerations. Statistical Science will begin making offers by mid-January to allow for timely visa processing.

Proposed projects (proposing faculty member in parenthesis):
  • Novel statistical and machine learning methods for species identification and distribution modeling. We are collecting unprecedented data on biodiversity across the globe, consisting of indirect measurements of species occurrence obtained through audio recordings (to identify birds), camera traps, and automated DNA sequencing of insects and fungi. We need improved methods for accounting for measurement error in species identification based on these data, including allowance for uncertainty regarding what species an individual is and whether the evolutionary tree should be revised in light of our new data - for example, to add species. As very little is known about fungi and insect biodiversity, in particular, we expect to add 10,000s of new species to the phylogenetic tree, while also learning about new families. Challenges with the data include spatial and temporal dependence and very high dimensionality - for example, preliminary data on insects contain DNA barcoding for over two million individual insects representing approximately 180,000 species. (Dunson)
  • Variational inference for AME network models. An AME model is a latent variable model for social networks and other types of relational data. Most implementations of AME models currently use MCMC, but this can be prohibitively slow when the network is large. This project is to implement and study the use of variational methods as an alternative to MCMC for AME models. (Hoff)
  • Principled approaches to community detection. Network inference problems often use spectral decompositions of the graph Laplacian to reveal underlying community structure. One line of recent work has proposed Gaussian approximations to the eigenvectors of the adjacency matrix and used these devise community detection algorithms with probabilistic uncertainty quantification. However, these approaches are currently limited to latent space models with certain kinds of structure (simple graphs satisfying a degree balanced constraint). The goal of this project is to develop theory and methods for more realistic settings involving asymmetric weighted graphs and covariate information. A further goal is to compare these approaches with state-of-the-art graph neural networks that are trained from data. (Reeves)
  • Sparse Joint Quantile Regression. Four decades ago, Roger Koenker and Gib Basett introduced the idea of quantile regression (QR). Today, QR is widely recognized as a fundamental statistical tool for analyzing complex predictor-response relationships, with a growing list of applications in ecology, economics, education, public health, climatology, and so on. In QR, one replaces the standard regression equation of the mean with a similar equation for a quantile at a given quantile level of interest. But the real strength of QR lies in the possibility of analyzing any quantile level of interest, and perhaps more importantly, contrasting many such analyses against each other with fascinating consequences. In spite of the popularity of QR, it is only recently that an analysis framework has been developed (Yang and Tokdar, JASA 2017) which transforms Koenker and Basett’s four decade old idea into a model based inference and prediction technique in its full generality. In doing so, the new joint estimation framework has opened doors to many important advancements of the QR analysis technique to address additional data complications. This project will focus on sparse quantile regression with many predictor variables. A fascinating feature of sparse QR is that predictors that may be important for a certain range of quantile levels, could be completely irrelevant at a different range. In the single quantile level world, where this problem has received a lot of recent attention, a popular solution appears to be adding an L1 penalty term on the slope parameters at a given level and solving the overall optimization problem through a slight adaptation of the linear programming method employed for ordinary QR. Unfortunately, such level-by-level sparse estimation, with little borrowing of information across levels, results in high degree of quantile crossing and uninterpretable inference when comparing and constructing multiple levels. In contrast, the joint estimation framework makes it obvious that in order to obtain a valid sparse QR formulation, the sparsity penalty should apply to the derivatives of the slope functions. This project will build upon this idea to design a Bayesian, sparse joint QR estimation framework and investigate its statistical properties. (Tokdar)
  • Majorization-minimization: The majorization-minimization principle offers a flexible framework for eliciting iterative procedures for optimizing difficult objective functions arising in a variety of statistical tasks. A generalization of EM, these algorithms have recently contributed toward difficult non-convex problems such as clustering and parameter estimation under sparsity, shape, and general set constraints. While these algorithms are simple and effective, several theoretical aspects remain open. We will seek to establish theory leading to confidence bounds/uncertainty quantification, optimal learning rates, and extensions to our recent methods. (Xu)
  • Evaluating disease spread in partially observed networks: The study of disease spread frequently requires the assumption that the paths by which the disease move are known of fixed. In practice however we are never certain that we have observed all possible interactions between individuals and moreover many individuals can be impossible to observe. In this project we will build on methods from network analysis that will be help us account for such uncertainty and provide more accurate estimates of disease spread and intervention efficacy. (Volfovsky)
  • Causal inference with text and networks. Social media platforms allow people to surround themselves with those who share their political views, and this human predisposition is only amplified by algorithms embedded within many of the world’s most popular platforms. One of the fundamental difficulties with online social media is the ability of malicious influence to spread in an instant. This can take on the form of “fake news” or incitement to violence. We propose to study this spread by considering online patterns of sharing as they relate to the text of the information being shared and the stature of the original poster. This type of complex data structures can lead to complications during causal inference and this project will develop statistical theory and methodology to assess treatment effects of information spread and evaluate any potential uncertainty due to the network and text information. (Volfovsky)

Application Materials Required:
Submit the following items online at this website to complete your application:
And anything else requested in the program description.

Further Info:
stat.duke.edu
 
Box 90251
Durham, NC 27708-0251

© 2020 MathPrograms.Org, American Mathematical Society. All Rights Reserved.