# Learning R (for data analysis and data science): Where to start

A friend of a friend (also in educational research) posted that he was interested in learning R. I had a couple of ideas but knew that others might have better ideas. So, I posted (on Twitter) looking for recommendations and received some excellent talks, links, and other resources. Here they are, in Tweet form (you may have to …

More

# An R package for sensitivity analysis (konfound)

knitr::opts_chunk\$set( comment = “#>”, collapse = TRUE ) With Ran Xu and Ken Frank, I have worked on a Shiny interactive web application for sensitivity analysis as well as an R package for carrying out sensitivity analysis using R. That R package is now available on CRAN! A link to the CRAN page for it is here and the website for the …

More

# Explorations in Markov Chain Monte Carlo – comparing results from MCMCglmm and lme4

I’ve been interested in Markov Chain Monte Carlo (MCMC) for a little while, in part because of a paper by Tom Houslay and Alastair Wilson (2017) that shows how using output from models the way I have been can lead to results that overstate the impact of effects. In particular, I’m working on a project with colleagues …

More

# Risk, Uncertainty, and Heuristics

Risk = space of outcomes and probabilities are known. Uncertainty = probabilities not known, and even space of possibilities may not be known. Heuristic rules are contrasted with algorithms like maximization of expected utility. See also Bounded Cognition and Risk, Ambiguity, and Decision (Ellsberg). Here’s a well-known 2007 paper by Gigerenzer et al. Helping Doctors and …

More

# Find the top rail-trails in each state using mixed effects models

Outside of education, one of my interests is cycling, and one of my favorite ways to cycle is on rail-trails, pathways and greenways that are converted from former railroad tracks. In a side-project (and because the data source can be used for teaching and learning about complex, nested data), I collected information from the TrailLink website. …

More

# A Shiny interactive web application to quantify how robust inferences are to potential sources of bias (sensitivity analysis)

We are happy to announce the release of an interactive web application, Konfound-It, to make it easy to quantify the conditions necessary to change an inference. For example, Konfound-It generates statements such as “XX% of the estimate would have to be due to bias to invalidate the inference” or “an omitted variable would have to …

More

# Two data packages: Rail-trails and an assessment of student achievement

Because of interest and the need for better examples (for teaching and for use in tools under development, such as prcr and tidyLPA, I worked to create two data packages, data easily available through an R package. A benefit of the data being in an R package is that it is even easier to access than other formats (in R): …

More

# A person-in-context approach to student engagement in science (article in JRST)

Over the past few years, I have worked with Jennifer Schmidt and Patrick Beymer to explore student engagement in science using the Experience Sampling Method (ESM). Most recently, we used what scholars have referred to as a “person-in-context” approach, using both ESM and a person-oriented approach. A figure is helpful for conveying how the person-oriented approach can be used to …

More

# Accurate Genomic Prediction Of Human Height

I’ve been posting preprints on arXiv since its beginning ~25 years ago, and I like to share research results as soon as they are written up. Science functions best through open discussion of new results! After some internal discussion, my research group decided to post our new paper on genomic prediction of human height on …

More

# Phase Transitions and Genomic Prediction of Cognitive Ability

James Thompson (University College London) recently blogged about my prediction that with sample size of order a million genotypes|phenotypes, one could construct a good genomic predictor for cognitive ability and identify most of the associated common SNPs. The Hsu Boundary … The “Hsu boundary” is Steve Hsu’s estimate that a sample size of roughly 1 …

More

# Comparing MPLUS and MCLUST output

Introduction At present, MPlus is a widely-used tool to carry out Latent Profile Analysis, and there does not seem to be a widely-accepted or used way to carry out Latent Profile Analysis in R. This compares output from MPlus to output from the R package MCLUST, which is accessed through the package tidymixmod which I …

More

# Using MPlus from R with MPlusAutomation

According to the MPlus website, the R package MPlusAutomation serves three purposes: Creating related groups of models Running batches Extracting and tabulating model parameters and test statistics. Because modeling involves comparing related models, (partially) automating these is compelling. It can make it easier to use model results in subsequent analyses and can cut down on copy and pasting …

More

# A first pass at Latent Profile Analysis using MCLUST (in R)

Along with starting to use MPlus, I’ve become (more) interested in trying to find out how to carry out Latent Profile Analysis (LPA) in R, focused on two options: OpenMx and MCLUST. The two are very different: OpenMx is an option for general latent variable modeling (i.e., it can be used to specify a wide range of latent …

More

# Updated Stepwise Regression Function

Back in 2011, when I was still teaching, I cobbled together some R code to demonstrate stepwise regression using F-tests for variable significance. It was a bit unrefined, not intended for production work, and a few recent comments on that post raised some issues with it. So I’ve worked up a new and (slightly) improved …

More

# In what months are educational psychology jobs posted?

Division 15 of the American Psychological Association sponsors the Ed Psych Jobs website, which is an excellent resource for Ed Psych job seekers. I thought it would possibly be helpful to see when jobs were posted in the past in order to have a better idea about when jobs may be posted this year. Ed Psych Jobs, Robots …

More

# Comparing estimates and their standard errors from mixed effects and linear models

Some background One reason to use mixed effects models is that they help to account for data with a complex structure, such as multiple responses (to questions, for example) from the same people, students grouped into classes, and measures collected over time. Often, the way they account for these complex structures is in terms of …

More

# Using characteristics of rail-trails to predict how they are rated

Catching up I wrote a blog post (one that, to be honest, I liked a lot) on what the best rail-trails are in Michigan (here). A friend and colleague at MSU, Andy, noticed that paved trails seemed to be rated higher, and this as well as my cfriend and colleague Kristy’s comment about how we …

More

# What are the best rail-trails in Michigan?

Background I was curious about what rail trails were the best in Michigan, and so to figure out an answer, I checked out the TrailLink website, sponsored by the Rails-to-Trails Conservancy. I had just purchased a copy of their book Rail-Trails Michigan and Wisconsin, and wanted to see whether I could learn more from the …

More

# An R package for plotting partially pooled estimates for mixed-effects models

I came across this excellent post from Tristan Mahr on plotting partially pooled estimates for mixed-effects models and was inspired to create an R package for it based on the code in the post. I found mixed models made more sense to me when I thought of them in terms of partial pooling, and I …

More

# Rock Climbing in the News (Updated) A quick look at how often rock climbing is mentioned after noteworthy events using newsflash

When I was visiting my brother, we came across a neat tool to track mentions of topics in the news, newsflash. We looked at how mentions of rock climbing spiked after particular media (a special on rock climber Alex Honnold) or events (the first ascent of the Dawn Wall in Yosemite National Park). You can …

More

# How many groups of Star Wars characters are there? R-squared and cross-validation approaches

Background How many groups, or types, of Star Wars characters are there? I’ve been wanting to use the starwars dataset built-in to the dplyr package, and at the same time, have been working hard on an R package to carry out an analysis suited to doing this. Part of the challenge of using the approach …

More

# Complex Trait Adaptation and the Branching History of Mankind

A new paper (94 pages!) investigates signals of recent selection on traits such as height and educational attainment (proxy for cognitive ability). Here’s what I wrote about height a few years ago in Genetic group differences in height and recent human evolution: These recent Nature Genetics papers offer more evidence that group differences in a …

More

# Epistemic Caution and Climate Change

I have not, until recently, invested significant time in trying to understand climate modeling. These notes are primarily for my own use, however I welcome comments from readers who have studied this issue in more depth. I take a dim view of people who express strong opinions about complex phenomena without having understood the underlying …

More

# prcr update

The R package for person-oriented analysis (prcr) is updated (it’s now version 0.1.4). In particular, it was not clear how to use the profile assignments (i.e., what cluster each response is in) in subsequent analyses. So, the update now returns two different representations of the profile assignments, or which profile is associated with each observation: …

More

# More Shock and Awe: James Lee and SSGAC in Oslo

To quote James Lee, the first author listed below: “Shock and Awe” for those who doubt that cognitive ability is influenced by genetic variants. See work from a year ago: ~100 hits from 300k individuals. Now ~600 hits from 750k. (SNPs associated with EA are likely to also be associated with cognitive ability — see …

More

# Job Satisfaction and the Role of Teacher Evaluation

There are a number of factors that influence how satisfied teachers are with their jobs. Working conditions, such as school facilities, support from administrators, and class size, are important factors that teachers take into consideration when deciding where to work. Other important factors that predict teacher job satisfaction include job security, quality of colleagues, the …

More

# History of Bayesian Neural Networks

This talk gives the history of neural networks in the framework of Bayesian inference. Deep learning is (so far) quite empirical in nature: things work, but we lack a good theoretical framework for understanding why or even how. The Bayesian approach offers some progress in these directions, and also toward quantifying prediction uncertainty. I was …

More

# Presentation on an Introduction to R for Data Analysis

I had an opportunity to present on an Introduction to R for Data Analysis to the School of Criminal Justice (at MSU). The presentation is organized into five sections: Background Wrangling, Plotting, and Modeling Essential Functionality Advanced Functionality Additional Resources A link to the presentation is here. Tweet

# Penalized regression from summary statistics

One of the difficulties in genomics is that when DNA donors are consented for a study, the agreements generally do not allow sharing (aggregation) of genomic data across multiple studies. This leads to isolated silos of data that can’t be fully shared. However, computations can be performed on one silo at a time, with the …

More

# Common Core and NGSS are not on the news

How often are curricular standards mentioned on TV news? With my friend Patrick, I was curious about using the newsflash package for something education-related. We came up with the idea of looking at mentions of the Common Core State Standards (for Mathematics and English Language Arts / Literacy) and the Next Generation Science Standards (for …

More

# What the world is data science education? Looking back on #dsetonline

DSET A few weeks ago, I was fortunate to attend the Data Science Educational Technology (DSET) conference. The goal for the conference was to kickstart data science education and to explore an educational technology, Concord Consortium’s Common Online Data Analysis Platform. What’s the big idea about data science education? To me, it’s a recognition that …

More

# Is the flu really worse this year? Comparing the (ongoing) 2016-17 and 2015-16 flu seasons

Background I was sick last week, and I think I might have had a mild case of the flu. Since it seems like a lot of people have been sick, I was curious whether the flu was really worse this year than last… and since the CDC makes the data available for each year, I …

More

# prcr: An R Package for Person-Centered Analysis

I’m excited to share that prcr (0.1.0), an R package for person-centered analysis, is now available on CRAN via install.packages(“prcr”). Person-centered analyses focus on clusters, or profiles, of observations, and their change over time or differences across factors. The package is designed to be “low threshold but high ceiling”, in that you can do all …

More

# R makes my blood boil and it’s Stack Exchanges fault

[Anna writes…] ​​I spend a lot of time on Stack Exchange. It’s an online forum where people ask questions about how to do statistics in the program R. Like Yahoo Answers for nerds. ​A visit to Stack Exchange is just about the only guaranteed way to ruin my day. ​When I have an R question …

More

# How much do we spend weekly on Groceries? Figuring out using R and Mint (Updated)

How much do we spend weekly on Groceries? Figuring out using R and Mint (Updated) We started using Mint to keep track of our spending. One of the best features of Mint is the ability to see past patterns of spending (and to use that information to not spend quite as much on, well, coffee, …

More

# Announcing clustRcompaR v.0.1.0

Announcing clustRcompaR v.0.1.0 Alex Lishinski and I worked on an R package over the last year or so. We are excited that it’s now available on CRAN. You can install the package using install.packages(‘clustRcompaR’) (only needed first time) and load it (more on its two functions below) using library(clustRcompaR). Here’s a description: Provides an interface …

More

# Can Life emerge spontaneously?

It would be nice if we knew where we came from. Sure, Darwin’s insight that we are the product of an ongoing process that creates new and meaningful solutions to surviving in complex and unpredictable environments is great and all. But it requires three sine qua non ingredients: inheritance, variation, and differential selection. Three does …

More

# Trump Triumph: Yes, it can happen here

Trump coverage on this blog. My advice: take Trump seriously, but not literally. The media and the left took him literally but not seriously. Trump on Trump: Playboy interview from 1990. You categorically don’t want to be President? I don’t want to be President. I’m one hundred percent sure. I’d change my mind only if …

More

# The truth about the Chinese economy, from debt to ghost cities

Highly recommended. See also references linked below. Andy Rothman has interpreted the Chinese economy for people who have serious and practical decisions to make since his early career heading up macroeconomic research at the U.S. Embassy in Beijing. He is now an investment strategist for Matthews Asia, where he continues to focus on the Chinese …

More

# Where Nobel winners get their start (Nature)

Nature covers some work by Jonathan Wai and myself. See here for a broader ranking of US schools, which includes Nobel, Turing, Fields awards and National Academies membership. Where Nobel winners get their start (Nature) Undergraduates from small, elite institutions have the best chance of winning a Nobel prize. There are many ways to rank …

More

# Annals of Reproducibility in Science: Social Psychology and Candidate Gene Studies

Andrew Gelman offers a historical timeline for the reproducibility crisis in Social Psychology, along with some juicy insight into the one funeral at a time manner in which academic science often advances. OK, that was a pretty detailed timeline. But here’s the point. Almost nothing was happening for a long time, and even after the …

More

# Speed, Balding, et al.: “for a wide range of traits, common SNPs tag a greater fraction of causal variation than is currently appreciated”

I recently blogged about a nice lecture by David Balding at the 2015 MLPM (Machine Learning for Personalized Medicine) Summer School: Machine Learning for Personalized Medicine: Heritability-based models for prediction of complex traits. In that talk he discussed some results concerning heritability estimation and potential improvements over GCTA. A new preprint on bioRxiv has the …

More

# Harvard to Release Six Years of Admissions Data for Lawsuit

This amounts to “comprehensive data” on almost 200k applicants! I imagine the legal team could use some good data scientists… Crimson: Harvard to Release Six Years of Admissions Data for Lawsuit Harvard must produce “comprehensive data” from six full admissions cycles for use in the pending admissions lawsuit between the University and anti-affirmative action group …

More

# Some R Resources

(Should I have spelled the last word in the title “ResouRces” or “resouRces”? The R community has a bit of a fascination about capitalizing the letter “r” as often as possible.) Anyway, getting down to business, I thought I’d post links to a few resources related to the R statistical language/system/ecology that I think may …

More

# Machine Learning for Personalized Medicine: Heritability-based models for prediction of complex traits (David Balding)

Highly recommended talk by David Balding on modern approaches to heritability, relatedness, etc. in statistical genetics. (I listened at 1.5x normal speed, which worked for me.) MLPM (Machine Learning for Personalized Medicine) Summer School 2015 Monday 21st of September Heritability-based models for prediction of complex traits by David Balding Complex trait genetics has been revolutionised …

More

# Over- and Underfitting

I just read a nice post by Jean-François Puget, suitable for readers not terribly familiar with the subject, on overfitting in machine learning. I was going to leave a comment mentioning a couple of things, and then decided that with minimal padding I could make it long enough to be a blog post. I agree …

More

# Looking at Mormon use of Twitter during #ldsconf

My first introduction to Twitter was in a class on the intersection of technology and Mormonism that I took from David Wiley at Brigham Young University. During the class, David encouraged us to try experiencing the sessions of the upcoming semi-annual LDS General Conference in a new way: by following the #ldsconf hashtag. My very …

More

# \$1.2 trillion college loan bubble?

See also When everyone goes to college: a lesson from S. Korea. Returns to a “college education” are highly dependent on the intrinsic cognitive ability and work ethic of the individual. WSJ: College Loan Glut Worries Policy Makers The U.S. government over the last 15 years made a trillion-dollar investment to improve the nation’s workforce, …

More

# Parametric and semi-parametric models for genome enabled prediction

This is a recent MSU seminar on genomic prediction. Vimeo won’t let me embed the video, so click here to watch the talk. Results are presented for models ranging from simple linear and linear + dominance to reproducing Hilbert space kernels and neural nets. Results are consistent with sub-dominant nonlinear (non-additive) effects, but interesting GxE …

More

# University quality and global rankings

University quality and global rankings The paper below is one of the best I’ve seen on university rankings. Yes, there is a univariate factor one might characterize as “university quality” that correlates across multiple measures. As I have long suspected, the THE (Times Higher Education) and QS rankings, which are partially survey/reputation based, are biased …

More