Simulating Synthetic Data

The following steps describe a simple procedure on simulating synthetic data assuming that the continuous variables are following a multivariate normal distribution:
  • Read the original data, create a subset table to include only the columns with variables that you want to simulate and remove any rows that include missing values
  • Calculate the mean value and the variance-covariance matrix (using the cov() function) of the independent (explanatory) variables 
  • Generate random independent variables that follow the multivariate normal distribution (using the mvrnorm() function), having the same mean and the same pairwise correlations as the mean and the pairwise correlations of the original independent variables
  • Apply logistic regression models (using the glm() function with family='binomial') to estimate the coefficients that represent the relationship between each binary dependent variable and the independent variables of the original dataset
  • Use the estimated coefficients to calculate the log odds of the predicted variables and use them to generate random binary variables (using the rbinom() function) with probability equal to each log odd
  • Combine the simulated dependent and independent variables in one new table 


Here is a link with R scripts showing how we have simulated synthetic data from the 1958 Birth Cohort for the Healthy Obesity Project.

Here is another link with an R script showing how we have simulated synthetic data from ALSPAC Birth Cohort for a Virtual Reality Project. You can find more information in our paper 'Synthetic ALSPAC longitudinal datasets for the Big Data VR project'




DataSHIELD Wiki by DataSHIELD is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at http://www.datashield.ac.uk/wiki