# Simulating Synthetic Data

The following steps describe a simple procedure on simulating synthetic data assuming that the continuous variables are following a multivariate normal distribution:

- Read the original data, create a subset table to include only the columns with variables that you want to simulate and remove any rows that include missing values
- Calculate the mean value and the variance-covariance matrix (using the
**cov()**function) of the independent (explanatory) variables - Generate random independent variables that follow the multivariate normal distribution (using the
**mvrnorm()**function), having the same mean and the same pairwise correlations as the mean and the pairwise correlations of the original independent variables - Apply logistic regression models (using the
**glm()**function with**family='binomial'**) to estimate the coefficients that represent the relationship between each binary dependent variable and the independent variables of the original dataset - Use the estimated coefficients to calculate the log odds of the predicted variables and use them to generate random binary variables (using the
**rbinom()**function) with probability equal to each log odd - Combine the simulated dependent and independent variables in one new table

Here is a link with R scripts showing how we have simulated synthetic data from the 1958 Birth Cohort for the Healthy Obesity Project.

Here is another link with an R script showing how we have simulated synthetic data from ALSPAC Birth Cohort for a Virtual Reality Project. You can find more information in our paper 'Synthetic ALSPAC longitudinal datasets for the Big Data VR project'

DataSHIELD Wiki by DataSHIELD is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at http://www.datashield.ac.uk/wiki