Title: | Simulate single cell and spatial transcriptomics data |
---|---|
Description: | Simulate single cell and spatial transcriptomics data with complicated heterogeneity structure with or without reference dataset. |
Authors: | Qi Gao |
Maintainer: | Qi Gao <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.2 |
Built: | 2025-03-07 04:49:35 UTC |
Source: | https://github.com/qigao1/genescape |
The function adds distortions to the input tissue type regions.
add_distort(old_tissue_region, sample_angle = pi/6, tissue_angle = 0, sigma = 0.01, scaleloc = FALSE, thres = 1)
add_distort(old_tissue_region, sample_angle = pi/6, tissue_angle = 0, sigma = 0.01, scaleloc = FALSE, thres = 1)
old_tissue_region |
An sf object of the tissue type region polygons. |
sample_angle |
a number of the angles for the whole sample. |
tissue_angle |
a number or a vector of the angles for each of the tissue regions. Tissue type regions are rotated clockwise by the input angle. |
sigma |
standard deviation of the normal error |
scaleloc |
whether to scale the location coordinates to [0,1] |
thres |
threshold parameter of the concave hull calculation method. Larger value results in simpler shapes |
The function adds distortions to the input tissue type regions by adding normal errors to the coordinates of the vertices of the tissue type region polygons.
An sf object of the polygons of the updated tissue type regions.
Pebesma, E., 2018. Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal 10 (1), 439-446, https://doi.org/10.32614/RJ-2018-009
The function aggregate the gene expression level in high resolution data to generate low resolution data.
aggregate_expression(counts, cell_position, spot_position, dis_thres_prop = 0.03, sigma_prop = 0.01, lib_size_mean = 5000, lib_size_dispersion = 3, cell_annotation = NULL)
aggregate_expression(counts, cell_position, spot_position, dis_thres_prop = 0.03, sigma_prop = 0.01, lib_size_mean = 5000, lib_size_dispersion = 3, cell_annotation = NULL)
counts |
read count matrix |
cell_position |
coordinates of the cells |
spot_position |
coordinates of the spots |
dis_thres_prop |
radius of the spot |
sigma_prop |
standard deviation of normal distribution. larger sigma implies that cells outside of the spot are more likely to contribute to the read observed in the spot |
lib_size_mean |
mean parameter of target library size (negative binomial distribution) |
lib_size_dispersion |
dispersion parameter of target library size (negative binomial distribution) |
cell_annotation |
cell type annotations |
This function aggregate the gene expression level in high resolution (cell level) data to generate low resolution (spot level) data. Cells contribute to the spots based on the distance between cell centers and spot centers, with the cells inside the spot contribute all their reads to the correponding spot. After obtaining the read pool, the funtion also downsample the reads to a target mean spot library size.
A matrix of the gene (row) expression in each spot (column)
The function chooses the cell type for each cell.
assign_cell_types(ttype, nttype, cell_type_proportion = NULL)
assign_cell_types(ttype, nttype, cell_type_proportion = NULL)
ttype |
tissue types of the cells |
nttype |
total number of possible tissue types. should be no less than the max value of ttype and the number of unique values in ttype. some tissue types may not exist in ttype |
cell_type_proportion |
matrix of cell type proportion, cell type (row) by tissue type (column) |
This function choose the cell type for each cell. Each tissue type could contains multiple types of cells. Based on the cell type proportion, this function randomly samples the cell type for each cell based on their corresponding tissue type.
A vector of the assigned cell type
The function assigns to each cell the corresponding tissue type.
assign_tissue_types(tissue_region, cell_spot_position, nneighbor = 3)
assign_tissue_types(tissue_region, cell_spot_position, nneighbor = 3)
tissue_region |
sf object of tissue region polygons. |
cell_spot_position |
coordinates for the cells |
nneighbor |
number of nearest tissue region taken into account |
This function assigns the cells to a tissue type if the cell is located in the tissue region polygon
A vector of the tissue region type
Pebesma, E., 2018. Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal 10 (1), 439-446, https://doi.org/10.32614/RJ-2018-009
The function simulates the coordinates of the cells within the sample region.
cell_position_simulation(sample_vertices, ncells = 10000)
cell_position_simulation(sample_vertices, ncells = 10000)
sample_vertices |
coordinates for the vertices of sample polygon. Each row for a vertex. |
ncells |
number of cells to simulate |
This function simulates the location of the cells within the sample region
A matrix of the centers, 1st column as x coordinate and 2nd column for the y coordinate
Pebesma, E., 2018. Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal 10 (1), 439-446, https://doi.org/10.32614/RJ-2018-009
changeLibSize This function modifies the simulated counts so that the total read count in each spot is the same as the example data.
changeLibSize(count, nreadspot, weight)
changeLibSize(count, nreadspot, weight)
count |
simulated matrix of read count, rows as genes and columns as spots |
nreadspot |
number of total read count in each spot in the example data |
weight |
a matrix of probability weights for the read count of each gene in each cell being changed. |
a matrix of the updated read count
Qi Gao
cumsum_cpp This function calculates the cumulative sum (partial sum) of a vector.
cumsum_cpp(x)
cumsum_cpp(x)
x |
vector to be summed. |
vector of the cumulative sum (partial sum)
Qi Gao
downSampleRead This function uses down-sampling to obtain the reads in each spot.
downSampleRead(count, nread)
downSampleRead(count, nread)
count |
gene (row) by cell (column) matrix. Gene read count in each cell. |
nread |
cell (row) by spot (column) matrix. Target number of reads sampled from each cell in each spot. |
gene (row) by spot (column) matrix. Gene expression level in each spot.
Qi Gao
The function estimates tissue type regions using cell/spot locations and the corresponding tissue type.
est_bound(sploc, ttype, nttype, thres = 1)
est_bound(sploc, ttype, nttype, thres = 1)
sploc |
coordinates for the cells. |
ttype |
tissue types of the cells. should be integers starting from 1. |
nttype |
total number of possible tissue types. should be no less than the max value of ttype and the number of unique values in ttype. some tissue types may not exist in ttype |
thres |
threshold parameter of the concave hull calculation method. Larger value results in simpler shapes |
The function estimates tissue type regions using cell/spot locations and the corresponding tissue type from reference data.
A sf object of the polygons of the tissue type regions estimated from data.
Joel Gombin and Ramnath Vaidyanathan and Vladimir Agafonkin (2020). concaveman: A Very Fast 2D Concave Hull Algorithm. R package version 1.1.0. https://CRAN.R-project.org/package=concaveman
Pebesma, E., 2018. Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal 10 (1), 439-446, https://doi.org/10.32614/RJ-2018-009
estimate_gamma This function estimates gamma distribution shape and rate parameters.
estimate_gamma(x)
estimate_gamma(x)
x |
target data vector |
list of gamma distribution shape and rate parameters
Qi Gao
estimate_lognormal This function estimates estimate_lognormal distribution loc and scale parameters.
estimate_lognormal(x)
estimate_lognormal(x)
x |
target data vector |
list of lognormal distribution loc and scale parameters
Qi Gao
estimate_nb This function estimates negative binomial distribution mean and dispersion parameters.
estimate_nb(datavec)
estimate_nb(datavec)
datavec |
target data vector |
vector of negative binomial distribution mean (1st element), dispersion (2nd element) parameters and 0 (3rd element).
Qi Gao
estimate_zinb This function estimates the mean, dispersion and zero inflation parameters in zero-inflated negative binomial distribution.
estimate_zinb(data, maxiter = 400)
estimate_zinb(data, maxiter = 400)
data |
target data matrix, rows as genes and columns as spots |
maxiter |
maximum number of iteration for R optim |
matrix of negative binomial distribution mean (1st column), dispersion (2nd column) and zero-inflation (3rd column) parameters.
Qi Gao
This function estimate zero-inflated poisson distribution parameters from data vector
estimate_zip(datavec)
estimate_zip(datavec)
datavec |
target data vector |
a vector of zero-inflated poisson distribution parameters. first element is mean parameter, while second element is zero-inflation parameter.
S. Dencks, M. Piepenbrock and G. Schmitz, "Assessing Vessel Reconstruction in Ultrasound Localization Microscopy by Maximum Likelihood Estimation of a Zero-Inflated Poisson Model," in IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, vol. 67, no. 8, pp. 1603-1612, Aug. 2020, doi: 10.1109/TUFFC.2020.2980063.
This function similate differential expression fold change level
fcsim(n.gene, de.id, fc.mean, fc.sd)
fcsim(n.gene, de.id, fc.mean, fc.sd)
n.gene |
total number of genes |
de.id |
index of differentially expressed genes |
fc.mean |
location parameter for fold change (normal distribution) |
fc.sd |
scale parameter for fold change (normal distribution) |
Zappia, L., Phipson, B., & Oshlack, A. (2017). Splatter: Simulation of single-cell RNA sequencing data. Genome Biology, 18(1). https://doi.org/10.1186/s13059-017-1305-0
This function estimate zero-inflated poisson distribution parameters from data
GeneScape_est(data, groups = NULL)
GeneScape_est(data, groups = NULL)
data |
target gene (row) by cell (column) data matrix |
groups |
group of the cells |
list of two gene (row) by group (column) matrices. first matrix contains the mean parameter, while second matrix contains zero inflation parameters.
This function simulate parameters for single cell RNAseq data simulation with complicated differential expression.
GeneScape_par(nCells = 6000, nGroups = NULL, groups = NULL, lib.size = NULL, lib.size.mean = 10000, lib.size.sd = 2000, de.fc.mat = NULL, nGenes = 5000, gene.mean.shape = 0.3, gene.mean.rate = 0.15, gene.means = NULL, zero.inflat = NULL, add.zero.inflat = TRUE, zero.inflat.times = 0.5, de.n = 50, de.share = NULL, de.id = NULL, de.fc.mean = 1, de.fc.sd = 0.2, add.sub = FALSE, sub.major = NULL, sub.prop = 0.1, sub.group = NULL, sub.de.n = 20, sub.de.id = NULL, sub.de.common = FALSE, sub.de.fc.mean = 1, sub.de.fc.sd = 0.2)
GeneScape_par(nCells = 6000, nGroups = NULL, groups = NULL, lib.size = NULL, lib.size.mean = 10000, lib.size.sd = 2000, de.fc.mat = NULL, nGenes = 5000, gene.mean.shape = 0.3, gene.mean.rate = 0.15, gene.means = NULL, zero.inflat = NULL, add.zero.inflat = TRUE, zero.inflat.times = 0.5, de.n = 50, de.share = NULL, de.id = NULL, de.fc.mean = 1, de.fc.sd = 0.2, add.sub = FALSE, sub.major = NULL, sub.prop = 0.1, sub.group = NULL, sub.de.n = 20, sub.de.id = NULL, sub.de.common = FALSE, sub.de.fc.mean = 1, sub.de.fc.sd = 0.2)
nCells |
number of cells |
nGroups |
number of cell groups |
groups |
group information for cells |
lib.size |
library size for cells |
lib.size.mean |
location parameter for library size (log-normal distribution) |
lib.size.sd |
scale parameter for library size (log-normal distribution) |
de.fc.mat |
differential expression fold change matrix, could be generated by this function |
nGenes |
number of genes |
gene.mean.shape |
shape parameter for mean expression level (Gamma distribution) |
gene.mean.rate |
rate parameter for mean expression level (Gamma distribution) |
gene.means |
mean gene expression levels |
zero.inflat |
vector of zero inflation paramters |
add.zero.inflat |
whether to add zero inflation |
zero.inflat.times |
the zero inflation paramter is simulated as zero.inflat.times / (1 + exp(gene.means)) |
de.n |
number of differentially expressed genes in each cell type. Should be a integer or a vector of length nGroups |
de.share |
number of shared DE genes between neighbor cell types. Should be a vector of length (nGroups - 1) |
de.id |
the index of genes that are DE across cell types. Should be a list of vectors. Each vector corresponds to a cell type. With non-null value of de.id, de.n and de.share would be ignored. |
de.fc.mean |
the location parameter for the fold change of DE genes. Should be a number, a vector of length nGroups |
de.fc.sd |
the scale parameter for fold change (log-normal distribution). Should be a number or a vector of length nGroups |
add.sub |
whether to add sub-cell-types |
sub.major |
the major cell types correspond to the sub-cell-types |
sub.prop |
proportion of sub-cell-types in the corresponding major cell type |
sub.group |
cell index for sub-cell-types. With non-null sub.group specified, sub.prop would be ignored. |
sub.de.n |
number of differentially expressed genes in each sub-cell-type compared to the corresponding major cell type. Should be a integer or a vector of length sub.major |
sub.de.id |
the index of additional differentially expressed genes between sub-cell-types and the corresponding major cell types |
sub.de.common |
whether the additional differential expression structure should be same for all sub-cell-types |
sub.de.fc.mean |
similar to de.fc.mean, but for addtional differentially expressed genes in sub-cell-types |
sub.de.fc.sd |
similar to de.fc.sd, but for addtional differentially expressed genes in sub-cell-types |
A list of simulation parameters: vector of cell groups, vector of cell library size, gene expression parameter (zero-inflated poisson model), and matrix of differentially expression fold changes.
Zappia, L., Phipson, B., & Oshlack, A. (2017). Splatter: Simulation of single-cell RNA sequencing data. Genome Biology, 18(1). https://doi.org/10.1186/s13059-017-1305-0
set.seed(1) para <- GeneScape_par()
set.seed(1) para <- GeneScape_par()
This function generate single cell RNAseq data using simulation parameter.
GeneScape_sim(para, add.path = FALSE, path.n = 4, path.size = 20, path.cor = 0.7, path.id = NULL, band.width = 10, add.hub = FALSE, hub.n = 10, hub.size = 20, hub.cor = 0.4, hub.id = NULL, hub.fix = NULL)
GeneScape_sim(para, add.path = FALSE, path.n = 4, path.size = 20, path.cor = 0.7, path.id = NULL, band.width = 10, add.hub = FALSE, hub.n = 10, hub.size = 20, hub.cor = 0.4, hub.id = NULL, hub.fix = NULL)
para |
simulation parameter estimated by GeneScape_est or simulated by GeneScape_par |
add.path |
whether to add pathways (correlated genes) |
path.n |
number of pathways included. Should be a integer. |
path.size |
number of correlated genes (length of pathway). Should be a number or a vector of length path.n |
path.cor |
correlation parameters |
path.id |
gene index of correlated (pathway) genes. Should be a list of vectors, with each vector represents a pathway. With non-null value of path.id, path.n would be ignored. |
band.width |
No correlation exists if distance of 2 genes are further than band_width in a pathway |
add.hub |
whether to add hub genes |
hub.n |
number of hub genes included. Should be a integer. |
hub.size |
number of genes correlated to the hub gene. Should be a number or a vector of length hub.n |
hub.cor |
correlation parameters between hub genes and their correlated genes |
hub.id |
gene index of hub genes. Should be a list of vectors. With non-null value of hub.id, hub.n would be ignored. |
hub.fix |
user defined genes correlated to hub genes (others are randomly selected). Should be a list of vectors of length hub.n or same as hub.id. |
A list of read count data, cell groups, cell library size, gene mean expression, gene differential expression rate, pathway genes and hub gene indices.
Zappia, L., Phipson, B., & Oshlack, A. (2017). Splatter: Simulation of single-cell RNA sequencing data. Genome Biology, 18(1). https://doi.org/10.1186/s13059-017-1305-0
set.seed(1) para <- GeneScape_par() data <- GeneScape_sim(para)
set.seed(1) para <- GeneScape_par() data <- GeneScape_sim(para)
GeneScapeS_est This function estimates distribution parameters from an example data.
GeneScapeS_est(data, ttype)
GeneScapeS_est(data, ttype)
data |
Example data matrix |
ttype |
Original tissue type of each spot |
list of matrices of negative binomial distribution mean (1st column), dispersion (2nd column) and zero-inflation (3rd column) parameters. one matrix for each tissue type.
Qi Gao
GeneScapeS_sim This function generates simulated data based on distribution parameters estimated from an example data.
GeneScapeS_sim(para, newttype, rank = TRUE, data = NULL, ttype = NULL)
GeneScapeS_sim(para, newttype, rank = TRUE, data = NULL, ttype = NULL)
para |
Example data matrix |
newttype |
output tissue type of each spot |
rank |
whether to rank the expression level in each tissue type based on the example |
data |
original reference dataset used for create para |
ttype |
Original tissue type of each spot |
matrix of simulated data
Qi Gao
inverse_logit This function is the inverse logit function.
inverse_logit(x)
inverse_logit(x)
x |
number |
inverse logit function value with input x
Qi Gao
The function updates the tissue type for each cell.
knn_tissue_update(cell_spot_position, old_tissue_type, k)
knn_tissue_update(cell_spot_position, old_tissue_type, k)
cell_spot_position |
coordinates for the cells |
old_tissue_type |
previously assigned tissue type for each cell. should be integers start from one |
k |
number of nearest neighbor |
This function assigns to the each cell to a new tissue type by sampling the tissue types of its k nearest neighbors.
A vector of the updated tissue type
knnassign This function uses KNN sampling to simulate gradually changed tissue types.
knnassign(distmat, ttype, nttype, k)
knnassign(distmat, ttype, nttype, k)
distmat |
Spot distance matrix |
ttype |
Input tissue type of each spot |
nttype |
total number of possible tissue types. should be no less than the max value of ttype and the number of unique values in ttype. some tissue types may not exist in ttype |
k |
parameter for KNN sampling |
List of tissue type weight and updated tissue type with gradual change
Qi Gao
The function rotate polygons clockwise.
rotate_pg(pg, angle = pi/12, center = NULL)
rotate_pg(pg, angle = pi/12, center = NULL)
pg |
An sf object of the polygon. |
angle |
a number of the angle. The polygon is rotated clockwise by the input angle. |
center |
customized rotation center |
The function rotate polygons clockwise based on the input angle around the center (by default its centroid).
An sf object of the rotated polygon.
Pebesma, E., 2018. Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal 10 (1), 439-446, https://doi.org/10.32614/RJ-2018-009
The function rotate polygons clockwise.
rotate_sample(tissue_region, angle = pi/12, thres = 1)
rotate_sample(tissue_region, angle = pi/12, thres = 1)
tissue_region |
Tissue region polygons. |
angle |
a number of the angle. The polygon is rotated clockwise by the input angle. |
thres |
threshold parameter of the concave hull calculation method. Larger value results in simpler shapes |
The function rotate tissue regions clockwise based on the input angle around the center (by default its centroid).
An sf object of the rotated tissue regions
Pebesma, E., 2018. Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal 10 (1), 439-446, https://doi.org/10.32614/RJ-2018-009
sample_int This function samples integers from 1 to an upper limit integer value with equal probability without replacement.
sample_int(maxvalue, nsample)
sample_int(maxvalue, nsample)
maxvalue |
the max integer to be sampled. |
nsample |
the number of integer to be sampled. |
vector of the integers in the sample
Qi Gao
This function simulates the sample region.
sample_region_simulation(nvert = 13, sample_shape = c("square", "circle"), limitx = c(0, 1), limity = c(0, 1))
sample_region_simulation(nvert = 13, sample_shape = c("square", "circle"), limitx = c(0, 1), limity = c(0, 1))
nvert |
number of vertices of the sample polygon |
sample_shape |
shape of the capture area |
limitx |
min and max value of x coordinates |
limity |
min and max value of y coordinates |
This function simulates the sample region with the capture area as a polygon.
A matrix of the coordinates of sample polygon vertices
Valtr, P. Probability that n random points are in convex position. Discrete Comput Geom 13, 637-643 (1995). https://doi.org/10.1007/BF02574070
The function shifts the tissue type regions.
scale_sample(tissue_region, scale_prop = c(1, 1), thres = 1)
scale_sample(tissue_region, scale_prop = c(1, 1), thres = 1)
tissue_region |
An sf object of the tissue type region polygons. |
scale_prop |
a vector of length 2, for x and y correspondingly. Value greater than 1 would enlarge the tissue type region on the corresponding coordinates. |
thres |
threshold parameter of the concave hull calculation method. Larger value results in simpler shapes |
The function shifts tissue type regions by modifying the x and y coordinates of the vertices of the tissue type region polygons. The sizes of modifications equal to the input proportion times the difference between the maximum of x (y) coordinates and the minimum of x (y) coordinates
An sf object of the polygons of the updated tissue type regions.
Pebesma, E., 2018. Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal 10 (1), 439-446, https://doi.org/10.32614/RJ-2018-009
The function scales the tissue type regions.
shift_sample(tissue_region, shift_prop = c(0, 0))
shift_sample(tissue_region, shift_prop = c(0, 0))
tissue_region |
An sf object of the tissue type region polygons. |
shift_prop |
a vector of length 2, for x and y correspondingly. Large absolute value implies greater shift. Positive value implies positive change of x and y coordinates (shift up and right) |
The function scales the tissue type regions around the sample centroid
An sf object of the polygons of the updated tissue type regions.
Pebesma, E., 2018. Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal 10 (1), 439-446, https://doi.org/10.32614/RJ-2018-009
sp_dist_euclidean_cpp This function calculates Euclidean distance between cell locations and spot locations.
sp_dist_euclidean_cpp(cloc, sloc)
sp_dist_euclidean_cpp(cloc, sloc)
cloc |
matrix of location coordinates of the cells. Each row represents a cell. |
sloc |
matrix of location coordinates of the spots. Each row represents a spot. |
matrix of the Euclidean distance, number of cells (row) by number of spots (column)
Qi Gao
The function simulates the spot positions.
spot_simulate(dimension = c(78, 64), n = 5000, structure = c("interlaced", "uniform", "random"), error_prop = 0.04, limitx = c(0, 1), limity = c(0, 1))
spot_simulate(dimension = c(78, 64), n = 5000, structure = c("interlaced", "uniform", "random"), error_prop = 0.04, limitx = c(0, 1), limity = c(0, 1))
dimension |
dimension of the spots, only used for "interlaced" and "uniform" structure. |
n |
number of spots, only used for "random" structure |
structure |
structure of the spots |
error_prop |
size of error added to the coordinates of the spots |
limitx |
min and max value of x coordinates |
limity |
min and max value of y coordinates |
This function simulates the spot coordinates. The "interlaced" structure mimic the Visium sequencing data. The "uniform" structure generate spot positions that are located on the vertices of squares. And the "random" structure randomly sample the spot locations.
A matrix of the spot coordinates
The function simulates regions of different tissue types within the sample region.
tissue_region_simulation(sample_vertices, ntissuetype = 4, centers = NULL)
tissue_region_simulation(sample_vertices, ntissuetype = 4, centers = NULL)
sample_vertices |
coordinates for the vertices of sample polygon. Each row for a vertex. |
ntissuetype |
number of tissue types |
centers |
centers of the tissue types |
This function simulates the tissue type regions using voronoi polygons correponding to the simulated or input centers
A sf object of the polygons of the tissue type regions.
Pebesma, E., 2018. Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal 10 (1), 439-446, https://doi.org/10.32614/RJ-2018-009
The function updates the sample region based on the regions of different tissue types.
update_sample_region(tissue_region, thres = 1)
update_sample_region(tissue_region, thres = 1)
tissue_region |
An sf object of the tissue type region polygons. |
thres |
threshold parameter of the concave hull calculation method. Larger value results in simpler shapes |
The function updates the sample region by finding the concave hull of the vertices of tissue type region polygons.
An sf object of the polygons of the updated tissue type regions.
Pebesma, E., 2018. Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal 10 (1), 439-446, https://doi.org/10.32614/RJ-2018-009
zinb_nll This function calculates the negative log likelihood of zero-inflated negative binomial distribution.
zinb_nll(y, par)
zinb_nll(y, par)
y |
observations |
par |
vector of 3 elements. parameters (mean, dispersion, zero-inflation) for zero-inflated negative binomial distribution. |
negative log likelihood of zero-inflated negative binomial distribution
Qi Gao