socube.task.doublet package¶

Submodules¶

socube.task.doublet.data module¶

class socube.task.doublet.data.ConvClassifyDataset(data_dir: str, transform: torch.nn.modules.module.Module = None, labels: str = 'label.csv', shuffle: bool = False, seed: int = None, k: int = 5, use_index: bool = True)¶

Bases: socube.data.loading.ConvDatasetBase

Class of dataset for socube

Parameters

data_dir (string) – the dataset’s directory
transform (torch module) – sample transform, such as Resize
labels (string) – the label file csv name
shuffle (boolean value) – if True, data will be shuffled while k-fold cross-valid
seed (random seed) – random seed for k-fold cross-valid or sample
k (integer scalar value) – k value of k-fold cross-valid
use_index (boolean value) – If True, it will read sample file by index.

sampler(subset: torch.utils.data.dataset.Subset) → torch.utils.data.sampler.WeightedRandomSampler¶

Generate weighted random sampler for a subset of this dataset

Parameters: subset (Subset) – the subset of this dataset
Returns
Return type: Weighted random sampler

property typeCounts¶: Numbers of different types

socube.task.doublet.data.generateDoublet(samples: pandas.core.frame.DataFrame, ratio: float = 1.0, adj: float = 1.0, seed: Optional[int] = None, size: Optional[int] = None, mode: Optional[str] = 'balance') → Tuple[pandas.core.frame.DataFrame]¶

Generate training set from samples. in silico doublet will be simulated as positive samples.

Parameters

samples (pd.DataFrame) – the samples dataframe, with row as cells (droplets, simples) and column as genes.
ratio (float, default 1.0) – The ratio of the number of doublet and singlet.
adj (float, default 1.0) – The adjustment factor for the doublet expression level. Generally, doublet is considered to have twice the gene expression level of singlet, but this is not necessarily the case in some cases. The gene expression level of the generated doublet is adjusted by the adjustment factor.
seed (int, default None) – The random seed for the generation of the doublet.
size (int, default None) – The size of the generated training set. If None, the size of the training set will be the same as the size of the samples.
mode (str, default "balance") – The mode of the generated training set. If “heterotypic”, heterotypic doublet will be majority. If “homotypic”, homotypic doublet will be majority. If “balance”, the number of heterotypic and homotypic doublet will be balanced.

Returns

a tuple of two pd.DataFrame, the first is the positive (doublet) samples,
the second is the negative (singlet) samples.

socube.task.doublet.data.checkShape(path: str, shape: tuple = (10, None, None)) → None¶

Check dataset shape

Parameters

path (string) – the dataset’s directory
shape (tuple, default (10, None, None)) – the expected shape of the dataset, None means any shape

Raises

AssertionError – if the shape of the dataset is not the same as the expected:

socube.task.doublet.data.checkData(data: pandas.core.frame.DataFrame)¶

Data legitimacy verification

Parameters: data (pd.DataFrame) – Data to be checked, a dataframe of scRNA-seq data
Raises: ValueError – If data contains NaN or inf IndexError If data contains duplicate column or row names, or if droplet name begins with “doublet”

socube.task.doublet.data.createTrainData(samples: pandas.core.frame.DataFrame, output_path: str, ratio: float = 1.0, adj: float = 1.0, seed: Optional[int] = None, mode: Optional[int] = 'balance') → Tuple[pandas.core.generic.NDFrame]¶

Based on the original data, doublets are generated as the positive data and a subset of the original data is used as the negative data to obtain the training dataset.

Parameters

samples (pd.DataFrame) – Original data, a dataframe of scRNA-seq data. Shape is (n_droplets, n_genes)
output_path (str) – Path to save the generated training data.
ratio (float, default 1.0) – The ratio of the number of doublet and singlet.
adj (float, default 1.0) – The adjustment factor for the doublet expression level. Generally, doublet is considered to have twice the gene expression level of singlet, but this is not necessarily the case in some cases. The gene expression level of the generated doublet is adjusted by the adjustment factor.
seed (int, default None) – The random seed for the generation of the doublet.
mode (str, default "balance") – The mode of the generated training set. If “heterotypic”, heterotypic doublet will be majority. If “homotypic”, homotypic doublet will be majority. If “balance”, the number of heterotypic and homotypic doublet will be balanced. see generateDoublet for more details.

Returns

A tuple of NDFrames. The first element is training data, the second element
is the training label.

socube.task.doublet.model module¶

class socube.task.doublet.model.SoCubeNet(in_channels: int, out_channels: int, freeze: bool = False, binary: bool = True, **kwargs)¶

Bases: socube.net._base.NetBase

Neural network model constructed for doublet detection task. previous name is Conv2DClassifyNet.

Parameters

in_channels (integer scalar value) – input data’s channel count
out_channels (integer scalar value) – output data’s channel count
freeze (boolean value) – Whether to freeze the feature extraction layer, default is False
binary (boolean value) – Whether the output probability is binary or multicategorical, default is True for binary

Examples

>>>  SoCubeNet(10, 2)

criterion(yPredict: torch.Tensor, yTrue: torch.Tensor) → torch.Tensor¶

Abstract methods to calculate the loss of the model.

Parameters

y_predict (torch.Tensor) – The predicted data.
y_true (torch.Tensor) – The true data.

Returns

Return type

The loss of the model.

forward(x1: torch.Tensor) → torch.Tensor¶

Data forward for a neural network waited to be implemented.

Parameters: x (torch.Tensor) – The input data.
Returns
Return type: The output data.

training = None¶

socube.task.doublet.train module¶

socube.task.doublet.train.fit(home_dir: str, data_dir: str, lr: float = 0.001, gamma: float = 0.99, epochs: int = 100, train_batch: int = 32, valid_batch: int = 500, transform: torch.nn.modules.module.Module = None, in_channels: int = 10, num_workers: int = 0, shuffle: bool = False, seed: int = None, label_file: str = 'label.csv', threshold: float = 0.5, k: int = 5, once: bool = False, use_index: bool = True, gpu_ids: List[str] = None, step: int = 5, model_id: str = None, pretrain_model_path: str = None, max_acc_limit: float = 1.0, multi_process: bool = False, **kwargs) → str¶

Train socube model.

Parameters

home_dir (str) – the home directory of the specfic job
data_dir (str) – the dataset’s directory
lr (float) – learning rate, default: 0.001
gamma (float) – learning rate decay, default: 0.99
epochs (int) – training epochs, default: 100
train_batch (int) – training batch size, default: 32
valid_batch (int) – validation batch size, default: 500
transform (nn.Module) – sample transform, such as Resize
in_channels (int) – the number of input channels, default: 10
num_workers (int) – the number of workers for data loading, default: 0
shuffle (bool) – if True, data will be shuffled while k-fold cross-valid
seed (int) – random seed for k-fold cross-valid or sample
label_file (str) – the label file csv name, default: “label.csv”,
threshold (float) – the threshold for classification, default: 0.5
k (int) – k value of k-fold cross-valid
once (bool) – if True, k-fold cross-validation runs first fold only
use_index (bool) – If True, it will read sample file by index. Otherwise, it will read sample file by sample name.
device_name (str) – the device name, default: “cpu”
step (int) – the epoch step of learning rate decay, default: 5
model_id (str) – the model id, If None, it will be generated automatically
pretrain_model_path (str) – the pretrain model path, if not None, it will load the pretrain model
max_acc_limit (float) – the max accuracy limit, if the accuracy is higher than this limit, the training will stop to prevent overfitting.
multi_process (bool) – if True, it will use multi-process to train the model.
**kwargs (dict) – the other parameters wanted to be saved in the log file.

Returns

Return type

job id string

socube.task.doublet.train.validate(data_loader: torch.utils.data.dataloader.DataLoader, model: socube.net._base.NetBase, device: torch.device, with_progress: bool = False) → tuple¶

Validate model performance basically

Parameters

dataLoader (the torch dataloader object used for validation) –
model (Network model implemented NetBase) – the model waited for validation
device (the cpu/gpu device) –

Returns

Return type

a quadra tuple of (average loss, average ACC, true label, predict score)

socube.task.doublet.train.infer(data_dir: str, home_dir: str, model_id: str, label_file: str, in_channels: int = 10, k: int = 5, threshold: float = 0.5, batch_size: int = 400, gpu_ids: List[str] = None, with_eval: bool = False, seed: Optional[int] = None, multi_process: bool = False, once: bool = False)¶

Model inference

Parameters

data_dir (str) – the directory of data
home_dir (str) – the home directory of output
model_id (str) – the id of model
label_file (str) – the label file used to inference
in_channels (int) – the number of input channels
k (int) – k value for k-fold cross validation
threshold (float) – the threshold for binary classification
batch_size (int) – the batch size for inference
gpu_ids (List[str]) – the list of gpu ids
with_eval (bool) – whether to evaluate the model performance
seed (int) – the seed for random
multi_process (bool) – whether to use multi-process for inference
once (bool) – whether use emsemble for inference

Module contents¶

class socube.task.doublet.SoCubeNet(in_channels: int, out_channels: int, freeze: bool = False, binary: bool = True, **kwargs)¶

Bases: socube.net._base.NetBase

Neural network model constructed for doublet detection task. previous name is Conv2DClassifyNet.

Parameters

in_channels (integer scalar value) – input data’s channel count
out_channels (integer scalar value) – output data’s channel count
freeze (boolean value) – Whether to freeze the feature extraction layer, default is False
binary (boolean value) – Whether the output probability is binary or multicategorical, default is True for binary

Examples

>>>  SoCubeNet(10, 2)

criterion(yPredict: torch.Tensor, yTrue: torch.Tensor) → torch.Tensor¶

Abstract methods to calculate the loss of the model.

Parameters

y_predict (torch.Tensor) – The predicted data.
y_true (torch.Tensor) – The true data.

Returns

Return type

The loss of the model.

forward(x1: torch.Tensor) → torch.Tensor¶

Data forward for a neural network waited to be implemented.

Parameters: x (torch.Tensor) – The input data.
Returns
Return type: The output data.

training = None¶

class socube.task.doublet.ConvClassifyDataset(data_dir: str, transform: torch.nn.modules.module.Module = None, labels: str = 'label.csv', shuffle: bool = False, seed: int = None, k: int = 5, use_index: bool = True)¶

Bases: socube.data.loading.ConvDatasetBase

Class of dataset for socube

Parameters

data_dir (string) – the dataset’s directory
transform (torch module) – sample transform, such as Resize
labels (string) – the label file csv name
shuffle (boolean value) – if True, data will be shuffled while k-fold cross-valid
seed (random seed) – random seed for k-fold cross-valid or sample
k (integer scalar value) – k value of k-fold cross-valid
use_index (boolean value) – If True, it will read sample file by index.

sampler(subset: torch.utils.data.dataset.Subset) → torch.utils.data.sampler.WeightedRandomSampler¶

Generate weighted random sampler for a subset of this dataset

Parameters: subset (Subset) – the subset of this dataset
Returns
Return type: Weighted random sampler

property typeCounts¶: Numbers of different types

socube.task.doublet.generateDoublet(samples: pandas.core.frame.DataFrame, ratio: float = 1.0, adj: float = 1.0, seed: Optional[int] = None, size: Optional[int] = None, mode: Optional[str] = 'balance') → Tuple[pandas.core.frame.DataFrame]¶

Generate training set from samples. in silico doublet will be simulated as positive samples.

Parameters

samples (pd.DataFrame) – the samples dataframe, with row as cells (droplets, simples) and column as genes.
ratio (float, default 1.0) – The ratio of the number of doublet and singlet.
adj (float, default 1.0) – The adjustment factor for the doublet expression level. Generally, doublet is considered to have twice the gene expression level of singlet, but this is not necessarily the case in some cases. The gene expression level of the generated doublet is adjusted by the adjustment factor.
seed (int, default None) – The random seed for the generation of the doublet.
size (int, default None) – The size of the generated training set. If None, the size of the training set will be the same as the size of the samples.
mode (str, default "balance") – The mode of the generated training set. If “heterotypic”, heterotypic doublet will be majority. If “homotypic”, homotypic doublet will be majority. If “balance”, the number of heterotypic and homotypic doublet will be balanced.

Returns

a tuple of two pd.DataFrame, the first is the positive (doublet) samples,
the second is the negative (singlet) samples.

socube.task.doublet.fit(home_dir: str, data_dir: str, lr: float = 0.001, gamma: float = 0.99, epochs: int = 100, train_batch: int = 32, valid_batch: int = 500, transform: torch.nn.modules.module.Module = None, in_channels: int = 10, num_workers: int = 0, shuffle: bool = False, seed: int = None, label_file: str = 'label.csv', threshold: float = 0.5, k: int = 5, once: bool = False, use_index: bool = True, gpu_ids: List[str] = None, step: int = 5, model_id: str = None, pretrain_model_path: str = None, max_acc_limit: float = 1.0, multi_process: bool = False, **kwargs) → str¶

Train socube model.

Parameters

home_dir (str) – the home directory of the specfic job
data_dir (str) – the dataset’s directory
lr (float) – learning rate, default: 0.001
gamma (float) – learning rate decay, default: 0.99
epochs (int) – training epochs, default: 100
train_batch (int) – training batch size, default: 32
valid_batch (int) – validation batch size, default: 500
transform (nn.Module) – sample transform, such as Resize
in_channels (int) – the number of input channels, default: 10
num_workers (int) – the number of workers for data loading, default: 0
shuffle (bool) – if True, data will be shuffled while k-fold cross-valid
seed (int) – random seed for k-fold cross-valid or sample
label_file (str) – the label file csv name, default: “label.csv”,
threshold (float) – the threshold for classification, default: 0.5
k (int) – k value of k-fold cross-valid
once (bool) – if True, k-fold cross-validation runs first fold only
use_index (bool) – If True, it will read sample file by index. Otherwise, it will read sample file by sample name.
device_name (str) – the device name, default: “cpu”
step (int) – the epoch step of learning rate decay, default: 5
model_id (str) – the model id, If None, it will be generated automatically
pretrain_model_path (str) – the pretrain model path, if not None, it will load the pretrain model
max_acc_limit (float) – the max accuracy limit, if the accuracy is higher than this limit, the training will stop to prevent overfitting.
multi_process (bool) – if True, it will use multi-process to train the model.
**kwargs (dict) – the other parameters wanted to be saved in the log file.

Returns

Return type

job id string

socube.task.doublet.validate(data_loader: torch.utils.data.dataloader.DataLoader, model: socube.net._base.NetBase, device: torch.device, with_progress: bool = False) → tuple¶

Validate model performance basically

Parameters

dataLoader (the torch dataloader object used for validation) –
model (Network model implemented NetBase) – the model waited for validation
device (the cpu/gpu device) –

Returns

Return type

a quadra tuple of (average loss, average ACC, true label, predict score)

socube.task.doublet.infer(data_dir: str, home_dir: str, model_id: str, label_file: str, in_channels: int = 10, k: int = 5, threshold: float = 0.5, batch_size: int = 400, gpu_ids: List[str] = None, with_eval: bool = False, seed: Optional[int] = None, multi_process: bool = False, once: bool = False)¶

Model inference

Parameters

data_dir (str) – the directory of data
home_dir (str) – the home directory of output
model_id (str) – the id of model
label_file (str) – the label file used to inference
in_channels (int) – the number of input channels
k (int) – k value for k-fold cross validation
threshold (float) – the threshold for binary classification
batch_size (int) – the batch size for inference
gpu_ids (List[str]) – the list of gpu ids
with_eval (bool) – whether to evaluate the model performance
seed (int) – the seed for random
multi_process (bool) – whether to use multi-process for inference
once (bool) – whether use emsemble for inference

socube.task.doublet.checkShape(path: str, shape: tuple = (10, None, None)) → None¶

Check dataset shape

Parameters

path (string) – the dataset’s directory
shape (tuple, default (10, None, None)) – the expected shape of the dataset, None means any shape

Raises

AssertionError – if the shape of the dataset is not the same as the expected:

socube.task.doublet.checkData(data: pandas.core.frame.DataFrame)¶

Data legitimacy verification

Parameters: data (pd.DataFrame) – Data to be checked, a dataframe of scRNA-seq data
Raises: ValueError – If data contains NaN or inf IndexError If data contains duplicate column or row names, or if droplet name begins with “doublet”

socube.task.doublet.createTrainData(samples: pandas.core.frame.DataFrame, output_path: str, ratio: float = 1.0, adj: float = 1.0, seed: Optional[int] = None, mode: Optional[int] = 'balance') → Tuple[pandas.core.generic.NDFrame]¶

Based on the original data, doublets are generated as the positive data and a subset of the original data is used as the negative data to obtain the training dataset.

Parameters

samples (pd.DataFrame) – Original data, a dataframe of scRNA-seq data. Shape is (n_droplets, n_genes)
output_path (str) – Path to save the generated training data.
ratio (float, default 1.0) – The ratio of the number of doublet and singlet.
adj (float, default 1.0) – The adjustment factor for the doublet expression level. Generally, doublet is considered to have twice the gene expression level of singlet, but this is not necessarily the case in some cases. The gene expression level of the generated doublet is adjusted by the adjustment factor.
seed (int, default None) – The random seed for the generation of the doublet.
mode (str, default "balance") – The mode of the generated training set. If “heterotypic”, heterotypic doublet will be majority. If “homotypic”, homotypic doublet will be majority. If “balance”, the number of heterotypic and homotypic doublet will be balanced. see generateDoublet for more details.

Returns

A tuple of NDFrames. The first element is training data, the second element
is the training label.