socube.task.doublet package

Submodules

socube.task.doublet.data module

class socube.task.doublet.data.ConvClassifyDataset(data_dir: str, transform: torch.nn.modules.module.Module = None, labels: str = 'label.csv', shuffle: bool = False, seed: int = None, k: int = 5, use_index: bool = True)

Bases: socube.data.loading.ConvDatasetBase

Class of dataset for socube

Parameters
  • data_dir (string) – the dataset’s directory

  • transform (torch module) – sample transform, such as Resize

  • labels (string) – the label file csv name

  • shuffle (boolean value) – if True, data will be shuffled while k-fold cross-valid

  • seed (random seed) – random seed for k-fold cross-valid or sample

  • k (integer scalar value) – k value of k-fold cross-valid

  • use_index (boolean value) – If True, it will read sample file by index.

sampler(subset: torch.utils.data.dataset.Subset) → torch.utils.data.sampler.WeightedRandomSampler

Generate weighted random sampler for a subset of this dataset

Parameters

subset (Subset) – the subset of this dataset

Returns

Return type

Weighted random sampler

property typeCounts

Numbers of different types

socube.task.doublet.data.generateDoublet(samples: pandas.core.frame.DataFrame, ratio: float = 1.0, adj: float = 1.0, seed: Optional[int] = None, size: Optional[int] = None, mode: Optional[str] = 'balance') → Tuple[pandas.core.frame.DataFrame]

Generate training set from samples. in silico doublet will be simulated as positive samples.

Parameters
  • samples (pd.DataFrame) – the samples dataframe, with row as cells (droplets, simples) and column as genes.

  • ratio (float, default 1.0) – The ratio of the number of doublet and singlet.

  • adj (float, default 1.0) – The adjustment factor for the doublet expression level. Generally, doublet is considered to have twice the gene expression level of singlet, but this is not necessarily the case in some cases. The gene expression level of the generated doublet is adjusted by the adjustment factor.

  • seed (int, default None) – The random seed for the generation of the doublet.

  • size (int, default None) – The size of the generated training set. If None, the size of the training set will be the same as the size of the samples.

  • mode (str, default "balance") – The mode of the generated training set. If “heterotypic”, heterotypic doublet will be majority. If “homotypic”, homotypic doublet will be majority. If “balance”, the number of heterotypic and homotypic doublet will be balanced.

Returns

  • a tuple of two pd.DataFrame, the first is the positive (doublet) samples,

  • the second is the negative (singlet) samples.

socube.task.doublet.data.checkShape(path: str, shape: tuple = (10, None, None)) → None

Check dataset shape

Parameters
  • path (string) – the dataset’s directory

  • shape (tuple, default (10, None, None)) – the expected shape of the dataset, None means any shape

Raises

AssertionError – if the shape of the dataset is not the same as the expected:

socube.task.doublet.data.checkData(data: pandas.core.frame.DataFrame)

Data legitimacy verification

Parameters

data (pd.DataFrame) – Data to be checked, a dataframe of scRNA-seq data

Raises

ValueError – If data contains NaN or inf IndexError If data contains duplicate column or row names, or if droplet name begins with “doublet”

socube.task.doublet.data.createTrainData(samples: pandas.core.frame.DataFrame, output_path: str, ratio: float = 1.0, adj: float = 1.0, seed: Optional[int] = None, mode: Optional[int] = 'balance') → Tuple[pandas.core.generic.NDFrame]

Based on the original data, doublets are generated as the positive data and a subset of the original data is used as the negative data to obtain the training dataset.

Parameters
  • samples (pd.DataFrame) – Original data, a dataframe of scRNA-seq data. Shape is (n_droplets, n_genes)

  • output_path (str) – Path to save the generated training data.

  • ratio (float, default 1.0) – The ratio of the number of doublet and singlet.

  • adj (float, default 1.0) – The adjustment factor for the doublet expression level. Generally, doublet is considered to have twice the gene expression level of singlet, but this is not necessarily the case in some cases. The gene expression level of the generated doublet is adjusted by the adjustment factor.

  • seed (int, default None) – The random seed for the generation of the doublet.

  • mode (str, default "balance") – The mode of the generated training set. If “heterotypic”, heterotypic doublet will be majority. If “homotypic”, homotypic doublet will be majority. If “balance”, the number of heterotypic and homotypic doublet will be balanced. see generateDoublet for more details.

Returns

  • A tuple of NDFrames. The first element is training data, the second element

  • is the training label.

socube.task.doublet.model module

class socube.task.doublet.model.SoCubeNet(in_channels: int, out_channels: int, freeze: bool = False, binary: bool = True, **kwargs)

Bases: socube.net._base.NetBase

Neural network model constructed for doublet detection task. previous name is Conv2DClassifyNet.

Parameters
  • in_channels (integer scalar value) – input data’s channel count

  • out_channels (integer scalar value) – output data’s channel count

  • freeze (boolean value) – Whether to freeze the feature extraction layer, default is False

  • binary (boolean value) – Whether the output probability is binary or multicategorical, default is True for binary

Examples

>>>  SoCubeNet(10, 2)
criterion(yPredict: torch.Tensor, yTrue: torch.Tensor) → torch.Tensor

Abstract methods to calculate the loss of the model.

Parameters
  • y_predict (torch.Tensor) – The predicted data.

  • y_true (torch.Tensor) – The true data.

Returns

Return type

The loss of the model.

forward(x1: torch.Tensor) → torch.Tensor

Data forward for a neural network waited to be implemented.

Parameters

x (torch.Tensor) – The input data.

Returns

Return type

The output data.

training = None

socube.task.doublet.train module

socube.task.doublet.train.fit(home_dir: str, data_dir: str, lr: float = 0.001, gamma: float = 0.99, epochs: int = 100, train_batch: int = 32, valid_batch: int = 500, transform: torch.nn.modules.module.Module = None, in_channels: int = 10, num_workers: int = 0, shuffle: bool = False, seed: int = None, label_file: str = 'label.csv', threshold: float = 0.5, k: int = 5, once: bool = False, use_index: bool = True, gpu_ids: List[str] = None, step: int = 5, model_id: str = None, pretrain_model_path: str = None, max_acc_limit: float = 1.0, multi_process: bool = False, **kwargs) → str

Train socube model.

Parameters
  • home_dir (str) – the home directory of the specfic job

  • data_dir (str) – the dataset’s directory

  • lr (float) – learning rate, default: 0.001

  • gamma (float) – learning rate decay, default: 0.99

  • epochs (int) – training epochs, default: 100

  • train_batch (int) – training batch size, default: 32

  • valid_batch (int) – validation batch size, default: 500

  • transform (nn.Module) – sample transform, such as Resize

  • in_channels (int) – the number of input channels, default: 10

  • num_workers (int) – the number of workers for data loading, default: 0

  • shuffle (bool) – if True, data will be shuffled while k-fold cross-valid

  • seed (int) – random seed for k-fold cross-valid or sample

  • label_file (str) – the label file csv name, default: “label.csv”,

  • threshold (float) – the threshold for classification, default: 0.5

  • k (int) – k value of k-fold cross-valid

  • once (bool) – if True, k-fold cross-validation runs first fold only

  • use_index (bool) – If True, it will read sample file by index. Otherwise, it will read sample file by sample name.

  • device_name (str) – the device name, default: “cpu”

  • step (int) – the epoch step of learning rate decay, default: 5

  • model_id (str) – the model id, If None, it will be generated automatically

  • pretrain_model_path (str) – the pretrain model path, if not None, it will load the pretrain model

  • max_acc_limit (float) – the max accuracy limit, if the accuracy is higher than this limit, the training will stop to prevent overfitting.

  • multi_process (bool) – if True, it will use multi-process to train the model.

  • **kwargs (dict) – the other parameters wanted to be saved in the log file.

Returns

Return type

job id string

socube.task.doublet.train.validate(data_loader: torch.utils.data.dataloader.DataLoader, model: socube.net._base.NetBase, device: torch.device, with_progress: bool = False) → tuple

Validate model performance basically

Parameters
  • dataLoader (the torch dataloader object used for validation) –

  • model (Network model implemented NetBase) – the model waited for validation

  • device (the cpu/gpu device) –

Returns

Return type

a quadra tuple of (average loss, average ACC, true label, predict score)

socube.task.doublet.train.infer(data_dir: str, home_dir: str, model_id: str, label_file: str, in_channels: int = 10, k: int = 5, threshold: float = 0.5, batch_size: int = 400, gpu_ids: List[str] = None, with_eval: bool = False, seed: Optional[int] = None, multi_process: bool = False, once: bool = False)

Model inference

Parameters
  • data_dir (str) – the directory of data

  • home_dir (str) – the home directory of output

  • model_id (str) – the id of model

  • label_file (str) – the label file used to inference

  • in_channels (int) – the number of input channels

  • k (int) – k value for k-fold cross validation

  • threshold (float) – the threshold for binary classification

  • batch_size (int) – the batch size for inference

  • gpu_ids (List[str]) – the list of gpu ids

  • with_eval (bool) – whether to evaluate the model performance

  • seed (int) – the seed for random

  • multi_process (bool) – whether to use multi-process for inference

  • once (bool) – whether use emsemble for inference

Module contents

class socube.task.doublet.SoCubeNet(in_channels: int, out_channels: int, freeze: bool = False, binary: bool = True, **kwargs)

Bases: socube.net._base.NetBase

Neural network model constructed for doublet detection task. previous name is Conv2DClassifyNet.

Parameters
  • in_channels (integer scalar value) – input data’s channel count

  • out_channels (integer scalar value) – output data’s channel count

  • freeze (boolean value) – Whether to freeze the feature extraction layer, default is False

  • binary (boolean value) – Whether the output probability is binary or multicategorical, default is True for binary

Examples

>>>  SoCubeNet(10, 2)
criterion(yPredict: torch.Tensor, yTrue: torch.Tensor) → torch.Tensor

Abstract methods to calculate the loss of the model.

Parameters
  • y_predict (torch.Tensor) – The predicted data.

  • y_true (torch.Tensor) – The true data.

Returns

Return type

The loss of the model.

forward(x1: torch.Tensor) → torch.Tensor

Data forward for a neural network waited to be implemented.

Parameters

x (torch.Tensor) – The input data.

Returns

Return type

The output data.

training = None
class socube.task.doublet.ConvClassifyDataset(data_dir: str, transform: torch.nn.modules.module.Module = None, labels: str = 'label.csv', shuffle: bool = False, seed: int = None, k: int = 5, use_index: bool = True)

Bases: socube.data.loading.ConvDatasetBase

Class of dataset for socube

Parameters
  • data_dir (string) – the dataset’s directory

  • transform (torch module) – sample transform, such as Resize

  • labels (string) – the label file csv name

  • shuffle (boolean value) – if True, data will be shuffled while k-fold cross-valid

  • seed (random seed) – random seed for k-fold cross-valid or sample

  • k (integer scalar value) – k value of k-fold cross-valid

  • use_index (boolean value) – If True, it will read sample file by index.

sampler(subset: torch.utils.data.dataset.Subset) → torch.utils.data.sampler.WeightedRandomSampler

Generate weighted random sampler for a subset of this dataset

Parameters

subset (Subset) – the subset of this dataset

Returns

Return type

Weighted random sampler

property typeCounts

Numbers of different types

socube.task.doublet.generateDoublet(samples: pandas.core.frame.DataFrame, ratio: float = 1.0, adj: float = 1.0, seed: Optional[int] = None, size: Optional[int] = None, mode: Optional[str] = 'balance') → Tuple[pandas.core.frame.DataFrame]

Generate training set from samples. in silico doublet will be simulated as positive samples.

Parameters
  • samples (pd.DataFrame) – the samples dataframe, with row as cells (droplets, simples) and column as genes.

  • ratio (float, default 1.0) – The ratio of the number of doublet and singlet.

  • adj (float, default 1.0) – The adjustment factor for the doublet expression level. Generally, doublet is considered to have twice the gene expression level of singlet, but this is not necessarily the case in some cases. The gene expression level of the generated doublet is adjusted by the adjustment factor.

  • seed (int, default None) – The random seed for the generation of the doublet.

  • size (int, default None) – The size of the generated training set. If None, the size of the training set will be the same as the size of the samples.

  • mode (str, default "balance") – The mode of the generated training set. If “heterotypic”, heterotypic doublet will be majority. If “homotypic”, homotypic doublet will be majority. If “balance”, the number of heterotypic and homotypic doublet will be balanced.

Returns

  • a tuple of two pd.DataFrame, the first is the positive (doublet) samples,

  • the second is the negative (singlet) samples.

socube.task.doublet.fit(home_dir: str, data_dir: str, lr: float = 0.001, gamma: float = 0.99, epochs: int = 100, train_batch: int = 32, valid_batch: int = 500, transform: torch.nn.modules.module.Module = None, in_channels: int = 10, num_workers: int = 0, shuffle: bool = False, seed: int = None, label_file: str = 'label.csv', threshold: float = 0.5, k: int = 5, once: bool = False, use_index: bool = True, gpu_ids: List[str] = None, step: int = 5, model_id: str = None, pretrain_model_path: str = None, max_acc_limit: float = 1.0, multi_process: bool = False, **kwargs) → str

Train socube model.

Parameters
  • home_dir (str) – the home directory of the specfic job

  • data_dir (str) – the dataset’s directory

  • lr (float) – learning rate, default: 0.001

  • gamma (float) – learning rate decay, default: 0.99

  • epochs (int) – training epochs, default: 100

  • train_batch (int) – training batch size, default: 32

  • valid_batch (int) – validation batch size, default: 500

  • transform (nn.Module) – sample transform, such as Resize

  • in_channels (int) – the number of input channels, default: 10

  • num_workers (int) – the number of workers for data loading, default: 0

  • shuffle (bool) – if True, data will be shuffled while k-fold cross-valid

  • seed (int) – random seed for k-fold cross-valid or sample

  • label_file (str) – the label file csv name, default: “label.csv”,

  • threshold (float) – the threshold for classification, default: 0.5

  • k (int) – k value of k-fold cross-valid

  • once (bool) – if True, k-fold cross-validation runs first fold only

  • use_index (bool) – If True, it will read sample file by index. Otherwise, it will read sample file by sample name.

  • device_name (str) – the device name, default: “cpu”

  • step (int) – the epoch step of learning rate decay, default: 5

  • model_id (str) – the model id, If None, it will be generated automatically

  • pretrain_model_path (str) – the pretrain model path, if not None, it will load the pretrain model

  • max_acc_limit (float) – the max accuracy limit, if the accuracy is higher than this limit, the training will stop to prevent overfitting.

  • multi_process (bool) – if True, it will use multi-process to train the model.

  • **kwargs (dict) – the other parameters wanted to be saved in the log file.

Returns

Return type

job id string

socube.task.doublet.validate(data_loader: torch.utils.data.dataloader.DataLoader, model: socube.net._base.NetBase, device: torch.device, with_progress: bool = False) → tuple

Validate model performance basically

Parameters
  • dataLoader (the torch dataloader object used for validation) –

  • model (Network model implemented NetBase) – the model waited for validation

  • device (the cpu/gpu device) –

Returns

Return type

a quadra tuple of (average loss, average ACC, true label, predict score)

socube.task.doublet.infer(data_dir: str, home_dir: str, model_id: str, label_file: str, in_channels: int = 10, k: int = 5, threshold: float = 0.5, batch_size: int = 400, gpu_ids: List[str] = None, with_eval: bool = False, seed: Optional[int] = None, multi_process: bool = False, once: bool = False)

Model inference

Parameters
  • data_dir (str) – the directory of data

  • home_dir (str) – the home directory of output

  • model_id (str) – the id of model

  • label_file (str) – the label file used to inference

  • in_channels (int) – the number of input channels

  • k (int) – k value for k-fold cross validation

  • threshold (float) – the threshold for binary classification

  • batch_size (int) – the batch size for inference

  • gpu_ids (List[str]) – the list of gpu ids

  • with_eval (bool) – whether to evaluate the model performance

  • seed (int) – the seed for random

  • multi_process (bool) – whether to use multi-process for inference

  • once (bool) – whether use emsemble for inference

socube.task.doublet.checkShape(path: str, shape: tuple = (10, None, None)) → None

Check dataset shape

Parameters
  • path (string) – the dataset’s directory

  • shape (tuple, default (10, None, None)) – the expected shape of the dataset, None means any shape

Raises

AssertionError – if the shape of the dataset is not the same as the expected:

socube.task.doublet.checkData(data: pandas.core.frame.DataFrame)

Data legitimacy verification

Parameters

data (pd.DataFrame) – Data to be checked, a dataframe of scRNA-seq data

Raises

ValueError – If data contains NaN or inf IndexError If data contains duplicate column or row names, or if droplet name begins with “doublet”

socube.task.doublet.createTrainData(samples: pandas.core.frame.DataFrame, output_path: str, ratio: float = 1.0, adj: float = 1.0, seed: Optional[int] = None, mode: Optional[int] = 'balance') → Tuple[pandas.core.generic.NDFrame]

Based on the original data, doublets are generated as the positive data and a subset of the original data is used as the negative data to obtain the training dataset.

Parameters
  • samples (pd.DataFrame) – Original data, a dataframe of scRNA-seq data. Shape is (n_droplets, n_genes)

  • output_path (str) – Path to save the generated training data.

  • ratio (float, default 1.0) – The ratio of the number of doublet and singlet.

  • adj (float, default 1.0) – The adjustment factor for the doublet expression level. Generally, doublet is considered to have twice the gene expression level of singlet, but this is not necessarily the case in some cases. The gene expression level of the generated doublet is adjusted by the adjustment factor.

  • seed (int, default None) – The random seed for the generation of the doublet.

  • mode (str, default "balance") – The mode of the generated training set. If “heterotypic”, heterotypic doublet will be majority. If “homotypic”, homotypic doublet will be majority. If “balance”, the number of heterotypic and homotypic doublet will be balanced. see generateDoublet for more details.

Returns

  • A tuple of NDFrames. The first element is training data, the second element

  • is the training label.