socube.data package¶

Submodules¶

socube.data.loading module¶

class socube.data.loading.DatasetBase(labels: pandas.core.frame.DataFrame, shuffle: bool = False, seed: int = None, k: int = 5, task_type: str = 'classify')¶

Bases: torch.utils.data.dataset.Dataset

Abstract base class for datasets. All SoCube extended datasets must inherit and implement its abstract interface.

Parameters

labels (pd.DataFrame) – Dataframe containing labels for each sample.
shuffle (bool, default False) – Whether to shuffle each class’s samples before splitting into batches. Note that the samples within each split will not be shuffled.
seed (int, default None) – Random seed for shuffling.
k (int, default 5) – Number of folds for k-fold cross-validation.
task_type (str, default "classify") – Type of task. Must be one of “classify”, “regress”.

property kFold¶

Get generator for k-fold cross-validation dataset

Returns: kFold – An generator for k-fold cross-validation dataset. Each iteration generates a tuple of two Subset objects for training and validating
Return type: generator

abstract sampler(subset: torch.utils.data.dataset.Subset) → torch.utils.data.sampler.Sampler¶

Abstract method for sampling a subset of this dataset.

Parameters: subset (Subset) – A subset of this dataset.
Returns: sampler – A sampler for the subset.
Return type: Sampler

class socube.data.loading.ConvDatasetBase(data_dir: str, labels: pandas.core.frame.DataFrame, transform: torch.nn.modules.module.Module = None, shuffle: bool = False, seed: int = None, k: int = 5, task_type: str = 'classify', use_index: bool = True)¶

Bases: socube.data.loading.DatasetBase

Basical dataset designed for CNN.

Parameters

data_dir (str) – Path to the directory containing dataset.
labels (pd.DataFrame) – Dataframe containing labels for each sample.
transform (torch.nn.Module, default None) – Transform to apply to each sample.
shuffle (bool, default False) – Whether to shuffle each class’s samples before splitting into batches.
seed (int, default None) – Random seed for shuffling.
k (int, default 5) – Number of folds for k-fold cross-validation.
task_type (str, default "classify") – Type of task. Must be one of “classify”, “regress”.
use_index (bool, default True) – Whether to use the numeric index as the sample file name, such as “0.npy”, if False, then use the sample name in the labels as the sample file name, such as “sample_name.npy”.

socube.data.preprocess module¶

socube.data.preprocess.summary(data: pandas.core.frame.DataFrame, axis: int = 1) → pandas.core.frame.DataFrame¶

Data summary for each column or row.

Parameters

data (dataframe) – a dataframe with row and column
axis (int, default 1) – 0 for summary for column, 1 for summary for row

Returns

Return type

a dataframe with summary for each column or row

Examples

>>> import pandas as pd
>>> import socube.data.preprocess as pre
>>> data = pd.DataFrame(np.random.rand(10, 10))
>>> pre.summary(data)

socube.data.preprocess.filterData(data: pandas.core.frame.DataFrame, filtered_gene_prop: float = 0.05, filtered_cell_prop: float = 0.05, mini_expr: float = 0.05, mini_library_size: int = 1000) → pandas.core.frame.DataFrame¶

Remove genes and cells which have low variation with given proportions and remove genes whose average expression less then mini_expr and remove cells whose cell library size less then mini_library_size.

Parameters

data (dataframe) – a dataframe, which row is gene and column is cell
filtered_gene_prop (float, default 0.05) – Remove genes with low variation with this proportion
filtered_cell_prop (float, default 0.05) – Remove cells with low variation with this proportion
mini_expr (float, default 0.05) – Remove genes whose average expression less then mini_expr
mini_library_size (int, default 1000) – Remove cells whose cell library size less then mini_library_size

Returns

Return type

a dataframe with filtered genes and cells

socube.data.preprocess.minmax(data: pandas.core.frame.DataFrame, range: Tuple[int] = (0, 1), flag: int = 0, dtype: str = 'float32') → pandas.core.frame.DataFrame¶

Perform maximum-minimum normalization

Parameters

data (dataframe) – a dataframe, which row is sample and column is feature
range (tuple, default (0, 1)) – The maximum and minimum values of the normalized data, normalized to 0~1 by default
flag (int, default 0) – Equal to 0 for minmax by columns, greater than 0 for minmax by rows, less than 0 for minmax by global.
dtype (str, default "float32") – The data type of the normalized data

Returns

Return type

a dataframe with normalized data

Examples

>>> import pandas as pd
>>> import socube.data.preprocess as pre
>>> data = pd.DataFrame(np.random.rand(10, 10))
>>> pre.minmax(data)

socube.data.preprocess.std(data: pandas.core.frame.DataFrame, horizontal: bool = False, dtype: str = 'float32', global_minmax: bool = False) → pandas.core.frame.DataFrame¶

Standardization of data

Parameters

data (dataframe) – a dataframe, which row is sample and column is feature
horizontal (bool, default False) – If True, perform standardization horizontally
dtype (str, default "float32") – The data type of the standardized data
global_minmax (bool, default False) – If True, perform global standardization, otherwise standardization by row or column

Returns

Return type

a dataframe with standardized data

Examples

>>> import pandas as pd
>>> import socube.data.preprocess as pre
>>> data = pd.DataFrame(np.random.rand(10, 10))
>>> pre.std(data)

socube.data.preprocess.cosineDistanceMatrix(x1: torch.Tensor, x2: torch.Tensor = None, device_name: str = 'cpu') → torch.Tensor¶

Calculate the cosine distance matrix between the two sets of samples.

Parameters

x1 (torch.Tensor) – a tensor of samples, with shape (n1, d)
x2 (torch.Tensor, default None) – a tensor of samples, with shape (n2, d), if None, x2 = x1
device_name (str, default "cpu") – the device used for accelerating the calculation such as “cpu”, “cuda:0”, “cuda:1”, etc.

Returns

Return type

a tensor of cosine distance matrix, with shape (n1, n2)

Examples

>>> import torch
>>> import socube.data.preprocess as pre
>>> x1 = torch.rand(10, 10)
>>> x2 = torch.rand(10, 10)
>>> pre.cosineDistanceMatrix(x1, x2)

socube.data.preprocess.scatterToGrid(scatters2d: torch.Tensor, transform: Tuple[int] = (1, -1), device_name: str = 'cpu') → torch.Tensor¶

Scattered coordinates-grid coordinate mapping based on J-V linear assignment algorithm

Parameters

scatters2d (torch.Tensor) – a tensor of scattered coordinates, with shape (n, 2)
transform (tuple, default (1, -1)) – the transformation of the coordinates, such as (1, -1). Greater than 0 means that the corresponding coordinates do not change direction, less than 0 means that the corresponding coordinates are reversed. For scatter and grid plots, the y-axis directions are often opposite, and for visual consistency, the y-axis needs to be transformed
device_name (str, default "cpu") – the device used for accelerating the calculation such as “cpu”, “cuda:0”, “cuda:1”, etc.

Returns

Return type

a tensor of grid coordinates, with shape (n, 2)

Examples

>>> import torch
>>> import socube.data.preprocess as pre
>>> scatters2d = torch.rand(10, 2)
>>> pre.scatterToGrid(scatters2d)

socube.data.preprocess.umap2D(data: pandas.core.frame.DataFrame, metric: str = 'correlation', neighbors: int = 5, seed: int = None) → pandas.core.frame.DataFrame¶

Reducing high-dimensional data to 2D using UMAP

Parameters

data (pd.DataFrame) – a dataframe of high-dimensional data, with shape (n, d). n is the number of samples, d is the dimension of the data
metric (str, default 'correlation') – the metric used for calculating the distance between samples. such as ‘correlation’, ‘euclidean’, ‘manhattan’, etc.
neighbors (int, default 5) – the number of neighbors used for UMAP.
seed (int, default None) – the random seed used for UMAP.

Returns

Return type

a dataframe of two-dimensional data, with shape (n, 2)

Examples

>>> import pandas as pd
>>> import socube.data.preprocess as pre
>>> data = pd.DataFrame(np.random.rand(10, 10))
>>> pre.umap2D(data)

socube.data.preprocess.tsne2D(data: pandas.core.frame.DataFrame, metric: str = 'correlation', seed: int = None) → pandas.core.frame.DataFrame¶

Reducing high-dimensional data to 2D using t-SNE

Parameters

data (pd.DataFrame) – a dataframe of high-dimensional data, with shape (n, d). n is the number of samples, d is the dimension of the data
metric (str, default 'correlation') – the metric used for calculating the distance between samples. such as ‘correlation’, ‘euclidean’, ‘manhattan’, etc.
seed (int, default None) – the random seed used for t-SNE.

Returns

Return type

a dataframe of two-dimensional data, with shape (n, 2)

Examples

>>> import pandas as pd
>>> import socube.data.preprocess as pre
>>> data = pd.DataFrame(np.random.rand(10, 10))
>>> pre.tsne2D(data)

socube.data.preprocess.vec2Grid(vector: numpy.ndarray, shuffle: bool = False, seed: int = None) → pandas.core.frame.DataFrame¶

Converts a one-dimensional vector to a two-dimensional grid

Parameters

vector (np.ndarray) – a one-dimensional vector, with shape (n,). n is the number of samples
shuffle (bool, default False) – whether to shuffle the vector
seed (int, default None) – the random seed used for shuffling the vector

Returns

a dataframe of two-dimensional data, with shape (n, 2),
each row represents a grid point with horizontal (x) and vertical (y) coordinates

Examples

>>> import numpy as np
>>> import socube.data.preprocess as pre
>>> vector = np.random.rand(10)
>>> pre.vec2Grid(vector)

socube.data.preprocess.onehot(label: numpy.ndarray, class_nums: int = None) → numpy.ndarray¶

Convert 1D multi-label vector (each element is a sample’s label) to onehot matrix. The label should be a integer

Parameters

label (np.ndarray) – a one-dimensional integer vector, with shape (n,). n is the number of samples
class_nums (int, default None) – the number of classes. If None, the number of classes is automatically determined.

Returns

Return type

a ndarray of onehot matrix with shape (n, class_nums)

Examples

>>> onehot(np.array([1,2,4]))
array([[0, 1, 0, 0, 0],
    [0, 0, 1, 0, 0],
    [0, 0, 0, 0, 1]], dtype=int32)

>>> onehot(np.array([1,2,4]), 6)
array([[0, 1, 0, 0, 0, 0],
    [0, 0, 1, 0, 0, 0],
    [0, 0, 0, 0, 1, 0]], dtype=int32)

socube.data.preprocess.items(data: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame¶

Convert a dataFrame to a dataframe with row, col, val three columns

Parameters: data (pd.DataFrame) –
Returns
Return type: a dataframe with row, col, val three columns

Examples

>>> import pandas as pd
>>> import socube.data.preprocess as pre
>>> data = pd.DataFrame(np.random.rand(10, 10))
>>> pre.items(data)

socube.data.visualize module¶

socube.data.visualize.getHeatColor(intensity: float, topColor: str, bottomColor: str = '#ffffff') → str¶

Calculation of heat map color values based on intensity values

Parameters

intensity (float value) – color intensity values between 0 and 1
topColor (hexadecimal color string) – Color value when intensity value is 1
bottomColor (hexadecimal color string) – Color value when intensity value is 0, default is white

Returns

Return type

The hexadecimal color string corresponding to the intensity

Examples

>>> getHeatColor(0.5, "#ff0000")
'#ff7f7f'

socube.data.visualize.convertHexToRGB(hex_color: str) → Tuple[int]¶

Convert hexadecimal color strings to RGB tri-color integer tuples

Parameters: hex_color (hexadecimal color string, such as '#ff0000') –
Returns
Return type: RGB tri-color integer tuples

Examples

>>> hexToRGB('#ff0000')
(255, 0, 0)

socube.data.visualize.convertRGBToHex(color: Tuple[int]) → str¶

Convert RGB tricolor integer tuple to hexadecimal color string

Parameters: color (RGB tricolor integer tuple) – such as (255, 0, 0)
Returns
Return type: hexadecimal color string, such as ‘#ff0000’

Examples

>>> rgbToHex((255, 0, 0))
'#ff0000'

socube.data.visualize.plotScatter(data2d: pandas.core.frame.DataFrame, colormap: Dict[str, str], title: str, subtitle: str, filename: str = None, scatter_symbol: str = 'circle', width: int = 1000, height: int = 850, radius: int = 3, x_min: Optional[int] = None, y_min: Optional[int] = None, x_max: Optional[int] = None, y_max: Optional[int] = None, x_title: Optional[str] = None, y_title: Optional[str] = None)¶

Draw the scatter image of socube

Parameters

data2d (pandas.DataFrame) – The data to be plotted, with columns of x, y, label and subtype, if subtype is float, it regarded as intensity value, and the color will be calculated based on the intensity value.
colormap (Dict[str, str]) – The color map for the subtype, with key as subtype name and value as hexadecimal color string. If the subtype is float, colormap’s key should be ‘0’ and ‘1’, ‘0’ for low intensity color and ‘1’ for high intensity color.
title (str) – The title of the plot
subtitle (str) – The subtitle of the plot
filename (str) – The filename of the plot, if None, the plot will not be saved. format is html and the filename extension will automatically be added and you should not add it.
width (int) – The width of the plot, unit is pixel
height (int) – The height of the plot, unit is pixel
radius (int) – The radius of the scatter point, unit is pixel

Returns

Return type

The plot object

Examples

>>> data2d = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6], 'label': ['a', 'b', 'c'], 'subtype': [0.5, 0.7, 0.9]})
>>> colormap = {'0': '#ff0000', '1': '#00ff00'}
>>> plotScatter(data2d, colormap, 'title', 'subtitle', 'test.html')

socube.data.visualize.plotGrid(data2d: pandas.core.frame.DataFrame, colormap: Dict[str, str], shape: Tuple[int], title: str, subtitle: str, filename: str, width: int = 1000, height: int = 850) → highcharts.highcharts.highcharts.Highchart¶

Draw socube’s Grid image

Parameters

data2d (pandas.DataFrame) – The data to be plotted, with columns of x, y, label and subtype, if subtype is float, it regarded as intensity value, and the color will be calculated based on the intensity value.
colormap (Dict[str, str]) – The color map for the subtype, with key as subtype name and value as hexadecimal color string. If the subtype is float, colormap’s key should be ‘0’ and ‘1’, ‘0’ for low intensity color and ‘1’ for high intensity color.
shape (Tuple[int]) – The shape of the grid, (row, col)
title (str) – The title of the plot
subtitle (str) – The subtitle of the plot
filename (str) – The filename of the plot, if None, the plot will not be saved. format is html and the filename extension will automatically be added and you should not add it.
width (int) – The width of the plot, unit is pixel
height (int) – The height of the plot, unit is pixel

Returns

Return type

The plot object

Examples

>>> data2d = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6], 'label': ['a', 'b', 'c'], 'subtype': [0.5, 0.7, 0.9]})
>>> colormap = {'0': '#ff0000', '1': '#00ff00'}
>>> plotGrid(data2d, colormap, (6, 3), 'title', 'subtitle', 'test.html')

socube.data.visualize.plotAUC(data: Dict[str, Tuple[Sequence]], title: str, xlabel: str, ylabel: str, file: Optional[str] = None, slash: int = 0)¶

Plot a AUC curve

Parameters

data (Dict[str, Tuple[Sequence]]) – The data to be plotted, with key as the name of the curve and value as the (x, y) data.
title (str) – The title of the plot
xlabel (str) – The xlabel of the plot
ylabel (str) – The ylabel of the plot
file (str) – The filename of the plot, if None, the plot will not be saved
slash (int) – if slash is positve, will plot a forward slash, if slash is negative, will plot a back slash.

Examples

>>> data = {'AUC': ([0.1, 0.2, 0.3], [0.1, 0.2, 0.3]), 'AUC2': ([0.1, 0.2, 0.3], [0.1, 0.2, 0.3])}
>>> plotAUC(data, 'AUC', 'xlabel', 'ylabel', 'auc.png')

Module contents¶

socube.data.summary(data: pandas.core.frame.DataFrame, axis: int = 1) → pandas.core.frame.DataFrame¶

Data summary for each column or row.

Parameters

data (dataframe) – a dataframe with row and column
axis (int, default 1) – 0 for summary for column, 1 for summary for row

Returns

Return type

a dataframe with summary for each column or row

Examples

>>> import pandas as pd
>>> import socube.data.preprocess as pre
>>> data = pd.DataFrame(np.random.rand(10, 10))
>>> pre.summary(data)

socube.data.filterData(data: pandas.core.frame.DataFrame, filtered_gene_prop: float = 0.05, filtered_cell_prop: float = 0.05, mini_expr: float = 0.05, mini_library_size: int = 1000) → pandas.core.frame.DataFrame¶

Remove genes and cells which have low variation with given proportions and remove genes whose average expression less then mini_expr and remove cells whose cell library size less then mini_library_size.

Parameters

data (dataframe) – a dataframe, which row is gene and column is cell
filtered_gene_prop (float, default 0.05) – Remove genes with low variation with this proportion
filtered_cell_prop (float, default 0.05) – Remove cells with low variation with this proportion
mini_expr (float, default 0.05) – Remove genes whose average expression less then mini_expr
mini_library_size (int, default 1000) – Remove cells whose cell library size less then mini_library_size

Returns

Return type

a dataframe with filtered genes and cells

socube.data.minmax(data: pandas.core.frame.DataFrame, range: Tuple[int] = (0, 1), flag: int = 0, dtype: str = 'float32') → pandas.core.frame.DataFrame¶

Perform maximum-minimum normalization

Parameters

data (dataframe) – a dataframe, which row is sample and column is feature
range (tuple, default (0, 1)) – The maximum and minimum values of the normalized data, normalized to 0~1 by default
flag (int, default 0) – Equal to 0 for minmax by columns, greater than 0 for minmax by rows, less than 0 for minmax by global.
dtype (str, default "float32") – The data type of the normalized data

Returns

Return type

a dataframe with normalized data

Examples

>>> import pandas as pd
>>> import socube.data.preprocess as pre
>>> data = pd.DataFrame(np.random.rand(10, 10))
>>> pre.minmax(data)

socube.data.std(data: pandas.core.frame.DataFrame, horizontal: bool = False, dtype: str = 'float32', global_minmax: bool = False) → pandas.core.frame.DataFrame¶

Standardization of data

Parameters

data (dataframe) – a dataframe, which row is sample and column is feature
horizontal (bool, default False) – If True, perform standardization horizontally
dtype (str, default "float32") – The data type of the standardized data
global_minmax (bool, default False) – If True, perform global standardization, otherwise standardization by row or column

Returns

Return type

a dataframe with standardized data

Examples

>>> import pandas as pd
>>> import socube.data.preprocess as pre
>>> data = pd.DataFrame(np.random.rand(10, 10))
>>> pre.std(data)

socube.data.cosineDistanceMatrix(x1: torch.Tensor, x2: torch.Tensor = None, device_name: str = 'cpu') → torch.Tensor¶

Calculate the cosine distance matrix between the two sets of samples.

Parameters

x1 (torch.Tensor) – a tensor of samples, with shape (n1, d)
x2 (torch.Tensor, default None) – a tensor of samples, with shape (n2, d), if None, x2 = x1
device_name (str, default "cpu") – the device used for accelerating the calculation such as “cpu”, “cuda:0”, “cuda:1”, etc.

Returns

Return type

a tensor of cosine distance matrix, with shape (n1, n2)

Examples

>>> import torch
>>> import socube.data.preprocess as pre
>>> x1 = torch.rand(10, 10)
>>> x2 = torch.rand(10, 10)
>>> pre.cosineDistanceMatrix(x1, x2)

socube.data.scatterToGrid(scatters2d: torch.Tensor, transform: Tuple[int] = (1, -1), device_name: str = 'cpu') → torch.Tensor¶

Scattered coordinates-grid coordinate mapping based on J-V linear assignment algorithm

Parameters

scatters2d (torch.Tensor) – a tensor of scattered coordinates, with shape (n, 2)
transform (tuple, default (1, -1)) – the transformation of the coordinates, such as (1, -1). Greater than 0 means that the corresponding coordinates do not change direction, less than 0 means that the corresponding coordinates are reversed. For scatter and grid plots, the y-axis directions are often opposite, and for visual consistency, the y-axis needs to be transformed
device_name (str, default "cpu") – the device used for accelerating the calculation such as “cpu”, “cuda:0”, “cuda:1”, etc.

Returns

Return type

a tensor of grid coordinates, with shape (n, 2)

Examples

>>> import torch
>>> import socube.data.preprocess as pre
>>> scatters2d = torch.rand(10, 2)
>>> pre.scatterToGrid(scatters2d)

socube.data.umap2D(data: pandas.core.frame.DataFrame, metric: str = 'correlation', neighbors: int = 5, seed: int = None) → pandas.core.frame.DataFrame¶

Reducing high-dimensional data to 2D using UMAP

Parameters

data (pd.DataFrame) – a dataframe of high-dimensional data, with shape (n, d). n is the number of samples, d is the dimension of the data
metric (str, default 'correlation') – the metric used for calculating the distance between samples. such as ‘correlation’, ‘euclidean’, ‘manhattan’, etc.
neighbors (int, default 5) – the number of neighbors used for UMAP.
seed (int, default None) – the random seed used for UMAP.

Returns

Return type

a dataframe of two-dimensional data, with shape (n, 2)

Examples

>>> import pandas as pd
>>> import socube.data.preprocess as pre
>>> data = pd.DataFrame(np.random.rand(10, 10))
>>> pre.umap2D(data)

socube.data.tsne2D(data: pandas.core.frame.DataFrame, metric: str = 'correlation', seed: int = None) → pandas.core.frame.DataFrame¶

Reducing high-dimensional data to 2D using t-SNE

Parameters

data (pd.DataFrame) – a dataframe of high-dimensional data, with shape (n, d). n is the number of samples, d is the dimension of the data
metric (str, default 'correlation') – the metric used for calculating the distance between samples. such as ‘correlation’, ‘euclidean’, ‘manhattan’, etc.
seed (int, default None) – the random seed used for t-SNE.

Returns

Return type

a dataframe of two-dimensional data, with shape (n, 2)

Examples

>>> import pandas as pd
>>> import socube.data.preprocess as pre
>>> data = pd.DataFrame(np.random.rand(10, 10))
>>> pre.tsne2D(data)

socube.data.vec2Grid(vector: numpy.ndarray, shuffle: bool = False, seed: int = None) → pandas.core.frame.DataFrame¶

Converts a one-dimensional vector to a two-dimensional grid

Parameters

vector (np.ndarray) – a one-dimensional vector, with shape (n,). n is the number of samples
shuffle (bool, default False) – whether to shuffle the vector
seed (int, default None) – the random seed used for shuffling the vector

Returns

a dataframe of two-dimensional data, with shape (n, 2),
each row represents a grid point with horizontal (x) and vertical (y) coordinates

Examples

>>> import numpy as np
>>> import socube.data.preprocess as pre
>>> vector = np.random.rand(10)
>>> pre.vec2Grid(vector)

socube.data.onehot(label: numpy.ndarray, class_nums: int = None) → numpy.ndarray¶

Convert 1D multi-label vector (each element is a sample’s label) to onehot matrix. The label should be a integer

Parameters

label (np.ndarray) – a one-dimensional integer vector, with shape (n,). n is the number of samples
class_nums (int, default None) – the number of classes. If None, the number of classes is automatically determined.

Returns

Return type

a ndarray of onehot matrix with shape (n, class_nums)

Examples

>>> onehot(np.array([1,2,4]))
array([[0, 1, 0, 0, 0],
    [0, 0, 1, 0, 0],
    [0, 0, 0, 0, 1]], dtype=int32)

>>> onehot(np.array([1,2,4]), 6)
array([[0, 1, 0, 0, 0, 0],
    [0, 0, 1, 0, 0, 0],
    [0, 0, 0, 0, 1, 0]], dtype=int32)

socube.data.items(data: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame¶

Convert a dataFrame to a dataframe with row, col, val three columns

Parameters: data (pd.DataFrame) –
Returns
Return type: a dataframe with row, col, val three columns

Examples

>>> import pandas as pd
>>> import socube.data.preprocess as pre
>>> data = pd.DataFrame(np.random.rand(10, 10))
>>> pre.items(data)

class socube.data.DatasetBase(labels: pandas.core.frame.DataFrame, shuffle: bool = False, seed: int = None, k: int = 5, task_type: str = 'classify')¶

Bases: torch.utils.data.dataset.Dataset

Abstract base class for datasets. All SoCube extended datasets must inherit and implement its abstract interface.

Parameters

labels (pd.DataFrame) – Dataframe containing labels for each sample.
shuffle (bool, default False) – Whether to shuffle each class’s samples before splitting into batches. Note that the samples within each split will not be shuffled.
seed (int, default None) – Random seed for shuffling.
k (int, default 5) – Number of folds for k-fold cross-validation.
task_type (str, default "classify") – Type of task. Must be one of “classify”, “regress”.

property kFold¶

Get generator for k-fold cross-validation dataset

Returns: kFold – An generator for k-fold cross-validation dataset. Each iteration generates a tuple of two Subset objects for training and validating
Return type: generator

abstract sampler(subset: torch.utils.data.dataset.Subset) → torch.utils.data.sampler.Sampler¶

Abstract method for sampling a subset of this dataset.

Parameters: subset (Subset) – A subset of this dataset.
Returns: sampler – A sampler for the subset.
Return type: Sampler

class socube.data.ConvDatasetBase(data_dir: str, labels: pandas.core.frame.DataFrame, transform: torch.nn.modules.module.Module = None, shuffle: bool = False, seed: int = None, k: int = 5, task_type: str = 'classify', use_index: bool = True)¶

Bases: socube.data.loading.DatasetBase

Basical dataset designed for CNN.

Parameters

data_dir (str) – Path to the directory containing dataset.
labels (pd.DataFrame) – Dataframe containing labels for each sample.
transform (torch.nn.Module, default None) – Transform to apply to each sample.
shuffle (bool, default False) – Whether to shuffle each class’s samples before splitting into batches.
seed (int, default None) – Random seed for shuffling.
k (int, default 5) – Number of folds for k-fold cross-validation.
task_type (str, default "classify") – Type of task. Must be one of “classify”, “regress”.
use_index (bool, default True) – Whether to use the numeric index as the sample file name, such as “0.npy”, if False, then use the sample name in the labels as the sample file name, such as “sample_name.npy”.

socube.data.getHeatColor(intensity: float, topColor: str, bottomColor: str = '#ffffff') → str¶

Calculation of heat map color values based on intensity values

Parameters

intensity (float value) – color intensity values between 0 and 1
topColor (hexadecimal color string) – Color value when intensity value is 1
bottomColor (hexadecimal color string) – Color value when intensity value is 0, default is white

Returns

Return type

The hexadecimal color string corresponding to the intensity

Examples

>>> getHeatColor(0.5, "#ff0000")
'#ff7f7f'

socube.data.convertHexToRGB(hex_color: str) → Tuple[int]¶

Convert hexadecimal color strings to RGB tri-color integer tuples

Parameters: hex_color (hexadecimal color string, such as '#ff0000') –
Returns
Return type: RGB tri-color integer tuples

Examples

>>> hexToRGB('#ff0000')
(255, 0, 0)

socube.data.convertRGBToHex(color: Tuple[int]) → str¶

Convert RGB tricolor integer tuple to hexadecimal color string

Parameters: color (RGB tricolor integer tuple) – such as (255, 0, 0)
Returns
Return type: hexadecimal color string, such as ‘#ff0000’

Examples

>>> rgbToHex((255, 0, 0))
'#ff0000'

socube.data.plotScatter(data2d: pandas.core.frame.DataFrame, colormap: Dict[str, str], title: str, subtitle: str, filename: str = None, scatter_symbol: str = 'circle', width: int = 1000, height: int = 850, radius: int = 3, x_min: Optional[int] = None, y_min: Optional[int] = None, x_max: Optional[int] = None, y_max: Optional[int] = None, x_title: Optional[str] = None, y_title: Optional[str] = None)¶

Draw the scatter image of socube

Parameters

data2d (pandas.DataFrame) – The data to be plotted, with columns of x, y, label and subtype, if subtype is float, it regarded as intensity value, and the color will be calculated based on the intensity value.
colormap (Dict[str, str]) – The color map for the subtype, with key as subtype name and value as hexadecimal color string. If the subtype is float, colormap’s key should be ‘0’ and ‘1’, ‘0’ for low intensity color and ‘1’ for high intensity color.
title (str) – The title of the plot
subtitle (str) – The subtitle of the plot
filename (str) – The filename of the plot, if None, the plot will not be saved. format is html and the filename extension will automatically be added and you should not add it.
width (int) – The width of the plot, unit is pixel
height (int) – The height of the plot, unit is pixel
radius (int) – The radius of the scatter point, unit is pixel

Returns

Return type

The plot object

Examples

>>> data2d = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6], 'label': ['a', 'b', 'c'], 'subtype': [0.5, 0.7, 0.9]})
>>> colormap = {'0': '#ff0000', '1': '#00ff00'}
>>> plotScatter(data2d, colormap, 'title', 'subtitle', 'test.html')

socube.data.plotGrid(data2d: pandas.core.frame.DataFrame, colormap: Dict[str, str], shape: Tuple[int], title: str, subtitle: str, filename: str, width: int = 1000, height: int = 850) → highcharts.highcharts.highcharts.Highchart¶

Draw socube’s Grid image

Parameters

data2d (pandas.DataFrame) – The data to be plotted, with columns of x, y, label and subtype, if subtype is float, it regarded as intensity value, and the color will be calculated based on the intensity value.
colormap (Dict[str, str]) – The color map for the subtype, with key as subtype name and value as hexadecimal color string. If the subtype is float, colormap’s key should be ‘0’ and ‘1’, ‘0’ for low intensity color and ‘1’ for high intensity color.
shape (Tuple[int]) – The shape of the grid, (row, col)
title (str) – The title of the plot
subtitle (str) – The subtitle of the plot
filename (str) – The filename of the plot, if None, the plot will not be saved. format is html and the filename extension will automatically be added and you should not add it.
width (int) – The width of the plot, unit is pixel
height (int) – The height of the plot, unit is pixel

Returns

Return type

The plot object

Examples

>>> data2d = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6], 'label': ['a', 'b', 'c'], 'subtype': [0.5, 0.7, 0.9]})
>>> colormap = {'0': '#ff0000', '1': '#00ff00'}
>>> plotGrid(data2d, colormap, (6, 3), 'title', 'subtitle', 'test.html')

socube.data.plotAUC(data: Dict[str, Tuple[Sequence]], title: str, xlabel: str, ylabel: str, file: Optional[str] = None, slash: int = 0)¶

Plot a AUC curve

Parameters

data (Dict[str, Tuple[Sequence]]) – The data to be plotted, with key as the name of the curve and value as the (x, y) data.
title (str) – The title of the plot
xlabel (str) – The xlabel of the plot
ylabel (str) – The ylabel of the plot
file (str) – The filename of the plot, if None, the plot will not be saved
slash (int) – if slash is positve, will plot a forward slash, if slash is negative, will plot a back slash.

Examples

>>> data = {'AUC': ([0.1, 0.2, 0.3], [0.1, 0.2, 0.3]), 'AUC2': ([0.1, 0.2, 0.3], [0.1, 0.2, 0.3])}
>>> plotAUC(data, 'AUC', 'xlabel', 'ylabel', 'auc.png')