{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dataset preparation\n", "\n", "This tutorial offers a step-by-step guide on how to prepare your single-cell datasets for training and for subsequent analysis using the UNAGI tool. The UNAGI tool lies in the assumption that the dataset is a time-series single-cell data and time information of each cell is known. Thus, to use the UNAGI tool, it's mandatory to annotate the time-point information for each cell.\n", "\n", "This example shows how to append time point attributes to the annData object. These time points should be sequentially organized as [0, 1, 2, ..., n]. (0->n, from early time points to late time points) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# define the project name and load the data\n", "import warnings\n", "import scanpy as sc\n", "import os\n", "warnings.filterwarnings('ignore')\n", "PATH_TO_YOUR_DATA = 'your_data.h5ad'\n", "adata = sc.read(PATH_TO_YOUR_DATA)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following code will assign the stage key to each batch according to their time points. (e.g. Assuming the time-series dataset has 3 batches, each comes from an individual time point.) \n", "\n", "**Note:** UNAGI tool requires the time point $\\geq$ 3 time points." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Using 'stage' as the key for the stage information in the adata.obs\n", "stage_key = 'stage' # change this to whatever you want\n", "adata.obs[stage_key] = None\n", "\n", "#Assume the batch information is in adata.obs['batch'], and the batch names are batch1, batch2, batch3....\n", "# Change the following code according to your data\n", "adata.obs.loc[adata.obs['batch'] == 'batch1', stage_key] = '0'\n", "adata.obs.loc[adata.obs['batch'] == 'batch2', stage_key] = '1'\n", "adata.obs.loc[adata.obs['batch'] == 'batch3', stage_key] = '2'\n", "#...." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After appending the time-points information, you can either write the whole dataset into the disk or divided it into individual stages and then write to the disk. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Option 1: Save the data in the disk\n", "adata.write(f'{PATH_TO_YOUR_DATA}', compression='gzip', compression_opts=9)\n", "\n", "# Option 2: Seperate the data into different stages and save them\n", "import os\n", "dir_name = os.path.dirname(PATH_TO_YOUR_DATA)\n", "\n", "for each in list(adata.obs[stage_key].unique()):\n", " stage_adata = adata[adata.obs[stage_key] == each]\n", " stage_adata.write(f'{dir_name}/{each}.h5ad', compression='gzip', compression_opts=9)" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 2 }