1. TCM-ID is a comprehensive resource for TCM research
TCM-ID is the key data resource center for facilitating the research and clinical investigations of Traditional Chinese Medicine. It also facilitates the authentification of TCM herbs, the link to the healthcare bigdata (via ICD-11 traditional medicine condition codes), and the bridge to AI tools (via molecular representations recognizble by AI tools). TCM-ID was firstly lunched in 2005 and maintained by Bioinformatics & Drug Design (BIDD) group in Department of Pharmacy, National University of Singapore.
2. Data Sources & Preprocessing
2.1 Data sources
2.1.1 TCM prescriptions and treated diseases collection
Total 7,443 TCM prescriptions with the information of prescriptions name, composition, functions, symptoms, prescriptions treated modern diseases, traditional medicine disorders and patterns were manually collected from China Food and Drug Administration (http://samr.cfda.gov.cn/WS01/CL0001/), Chinese pharmacopoeia (2015) (ISBN 978-7-5067-7337-9), Chinese Classical Prescriptions (ISBN 7-5023-3930-2), , Current Kampo Medicine (ISSN 1559-033X), National Administration of Traditional Chinese Medicine (http://www.satcm.gov.cn/), and
Taiwan Herbal Pharmacopeia (https://www.mohw.gov.tw/cp-3690-39025-2.html). In particular, total 366 ICD11 (International Classification of Diseases 11th Revision, https://icd.who.int/en/) codes for 4,601 prescriptions in current TCM-ID.
2.1.2 TCM Component
This part contains Chinese name, Latin name, English name, flavor and Meridian tropism, therapeutic and toxic information, Geo-authentic habitats, and plant DNA barcode. Total 2,751 component with about 1,400 commonly used herbs were mainly integrated from TCM-ID (Version 1), A comprehensive Chinese-Latin-English Dictionary of The Names of Chinese Herbal Medicines (ISBN: 7-5062-3971-X), SymMap (http://www.symmap.org/), ETCM (http://www.nrc.ac.cn:9090/ETCM/). Particularly, plant DNA barcodes for herbal medicine plants were collect from Barcode of Life Data System (http://www.boldsystems.org/) and GenBank (https://www.ncbi.nlm.nih.gov/genbank/) and Geo-authentic habitats information was manually collected from Atlas of genuine herbs (ISBN 7-5335-2075-0, ISBN 7-5335-2074-2, ISBN 7-5335-2073-4, ISBN 7-5335-2072-6).
2.1.3 Chemical ingredients and targeted genes
7,375 ingredients of herbs in current TCM-ID were integrated from the PubChem (https://pubchem.ncbi.nlm.nih.gov/), ChEMBL (https://www.ebi.ac.uk/chembl/), NPASS (http://bidd2.nus.edu.sg/NPASS/), ETCM (http://www.nrc.ac.cn:9090/ETCM/), SymMap (http://www.symmap.org/), and extracted from published paper. The 2D structures of chemical ingredients were integrated from PubChem, ChEMBL, and NPASS databases. Furthermore, 1,292 of the ingredients based on experimental activity (<=10uM for targeting human proteins, <=100uM or 100ug/ml for pathogenic microbes) as well as 768 targets (human proteins: 463, pathogenic microbes: 305) were obtained from the NPASS database. Classification information and external link to Uniprot ID of all targets were also included.
2.1.4 Human samples (Healthy&Disease) with Gene/Protein Expression Measurement
Total 27,631 human samples (Healthy&Disease) with gene expression measurement covered 56 diseases were preliminary extracted by programming scripts from the GEO (https://www.ncbi.nlm.nih.gov/geo/), ARCHS4 (https://amp.pharm.mssm.edu/archs4/), and GTEx (https://gtexportal.org/home/). Healthy human samples with protein expression measurement were obtained from Human Proteome Map (http://www.humanproteomemap.org/). Human derived cell line samples were also included.
All experiments’ summaries or study design descriptions were manually checked to remove samples from animal models of human diseases. In addition, tissue information was extracted from samples’ metadata provided by databases or manually curated from publications. Duplicated donors or samples were removed since the same donor or sample might be included in multiple datasets or be deposited into different databases. Available omics data types for each sample were marked.
2.2 Preprocessing of chemical data
Canonical SMILES of all collected chemical ingredients were obtained from PubChem (https://pubchem.ncbi.nlm.nih.gov/). Open Babel software was used to generate InChIKeys and InChI from canonical SMILES, Duplicates were removed by comparing molecules’ InChIKeys, which is a nearly unique identifier of structures. After these pre-processing steps, molecular formula, molecular weight, alogP, mlogP, xlogP, number of hydrogen bond accepters, number of hydrogen bond donors, polar surface area, rotatable bond, number of rings, number of heavy atoms, number failures of the Lipinski's Rule Of 5, total 11 types of molecular descriptors were generated by using PaDEL software (PMID: 29636450).
2.3 Processing of protein expression data
The processes were refer to :Kim et al. A draft map of the human proteome. 2014. Nature. 509, 575-581.PMID: 24870542
2.4 Processing of gene expression data
Available SAR files from RNA-seq experiments were from the GEO, extracting the FASTQ file from the SRA data format, Kallisto alignment was used. R Bioconductor package ‘preprocessCore’ was used to complete quantile normalization, package ‘sva’ was used to correct batch effect. The processes were refer to: Massive mining of publicly available RNA-seq data from human and mouse, 2018, Nature communications, 9(1), 1366. PMID: 29636450
2.5 Target Expression Heatmap Generation
2.6 Violin Plot Generation For Human Gene Expresiion Samples
Gene expression of individual patient and healthy samples was provided by Violin Plot, generated by R package “ggstatsplot”.
3. How to Use TCM-ID?
Please refer to "Help" page for detailed instructions