Datasets

What's on this page?

Welcome to the datasets page where you can find the datasets developed in our publications ready for bulk download. The list of currently downloadable datasets can be found below:
If you use these datasets, consider citing (along with tool paper):
@misc{insidata,
author = {Blaž Škrlj and Janez Konc},
title = {{Insidrug.org: Datasets}: Structural databases for drug design},
howpublished = {\url{http://insilab.org/datasets}},
month = sep,
year = 2018
}

The PROTEUS Dataset

The PROTEUS dataset consists of docking-ready binding sites on proteins for small molecules classified as cofactors (ex. ATP, NAD...) and other small molecules (ex. substrates, receptor agonists...). It covers the Protein Data Bank and is ready for use with the PROTEUS (inverse) docking software enabling proteome-wide target prediction. With minor modifications it can also be used with other docking tools. The binding sites on proteins are accurately defined for each query protein using multiple centroids (x,y,z,radius) that capture different binding site shapes. Centroids are defined based on (multiple) template ligands obtained from similar binding sites in the PDB, so that the space occupied by all these template ligands is considered as a binding site. Each dataset entry contains a receptor, centroids, and template ligands. Receptor can be a single chain or a multi-protein complex, while centroids define the location of the binding site on the receptor. Template ligands have been transposed from similar binding sites and are used in the scoring function of PROTEUS. The dataset uniquely enables docking into substrate OR cofactor binding sites to facilitate discovery of both types of competitive inhibitors. It is further divided into different subsets, enabling selective docking against different species or different protein classes.

Subset Description/Source Small Molecule* Cofactor
All All proteins in the Protein Data Bank all.pro.gz cof_all.pro.gz
Human Human proteins determined as those where source organism is HOMO SAPIENS human.pro.gz cof_human.pro.gz
Cancer Cancer-related human genes obtained from The Human Protein Atlas (https://www.proteinatlas.org) cancer.pro.gz cof_cancer.pro.gz
Human kinases EC numbers: 1.1.1.3, 1.2.1.38, 2.1.2.-, 2.1.7.127, 2.3.1.1, 2.3.1.48, 2.4.2.9, 2.5.1.15, 2.5.1.3, 2.7.-.-, 2.7.1.-, 2.7.10.-, 2.7.10.1, 2.7.10.2, 2.7.1.1, 2.7.1 1.-, 2.7.1.100, 2.7.1.105, 2.7.1.107, 2.7.1.11, 2.7.11.1, 2.7.11.10, 2.7.11.11, 2.7.1.112, 2.7.11.12, 2.7.1.113, 2.7.11.13, 2.7.11.14, 2.7.1.115, 2.7.11.15, 2.7.11.16, 2.7 .11.17, 2.7.11.19, 2.7.1.12, 2.7.11.2, 2.7.11.21, 2.7.11.22, 2.7.1.123, 2.7.11.23, 2.7.11.24, 2.7.11.25, 2.7.1.126, 2.7.11.26, 2.7.1.127, 2.7.11.27, 2.7.1.130, 2.7.11.30, 2.7.11.31, 2.7.11.32, 2.7.1.134, 2.7.1.137, 2.7.11.4, 2.7.1.140, 2.7.1.143, 2.7.1.144, 2.7.1.145, 2.7.1.146, 2.7.1.147, 2.7.1.148, 2.7.1.149, 2.7.1.15, 2.7.11.5, 2.7.1.151, 2.7.1.153, 2.7.1.154, 2.7.1.158, 2.7.1.159, 2.7.1.16, 2.7.1.161, 2.7.1.162, 2.7.1.163, 2.7.1.164, 2.7.1.167, 2.7.1.17, 2.7.11.7, 2.7.1.170, 2.7.1.173, 2.7.1.175, 2.7.1.183 , 2.7.1.189, 2.7.1.19, 2.7.1.2, 2.7.1.20, 2.7.1.21, 2.7.12.1, 2.7.1.22, 2.7.12.2, 2.7.1.23, 2.7.1.24, 2.7.1.25, 2.7.1.26, 2.7.1.27, 2.7.1.29, 2.7.1.3, 2.7.1.30, 2.7.1.31 , 2.7.13.1, 2.7.1.32, 2.7.1.33, 2.7.13.3, 2.7.1.35, 2.7.1.36, 2.7.1.37, 2.7.1.38, 2.7.1.39, 2.7.1.4, 2.7.1.40, 2.7.1.45, 2.7.1.48, 2.7.1.49, 2.7.1.5, 2.7.1.50, 2.7.1.51 , 2.7.1.55, 2.7.1.56, 2.7.1.58, 2.7.1.59, 2.7.1.6, 2.7.1.60, 2.7.1.63, 2.7.1.64, 2.7.1.67, 2.7.1.68, 2.7.1.69, 2.7.1.71, 2.7.1.74, 2.7.1.76, 2.7.1.78, 2.7.1.82, 2.7.1.85 , 2.7.1.86, 2.7.1.90, 2.7.1.91, 2.7.1.95, 2.7.1.99, 2.7.2.-, 2.7.2.1, 2.7.2.11, 2.7.2.12, 2.7.2.15, 2.7.2.2, 2.7.2.3, 2.7.2.4, 2.7.2.7, 2.7.2.8, 2.7.3.-, 2.7.3.1, 2.7.3 .2, 2.7.3.3, 2.7.3.4, 2.7.3.5, 2.7.3.9, 2.7.4.-, 2.7.4.1, 2.7.4.10, 2.7.4.13, 2.7.4.14, 2.7.4.16, 2.7.4.2, 2.7.4.21, 2.7.4.22, 2.7.4.24, 2.7.4.25, 2.7.4.27, 2.7.4.3, 2 .7.4.4, 2.7.4.6, 2.7.4.7, 2.7.4.8, 2.7.4.9, 2.7.6.1, 2.7.6.2, 2.7.6.3, 2.7.6.5, 2.7.7.10, 2.7.7.12, 2.7.7.2, 2.7.7.4, 2.7.7.6, 2.7.7.62, 2.7.9.1, 2.7.9.2, 2.7.9.3, 3.1 .26.-, 3.1.3.-, 3.1.3.46, 3.1.3.48, 3.1.3.5, 3.4.21.-, 3.4.24.57, 3.6.-.-, 3.6.4.12, 3.6.5.2, 4.-.-.-, 4.1.1.32, 4.1.1.33, 4.1.2.25, 4.1.2.40, 5.3.1.8, 5.3.1.9 kinases.pro.gz cof_kinases.pro.gz
Animals NCBI taxonomy: invertebrates, vertebrates, mammals, primates, rodents animals.pro.gz cof_animals.pro.gz
Bacteria NCBI taxonomy: bacteria bacteria.pro.gz cof_bacteria.pro.gz
Pathogenic bacteria Source organism: BACILLUS ANTHRACIS, BACILLUS CEREUS, BARTONELLA HENSELAE, BARTONELLA QUINTANA, BORDETELLA PERTUSSIS, BORRELIA BURGDORFERI, BORRELIA GARINII, BORRELIA AFZELII, BORRELIA RECURRENTIS, BRUCELLA ABORTUS, BRUCELLA CANIS, BRUCELLA MELITENSIS, BRUCELLA SUIS, CAMPYLOBACTER JEJUNI, CHLAMYDIA PNEUMONIAE, CHLAMYDIA TRACHOMATIS, CHLAMYDOPHILA PSITTACI, CLOSTRIDIUM BOTULINUM, CLOSTRIDIUM DIFFICILE, CLOSTRIDIUM PERFRINGENS, CLOSTRIDIUM TETANI, CORYNEBACTERIUM DIPHTHERIAE, ENTEROCOCCUS FAECALIS, ENTEROCOCCUS FAECIUM, ESCHERICHIA COLI, FRANCISELLA TULARENSIS, HAEMOPHILUS INFLUENZAE, HELICOBACTER PYLORI, LEGIONELLA PNEUMOPHILA, LEPTOSPIRA INTERROGANS, LEPTOSPIRA SANTAROSAI, LEPTOSPIRA WEILII, LEPTOSPIRA NOGUCHII, MYCOBACTERIUM LEPRAE, MYCOBACTERIUM TUBERCULOSIS, MYCOBACTERIUM ULCERANS, MYCOPLASMA PNEUMONIAE, NEISSERIA GONORRHOEAE, NEISSERIA MENINGITIDIS, PSEUDOMONAS AERUGINOSA, RICKETTSIA RICKETTSII, SALMONELLA TYPHI, SALMONELLA TYPHIMURIUM, SHIGELLA SONNEI, STAPHYLOCOCCUS AUREUS, STAPHYLOCOCCUS EPIDERMIDIS, STAPHYLOCOCCUS SAPROPHYTICUS, STREPTOCOCCUS AGALACTIAE, STREPTOCOCCUS PNEUMONIAE, STREPTOCOCCUS PYOGENES, TREPONEMA PALLIDUM, UREAPLASMA UREALYTICUM, VIBRIO CHOLERAE, YERSINIA PESTIS, YERSINIA ENTEROCOLITICA, YERSINIA PSEUDOTUBERCULOSIS patho_bacteria.pro.gz cof_patho_bacteria.pro.gz
Viruses NCBI taxonomy: viruses viruses.pro.gz cof_viruses.pro.gz
Plants and Fungi NCBI taxonomy: plants and fungi plants_fungi.pro.gz cof_plants_fungi.pro.gz
* Small organic molecule binding sites except cofactor binding sites.

The GenProBiS Dataset

The GenProBiS dataset enables to determine whether a sequence variant is in binding site and if it could disrupt ligand-binding. The dataset consists of mappings of sequence variants -> protein structures from the Protein Data Bank -> predicted (ProBiS) and experimental binding sites -> predicted and experimental ligands classified as proteins, nucleic acids, compounds, ions and conserved waters.

Name Description Download link
GenProBiSdb Whole dataset dump in TSV format. genprobisdb.tsv.gz
MappingSeqVarToProteins Mappings of sequence variants from UniProt numbering to PDB numbering each protein structure separately in JSON format Opens folder

The ProBiS-ligands Dataset

ProBiS-ligands dataset consists of all-against-all local structural alignments of the proteins in the Protein Data Bank and derived information. For each protein chain it contains its local superimpositions with every protein structural chain (these can go far below 30 % seq. id.). It contains ligands (proteins, nucleic acids, compounds, cofactors, ions and waters) transposed from the locally similar structures onto the query protein chain; binding site residues; evolutionary conservation for each protein residue calculated based on the structural alignemnts; interactions between transposed ligands and the query protein chain; grids and centroids of predicted binding sites.

Name Description Download link
LocalStructuralAlignments Local structural alignments of protein chains for the entire PDB calculated using ProBiS algorithm Opens folder
TranposedLigands Predicted ligands (by binding site similarity) for each protein chain in the PDB. The ligands are clustered by spatial distance and each cluster represents one binding site Opens folder
BindingSiteResidues Binding site residues on proteins based on binding sites defined as clusters of transposed ligands Opens folder
EvolutionaryConservation Evolutionary conservation level for each protein residue; levels go from 0 - not conserved to 9 - highly conserved residue Opens folder
LigandProteinInteractions Interactions between proteins and predicted (transposed) ligands Opens folder
BindingSiteGrids Binding site grids that accurately describe the identified binding sites Opens folder
BindingSiteCentroids Centroids that (similar to grids but more sparsely) describe the space taken by identified binding sites Opens folder
ProteinSuperimpositions Local superimpositions of protein chains as defined by the local alignments Opens folder
BindingSiteSurfaces Surfaces (triangulated) of identified binding sites Opens folder

The BoBER Dataset

This is the BoBER (Base of Bioisosteric Replacements) dataset. This dataset includes bioisosteres (different fragments that should have similar binding properties) that were detected by mining the Protein Data Bank using ProBiS algorithm. Two fragments are defined as bioisosteric if they were found to bind to structurally (geometrically & physicochemically) similar binding sites.

Name Description Download link
BoBERdb Bioisosteric replacements discovered by mining the PDB BoBER.zip

The complex biological networks

Complex biological networks that we constructed in our studies. These may be used as benchmarks for network exploring algorithms. Networks are given in standard format, where each line contains a weighted edge:

Vertex#1 Vertex#2 Weight

Study name Description Citation Download link
Insights from Ion Binding Site Network Analysis into Evolution and Functions of Proteins Many biological phenomena can be represented as complex networks. Using a protein binding site comparison approach, we generated a network of ion binding sites on the scale of all known protein structures from the Protein Data Bank. Skrlj et al. Mol. Inform., 2018 Link to the network (Table S6)
Global organization of a binding site network gives insight into evolution and structure-function relationships of proteins The global organization of protein binding sites is analyzed by constructing a weighted network of binding sites based on their structural similarities and detecting communities of structurally similar binding sites based on the minimum description length principle. Lee et al. Sci. Rep., 2017 Link to the network