Datasets

What's on this page?

Welcome to the datasets page where you can find the datasets developed in our publications ready for bulk download. The list of currently downloadable datasets can be found below:

The PROTEUS dataset
The GenProBiS dataset
The ProBiS-ligands dataset
The BoBER dataset
The complex biological networks

If you use these datasets, consider citing (along with tool paper):

@misc{insidata,
author = {Blaž Škrlj and Janez Konc},
title = {{Insidrug.org: Datasets}: Structural databases for drug design},
howpublished = {\url{http://insilab.org/datasets}},
month = sep,
year = 2018
}

The PROTEUS Dataset

The PROTEUS dataset consists of docking-ready binding sites on proteins for small molecules classified as cofactors (ex. ATP, NAD...) and other small molecules (ex. substrates, receptor agonists...). It covers the Protein Data Bank and is ready for use with the PROTEUS (inverse) docking software enabling proteome-wide target prediction. With minor modifications it can also be used with other docking tools. The binding sites on proteins are accurately defined for each query protein using multiple centroids (x,y,z,radius) that capture different binding site shapes. Centroids are defined based on (multiple) template ligands obtained from similar binding sites in the PDB, so that the space occupied by all these template ligands is considered as a binding site. Each dataset entry contains a receptor, centroids, and template ligands. Receptor can be a single chain or a multi-protein complex, while centroids define the location of the binding site on the receptor. Template ligands have been transposed from similar binding sites and are used in the scoring function of PROTEUS. The dataset uniquely enables docking into substrate OR cofactor binding sites to facilitate discovery of both types of competitive inhibitors. It is further divided into different subsets, enabling selective docking against different species or different protein classes.

Subset	Description/Source	Small Molecule^*	Cofactor
All	All proteins in the Protein Data Bank	all.pro.gz	cof_all.pro.gz
Human	Human proteins determined as those where source organism is HOMO SAPIENS	human.pro.gz	cof_human.pro.gz
Cancer	Cancer-related human genes obtained from The Human Protein Atlas (https://www.proteinatlas.org)	cancer.pro.gz	cof_cancer.pro.gz
Human kinases	EC numbers: 1.1.1.3, 1.2.1.38, 2.1.2.-, 2.1.7.127, 2.3.1.1, 2.3.1.48, 2.4.2.9, 2.5.1.15, 2.5.1.3, 2.7.-.-, 2.7.1.-, 2.7.10.-, 2.7.10.1, 2.7.10.2, 2.7.1.1, 2.7.1 1.-, 2.7.1.100, 2.7.1.105, 2.7.1.107, 2.7.1.11, 2.7.11.1, 2.7.11.10, 2.7.11.11, 2.7.1.112, 2.7.11.12, 2.7.1.113, 2.7.11.13, 2.7.11.14, 2.7.1.115, 2.7.11.15, 2.7.11.16, 2.7 .11.17, 2.7.11.19, 2.7.1.12, 2.7.11.2, 2.7.11.21, 2.7.11.22, 2.7.1.123, 2.7.11.23, 2.7.11.24, 2.7.11.25, 2.7.1.126, 2.7.11.26, 2.7.1.127, 2.7.11.27, 2.7.1.130, 2.7.11.30, 2.7.11.31, 2.7.11.32, 2.7.1.134, 2.7.1.137, 2.7.11.4, 2.7.1.140, 2.7.1.143, 2.7.1.144, 2.7.1.145, 2.7.1.146, 2.7.1.147, 2.7.1.148, 2.7.1.149, 2.7.1.15, 2.7.11.5, 2.7.1.151, 2.7.1.153, 2.7.1.154, 2.7.1.158, 2.7.1.159, 2.7.1.16, 2.7.1.161, 2.7.1.162, 2.7.1.163, 2.7.1.164, 2.7.1.167, 2.7.1.17, 2.7.11.7, 2.7.1.170, 2.7.1.173, 2.7.1.175, 2.7.1.183 , 2.7.1.189, 2.7.1.19, 2.7.1.2, 2.7.1.20, 2.7.1.21, 2.7.12.1, 2.7.1.22, 2.7.12.2, 2.7.1.23, 2.7.1.24, 2.7.1.25, 2.7.1.26, 2.7.1.27, 2.7.1.29, 2.7.1.3, 2.7.1.30, 2.7.1.31 , 2.7.13.1, 2.7.1.32, 2.7.1.33, 2.7.13.3, 2.7.1.35, 2.7.1.36, 2.7.1.37, 2.7.1.38, 2.7.1.39, 2.7.1.4, 2.7.1.40, 2.7.1.45, 2.7.1.48, 2.7.1.49, 2.7.1.5, 2.7.1.50, 2.7.1.51 , 2.7.1.55, 2.7.1.56, 2.7.1.58, 2.7.1.59, 2.7.1.6, 2.7.1.60, 2.7.1.63, 2.7.1.64, 2.7.1.67, 2.7.1.68, 2.7.1.69, 2.7.1.71, 2.7.1.74, 2.7.1.76, 2.7.1.78, 2.7.1.82, 2.7.1.85 , 2.7.1.86, 2.7.1.90, 2.7.1.91, 2.7.1.95, 2.7.1.99, 2.7.2.-, 2.7.2.1, 2.7.2.11, 2.7.2.12, 2.7.2.15, 2.7.2.2, 2.7.2.3, 2.7.2.4, 2.7.2.7, 2.7.2.8, 2.7.3.-, 2.7.3.1, 2.7.3 .2, 2.7.3.3, 2.7.3.4, 2.7.3.5, 2.7.3.9, 2.7.4.-, 2.7.4.1, 2.7.4.10, 2.7.4.13, 2.7.4.14, 2.7.4.16, 2.7.4.2, 2.7.4.21, 2.7.4.22, 2.7.4.24, 2.7.4.25, 2.7.4.27, 2.7.4.3, 2 .7.4.4, 2.7.4.6, 2.7.4.7, 2.7.4.8, 2.7.4.9, 2.7.6.1, 2.7.6.2, 2.7.6.3, 2.7.6.5, 2.7.7.10, 2.7.7.12, 2.7.7.2, 2.7.7.4, 2.7.7.6, 2.7.7.62, 2.7.9.1, 2.7.9.2, 2.7.9.3, 3.1 .26.-, 3.1.3.-, 3.1.3.46, 3.1.3.48, 3.1.3.5, 3.4.21.-, 3.4.24.57, 3.6.-.-, 3.6.4.12, 3.6.5.2, 4.-.-.-, 4.1.1.32, 4.1.1.33, 4.1.2.25, 4.1.2.40, 5.3.1.8, 5.3.1.9	kinases.pro.gz	cof_kinases.pro.gz
Animals	NCBI taxonomy: invertebrates, vertebrates, mammals, primates, rodents	animals.pro.gz	cof_animals.pro.gz
Bacteria	NCBI taxonomy: bacteria	bacteria.pro.gz	cof_bacteria.pro.gz
Pathogenic bacteria	Source organism: BACILLUS ANTHRACIS, BACILLUS CEREUS, BARTONELLA HENSELAE, BARTONELLA QUINTANA, BORDETELLA PERTUSSIS, BORRELIA BURGDORFERI, BORRELIA GARINII, BORRELIA AFZELII, BORRELIA RECURRENTIS, BRUCELLA ABORTUS, BRUCELLA CANIS, BRUCELLA MELITENSIS, BRUCELLA SUIS, CAMPYLOBACTER JEJUNI, CHLAMYDIA PNEUMONIAE, CHLAMYDIA TRACHOMATIS, CHLAMYDOPHILA PSITTACI, CLOSTRIDIUM BOTULINUM, CLOSTRIDIUM DIFFICILE, CLOSTRIDIUM PERFRINGENS, CLOSTRIDIUM TETANI, CORYNEBACTERIUM DIPHTHERIAE, ENTEROCOCCUS FAECALIS, ENTEROCOCCUS FAECIUM, ESCHERICHIA COLI, FRANCISELLA TULARENSIS, HAEMOPHILUS INFLUENZAE, HELICOBACTER PYLORI, LEGIONELLA PNEUMOPHILA, LEPTOSPIRA INTERROGANS, LEPTOSPIRA SANTAROSAI, LEPTOSPIRA WEILII, LEPTOSPIRA NOGUCHII, MYCOBACTERIUM LEPRAE, MYCOBACTERIUM TUBERCULOSIS, MYCOBACTERIUM ULCERANS, MYCOPLASMA PNEUMONIAE, NEISSERIA GONORRHOEAE, NEISSERIA MENINGITIDIS, PSEUDOMONAS AERUGINOSA, RICKETTSIA RICKETTSII, SALMONELLA TYPHI, SALMONELLA TYPHIMURIUM, SHIGELLA SONNEI, STAPHYLOCOCCUS AUREUS, STAPHYLOCOCCUS EPIDERMIDIS, STAPHYLOCOCCUS SAPROPHYTICUS, STREPTOCOCCUS AGALACTIAE, STREPTOCOCCUS PNEUMONIAE, STREPTOCOCCUS PYOGENES, TREPONEMA PALLIDUM, UREAPLASMA UREALYTICUM, VIBRIO CHOLERAE, YERSINIA PESTIS, YERSINIA ENTEROCOLITICA, YERSINIA PSEUDOTUBERCULOSIS	patho_bacteria.pro.gz	cof_patho_bacteria.pro.gz
Viruses	NCBI taxonomy: viruses	viruses.pro.gz	cof_viruses.pro.gz
Plants and Fungi	NCBI taxonomy: plants and fungi	plants_fungi.pro.gz	cof_plants_fungi.pro.gz
* Small organic molecule binding sites except cofactor binding sites.

The GenProBiS Dataset

The GenProBiS dataset enables to determine whether a sequence variant is in binding site and if it could disrupt ligand-binding. The dataset consists of mappings of sequence variants -> protein structures from the Protein Data Bank -> predicted (ProBiS) and experimental binding sites -> predicted and experimental ligands classified as proteins, nucleic acids, compounds, ions and conserved waters.

Name	Description	Download link
GenProBiSdb	Whole dataset dump in TSV format.	genprobisdb.tsv.gz
MappingSeqVarToProteins	Mappings of sequence variants from UniProt numbering to PDB numbering each protein structure separately in JSON format	Opens folder

The ProBiS-ligands Dataset

ProBiS-ligands dataset consists of all-against-all local structural alignments of the proteins in the Protein Data Bank and derived information. For each protein chain it contains its local superimpositions with every protein structural chain (these can go far below 30 % seq. id.). It contains ligands (proteins, nucleic acids, compounds, cofactors, ions and waters) transposed from the locally similar structures onto the query protein chain; binding site residues; evolutionary conservation for each protein residue calculated based on the structural alignemnts; interactions between transposed ligands and the query protein chain; grids and centroids of predicted binding sites.

Name	Description	Download link
LocalStructuralAlignments	Local structural alignments of protein chains for the entire PDB calculated using ProBiS algorithm	Opens folder
TranposedLigands	Predicted ligands (by binding site similarity) for each protein chain in the PDB. The ligands are clustered by spatial distance and each cluster represents one binding site	Opens folder
BindingSiteResidues	Binding site residues on proteins based on binding sites defined as clusters of transposed ligands	Opens folder
EvolutionaryConservation	Evolutionary conservation level for each protein residue; levels go from 0 - not conserved to 9 - highly conserved residue	Opens folder
LigandProteinInteractions	Interactions between proteins and predicted (transposed) ligands	Opens folder
BindingSiteGrids	Binding site grids that accurately describe the identified binding sites	Opens folder
BindingSiteCentroids	Centroids that (similar to grids but more sparsely) describe the space taken by identified binding sites	Opens folder
ProteinSuperimpositions	Local superimpositions of protein chains as defined by the local alignments	Opens folder
BindingSiteSurfaces	Surfaces (triangulated) of identified binding sites	Opens folder

The BoBER Dataset

This is the BoBER (Base of Bioisosteric Replacements) dataset. This dataset includes bioisosteres (different fragments that should have similar binding properties) that were detected by mining the Protein Data Bank using ProBiS algorithm. Two fragments are defined as bioisosteric if they were found to bind to structurally (geometrically & physicochemically) similar binding sites.

Name	Description	Download link
BoBERdb	Bioisosteric replacements discovered by mining the PDB	BoBER.zip

The complex biological networks

Complex biological networks that we constructed in our studies. These may be used as benchmarks for network exploring algorithms. Networks are given in standard format, where each line contains a weighted edge:

Vertex#1 Vertex#2 Weight

Study name	Description	Citation	Download link
Insights from Ion Binding Site Network Analysis into Evolution and Functions of Proteins	Many biological phenomena can be represented as complex networks. Using a protein binding site comparison approach, we generated a network of ion binding sites on the scale of all known protein structures from the Protein Data Bank.	Skrlj et al. Mol. Inform., 2018	Link to the network (Table S6)
Global organization of a binding site network gives insight into evolution and structure-function relationships of proteins	The global organization of protein binding sites is analyzed by constructing a weighted network of binding sites based on their structural similarities and detecting communities of structurally similar binding sites based on the minimum description length principle.	Lee et al. Sci. Rep., 2017	Link to the network