@misc{insidata, author = {Blaž Škrlj and Janez Konc}, title = {{Insidrug.org: Datasets}: Structural databases for drug design}, howpublished = {\url{http://insilab.org/datasets}}, month = sep, year = 2018 }
The PROTEUS dataset consists of docking-ready binding sites on proteins for small molecules classified as cofactors (ex. ATP, NAD...) and other small molecules (ex. substrates, receptor agonists...). It covers the Protein Data Bank and is ready for use with the PROTEUS (inverse) docking software enabling proteome-wide target prediction. With minor modifications it can also be used with other docking tools. The binding sites on proteins are accurately defined for each query protein using multiple centroids (x,y,z,radius) that capture different binding site shapes. Centroids are defined based on (multiple) template ligands obtained from similar binding sites in the PDB, so that the space occupied by all these template ligands is considered as a binding site. Each dataset entry contains a receptor, centroids, and template ligands. Receptor can be a single chain or a multi-protein complex, while centroids define the location of the binding site on the receptor. Template ligands have been transposed from similar binding sites and are used in the scoring function of PROTEUS. The dataset uniquely enables docking into substrate OR cofactor binding sites to facilitate discovery of both types of competitive inhibitors. It is further divided into different subsets, enabling selective docking against different species or different protein classes.
Subset | Description/Source | Small Molecule* | Cofactor |
---|---|---|---|
All | All proteins in the Protein Data Bank | all.pro.gz | cof_all.pro.gz |
Human | Human proteins determined as those where source organism is HOMO SAPIENS | human.pro.gz | cof_human.pro.gz |
Cancer | Cancer-related human genes obtained from The Human Protein Atlas (https://www.proteinatlas.org) | cancer.pro.gz | cof_cancer.pro.gz |
Human kinases | EC numbers: 1.1.1.3, 1.2.1.38, 2.1.2.-, 2.1.7.127, 2.3.1.1, 2.3.1.48, 2.4.2.9, 2.5.1.15, 2.5.1.3, 2.7.-.-, 2.7.1.-, 2.7.10.-, 2.7.10.1, 2.7.10.2, 2.7.1.1, 2.7.1 1.-, 2.7.1.100, 2.7.1.105, 2.7.1.107, 2.7.1.11, 2.7.11.1, 2.7.11.10, 2.7.11.11, 2.7.1.112, 2.7.11.12, 2.7.1.113, 2.7.11.13, 2.7.11.14, 2.7.1.115, 2.7.11.15, 2.7.11.16, 2.7 .11.17, 2.7.11.19, 2.7.1.12, 2.7.11.2, 2.7.11.21, 2.7.11.22, 2.7.1.123, 2.7.11.23, 2.7.11.24, 2.7.11.25, 2.7.1.126, 2.7.11.26, 2.7.1.127, 2.7.11.27, 2.7.1.130, 2.7.11.30, 2.7.11.31, 2.7.11.32, 2.7.1.134, 2.7.1.137, 2.7.11.4, 2.7.1.140, 2.7.1.143, 2.7.1.144, 2.7.1.145, 2.7.1.146, 2.7.1.147, 2.7.1.148, 2.7.1.149, 2.7.1.15, 2.7.11.5, 2.7.1.151, 2.7.1.153, 2.7.1.154, 2.7.1.158, 2.7.1.159, 2.7.1.16, 2.7.1.161, 2.7.1.162, 2.7.1.163, 2.7.1.164, 2.7.1.167, 2.7.1.17, 2.7.11.7, 2.7.1.170, 2.7.1.173, 2.7.1.175, 2.7.1.183 , 2.7.1.189, 2.7.1.19, 2.7.1.2, 2.7.1.20, 2.7.1.21, 2.7.12.1, 2.7.1.22, 2.7.12.2, 2.7.1.23, 2.7.1.24, 2.7.1.25, 2.7.1.26, 2.7.1.27, 2.7.1.29, 2.7.1.3, 2.7.1.30, 2.7.1.31 , 2.7.13.1, 2.7.1.32, 2.7.1.33, 2.7.13.3, 2.7.1.35, 2.7.1.36, 2.7.1.37, 2.7.1.38, 2.7.1.39, 2.7.1.4, 2.7.1.40, 2.7.1.45, 2.7.1.48, 2.7.1.49, 2.7.1.5, 2.7.1.50, 2.7.1.51 , 2.7.1.55, 2.7.1.56, 2.7.1.58, 2.7.1.59, 2.7.1.6, 2.7.1.60, 2.7.1.63, 2.7.1.64, 2.7.1.67, 2.7.1.68, 2.7.1.69, 2.7.1.71, 2.7.1.74, 2.7.1.76, 2.7.1.78, 2.7.1.82, 2.7.1.85 , 2.7.1.86, 2.7.1.90, 2.7.1.91, 2.7.1.95, 2.7.1.99, 2.7.2.-, 2.7.2.1, 2.7.2.11, 2.7.2.12, 2.7.2.15, 2.7.2.2, 2.7.2.3, 2.7.2.4, 2.7.2.7, 2.7.2.8, 2.7.3.-, 2.7.3.1, 2.7.3 .2, 2.7.3.3, 2.7.3.4, 2.7.3.5, 2.7.3.9, 2.7.4.-, 2.7.4.1, 2.7.4.10, 2.7.4.13, 2.7.4.14, 2.7.4.16, 2.7.4.2, 2.7.4.21, 2.7.4.22, 2.7.4.24, 2.7.4.25, 2.7.4.27, 2.7.4.3, 2 .7.4.4, 2.7.4.6, 2.7.4.7, 2.7.4.8, 2.7.4.9, 2.7.6.1, 2.7.6.2, 2.7.6.3, 2.7.6.5, 2.7.7.10, 2.7.7.12, 2.7.7.2, 2.7.7.4, 2.7.7.6, 2.7.7.62, 2.7.9.1, 2.7.9.2, 2.7.9.3, 3.1 .26.-, 3.1.3.-, 3.1.3.46, 3.1.3.48, 3.1.3.5, 3.4.21.-, 3.4.24.57, 3.6.-.-, 3.6.4.12, 3.6.5.2, 4.-.-.-, 4.1.1.32, 4.1.1.33, 4.1.2.25, 4.1.2.40, 5.3.1.8, 5.3.1.9 | kinases.pro.gz | cof_kinases.pro.gz |
Animals | NCBI taxonomy: invertebrates, vertebrates, mammals, primates, rodents | animals.pro.gz | cof_animals.pro.gz |
Bacteria | NCBI taxonomy: bacteria | bacteria.pro.gz | cof_bacteria.pro.gz |
Pathogenic bacteria | Source organism: BACILLUS ANTHRACIS, BACILLUS CEREUS, BARTONELLA HENSELAE, BARTONELLA QUINTANA, BORDETELLA PERTUSSIS, BORRELIA BURGDORFERI, BORRELIA GARINII, BORRELIA AFZELII, BORRELIA RECURRENTIS, BRUCELLA ABORTUS, BRUCELLA CANIS, BRUCELLA MELITENSIS, BRUCELLA SUIS, CAMPYLOBACTER JEJUNI, CHLAMYDIA PNEUMONIAE, CHLAMYDIA TRACHOMATIS, CHLAMYDOPHILA PSITTACI, CLOSTRIDIUM BOTULINUM, CLOSTRIDIUM DIFFICILE, CLOSTRIDIUM PERFRINGENS, CLOSTRIDIUM TETANI, CORYNEBACTERIUM DIPHTHERIAE, ENTEROCOCCUS FAECALIS, ENTEROCOCCUS FAECIUM, ESCHERICHIA COLI, FRANCISELLA TULARENSIS, HAEMOPHILUS INFLUENZAE, HELICOBACTER PYLORI, LEGIONELLA PNEUMOPHILA, LEPTOSPIRA INTERROGANS, LEPTOSPIRA SANTAROSAI, LEPTOSPIRA WEILII, LEPTOSPIRA NOGUCHII, MYCOBACTERIUM LEPRAE, MYCOBACTERIUM TUBERCULOSIS, MYCOBACTERIUM ULCERANS, MYCOPLASMA PNEUMONIAE, NEISSERIA GONORRHOEAE, NEISSERIA MENINGITIDIS, PSEUDOMONAS AERUGINOSA, RICKETTSIA RICKETTSII, SALMONELLA TYPHI, SALMONELLA TYPHIMURIUM, SHIGELLA SONNEI, STAPHYLOCOCCUS AUREUS, STAPHYLOCOCCUS EPIDERMIDIS, STAPHYLOCOCCUS SAPROPHYTICUS, STREPTOCOCCUS AGALACTIAE, STREPTOCOCCUS PNEUMONIAE, STREPTOCOCCUS PYOGENES, TREPONEMA PALLIDUM, UREAPLASMA UREALYTICUM, VIBRIO CHOLERAE, YERSINIA PESTIS, YERSINIA ENTEROCOLITICA, YERSINIA PSEUDOTUBERCULOSIS | patho_bacteria.pro.gz | cof_patho_bacteria.pro.gz |
Viruses | NCBI taxonomy: viruses | viruses.pro.gz | cof_viruses.pro.gz |
Plants and Fungi | NCBI taxonomy: plants and fungi | plants_fungi.pro.gz | cof_plants_fungi.pro.gz |
* Small organic molecule binding sites except cofactor binding sites. |
The GenProBiS dataset enables to determine whether a sequence variant is in binding site and if it could disrupt ligand-binding. The dataset consists of mappings of sequence variants -> protein structures from the Protein Data Bank -> predicted (ProBiS) and experimental binding sites -> predicted and experimental ligands classified as proteins, nucleic acids, compounds, ions and conserved waters.
Name | Description | Download link |
---|---|---|
GenProBiSdb | Whole dataset dump in TSV format. | genprobisdb.tsv.gz |
MappingSeqVarToProteins | Mappings of sequence variants from UniProt numbering to PDB numbering each protein structure separately in JSON format | Opens folder |
ProBiS-ligands dataset consists of all-against-all local structural alignments of the proteins in the Protein Data Bank and derived information. For each protein chain it contains its local superimpositions with every protein structural chain (these can go far below 30 % seq. id.). It contains ligands (proteins, nucleic acids, compounds, cofactors, ions and waters) transposed from the locally similar structures onto the query protein chain; binding site residues; evolutionary conservation for each protein residue calculated based on the structural alignemnts; interactions between transposed ligands and the query protein chain; grids and centroids of predicted binding sites.
Name | Description | Download link |
---|---|---|
LocalStructuralAlignments | Local structural alignments of protein chains for the entire PDB calculated using ProBiS algorithm | Opens folder |
TranposedLigands | Predicted ligands (by binding site similarity) for each protein chain in the PDB. The ligands are clustered by spatial distance and each cluster represents one binding site | Opens folder |
BindingSiteResidues | Binding site residues on proteins based on binding sites defined as clusters of transposed ligands | Opens folder |
EvolutionaryConservation | Evolutionary conservation level for each protein residue; levels go from 0 - not conserved to 9 - highly conserved residue | Opens folder |
LigandProteinInteractions | Interactions between proteins and predicted (transposed) ligands | Opens folder |
BindingSiteGrids | Binding site grids that accurately describe the identified binding sites | Opens folder |
BindingSiteCentroids | Centroids that (similar to grids but more sparsely) describe the space taken by identified binding sites | Opens folder |
ProteinSuperimpositions | Local superimpositions of protein chains as defined by the local alignments | Opens folder |
BindingSiteSurfaces | Surfaces (triangulated) of identified binding sites | Opens folder |
This is the BoBER (Base of Bioisosteric Replacements) dataset. This dataset includes bioisosteres (different fragments that should have similar binding properties) that were detected by mining the Protein Data Bank using ProBiS algorithm. Two fragments are defined as bioisosteric if they were found to bind to structurally (geometrically & physicochemically) similar binding sites.
Name | Description | Download link |
---|---|---|
BoBERdb | Bioisosteric replacements discovered by mining the PDB | BoBER.zip |
Complex biological networks that we constructed in our studies. These may be used as benchmarks for network exploring algorithms. Networks are given in standard format, where each line contains a weighted edge:
Vertex#1 Vertex#2 Weight
Study name | Description | Citation | Download link |
---|---|---|---|
Insights from Ion Binding Site Network Analysis into Evolution and Functions of Proteins | Many biological phenomena can be represented as complex networks. Using a protein binding site comparison approach, we generated a network of ion binding sites on the scale of all known protein structures from the Protein Data Bank. | Skrlj et al. Mol. Inform., 2018 | Link to the network (Table S6) |
Global organization of a binding site network gives insight into evolution and structure-function relationships of proteins | The global organization of protein binding sites is analyzed by constructing a weighted network of binding sites based on their structural similarities and detecting communities of structurally similar binding sites based on the minimum description length principle. | Lee et al. Sci. Rep., 2017 | Link to the network |