Manual

1. Getting started

What is RDF Portal?

RDF Portal is a platform for accessing and integrating life science datasets represented in the Resource Description Framework (RDF). It hosts a wide range of RDF datasets covering genomic, proteomic, chemical, disease, and other biomedical domains.

The portal is maintained by the Database Center for Life Science (DBCLS) and provides multiple ways to access data — from browsing and downloading to querying via SPARQL endpoints and AI-assisted interfaces. For more details on the background, history, and mission of RDF Portal, see the About page.

Key concepts for beginners

If you are new to RDF and SPARQL, here is a brief introduction to the core concepts.

RDF (Resource Description Framework) is a standard model for representing data as a graph. Data is expressed as a collection of statements called triples, each consisting of three parts:

  • Subject — the entity being described (e.g., a gene, a protein)
  • Predicate — the relationship or property (e.g., “has function”, “is located in”)
  • Object — the value or related entity (e.g., a specific function, a chromosome)

Each component is typically identified by a URI (Uniform Resource Identifier), which ensures global uniqueness and allows datasets from different sources to be linked together.

SPARQL is the query language for RDF data. It allows you to retrieve, filter, and combine data across RDF datasets. If you are familiar with SQL for relational databases, SPARQL serves a similar role for graph-based RDF data.

Ontology is a formal representation of knowledge within a domain, defining the types of entities and the relationships between them. RDF Portal datasets use community-standard ontologies such as Gene Ontology (GO), ChEBI, and Disease Ontology to ensure consistent semantics across datasets.

The RDF Portal website is organized into the following main sections, accessible from the left sidebar:

Screenshot: RDF Portal top page showing the sidebar navigation and main content area

Section Description
About Background information on RDF Portal, its history, and funding
Access methods Different ways to query and interact with the data
Datasets Browsable list of all hosted RDF datasets
Statistics Summary statistics (triples, classes, properties, etc.) for each dataset
Download Download RDF data files in various serialization formats
Documents Manual, data submission guidelines, and RDF config documentation
Announcements News and updates
Update log History of data updates

The site is available in both English and Japanese. You can switch languages using the language link at the bottom of the sidebar.


2. Browsing datasets

Dataset list

The Datasets page displays all RDF datasets hosted on the portal. You can use the controls at the top to sort and filter the list.

Screenshot: Datasets page showing the sort, order, and filter controls

Sorting options:

  • Date — sort by the dataset registration or update date
  • Name — sort alphabetically by dataset name
  • Triples — sort by the number of triples (dataset size)

Each sort can be set to descending or ascending order.

Filtering options:

  • Tags — filter by domain category (see the tag list below)
  • Provenance — filter by data origin
  • Registration — filter by how the dataset was registered

Dataset tags

Each dataset is assigned one or more tags indicating its domain category. Tags are displayed with petal-shaped icons for easy visual identification.

Screenshot: Tag filter dropdown
Icon Tag Description
Gene Datasets related to genes, gene annotations, and gene-level information
Gene expression Datasets containing gene expression profiles and transcriptomics data
Genome Datasets related to genome sequences and genomic features
Protein Datasets related to protein sequences, structures, and functions
Drug/Chemical Datasets related to drugs, chemical compounds, and bioactive molecules
Health/Disease Datasets related to diseases, clinical variants, and medical information
Glycan Datasets related to glycans and carbohydrate structures
Organism Datasets related to organism-level information and taxonomy
Cell Datasets related to cell-level information
Bioresource Datasets related to biological resource collections (culture collections, biobanks)
Polymorphism Datasets related to genetic variants, SNPs, and polymorphisms
Sequence Datasets related to nucleotide or amino acid sequences

A single dataset may have multiple tags. For example, Open TG-GATEs is tagged with Gene, Drug/Chemical, Health/Disease, and Gene expression, reflecting its coverage of toxicogenomics data across these domains.

Provenance

Provenance indicates how the RDF data was created relative to the original data source.

Value Description
Original RDF data developed by the original database developers themselves. The data provider created the RDF representation of their own data.
Third-party RDF data developed by a third party, not by the original database developers. Someone other than the original data provider independently converted the publicly available data into RDF.

Registration

Registration indicates how the dataset was added to RDF Portal.

Value Description
Submitted The dataset was submitted to RDF Portal by the RDF data developers.
Added by RDF Portal The dataset was registered by the RDF Portal team.

Dataset detail page

Clicking on a dataset name opens its detail page. Each detail page contains the following information:

Screenshot: Dataset detail page (e.g., DDBJ) showing specifications, statistics

Dataset specifications — a table showing metadata about the dataset:

Field Description
Tags Domain categories assigned to the dataset
Provenance Whether the data is original or derived
Registration How the dataset was added to RDF Portal
Data provider The organization that provides the data
Creator The creator of the RDF conversion
Issued The date the current version was published
Licenses License information for the dataset
Version The version number of the dataset
Download Link to download the RDF data files
SPARQL Endpoint The SPARQL endpoint URL for querying this dataset

Dataset statistics — summary counts including the total number of triples, subjects, properties, objects, and classes.

SPARQL example queries — ready-to-use example queries that demonstrate how to retrieve data from the dataset. Each example includes a description and a “Run on Endpoint” link that opens the query directly in the SPARQL endpoint interface.

Screenshot: SPARQL example query section with the 'Run on Endpoint' button

Schema diagram — a visual representation of the dataset’s RDF schema, showing the classes and properties used in the data. These diagrams are automatically generated from RDF-config, a framework for describing RDF dataset structure in a machine-readable format. RDF-config models are maintained for each dataset hosted on RDF Portal, providing a consistent and practical way to document how the data is organized. In some exceptional cases, schema diagrams may be provided through other means. For more details on RDF-config and its role in RDF Portal, see the RDF config documentation.

Screenshot: Schema diagram for a dataset

3. Access methods

RDF Portal provides several methods for accessing data, ranging from direct SPARQL queries to AI-assisted natural language interfaces.

3a. SPARQL endpoints

SPARQL endpoints allow you to execute SPARQL queries directly against the RDF data. RDF Portal organizes its datasets across multiple SPARQL endpoints, grouped by data source.

For a complete and up-to-date list of available SPARQL endpoints and the datasets they host, see the SPARQL Endpoints page.

Using the SPARQL endpoint in a web browser

Each endpoint provides a web-based query interface. You can access it by visiting the endpoint URL directly (e.g., https://rdfportal.org/ebi/sparql). The interface allows you to:

  1. Enter a SPARQL query in the text area
  2. Click “Run” to execute the query
  3. View the results in tabular format

Screenshot: SPARQL endpoint web interface with a query entered and results displayed

You can also use the “Run on Endpoint” links provided on each dataset’s detail page to execute the example queries directly.

The web interface is powered by two open-source tools developed by DBCLS:

  • SPARQL proxy — A proxy server that sits in front of SPARQL endpoints, providing query validation, job scheduling for concurrent queries, result caching, and logging. It ensures safe and stable access to the endpoints by filtering out potentially harmful queries and managing query workloads.
  • Endpoint browser — A web-based interface for browsing and exploring the structure of RDF data stored in SPARQL endpoints. It allows users to visually navigate classes, properties, and their relationships within datasets.

Querying from the command line

You can send SPARQL queries programmatically using tools like curl:

curl -H "Accept: application/sparql-results+json" \
     --data-urlencode "query=SELECT ?s ?p ?o WHERE { ?s ?p ?o } LIMIT 10" \
     https://rdfportal.org/ebi/sparql

Common Accept header values for specifying the response format:

Format Accept header
JSON application/sparql-results+json
XML application/sparql-results+xml
CSV text/csv
TSV text/tab-separated-values

Querying from Python

from SPARQLWrapper import SPARQLWrapper, JSON

sparql = SPARQLWrapper("https://rdfportal.org/ebi/sparql")
sparql.setQuery("""
    PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

    SELECT DISTINCT ?molecule_chemblid ?molecule_label
    FROM <http://rdf.ebi.ac.uk/dataset/chembl>
    WHERE {
        ?Molecule a cco:SmallMolecule ;
            cco:chemblId ?molecule_chemblid ;
            rdfs:label ?molecule_label .
    }
    LIMIT 10
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

for result in results["results"]["bindings"]:
    print(result["molecule_chemblid"]["value"], result["molecule_label"]["value"])

To install the required library: pip install sparqlwrapper

Querying from R

library(SPARQL)

endpoint <- "https://rdfportal.org/ebi/sparql"
query <- "
    PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

    SELECT DISTINCT ?molecule_chemblid ?molecule_label
    FROM <http://rdf.ebi.ac.uk/dataset/chembl>
    WHERE {
        ?Molecule a cco:SmallMolecule ;
            cco:chemblId ?molecule_chemblid ;
            rdfs:label ?molecule_label .
    }
    LIMIT 10
"

results <- SPARQL(endpoint, query)
print(results$results)

Specifying the target graph

Many endpoints host multiple datasets within a single endpoint (e.g., the EBI endpoint hosts ChEBI, ChEMBL, Ensembl, and Reactome). To query a specific dataset, use the FROM clause to specify the named graph:

SELECT ?s ?p ?o
FROM <http://rdf.ebi.ac.uk/dataset/chembl>
WHERE {
    ?s ?p ?o .
}
LIMIT 10

To discover which named graphs are available in an endpoint, you can use the following query:

SELECT DISTINCT ?g
WHERE {
    GRAPH ?g { ?s ?p ?o }
}

3b. GraphQL API

This section will be available once the GraphQL API documentation is published.

The GraphQL API provides an alternative query interface designed for application developers. It allows schema-based, intuitive query construction and is particularly suitable for frontend application integration.

3c. MCP interface (TogoMCP)

TogoMCP is a Model Context Protocol (MCP) interface for AI agents. It enables large language models (LLMs) to understand the structure and content of RDF datasets, allowing them to retrieve accurate data in response to natural language questions.

The TogoMCP server URL is: https://togomcp.rdfportal.org/mcp

For detailed documentation on how to set up and use TogoMCP, including configuration guides and usage examples, please refer to the TogoMCP website.

3d. LLM chat interface

TBA

3e. SPARQL composer

The SPARQL composer is a tool for generating SPARQL queries through a graphical interface. It is useful for users who want to construct queries visually without writing SPARQL syntax manually.


4. Statistics

The Statistics page provides a summary table showing key metrics for each dataset hosted on RDF Portal.

Screenshot: Statistics page showing the summary table

Understanding the statistics table

Column Description
Dataset The name of the dataset (links to the dataset detail page)
Triples The total number of RDF triples in the dataset. This is the primary measure of dataset size.
Classes The number of distinct RDF classes (types of entities) defined or used in the dataset.
Properties The number of distinct predicates (relationships) used in the dataset.
Subjects The number of distinct subject URIs. This roughly corresponds to the number of unique entities described in the dataset.
Objects The number of distinct object values (both URIs and literals).

Interpreting the numbers

The datasets on RDF Portal vary enormously in size. For example:

  • DDBJ is the largest dataset with approximately 68.5 billion triples, containing nucleotide sequence data from the DNA Data Bank of Japan.
  • UniProt RDF contains over 51.3 billion triples of protein sequence and functional information.
  • Nucleic Acid Drug Database is one of the smallest datasets with 948 triples.

The ratio of classes to properties gives an indication of the dataset’s schema complexity. A dataset with many classes and properties (e.g., wwPDB/RDF with 647 classes and 3,823 properties) has a rich, detailed data model, while a dataset with few classes and properties (e.g., PubMed with 2 classes and 5 properties) has a simpler, flatter structure.


5. Download

The Download page provides links to download RDF data files for each dataset. Data is available in multiple RDF serialization formats.

Available formats

Format Extension Description Best for
N-Triples .nt One triple per line, simple text format Streaming, bulk loading, line-by-line processing
Turtle .ttl Compact, human-readable format with prefix abbreviations Manual inspection, readability
RDF-XML .rdf XML-based serialization XML toolchains, legacy systems
JSON-LD .jsonld JSON-based linked data format Web applications, JavaScript environments

Format availability

Not all formats are available for every dataset. The availability follows these rules:

  • Original submitted files — Every dataset provides at minimum the RDF files in the format originally submitted by the data provider. This is always available.
  • N-Triples — In addition to the original format, an N-Triples (.nt) version is always generated and provided for every dataset. N-Triples serves as the common baseline format, ensuring that all datasets can be processed uniformly regardless of the original submission format.
  • Other formats (Turtle, RDF-XML, JSON-LD) — These are provided when available, but are not guaranteed for every dataset. Availability depends on whether the conversion has been performed for that particular dataset.

On the Download page, a dash (—) in a format column indicates that the format is not currently available for that dataset.

Choosing a format

  • For bulk loading into a triplestore (e.g., Virtuoso, GraphDB, Apache Jena), N-Triples is generally the fastest format to parse and is always available.
  • For reading and understanding the data structure, Turtle provides the most human-friendly representation.
  • For web application integration, JSON-LD is the natural choice as it can be processed directly by JavaScript.
  • For compatibility with XML-based tools, RDF-XML is appropriate.

Download URLs

Download links follow the pattern: https://rdfportal.org/download/{dataset_id}

For example, to download ChEMBL RDF data: https://rdfportal.org/download/chembl


6. Use cases and tutorials

This section provides practical examples of how to use RDF Portal data for life science research.

Tutorial 1: Your first SPARQL query

This tutorial walks you through executing a simple SPARQL query to retrieve data from ChEMBL.

Goal: List 10 small molecule compounds with their ChEMBL IDs and names.

Step 1: Open the EBI SPARQL endpoint at https://rdfportal.org/ebi/sparql

Step 2: Enter the following query:

PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?molecule_chemblid ?molecule_label
FROM <http://rdf.ebi.ac.uk/dataset/chembl>
WHERE {
    ?Molecule a cco:SmallMolecule ;
        cco:chemblId ?molecule_chemblid ;
        rdfs:label ?molecule_label .
}
LIMIT 10

Step 3: Click “Run” to execute the query. You will see a table of results showing ChEMBL IDs and molecule names.

Understanding the query:

  • PREFIX lines define namespace abbreviations used in the query
  • SELECT specifies which variables to return
  • FROM specifies the named graph (dataset) to query
  • WHERE defines the pattern to match — here, we look for entities that are typed as SmallMolecule and have both a ChEMBL ID and a label
  • LIMIT 10 restricts the output to 10 results

Tutorial 2: Finding approved drugs for a specific target

Goal: Find compounds approved as drugs (development phase 4) that target Tyrosine-protein kinase ABL.

PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX chembl_target: <http://rdf.ebi.ac.uk/resource/chembl/target/>

SELECT ?molecule_chemblid ?molecule_label
FROM <http://rdf.ebi.ac.uk/dataset/chembl>
WHERE {
    ?Molecule a cco:SmallMolecule ;
        cco:chemblId ?molecule_chemblid ;
        rdfs:label ?molecule_label ;
        cco:highestDevelopmentPhase 4 ;
        cco:hasMechanism ?mechanism .
    ?mechanism cco:hasTarget chembl_target:CHEMBL1862 .
}
LIMIT 100

This query combines multiple conditions: filtering by molecule type, development phase, and a specific drug target. Modify chembl_target:CHEMBL1862 to search for drugs targeting other proteins.

Tutorial 3: Cross-endpoint federated queries

SPARQL supports federated queries using the SERVICE keyword, which allows you to combine data from multiple endpoints in a single query.

Goal: Retrieve UniProt protein entries and link them to their corresponding Reactome pathways.

PREFIX up: <http://purl.uniprot.org/core/>
PREFIX biopax3: <http://www.biopax.org/release/biopax-level3.owl#>

SELECT ?protein ?proteinName ?pathway ?pathwayName
WHERE {
    SERVICE <https://rdfportal.org/sib/sparql> {
        ?protein a up:Protein ;
            up:mnemonic ?proteinName .
        FILTER(CONTAINS(?proteinName, "HUMAN"))
    }
    SERVICE <https://rdfportal.org/ebi/sparql> {
        ?pathway a biopax3:Pathway ;
            biopax3:displayName ?pathwayName .
    }
}
LIMIT 10

Note: Federated queries can be slow depending on the size of the intermediate results. Always use LIMIT and apply FILTER conditions to reduce the data transferred between endpoints.

Tutorial 4: Exploring dataset structure

Before writing queries for a new dataset, it is helpful to explore its structure. The following queries can be used with any endpoint.

List all classes in a dataset:

SELECT DISTINCT ?class (COUNT(?s) AS ?count)
FROM <http://rdf.ebi.ac.uk/dataset/chembl>
WHERE {
    ?s a ?class .
}
GROUP BY ?class
ORDER BY DESC(?count)

List all properties used in a dataset:

SELECT DISTINCT ?property (COUNT(?s) AS ?count)
FROM <http://rdf.ebi.ac.uk/dataset/chembl>
WHERE {
    ?s ?property ?o .
}
GROUP BY ?property
ORDER BY DESC(?count)

Get sample data for a specific class:

PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#>

SELECT ?s ?p ?o
FROM <http://rdf.ebi.ac.uk/dataset/chembl>
WHERE {
    ?s a cco:SmallMolecule ;
       ?p ?o .
}
LIMIT 20

7. FAQ and troubleshooting

General questions

Q: Is RDF Portal free to use? A: Yes. RDF Portal is a publicly funded infrastructure and all data access is free of charge. Individual datasets may have their own licenses — check the “Licenses” field on each dataset’s detail page.

Q: How often is the data updated? A: Update frequency varies by dataset. Check the Update log page for the latest update history.

Q: Can I submit my own RDF dataset? A: Yes. Please refer to the Data submission guidelines. All submitted datasets undergo a quality review by DBCLS to ensure compliance with the DBCLS RDF Guidelines.

SPARQL query issues

Q: My query is timing out. What should I do? A: Try the following approaches:

  1. Add a LIMIT clause to restrict the number of results
  2. Use more specific FILTER conditions to narrow the search
  3. Avoid SELECT * — specify only the variables you need
  4. Use FROM to target a specific named graph rather than querying the entire endpoint
  5. For very large result sets, consider downloading the data files and loading them into a local triplestore

Q: How do I know which named graph to use in the FROM clause? A: Each dataset’s detail page shows the SPARQL endpoint URL and named graph URI. You can also discover named graphs by running:

SELECT DISTINCT ?g WHERE { GRAPH ?g { ?s ?p ?o } }

Q: My query returns no results. What could be wrong? A: Common causes include:

  1. Incorrect namespace URIs — check the prefix declarations against the dataset’s schema diagram
  2. Wrong named graph — verify the FROM clause matches the dataset’s graph URI
  3. Case sensitivity — URIs and literal values are case-sensitive in SPARQL
  4. Data type mismatches — when filtering by numeric values, ensure the type matches (e.g., integer vs. string)

Data and licensing

Q: Can I redistribute the data I download? A: This depends on the license of each individual dataset. Check the “Licenses” field on the dataset’s detail page. Many datasets are available under open licenses that permit redistribution.

Q: How should I cite RDF Portal? A: Please cite the RDF Portal website URL (https://rdfportal.org/) and the specific dataset(s) you used. For individual datasets, cite the original data providers as specified on each dataset’s detail page.

Q: How should I cite RDF Portal? A: Please cite the following publication:

Shuichi Kawashima, Toshiaki Katayama, Hideki Hatanaka, Tatsuya Kushida, Toshihisa Takagi. NBDC RDF portal: a comprehensive repository for semantic data in life sciences. Database (Oxford). 2018 Jan 1;2018:bay123. doi: 10.1093/database/bay123. PubMed: 30576482.

Contact

For questions, bug reports, or feedback, please contact the DBCLS team through the information provided on the About page.