Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
.bsp		.bsp
.github/workflows		.github/workflows
core		core
datasets		datasets
project		project
python/graphster		python/graphster
query/src/main/scala/com/graphster/orpheus/query		query/src/main/scala/com/graphster/orpheus/query
text/src/main/scala/com/graphster/orpheus/text		text/src/main/scala/com/graphster/orpheus/text
textjsl		textjsl
website		website
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
pom.xml		pom.xml

Repository files navigation

Graphster is an open-source knowledge graph library. It is a spark-based library purpose-built for scalable, end-to-end knowledge graph construction and querying from unstructured and structured source data. The graphster library takes a collection of documents, extracts mentions and relations to populate a raw knowledge graph, links mentions to entities in Wikidata, and then enriches the knowledge graph with facts from Wikidata. Once the knowledge graph is built, graphster can also help natively query the knowledge graph using SPARQL.

Give graphster a try!

This README provies instructions on how to use the library in your own project.

There are two main concerns structured data - quality and complexity. Of course, general data quality is always important in data engineering, here we are talking about a specific kind of data quality. The kind of data quality we are concerned with is completeness and consistency. What fields are null? In what formats are different data types stored (e.g. dates, floating point, booleans). Complexity is the other ingredient we must manage. Transforming a 30 column CSV into a set of triples is very different from a database with dozens of tables in 3rd normal form.

Text

In order to add text into a graph, we must extract the information we are interested into a structured format. This is where NLP comes in. This library is not an NLP library, which is why there is an abstraction layer. The idea is that the information is extracted into a structured form, so that the downstream process does not need to know what engine was used for NLP.

The minimal requirement for an NLP library to serve as an engine is named entity recognition. However, supporting syntactic parsing, entity linking, and relationship extraction can also be utilized.

The wisecube-text module is a module that acts as an interface to an NLP engine. There is an implementation with JSl Spark NLP.

back to top

Fusion

The fusion step is where we take the structured data that has been cleaned, transformed, or extracted and map into the schema of the graph we are building. The first step of fusing new data into a graph is matching what is already there. Matching entities in the new data to entities already in the graph. The next step is mapping the kinds of relationships and properties to predicates.

Mapping to schema

There are two reasons to have custom transformations at this stage. The first is dealing differences in the conceptual design between the new data and the graph. The second is differences in the conventions recording properties.

For example, if your graph has an "author" relationship between documents and authors, but the new data has "wrote" relationship between authors and documents.

graph TB
    A{123} -- author --> B{456}
    A -- rdfs:label --> C(Important Article) 
    B -- rdfs:label --> D(Jane Doe) 
    
    F{123} -- wrote --> E{456}
    E -- rdfs:label --> G(Important Article) 
    F -- rdfs:label --> H(Jane Doe)

This is a simple example. The direction of the relationship needs only be reversed. Let's consider an example with a deeper difference. Suppose the graph with which we are fusing keeps certain closely related terms as a single entity - Metonymy. For example, let's say the graph contains proteins, genes, and chemicals. The data that is being added only has genes and chemicals. In this data, relationships between a gene and a chemical may actually represent a relationship between a protein encoded by the gene and the chemical.

graph LR
    A((Gene)) -- encodes --> B((Protein))
    B((Protein)) -- interacts with --> C((Chemical))
    C((Chemical)) -- interacts with --> B((Protein))
    
    D((Gene)) -- interacts with --> E((Chemical))
    E((Chemical)) -- interacts with --> D((Gene))

How this is mapped to the target schema depends on what other data is available as well as how flexible the schema is. If there is additional information about the gene-chemical edges that we can use to deduce the protein (e.g. pathway information), that can be used. Another option, if no such information is available, is to overload the interacts with edge in the target graph to allow gene-chemical relations.

graph LR
    A((Gene)) -- encodes --> B((Protein))
    B((Protein)) -- interacts with --> C((Chemical))
    C((Chemical)) -- interacts with --> B((Protein))
    A((Gene)) -- interacts with --> C((Chemical))

If such overloading is not possible in the target schema, then these edges can be represented with a special edge.

graph LR
    A((Gene)) -- encodes --> B((Protein))
    B((Protein)) -- interacts with --> C((Chemical))
    C((Chemical)) -- interacts with --> B((Protein))
    A((Gene)) -- interacts through protein with --> C((Chemical))

Another option is to use blank nodes, and to try and resolve them with other data.

graph LR
    A((Gene)) -- encodes --> B((Protein))
    B((Protein)) -- interacts with --> C((Chemical))
    C((Chemical)) -- interacts with --> B((Protein))
    A((Gene)) -- encodes --> D((_:blank))
    D((_:blank)) -- interacts with --> C((Chemical))

back to top

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of contents

Setup

Configuration

Data Sources

Extraction

Graph

RDF data

Other Graph Formats

Structured

Text

Fusion

Mapping to schema

About

Releases 2

Packages

Contributors 6

Languages

License

wisecubeai/graphster

Folders and files

Latest commit

History

Repository files navigation

Table of contents

Setup

Configuration

Data Sources

Extraction

Graph

RDF data

Other Graph Formats

Structured

Text

Fusion

Mapping to schema

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 6

Languages

Packages