Introduction to Neo4j:
- Neo4j is a graph database management system developed by Neo Technology.
- ACID-compliant transactional database with native graph storage and processing.
- Community edition is open sourced and freely available and it is the leading graph database.
Scope of this project:
This project aims at developing a graph based database in Neo4j for faster and more efficient data retrieval and storage. We modeled the EGGNOG dataset in Neo4j. The primary motivation behind exploring graph databases was to be able to characterize the relations between gestures, sessions, and participants. We decided to use Neo4j since it is freely available graph database under community edition and it is the leading graph database. Neo4j uses a powerful query language known as cypher which allows to express complex relationships and patterns existing between nodes that represent data points.
Neo4J Graph Model:
- Each of the gesture label is assigned to a node.
- The time properties like StartTimeStamp, EndTimeStamp, StartFrame, EndFrame are assigned as the properties of the individual node.
- Each of the Sessions has a node with properties: SessionName, Participant Number, Block Video Number, path of the video file.
- Participants are represented as nodes with details: age, gender, dominantHand.
- We have two nodes: With Sound and No Sound corresponding to type of experiment.
- Ensure that uniqueness constraints are enforced on each node.
- Two nodes get connected by a connector representing the relationship between the nodes.
- This database has 3 relationships: PERFORMED_BY, PROJECT_TYPE, and PRESENT_IN.
- PREFORMED_BY: from a gesture to a participant.
- PRESENT_IN: from a gesture to a session.
- PROJECT_TYPE: is it with/no sound experiment type.
- Nodes & Relationships are constructed using CREATE command.
Here is an example of Cypher query:
Advantages of Neo4j over PostgreSQL:
- Avoids costly JOIN operations.
- Ability to represent intuitive relationships that naturally exist between the nodes.
- Ability to assign properties to node that otherwise are represented as column in RDBMS.
- Query execution time is independent of the number of nodes if the dataset is modeled in a appropriate structure.
Analysis of EGGNOG Graph Database:
Neo4j compared with PostgreSQL:
- With varying database size we observed that query execution time is independent of the size of the database represented by the number of nodes.
- For the Cypher query shown above, Neo4j scans the nodes connected only to Session 2 and 3 to get information about 'PARTICIPANT 4' and 'PARTICIPANT 5', unlike PostgreSQL where all the entries in the 'PARTICIPANT' field is scanned for results.
- The query time for PostgreSQL is linear as a function of number of records.
Caching and Slow Start in Neo4j:
- For 25K nodes, the execution time for first query after start up was 700 to 900 ms.
- However, the execution time reduces exponentially for the same query on subsequent instances (50 ms).