Skip to content

Instantly share code, notes, and snippets.

@vikpande
Last active November 15, 2019 17:03
Show Gist options
  • Save vikpande/41c512175c62b11ed3ef09e0766c520f to your computer and use it in GitHub Desktop.
Save vikpande/41c512175c62b11ed3ef09e0766c520f to your computer and use it in GitHub Desktop.
neo4j_workshop_amsterdam - Refactoring large graphs (Excerpt from the training)
On Neo4J Desktop browser console :
To connect to the graph
- Connect:play http://guides.neo4j.com/modeling_sandbox/05_refactoring_large_graphs.html
Steps:
- As our graph gets bigger in size it starts to become unfeasible to refactor the whole thing in one go. Instead we’ll have to update it in batches.
Manual batching
When batching we sacrifice the atomicity that we’d get if we did everything in one transaction. It’s therefore useful to make our refactoring queries idempotent in case we need to re-run them. We also need to decide which node we’re going to center the refactoring around.
To recap, this was the refactoring query from the previous section:
MATCH (origin:Airport)<-[:ORIGIN]-(flight:Flight)-[:DESTINATION]->(destination:Airport)
MERGE (originAirportDay:AirportDay {id: origin.code + "_" + flight.date})
ON CREATE SET originAirportDay.date = flight.date
MERGE (destinationAirportDay:AirportDay {id: destination.code + "_" + flight.date})
ON CREATE SET destinationAirportDay.date = flight.date
MERGE (origin)-[:HAS_DAY]->(originAirportDay)
MERGE (flight)-[:ORIGIN]->(originAirportDay)
MERGE (flight)-[:DESTINATION]-(destinationAirportDay)
MERGE (destination)-[:HAS_DAY]->(destinationAirportDay)
Flight is probably the easiest node to batch on.
Before we execute the batching workflow, let’s import a few more flights to keep it interesting.
Importing 100,000 flights
As we’re now dealing with much more data we’ll have to be a bit cleverer about how we import the data.
We know that most of the airports are going to be duplicates so there’s no point calling MERGE loads of times. Instead we’ll find the distinct set of airports and only MERGE on each airport once:
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/training/master/modeling/data/flights_100k.csv" AS row
UNWIND [row.Origin, row.Dest] AS airport
WITH DISTINCT airport
MERGE (:Airport {code: airport})
We’ll also use the periodic commit functionality of LOAD CSV. This will flush the transaction every 10,000 rows rather than executing the whole query in one transaction.
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/training/master/modeling/data/flights_100k.csv" AS row
MATCH (origin:Airport {code: row.Origin})
MATCH (destination:Airport {code: row.Dest})
MERGE (newFlight:Flight { id: row.UniqueCarrier + row.FlightNum + "_" + row.Year + "-" + row.Month + "-" + row.DayofMonth + "_" + row.Origin + "_" + row.Dest } )
ON CREATE SET newFlight.date = toInteger(row.Year) + "-" + toInteger(row.Month) + "-" + toInteger(row.DayofMonth),
newFlight.airline = row.UniqueCarrier,
newFlight.number = row.FlightNum,
newFlight.departure = toInteger(row.CRSDepTime),
newFlight.arrival = toInteger(row.CRSArrTime)
MERGE (origin)<-[:ORIGIN]-(newFlight)
MERGE (newFlight)-[:DESTINATION]->(destination)
Now we’re ready to do some batch refactoring.
Batch refactoring flights
Let’s put a Process label on each of our Flight nodes so that we know which ones we’ve still got to process.
MATCH (f:Flight)
SET f:Process
Now we’re ready to run the refactoring query. We’ll start by processing 500 flights at a time:
MATCH (flight:Process)
WITH flight LIMIT 500
MATCH (origin:Airport)<-[:ORIGIN]-(flight)-[:DESTINATION]->(destination:Airport)
MERGE (originAirportDay:AirportDay {id: origin.code + "_" + flight.date})
ON CREATE SET originAirportDay.date = flight.date
MERGE (destinationAirportDay:AirportDay {id: destination.code + "_" + flight.date})
ON CREATE SET destinationAirportDay.date = flight.date
MERGE (origin)-[:HAS_DAY]->(originAirportDay)
MERGE (originAirportDay)<-[:ORIGIN]-(flight)
MERGE (flight)-[:DESTINATION]-(destinationAirportDay)
MERGE (destination)-[:HAS_DAY]->(destinationAirportDay)
REMOVE flight:Process
RETURN COUNT(*)
We’d have to run this query 100,000 / 500 = 200 times to process all the flights, which would be a very boring way to pass the time!
Lucky for us, the apoc library that we installed in the previous section has a procedure that we can use to batch operations.
The following procedure is the one we want:
CALL apoc.help("apoc.periodic.commit")
We can also pass the apoc.help procedure a package name and it’ll show us all the procedures in that package. e.g.
CALL apoc.help("apoc.periodic")
Let’s get on with the batch refactoring.
Since we’ve imported more nodes we’ll need to tag them with the Process label. For simplicity’s sake we’ll just put the Process tag on all our flights and process them all again.
MATCH (f:Flight)
SET f:Process
Hint Remember, since our query is idempotent, if a flight has already been processed before the query won’t actually do anything with that flight.
We can now call our refactoring query inside the procedure:
call apoc.periodic.commit('
MATCH (flight:Process)
WITH flight LIMIT {limit}
MATCH (origin:Airport)<-[:ORIGIN]-(flight)-[:DESTINATION]->(destination:Airport)
MERGE (originAirportDay:AirportDay {id: origin.code + "_" + flight.date})
ON CREATE SET originAirportDay.date = flight.date
MERGE (destinationAirportDay:AirportDay {id: destination.code + "_" + flight.date})
ON CREATE SET destinationAirportDay.date = flight.date
MERGE (origin)-[:HAS_DAY]->(originAirportDay)
MERGE (originAirportDay)<-[:ORIGIN]-(flight)
MERGE (flight)-[:DESTINATION]-(destinationAirportDay)
MERGE (destination)-[:HAS_DAY]->(destinationAirportDay)
REMOVE flight:Process
RETURN COUNT(*)
',{limit:500})
Check the import worked
Run the following query to check our import worked:
MATCH (:Process)
RETURN COUNT(*)
Try repeating some of the queries from earlier sections with this new larger dataset. You can see the previous queries you’ve run by executing the following command:
:history
Exercise: Specific date relationships
We forgot to add the specific date relationships between :Airport and :AirportDay nodes that we introduced in the previous section!
Can you write a refactoring query using apoc to do this?
Hint We’ll need to figure out how not to create duplicate relationships between :Airport and :AirportDay nodes that we processed in the previous guide.
Answer: Specific date relationships
This time we need to process :AirportDay nodes so we’ll put the temporary :Process label on those:
MATCH (ad:AirportDay)
SET ad:Process
The simplest way to not create duplicate date relationships between :Airport and :AirportDay nodes is to delete the ones we created earlier:
MATCH (airport:Airport)-[r]->(:AirportDay)
WHERE NOT TYPE(r) = "HAS_DAY"
DELETE r
Now we can create the new relationships:
call apoc.periodic.commit('
MATCH (ad:Process)
WITH ad LIMIT {limit}
MATCH (origin:Airport)-[hasDay:HAS_DAY]->(ad:AirportDay)
CALL apoc.create.relationship(startNode(hasDay), ad.date, {}, endNode(hasDay) ) YIELD rel
REMOVE ad:Process
RETURN COUNT(*)
',{limit:500})
Specific vs general
Now let’s go back and compare the queries from the end of the previous guide.
PROFILE
MATCH (origin:Airport {code: "LAS"})-[:`2008-1-3`]->(:AirportDay)<-[:ORIGIN]-(flight:Flight),
(flight)-[:DESTINATION]->(:AirportDay)<-[:`2008-1-3`]-(destination:Airport {code: "MDW"})
RETURN *
vs
PROFILE
MATCH (origin:Airport {code: "LAS"})-[:HAS_DAY]->(:AirportDay {date: "2008-1-3"})<-[:ORIGIN]-(flight:Flight),
(flight)-[:DESTINATION]->(:AirportDay {date: "2008-1-3"})<-[:HAS_DAY]-(destination:Airport {code: "MDW"})
RETURN *
The number of db hits has increased for the second query since we’ve now imported another ~20 extra days for the airport. This means that we need to check extra :Airport(date) properties each time we traverse HAS_DAY relationships.
The number of db hits for the first query hasn’t changed.
Next
Thus far we haven’t been deleting the old model when we refactored it.
Later :
In one of the next gists, we’ll look at the advantages/disadvantages of having multiple models in the graph.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment