vikpande · November 15, 2019 17:03
diff --git a/neo4j workshop b/neo4j workshop
 On Neo4J Desktop browser console : 
 To connect to the graph 
 - Connect:play http://guides.neo4j.com/modeling_sandbox/05_refactoring_large_graphs.html
 Steps: 
 - As our graph gets bigger in size it starts to become unfeasible to refactor the whole thing in one go. Instead we’ll have to update it in batches.

 Manual batching

 When batching we sacrifice the atomicity that we’d get if we did everything in one transaction. It’s therefore useful to make our refactoring queries idempotent in case we need to re-run them. We also need to decide which node we’re going to center the refactoring around.

 To recap, this was the refactoring query from the previous section:

 MATCH (origin:Airport)<-[:ORIGIN]-(flight:Flight)-[:DESTINATION]->(destination:Airport)

 MERGE (originAirportDay:AirportDay {id: origin.code + "_" + flight.date})
 ON CREATE SET originAirportDay.date = flight.date

 MERGE (destinationAirportDay:AirportDay {id: destination.code + "_" + flight.date})
 ON CREATE SET destinationAirportDay.date = flight.date

 MERGE (origin)-[:HAS_DAY]->(originAirportDay)
 MERGE (flight)-[:ORIGIN]->(originAirportDay)
 MERGE (flight)-[:DESTINATION]-(destinationAirportDay)
 MERGE (destination)-[:HAS_DAY]->(destinationAirportDay)

 Flight is probably the easiest node to batch on.
 Before we execute the batching workflow, let’s import a few more flights to keep it interesting.

 Importing 100,000 flights

 As we’re now dealing with much more data we’ll have to be a bit cleverer about how we import the data.

 We know that most of the airports are going to be duplicates so there’s no point calling MERGE loads of times. Instead we’ll find the distinct set of airports and only MERGE on each airport once:

 LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/training/master/modeling/data/flights_100k.csv" AS row
 UNWIND [row.Origin, row.Dest] AS airport
 WITH DISTINCT airport
 MERGE (:Airport {code: airport})

 We’ll also use the periodic commit functionality of LOAD CSV. This will flush the transaction every 10,000 rows rather than executing the whole query in one transaction.

 USING PERIODIC COMMIT 10000
 LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/training/master/modeling/data/flights_100k.csv" AS row
 MATCH (origin:Airport {code: row.Origin})
 MATCH (destination:Airport {code: row.Dest})
 MERGE (newFlight:Flight { id: row.UniqueCarrier + row.FlightNum + "_" + row.Year + "-" + row.Month + "-" + row.DayofMonth + "_" + row.Origin + "_" + row.Dest }   )
 ON CREATE SET newFlight.date = toInteger(row.Year) + "-" + toInteger(row.Month) + "-" + toInteger(row.DayofMonth),
              newFlight.airline = row.UniqueCarrier,
              newFlight.number = row.FlightNum,
              newFlight.departure = toInteger(row.CRSDepTime),
              newFlight.arrival = toInteger(row.CRSArrTime)
 MERGE (origin)<-[:ORIGIN]-(newFlight)
 MERGE (newFlight)-[:DESTINATION]->(destination)

 Now we’re ready to do some batch refactoring.


 Batch refactoring flights

 Let’s put a Process label on each of our Flight nodes so that we know which ones we’ve still got to process.

 MATCH (f:Flight)
 SET f:Process

 Now we’re ready to run the refactoring query. We’ll start by processing 500 flights at a time:
 MATCH (flight:Process)
 WITH flight LIMIT 500

 MATCH (origin:Airport)<-[:ORIGIN]-(flight)-[:DESTINATION]->(destination:Airport)

 MERGE (originAirportDay:AirportDay {id: origin.code + "_" + flight.date})
 ON CREATE SET originAirportDay.date = flight.date

 MERGE (destinationAirportDay:AirportDay {id: destination.code + "_" + flight.date})
 ON CREATE SET destinationAirportDay.date = flight.date

 MERGE (origin)-[:HAS_DAY]->(originAirportDay)
 MERGE (originAirportDay)<-[:ORIGIN]-(flight)
 MERGE (flight)-[:DESTINATION]-(destinationAirportDay)
 MERGE (destination)-[:HAS_DAY]->(destinationAirportDay)

 REMOVE flight:Process
 RETURN COUNT(*)

 We’d have to run this query 100,000 / 500 = 200 times to process all the flights, which would be a very boring way to pass the time!

 Lucky for us, the apoc library that we installed in the previous section has a procedure that we can use to batch operations.

 The following procedure is the one we want:

 CALL apoc.help("apoc.periodic.commit")
 We can also pass the apoc.help procedure a package name and it’ll show us all the procedures in that package. e.g.

 CALL apoc.help("apoc.periodic")
 Let’s get on with the batch refactoring.

 Since we’ve imported more nodes we’ll need to tag them with the Process label. For simplicity’s sake we’ll just put the Process tag on all our flights and process them all again.

 MATCH (f:Flight)
 SET f:Process
 Hint Remember, since our query is idempotent, if a flight has already been processed before the query won’t actually do anything with that flight.

 We can now call our refactoring query inside the procedure:

 call apoc.periodic.commit('
  MATCH (flight:Process)
  WITH flight LIMIT {limit}

  MATCH (origin:Airport)<-[:ORIGIN]-(flight)-[:DESTINATION]->(destination:Airport)

  MERGE (originAirportDay:AirportDay {id: origin.code + "_" + flight.date})
  ON CREATE SET originAirportDay.date = flight.date

  MERGE (destinationAirportDay:AirportDay {id: destination.code + "_" + flight.date})
  ON CREATE SET destinationAirportDay.date = flight.date

  MERGE (origin)-[:HAS_DAY]->(originAirportDay)
  MERGE (originAirportDay)<-[:ORIGIN]-(flight)
  MERGE (flight)-[:DESTINATION]-(destinationAirportDay)
  MERGE (destination)-[:HAS_DAY]->(destinationAirportDay)

  REMOVE flight:Process
  RETURN COUNT(*)
 ',{limit:500})

 Check the import worked

 Run the following query to check our import worked:

 MATCH (:Process)
 RETURN COUNT(*)
 Try repeating some of the queries from earlier sections with this new larger dataset. You can see the previous queries you’ve run by executing the following command:

 :history


 Exercise: Specific date relationships

 We forgot to add the specific date relationships between :Airport and :AirportDay nodes that we introduced in the previous section!

 Can you write a refactoring query using apoc to do this?

 Hint We’ll need to figure out how not to create duplicate relationships between :Airport and :AirportDay nodes that we processed in the previous guide.

 Answer: Specific date relationships

 This time we need to process :AirportDay nodes so we’ll put the temporary :Process label on those:

 MATCH (ad:AirportDay)
 SET ad:Process
 The simplest way to not create duplicate date relationships between :Airport and :AirportDay nodes is to delete the ones we created earlier:

 MATCH (airport:Airport)-[r]->(:AirportDay)
 WHERE NOT TYPE(r) = "HAS_DAY"
 DELETE r
 Now we can create the new relationships:

 call apoc.periodic.commit('
  MATCH (ad:Process)
  WITH ad LIMIT {limit}

  MATCH (origin:Airport)-[hasDay:HAS_DAY]->(ad:AirportDay)
  CALL apoc.create.relationship(startNode(hasDay), ad.date, {}, endNode(hasDay) ) YIELD rel

  REMOVE ad:Process
  RETURN COUNT(*)
 ',{limit:500})

 Specific vs general

 Now let’s go back and compare the queries from the end of the previous guide.

 PROFILE
 MATCH (origin:Airport {code: "LAS"})-[:`2008-1-3`]->(:AirportDay)<-[:ORIGIN]-(flight:Flight),
      (flight)-[:DESTINATION]->(:AirportDay)<-[:`2008-1-3`]-(destination:Airport {code: "MDW"})
 RETURN *
 vs

 PROFILE
 MATCH (origin:Airport {code: "LAS"})-[:HAS_DAY]->(:AirportDay {date: "2008-1-3"})<-[:ORIGIN]-(flight:Flight),
      (flight)-[:DESTINATION]->(:AirportDay {date: "2008-1-3"})<-[:HAS_DAY]-(destination:Airport {code: "MDW"})
 RETURN *
 The number of db hits has increased for the second query since we’ve now imported another ~20 extra days for the airport. This means that we need to check extra :Airport(date) properties each time we traverse HAS_DAY relationships.

 The number of db hits for the first query hasn’t changed.

 Next

 Thus far we haven’t been deleting the old model when we refactored it. 

 Later :
 In one of the next gists, we’ll look at the advantages/disadvantages of having multiple models in the graph.
	On Neo4J Desktop browser console :
	To connect to the graph
	- Connect:play http://guides.neo4j.com/modeling_sandbox/05_refactoring_large_graphs.html
	Steps:
	- As our graph gets bigger in size it starts to become unfeasible to refactor the whole thing in one go. Instead we’ll have to update it in batches.

	Manual batching

	When batching we sacrifice the atomicity that we’d get if we did everything in one transaction. It’s therefore useful to make our refactoring queries idempotent in case we need to re-run them. We also need to decide which node we’re going to center the refactoring around.

	To recap, this was the refactoring query from the previous section:

	MATCH (origin:Airport)<-[:ORIGIN]-(flight:Flight)-[:DESTINATION]->(destination:Airport)

	MERGE (originAirportDay:AirportDay {id: origin.code + "_" + flight.date})
	ON CREATE SET originAirportDay.date = flight.date

	MERGE (destinationAirportDay:AirportDay {id: destination.code + "_" + flight.date})
	ON CREATE SET destinationAirportDay.date = flight.date

	MERGE (origin)-[:HAS_DAY]->(originAirportDay)
	MERGE (flight)-[:ORIGIN]->(originAirportDay)
	MERGE (flight)-[:DESTINATION]-(destinationAirportDay)
	MERGE (destination)-[:HAS_DAY]->(destinationAirportDay)

	Flight is probably the easiest node to batch on.
	Before we execute the batching workflow, let’s import a few more flights to keep it interesting.

	Importing 100,000 flights

	As we’re now dealing with much more data we’ll have to be a bit cleverer about how we import the data.

	We know that most of the airports are going to be duplicates so there’s no point calling MERGE loads of times. Instead we’ll find the distinct set of airports and only MERGE on each airport once:

	LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/training/master/modeling/data/flights_100k.csv" AS row
	UNWIND [row.Origin, row.Dest] AS airport
	WITH DISTINCT airport
	MERGE (:Airport {code: airport})

	We’ll also use the periodic commit functionality of LOAD CSV. This will flush the transaction every 10,000 rows rather than executing the whole query in one transaction.

	USING PERIODIC COMMIT 10000
	LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/training/master/modeling/data/flights_100k.csv" AS row
	MATCH (origin:Airport {code: row.Origin})
	MATCH (destination:Airport {code: row.Dest})
	MERGE (newFlight:Flight { id: row.UniqueCarrier + row.FlightNum + "_" + row.Year + "-" + row.Month + "-" + row.DayofMonth + "_" + row.Origin + "_" + row.Dest } )
	ON CREATE SET newFlight.date = toInteger(row.Year) + "-" + toInteger(row.Month) + "-" + toInteger(row.DayofMonth),
	newFlight.airline = row.UniqueCarrier,
	newFlight.number = row.FlightNum,
	newFlight.departure = toInteger(row.CRSDepTime),
	newFlight.arrival = toInteger(row.CRSArrTime)
	MERGE (origin)<-[:ORIGIN]-(newFlight)
	MERGE (newFlight)-[:DESTINATION]->(destination)

	Now we’re ready to do some batch refactoring.


	Batch refactoring flights

	Let’s put a Process label on each of our Flight nodes so that we know which ones we’ve still got to process.

	MATCH (f:Flight)
	SET f:Process

	Now we’re ready to run the refactoring query. We’ll start by processing 500 flights at a time:
	MATCH (flight:Process)
	WITH flight LIMIT 500

	MATCH (origin:Airport)<-[:ORIGIN]-(flight)-[:DESTINATION]->(destination:Airport)

	MERGE (originAirportDay:AirportDay {id: origin.code + "_" + flight.date})
	ON CREATE SET originAirportDay.date = flight.date

	MERGE (destinationAirportDay:AirportDay {id: destination.code + "_" + flight.date})
	ON CREATE SET destinationAirportDay.date = flight.date

	MERGE (origin)-[:HAS_DAY]->(originAirportDay)
	MERGE (originAirportDay)<-[:ORIGIN]-(flight)
	MERGE (flight)-[:DESTINATION]-(destinationAirportDay)
	MERGE (destination)-[:HAS_DAY]->(destinationAirportDay)

	REMOVE flight:Process
	RETURN COUNT(*)

	We’d have to run this query 100,000 / 500 = 200 times to process all the flights, which would be a very boring way to pass the time!

	Lucky for us, the apoc library that we installed in the previous section has a procedure that we can use to batch operations.

	The following procedure is the one we want:

	CALL apoc.help("apoc.periodic.commit")
	We can also pass the apoc.help procedure a package name and it’ll show us all the procedures in that package. e.g.

	CALL apoc.help("apoc.periodic")
	Let’s get on with the batch refactoring.

	Since we’ve imported more nodes we’ll need to tag them with the Process label. For simplicity’s sake we’ll just put the Process tag on all our flights and process them all again.

	MATCH (f:Flight)
	SET f:Process
	Hint Remember, since our query is idempotent, if a flight has already been processed before the query won’t actually do anything with that flight.

	We can now call our refactoring query inside the procedure:

	call apoc.periodic.commit('
	MATCH (flight:Process)
	WITH flight LIMIT {limit}

	MATCH (origin:Airport)<-[:ORIGIN]-(flight)-[:DESTINATION]->(destination:Airport)

	MERGE (originAirportDay:AirportDay {id: origin.code + "_" + flight.date})
	ON CREATE SET originAirportDay.date = flight.date

	MERGE (destinationAirportDay:AirportDay {id: destination.code + "_" + flight.date})
	ON CREATE SET destinationAirportDay.date = flight.date

	MERGE (origin)-[:HAS_DAY]->(originAirportDay)
	MERGE (originAirportDay)<-[:ORIGIN]-(flight)
	MERGE (flight)-[:DESTINATION]-(destinationAirportDay)
	MERGE (destination)-[:HAS_DAY]->(destinationAirportDay)

	REMOVE flight:Process
	RETURN COUNT(*)
	',{limit:500})

	Check the import worked

	Run the following query to check our import worked:

	MATCH (:Process)
	RETURN COUNT(*)
	Try repeating some of the queries from earlier sections with this new larger dataset. You can see the previous queries you’ve run by executing the following command:

	:history


	Exercise: Specific date relationships

	We forgot to add the specific date relationships between :Airport and :AirportDay nodes that we introduced in the previous section!

	Can you write a refactoring query using apoc to do this?

	Hint We’ll need to figure out how not to create duplicate relationships between :Airport and :AirportDay nodes that we processed in the previous guide.

	Answer: Specific date relationships

	This time we need to process :AirportDay nodes so we’ll put the temporary :Process label on those:

	MATCH (ad:AirportDay)
	SET ad:Process
	The simplest way to not create duplicate date relationships between :Airport and :AirportDay nodes is to delete the ones we created earlier:

	MATCH (airport:Airport)-[r]->(:AirportDay)
	WHERE NOT TYPE(r) = "HAS_DAY"
	DELETE r
	Now we can create the new relationships:

	call apoc.periodic.commit('
	MATCH (ad:Process)
	WITH ad LIMIT {limit}

	MATCH (origin:Airport)-[hasDay:HAS_DAY]->(ad:AirportDay)
	CALL apoc.create.relationship(startNode(hasDay), ad.date, {}, endNode(hasDay) ) YIELD rel

	REMOVE ad:Process
	RETURN COUNT(*)
	',{limit:500})

	Specific vs general

	Now let’s go back and compare the queries from the end of the previous guide.

	PROFILE
	MATCH (origin:Airport {code: "LAS"})-[:`2008-1-3`]->(:AirportDay)<-[:ORIGIN]-(flight:Flight),
	(flight)-[:DESTINATION]->(:AirportDay)<-[:`2008-1-3`]-(destination:Airport {code: "MDW"})
	RETURN *
	vs

	PROFILE
	MATCH (origin:Airport {code: "LAS"})-[:HAS_DAY]->(:AirportDay {date: "2008-1-3"})<-[:ORIGIN]-(flight:Flight),
	(flight)-[:DESTINATION]->(:AirportDay {date: "2008-1-3"})<-[:HAS_DAY]-(destination:Airport {code: "MDW"})
	RETURN *
	The number of db hits has increased for the second query since we’ve now imported another ~20 extra days for the airport. This means that we need to check extra :Airport(date) properties each time we traverse HAS_DAY relationships.

	The number of db hits for the first query hasn’t changed.

	Next

	Thus far we haven’t been deleting the old model when we refactored it.

	Later :
	In one of the next gists, we’ll look at the advantages/disadvantages of having multiple models in the graph.