Created
December 16, 2013 05:18
-
-
Save syrusakbary/7982653 to your computer and use it in GitHub Desktop.
Django efficient queryset iterator (by dividing in chunks). Taked from https://djangosnippets.org/snippets/1949/
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import gc | |
def queryset_iterator(queryset, chunksize=1000): | |
''''' | |
Iterate over a Django Queryset ordered by the primary key | |
This method loads a maximum of chunksize (default: 1000) rows in it's | |
memory at the same time while django normally would load all rows in it's | |
memory. Using the iterator() method only causes it to not preload all the | |
classes. | |
Note that the implementation of the iterator does not support ordered query sets. | |
''' | |
pk = 0 | |
last_pk = queryset.order_by('-pk')[0].pk | |
queryset = queryset.order_by('pk') | |
while pk < last_pk: | |
for row in queryset.filter(pk__gt=pk)[:chunksize]: | |
pk = row.pk | |
yield row | |
gc.collect() |
Your code does not work with empty queryset correctly.
Version below does:
def queryset_iterator(queryset, chunk_size=1000):
"""
Iterate over a Django Queryset ordered by the primary key
This method loads a maximum of chunk_size (default: 1000) rows in it's
memory at the same time while django normally would load all rows in it's
memory. Using the iterator() method only causes it to not preload all the
classes.
Note that the implementation of the iterator does not support ordered query sets.
"""
try:
last_pk = queryset.order_by('-pk')[:1].get().pk
except ObjectDoesNotExist:
return
pk = 0
queryset = queryset.order_by('pk')
while pk < last_pk:
for row in queryset.filter(pk__gt=pk)[:chunk_size]:
pk = row.pk
yield row
gc.collect()
The approach in this link is more efficient and stable.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Why gc collect? Doesn't python handle gc and wouldn't this likely interfere with auto gc and be less performant?