Last active
December 22, 2024 20:15
-
-
Save tylerneylon/ce60e8a06e7506ac45788443f7269e40 to your computer and use it in GitHub Desktop.
A function to load numpy arrays from the MNIST data files.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" A function that can read MNIST's idx file format into numpy arrays. | |
The MNIST data files can be downloaded from here: | |
http://yann.lecun.com/exdb/mnist/ | |
This relies on the fact that the MNIST dataset consistently uses | |
unsigned char types with their data segments. | |
""" | |
import struct | |
import numpy as np | |
def read_idx(filename): | |
with open(filename, 'rb') as f: | |
zero, data_type, dims = struct.unpack('>HBB', f.read(4)) | |
shape = tuple(struct.unpack('>I', f.read(4))[0] for d in range(dims)) | |
return np.fromstring(f.read(), dtype=np.uint8).reshape(shape) |
Python can automatically handle gzip files, just add:
import gzip
Then change
with open(filename, 'rb') as f:
to:
with gzip.open(filename) as f:
Thanks a lot dude! It is interesting that the default encoding was high-endian for NON-intel processors... since in my mind most people ARE using Intel processors... Anyway, thanks for the gist!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The original file "train-images-idx3-ubyte.gz" is 9912422 bytes. It is much smaller than 282860000. We need to unzip the file first in order to use it. Some software 7zip or Winrar will help you to do this. Otherwise you will face the problem "cannot reshape array of size 9912386 into shape (2055376946,226418,1634299437,1768776039,1702047337,1685599021,1969387892,1694559388)". (written for some students who is as puzzled as me, when they are doing their homework)