Extract and load data directly from a tarball

This is a sample code to extract a tarball (tar.gz) and load data into a numpy array. You may also load the file into a pandas dataframe.

from io import BytesIO
import tarfile
from urllib.request import urlopen
# getting url for tarfile
url = 'url/to/tarfile.tgz'
b = BytesIO(urlopen(url).read())
fpath = 'local_folder_path/to/extract/data'

with tarfile.open(mode='r', fileobj=b) as archive:
    numpy_data = np.loadtxt(archive.extractfile(fpath), delimiter=',')


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s