Boto3 pour télécharger tous les fichiers d'un seau S3

Question

J'utilise boto3 pour obtenir des fichiers du compartiment s3. J'ai besoin d'une fonctionnalité similaire comme aws s3 sync

Mon code actuel est

#!/usr/bin/python import boto3 s3=boto3.client('s3') list=s3.list_objects(Bucket='my_bucket_name')['Contents'] for key in list: s3.download_file('my_bucket_name', key['Key'], key['Key'])

Cela fonctionne correctement, tant que le compartiment ne contient que des fichiers . Si un dossier est présent à l'intérieur du compartiment, il génère une erreur.

Traceback (most recent call last): File "./test", line 6, in <module> s3.download_file('my_bucket_name', key['Key'], key['Key']) File "/usr/local/lib/python2.7/dist-packages/boto3/s3/inject.py", line 58, in download_file extra_args=ExtraArgs, callback=Callback) File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 651, in download_file extra_args, callback) File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 666, in _download_file self._get_object(bucket, key, filename, extra_args, callback) File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 690, in _get_object extra_args, callback) File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 707, in _do_get_object with self._osutil.open(filename, 'wb') as f: File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 323, in open return open(filename, mode) IOError: [Errno 2] No such file or directory: 'my_folder/.8Df54234'

Est-ce une bonne façon de télécharger un seau complet à l’aide de boto3? Comment télécharger des dossiers.

glefait · Accepted Answer

J'ai les mêmes besoins et crée la fonction suivante qui télécharge récursivement les fichiers . Les répertoires sont créés localement uniquement s'ils contiennent des fichiers.

import boto3 import os def download_dir(client, resource, dist, local='/tmp', bucket='your_bucket'): paginator = client.get_paginator('list_objects') for result in paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist): if result.get('CommonPrefixes') is not None: for subdir in result.get('CommonPrefixes'): download_dir(client, resource, subdir.get('Prefix'), local, bucket) for file in result.get('Contents', []): dest_pathname = os.path.join(local, file.get('Key')) if not os.path.exists(os.path.dirname(dest_pathname): os.makedirs(os.path.dirname(dest_pathname)) resource.meta.client.download_file(bucket, file.get('Key'), dest_pathname)

La fonction s'appelle comme ça:

def _start(): client = boto3.client('s3') resource = boto3.resource('s3') download_dir(client, resource, 'clientconf/', '/tmp')

John Rotenstein · Answer

Amazon S3 n'a pas de dossiers/répertoires. C'est une structure de fichier flat.

Pour conserver l'apparence des répertoires, les noms de chemin sont stockés dans l'objet Key (nom de fichier). Par exemple:

images/foo.jpg

Dans ce cas, la clé entière est images/foo.jpg, plutôt que simplement foo.jpg.

Je soupçonne que votre problème est que boto renvoie un fichier appelé my_folder/.8Df54234 et tente de l’enregistrer sur le système de fichiers local. Cependant, votre système de fichiers local interprète la partie my_folder/ en tant que nom de répertoire et ce répertoire n'existe pas sur votre système de fichiers local.

Vous pouvez soit tronquer le nom du fichier pour n'enregistrer que la partie .8Df54234, soit vous devez créer les répertoires nécessaires avant d'écrire des fichiers. Notez qu'il peut s'agir de répertoires imbriqués à plusieurs niveaux.

Une méthode plus simple consisterait à utiliser l’interface de ligne de commande AWS , qui effectuera tout ce travail à votre place, par exemple:

aws s3 cp --recursive s3://my_bucket_name local_folder

Il existe également une option sync qui ne copie que les fichiers nouveaux et modifiés.

Tushar Niras · Answer

import os import boto3 #initiate s3 resource s3 = boto3.resource('s3') # select bucket my_bucket = s3.Bucket('my_bucket_name') # download file into current directory for s3_object in my_bucket.objects.all(): # Need to split s3_object.key into path and file name, else it will give error file not found. path, filename = os.path.split(s3_object.key) my_bucket.download_file(s3_object.key, filename)

Shan · Answer

J'accomplis actuellement la tâche en utilisant les outils suivants

#!/usr/bin/python import boto3 s3=boto3.client('s3') list=s3.list_objects(Bucket='bucket')['Contents'] for s3_key in list: s3_object = s3_key['Key'] if not s3_object.endswith("/"): s3.download_file('bucket', s3_object, s3_object) else: import os if not os.path.exists(s3_object): os.makedirs(s3_object)

Bien que cela fasse le travail, je ne suis pas sûr que ce soit bien de le faire de cette façon. Je le laisse ici pour aider les autres utilisateurs et apporter des réponses, avec une meilleure manière de l'obtenir

Grant Langseth · Answer

Lorsque vous travaillez avec des compartiments comportant plus de 1000 objets, il est nécessaire de mettre en œuvre une solution utilisant le NextContinuationToken sur des ensembles séquentiels d'au plus 1000 clés. Cette solution compile d'abord une liste d'objets, puis crée de manière itérative les répertoires spécifiés et télécharge les objets existants.

import boto3 import os s3_client = boto3.client('s3') def download_dir(prefix, local, bucket, client=s3_client): """ params: - prefix: pattern to match in s3 - local: local path to folder in which to place files - bucket: s3 bucket with target contents - client: initialized s3 client object """ keys = [] dirs = [] next_token = '' base_kwargs = { 'Bucket':bucket, 'Prefix':prefix, } while next_token is not None: kwargs = base_kwargs.copy() if next_token != '': kwargs.update({'ContinuationToken': next_token}) results = client.list_objects_v2(**kwargs) contents = results.get('Contents') for i in contents: k = i.get('Key') if k[-1] != '/': keys.append(k) else: dirs.append(k) next_token = results.get('NextContinuationToken') for d in dirs: dest_pathname = os.path.join(local, d) if not os.path.exists(os.path.dirname(dest_pathname)): os.makedirs(os.path.dirname(dest_pathname)) for k in keys: dest_pathname = os.path.join(local, k) if not os.path.exists(os.path.dirname(dest_pathname)): os.makedirs(os.path.dirname(dest_pathname)) client.download_file(bucket, k, dest_pathname)

ifoukarakis · Answer

Mieux vaut tard que jamais :) La réponse précédente avec paginateur est vraiment bonne. Cependant, il est récursif et vous risquez de vous heurter aux limites de récursivité de Python. Voici une autre approche, avec quelques vérifications supplémentaires.

import os import errno import boto3 def assert_dir_exists(path): """ Checks if directory tree in path exists. If not it created them. :param path: the path to check if it exists """ try: os.makedirs(path) except OSError as e: if e.errno != errno.EEXIST: raise def download_dir(client, bucket, path, target): """ Downloads recursively the given S3 path to the target directory. :param client: S3 client to use. :param bucket: the name of the bucket to download from :param path: The S3 directory to download. :param target: the local directory to download the files to. """ # Handle missing / at end of prefix if not path.endswith('/'): path += '/' paginator = client.get_paginator('list_objects_v2') for result in paginator.paginate(Bucket=bucket, Prefix=path): # Download each file individually for key in result['Contents']: # Calculate relative path rel_path = key['Key'][len(path):] # Skip paths ending in / if not key['Key'].endswith('/'): local_file_path = os.path.join(target, rel_path) # Make sure directories exist local_file_dir = os.path.dirname(local_file_path) assert_dir_exists(local_file_dir) client.download_file(bucket, key['Key'], local_file_path) client = boto3.client('s3') download_dir(client, 'bucket-name', 'path/to/data', 'downloads')

Ganatra · Answer

C’est une très mauvaise idée d’obtenir tous les fichiers en une fois, vous devriez plutôt l’obtenir par lots.

Une implémentation que j'utilise pour récupérer un dossier particulier (répertoire) de S3 est,

def get_directory(directory_path, download_path, exclude_file_names): # prepare session session = Session(aws_access_key_id, aws_secret_access_key, region_name) # get instances for resource and bucket resource = session.resource('s3') bucket = resource.Bucket(bucket_name) for s3_key in self.client.list_objects(Bucket=self.bucket_name, Prefix=directory_path)['Contents']: s3_object = s3_key['Key'] if s3_object not in exclude_file_names: bucket.download_file(file_path, download_path + str(s3_object.split('/')[-1])

et encore si vous voulez obtenir le seau entier, utilisez-le via CIL comme @John Rotenstein mentionné comme ci-dessous,

aws s3 cp --recursive s3://bucket_name download_path

mattalxndr · Answer

J'ai une solution de contournement pour cela qui exécute l'AWS CLI dans le même processus.

Installez awscli en librairie python:

pip install awscli

Puis définissez cette fonction:

from awscli.clidriver import create_clidriver def aws_cli(*cmd): old_env = dict(os.environ) try: # Environment env = os.environ.copy() env['LC_CTYPE'] = u'en_US.UTF' os.environ.update(env) # Run awscli in the same process exit_code = create_clidriver().main(*cmd) # Deal with problems if exit_code > 0: raise RuntimeError('AWS CLI exited with code {}'.format(exit_code)) finally: os.environ.clear() os.environ.update(old_env)

Éxécuter:

aws_cli('s3', 'sync', '/path/to/source', 's3://bucket/destination', '--delete')

Rajesh Rajendran · Answer

for objs in my_bucket.objects.all(): print(objs.key) path='/tmp/'+os.sep.join(objs.key.split(os.sep)[:-1]) try: if not os.path.exists(path): os.makedirs(path) my_bucket.download_file(objs.key, '/tmp/'+objs.key) except FileExistsError as fe: print(objs.key+' exists')

Ce code téléchargera le contenu dans le répertoire /tmp/. Si vous le souhaitez, vous pouvez modifier le répertoire.

snat2100 · Answer

Si vous souhaitez appeler un script bash à l'aide de python, voici une méthode simple pour charger un fichier depuis un dossier du compartiment S3 dans un dossier local (sur une machine Linux):

import boto3 import subprocess import os ###TOEDIT### my_bucket_name = "your_my_bucket_name" bucket_folder_name = "your_bucket_folder_name" local_folder_path = "your_local_folder_path" ###TOEDIT### # 1.Load thes list of files existing in the bucket folder FILES_NAMES = [] s3 = boto3.resource('s3') my_bucket = s3.Bucket('{}'.format(my_bucket_name)) for object_summary in my_bucket.objects.filter(Prefix="{}/".format(bucket_folder_name)): # print(object_summary.key) FILES_NAMES.append(object_summary.key) # 2.List only new files that do not exist in local folder (to not copy everything!) new_filenames = list(set(FILES_NAMES )-set(os.listdir(local_folder_path))) # 3.Time to load files in your destination folder for new_filename in new_filenames: upload_S3files_CMD = """aws s3 cp s3://{}/{}/{} {}""".format(my_bucket_name,bucket_folder_name,new_filename ,local_folder_path) subprocess_call = subprocess.call([upload_S3files_CMD], Shell=True) if subprocess_call != 0: print("ALERT: loading files not working correctly, please re-check new loaded files")