web-dev-qa-db-fra.com

fractionner un csv en plusieurs fichiers en python

J'ai un fichier csv d'environ 5000 lignes en python, je veux le scinder en cinq fichiers.

J'ai écrit un code pour cela mais ça ne marche pas

import codecs
import csv
NO_OF_LINES_PER_FILE = 1000
def again(count_file_header,count):
    f3 = open('write_'+count_file_header+'.csv', 'at')
    with open('import_1458922827.csv', 'rb') as csvfile:
        candidate_info_reader = csv.reader(csvfile, delimiter=',', quoting=csv.QUOTE_ALL)
        co = 0      
        for row in candidate_info_reader:
            co = co + 1
            count  = count + 1
            if count <= count:
                pass
            Elif count >= NO_OF_LINES_PER_FILE:
                count_file_header = count + NO_OF_LINES_PER_FILE
                again(count_file_header,count)
            else:
                writer = csv.writer(f3,delimiter = ',', lineterminator='\n',quoting=csv.QUOTE_ALL)
                writer.writerow(row)

def read_write():
    f3 = open('write_'+NO_OF_LINES_PER_FILE+'.csv', 'at')
    with open('import_1458922827.csv', 'rb') as csvfile:


        candidate_info_reader = csv.reader(csvfile, delimiter=',', quoting=csv.QUOTE_ALL)

        count = 0       
        for row in candidate_info_reader:
            count  = count + 1
            if count >= NO_OF_LINES_PER_FILE:
                count_file_header = count + NO_OF_LINES_PER_FILE
                again(count_file_header,count)
            else:
                writer = csv.writer(f3,delimiter = ',', lineterminator='\n',quoting=csv.QUOTE_ALL)
                writer.writerow(row)

read_write()

Le code ci-dessus crée de nombreux fichiers avec un contenu vide.

Comment diviser un fichier en cinq fichiers csv?

9
Mounarajan

Je vous suggère de ne pas inventer une roue. Il existe une solution existante. Source ici

import os


def split(filehandler, delimiter=',', row_limit=1000,
          output_name_template='output_%s.csv', output_path='.', keep_headers=True):
    import csv
    reader = csv.reader(filehandler, delimiter=delimiter)
    current_piece = 1
    current_out_path = os.path.join(
        output_path,
        output_name_template % current_piece
    )
    current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter)
    current_limit = row_limit
    if keep_headers:
        headers = reader.next()
        current_out_writer.writerow(headers)
    for i, row in enumerate(reader):
        if i + 1 > current_limit:
            current_piece += 1
            current_limit = row_limit * current_piece
            current_out_path = os.path.join(
                output_path,
                output_name_template % current_piece
            )
            current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter)
            if keep_headers:
                current_out_writer.writerow(headers)
        current_out_writer.writerow(row)

Utilisez-le comme: 

split(open('/your/pat/input.csv', 'r'));
9
Rudziankoŭ

En Python

Utilisez readlines() et writelines() pour le faire, voici un exemple:

>>> csvfile = open('import_1458922827.csv', 'r').readlines()
>>> filename = 1
>>> for i in range(len(csvfile)):
...     if i % 1000 == 0:
...         open(str(filename) + '.csv', 'w+').writelines(csvfile[i:i+1000])
...         filename += 1

les noms des fichiers de sortie seront numérotés 1.csv, 2.csv, ... etc. 

Depuis le terminal

Pour votre information, vous pouvez le faire depuis la ligne de commande en utilisant split comme suit:

$ split -l 1000 import_1458922827.csv
11
Aziz Alto

Une solution conviviale pour python3:

def split_csv(source_filepath, dest_folder, split_file_prefix,
                records_per_file):
    """
    Split a source csv into multiple csvs of equal numbers of records,
    except the last file.

    Includes the initial header row in each split file.

    Split files follow a zero-index sequential naming convention like so:

        `{split_file_prefix}_0.csv`
    """
    if records_per_file <= 0:
        raise Exception('records_per_file must be > 0')

    with open(source_filepath, 'r') as source:
        reader = csv.reader(source)
        headers = next(reader)

        file_idx = 0
        records_exist = True

        while records_exist:

            i = 0
            target_filename = f'{split_file_prefix}_{file_idx}.csv'
            target_filepath = os.path.join(dest_folder, target_filename)

            with open(target_filepath, 'w') as target:
                writer = csv.writer(target)

                while i < records_per_file:
                    if i == 0:
                        writer.writerow(headers)

                    try:
                        writer.writerow(next(reader))
                        i += 1
                    except:
                        records_exist = False
                        break

            if i == 0:
                # we only wrote the header, so delete that file
                os.remove(target_filepath)

            file_idx += 1
3
Ryan Tuck
if count <= count:
   pass

Cette condition est toujours vraie pour que vous passiez à chaque fois

Sinon, vous pouvez regarder ce post: Fractionner un fichier CSV en parties égales?

1
Whitefret

Je vous suggère de tirer parti des possibilités offertes par les pandas. Voici les fonctions que vous pourriez utiliser pour faire cela:

def csv_count_rows(file):
    """
    Counts the number of rows in a file.
    :param file: path to the file.
    :return: number of lines in the designated file.
    """
    with open(file) as f:
        nb_lines = sum(1 for line in f)
    return nb_lines


def split_csv(file, sep=",", output_path=".", nrows=None, chunksize=None, low_memory=True, usecols=None):
    """
    Split a csv into several files.
    :param file: path to the original csv.
    :param sep: View pandas.read_csv doc.
    :param output_path: path in which to output the resulting parts of the splitting.
    :param nrows: Number of rows to split the original csv by, also view pandas.read_csv doc.
    :param chunksize: View pandas.read_csv doc.
    :param low_memory: View pandas.read_csv doc.
    :param usecols: View pandas.read_csv doc.
    """
    nb_of_rows = csv_count_rows(file)

    # Parsing file elements : Path, name, extension, etc...
    # file_path = "/".join(file.split("/")[0:-1])
    file_name = file.split("/")[-1]
    # file_ext = file_name.split(".")[-1]
    file_name_trunk = file_name.split(".")[0]
    split_files_name_trunk = file_name_trunk + "_part_"

    # Number of chunks to partition the original file into
    nb_of_chunks = math.ceil(nb_of_rows / nrows)
    if nrows:
        log_debug_process_start = f"The file '{file_name}' contains {nb_of_rows} ROWS. " \
            f"\nIt will be split into {nb_of_chunks} chunks of a max number of rows : {nrows}." \
            f"\nThe resulting files will be output in '{output_path}' as '{split_files_name_trunk}0 to {nb_of_chunks - 1}'"
        logging.debug(log_debug_process_start)

    for i in range(nb_of_chunks):
        # Number of rows to skip is determined by (the number of the chunk being processed) multiplied by (the nrows parameter).
        rows_to_skip = range(1, i * nrows) if i else None
        output_file = f"{output_path}/{split_files_name_trunk}{i}.csv"

        log_debug_chunk_processing = f"Processing chunk {i} of the file '{file_name}'"
        logging.debug(log_debug_chunk_processing)

        # Fetching the original csv file and handling it with skiprows and nrows to process its data
        df_chunk = pd.read_csv(filepath_or_buffer=file, sep=sep, nrows=nrows, skiprows=rows_to_skip,
                               chunksize=chunksize, low_memory=low_memory, usecols=usecols)
        df_chunk.to_csv(path_or_buf=output_file, sep=sep)

        log_info_file_output = f"Chunk {i} of file '{file_name}' created in '{output_file}'"
        logging.info(log_info_file_output)

Et puis dans votre cahier principal ou jupyter, vous mettez:

# This is how you initiate logging in the most basic way.
logging.basicConfig(level=logging.DEBUG)
file = {#Path to your file}
split_csv(file,sep=";" ,output_path={#Path where you'd like to output it},nrows = 4000000, low_memory = False)

P.S.1: Je mets nrows = 4000000 parce que c'est une préférence personnelle. Vous pouvez changer ce nombre si vous le souhaitez.

P.S.2: J'ai utilisé la bibliothèque de journalisation pour afficher des messages. Dans les cas où une telle fonction serait appliquée à des fichiers volumineux existant sur un serveur distant, vous voulez vraiment éviter les «impressions simples» et incorporer des fonctionnalités de journalisation. Vous pouvez remplacer logging.info ou logging.debug par print

P.S.3: Bien sûr, vous devez remplacer les parties {# Blablabla} du code par vos propres paramètres.

0
Aetos

@Ryan, le code Python3 a fonctionné pour moi, j'ai utilisé newline = '' comme ci-dessous pour éviter les problèmes de lignes vierges, Avec open (target_filepath, 'w', newline = '') comme cible:

0
Ramesh K