A Python Function for Removing Duplicate Lines

Dealing with text files is a common task for programmers, data scientists, and anyone who works with data. Often, these files can contain duplicate lines, which can be a nuisance when you're trying to analyze or process the information. Manually removing duplicates can be tedious and error-prone, especially for large files. Fortunately, Python offers a simple and efficient way to automate this process.

In this blog post, I'll share a Python function that reads a text file, removes duplicate lines, and writes the unique lines to a new file. This can save you significant time and effort, and ensure the accuracy of your data.

The Python Solution

def remove_duplicate_lines(input_file, output_file):
    """
    Removes duplicate lines from a text file.

    Args:
        input_file: Path to the input text file.
        output_file: Path to the output text file (containing unique lines).
    """
    try:
        with open(input_file, 'r') as infile:
            lines = infile.readlines()

        unique_lines = []
        seen_lines = set()

        for line in lines:
            cleaned_line = line.strip()  # Remove leading/trailing whitespace
            if cleaned_line not in seen_lines:
                unique_lines.append(line)  # Append the original line
                seen_lines.add(cleaned_line)

        with open(output_file, 'w') as outfile:
            outfile.writelines(unique_lines)

    except FileNotFoundError:
        print(f"Error: Input file '{input_file}' not found.")
    except IOError as e:
        print(f"An I/O error occurred: {e}")


# Example usage:
if __name__ == "__main__":
    input_filename = "input.txt"
    output_filename = "output.txt"

    # Create a dummy input file for testing (optional)
    with open(input_filename, "w") as f:
        f.write("This is line 1.\n")
        f.write("This is line 2.\n")
        f.write("This is line 1.\n")  # Duplicate
        f.write("This is line 3.\n")
        f.write("This is line 2.\n")  # Duplicate
        f.write("This is line 1.\n") # Duplicate
        f.write("This is line 3.\n") # Duplicate
        f.write("This is line 4.   \n")  # Line with trailing whitespace
        f.write("This is line 4.\n") # Duplicate (different whitespace)

    remove_duplicate_lines(input_filename, output_filename)
    print(f"Duplicate lines removed. Unique lines written to '{output_filename}'.")

How it Works

Reading the File: The code first opens the input file in read mode ('r') and reads all the lines into a list called lines.
Efficient Duplicate Detection: A set called seen_lines is used to keep track of the lines that have already been processed. Sets are incredibly efficient for checking if an element is present, making the duplicate removal process fast, even for large files.
Whitespace Handling: The line.strip() method is crucial. It removes leading and trailing whitespace from each line before checking for duplicates. This ensures that lines that are identical except for extra spaces are treated as duplicates. The original line (including the whitespace) is then written to the output file.
Writing Unique Lines: The code iterates through the lines list. For each line, it checks if the stripped version of the line is already in the seen_lines set. If it's not, the original line (with its whitespace) is added to the unique_lines list, and the stripped line is added to the seen_lines set. Finally, the unique_lines are written to the output file.
Error Handling: The try...except block handles potential FileNotFoundError (if the input file doesn't exist) and IOError (for other input/output issues), making the code more robust.

Why This Approach is Effective

Efficiency: Using a set for duplicate detection makes the process very efficient, especially for large files.
Clarity: The code is well-structured and easy to understand.
Robustness: The inclusion of error handling makes the code more reliable.
Whitespace Awareness: Handling whitespace ensures that lines that are effectively the same are treated as duplicates.

Conclusion

Removing duplicate lines from text files is a common task, and this Python solution provides a clean, efficient, and robust way to automate it. By using this function, you can save time and ensure the accuracy of your data processing. Give it a try and see how it simplifies your text file management!