Dealing with text files is a common task for programmers, data scientists, and anyone who works with data. Often, these files can contain duplicate lines, which can be a nuisance when you're trying to analyze or process the information. Manually removing duplicates can be tedious and error-prone, especially for large files. Fortunately, Python offers a simple and efficient way to automate this process.
In this blog post, I'll share a Python function that reads a text file, removes duplicate lines, and writes the unique lines to a new file. This can save you significant time and effort, and ensure the accuracy of your data.
The Python Solution
def remove_duplicate_lines(input_file, output_file): """ Removes duplicate lines from a text file. Args: input_file: Path to the input text file. output_file: Path to the output text file (containing unique lines). """ try: with open(input_file, 'r') as infile: lines = infile.readlines() unique_lines = [] seen_lines = set() for line in lines: cleaned_line = line.strip() # Remove leading/trailing whitespace if cleaned_line not in seen_lines: unique_lines.append(line) # Append the original line seen_lines.add(cleaned_line) with open(output_file, 'w') as outfile: outfile.writelines(unique_lines) except FileNotFoundError: print(f"Error: Input file '{input_file}' not found.") except IOError as e: print(f"An I/O error occurred: {e}") # Example usage: if __name__ == "__main__": input_filename = "input.txt" output_filename = "output.txt" # Create a dummy input file for testing (optional) with open(input_filename, "w") as f: f.write("This is line 1.\n") f.write("This is line 2.\n") f.write("This is line 1.\n") # Duplicate f.write("This is line 3.\n") f.write("This is line 2.\n") # Duplicate f.write("This is line 1.\n") # Duplicate f.write("This is line 3.\n") # Duplicate f.write("This is line 4. \n") # Line with trailing whitespace f.write("This is line 4.\n") # Duplicate (different whitespace) remove_duplicate_lines(input_filename, output_filename) print(f"Duplicate lines removed. Unique lines written to '{output_filename}'.")
How it Works
-
Reading the File: The code first opens the input file in read mode (
'r'
) and reads all the lines into a list calledlines
. -
Efficient Duplicate Detection: A
set
calledseen_lines
is used to keep track of the lines that have already been processed. Sets are incredibly efficient for checking if an element is present, making the duplicate removal process fast, even for large files. -
Whitespace Handling: The
line.strip()
method is crucial. It removes leading and trailing whitespace from each line before checking for duplicates. This ensures that lines that are identical except for extra spaces are treated as duplicates. The original line (including the whitespace) is then written to the output file. -
Writing Unique Lines: The code iterates through the
lines
list. For each line, it checks if the stripped version of the line is already in theseen_lines
set. If it's not, the original line (with its whitespace) is added to theunique_lines
list, and the stripped line is added to theseen_lines
set. Finally, theunique_lines
are written to the output file. -
Error Handling: The
try...except
block handles potentialFileNotFoundError
(if the input file doesn't exist) andIOError
(for other input/output issues), making the code more robust.
Why This Approach is Effective
- Efficiency: Using a
set
for duplicate detection makes the process very efficient, especially for large files. - Clarity: The code is well-structured and easy to understand.
- Robustness: The inclusion of error handling makes the code more reliable.
- Whitespace Awareness: Handling whitespace ensures that lines that are effectively the same are treated as duplicates.
Conclusion
Removing duplicate lines from text files is a common task, and this Python solution provides a clean, efficient, and robust way to automate it. By using this function, you can save time and ensure the accuracy of your data processing. Give it a try and see how it simplifies your text file management!
0 Comments