Friday, 27 December 2019

windows - How to split a text file into multiple text files


I have a text file called entry.txt that contains the following:


[ entry1 ]
1239 1240 1242 1391 1392 1394 1486 1487 1489 1600
1601 1603 1657 1658 1660 2075 2076 2078 2322 2323
2325 2740 2741 2743 3082 3083 3085 3291 3292 3294
3481 3482 3484 3633 3634 3636 3690 3691 3693 3766
3767 3769 4526 4527 4529 4583 4584 4586 4773 4774
4776 5153 5154 5156 5628 5629 5631
[ entry2 ]
1239 1240 1242 1391 1392 1394 1486 1487 1489 1600
1601 1603 1657 1658 1660 2075 2076 2078 2322 2323
2325 2740 2741 2743 3082 3083 3085 3291 3292 3294
3481 3482 3484 3690 3691 3693 3766 3767 3769 4526
4527 4529 4583 4584 4586 4773 4774 4776 5153 5154
5156 5628 5629 5631
[ entry3 ]
1239 1240 1242 1391 1392 1394 1486 1487 1489 1600
1601 1603 1657 1658 1660 2075 2076 2078 2322 2323
2325 2740 2741 2743 3082 3083 3085 3291 3292 3294
3481 3482 3484 3690 3691 3693 3766 3767 3769 4241
4242 4244 4526 4527 4529 4583 4584 4586 4773 4774
4776 5153 5154 5156 5495 5496 5498 5628 5629 5631

I would like to split it into three text files: entry1.txt, entry2.txt, entry3.txt. Their contents are as follows.


entry1.txt:


[ entry1 ]
1239 1240 1242 1391 1392 1394 1486 1487 1489 1600
1601 1603 1657 1658 1660 2075 2076 2078 2322 2323
2325 2740 2741 2743 3082 3083 3085 3291 3292 3294
3481 3482 3484 3633 3634 3636 3690 3691 3693 3766
3767 3769 4526 4527 4529 4583 4584 4586 4773 4774
4776 5153 5154 5156 5628 5629 5631

entry2.txt:


[ entry2 ]
1239 1240 1242 1391 1392 1394 1486 1487 1489 1600
1601 1603 1657 1658 1660 2075 2076 2078 2322 2323
2325 2740 2741 2743 3082 3083 3085 3291 3292 3294
3481 3482 3484 3690 3691 3693 3766 3767 3769 4526
4527 4529 4583 4584 4586 4773 4774 4776 5153 5154
5156 5628 5629 5631

entry3.txt:


[ entry3 ]
1239 1240 1242 1391 1392 1394 1486 1487 1489 1600
1601 1603 1657 1658 1660 2075 2076 2078 2322 2323
2325 2740 2741 2743 3082 3083 3085 3291 3292 3294
3481 3482 3484 3690 3691 3693 3766 3767 3769 4241
4242 4244 4526 4527 4529 4583 4584 4586 4773 4774
4776 5153 5154 5156 5495 5496 5498 5628 5629 5631

In other words, the [ character indicates a new file should begin.


Is there any way I can accomplish automatic text file splitting? My eventual, actual input entry.txt actually contains 200,001 entries.


Doing the text split in either Windows or Linux would be great. I do not have access to a Mac machine. Thanks!



Answer



And here's a nice, simple gawk one-liner :


$ gawk '{if(match($0, /^\[ (.+?) \]/, k)){name=k[1]}} {print >name".txt" }' entry.txt

This will work for any file size, irrespective of the number of lines in each entry, as long as each entry header looks like [ blahblah blah blah ]. Notice the space just after the opening [ and just before the closing ].




EXPLANATION:


awk and gawk read an input file line by line. As each line is read, its contents are saved in the $0 variable. Here, we are telling awk to match anything within square brackets, and save its match into the array k.


So, every time that regular expression is matched, that is, for every header in your file, k[1] will have the matched region of the line. Namely, "entry1", "entry2" or "entry3" or "entryN". name=k[1] just saves the value of k[1] (the match) into a new variable name.


Finally, we print each line into a file called .txt, ie entry1.txt, entry2.txt ... entryN.txt.


This method will be much faster than perl for larger files.


I can't vouch for this as I have never used windows shell, but I am willing to bet it will be far faster than that also. Gawk/awk are FAST.


No comments:

Post a Comment

How can I VLOOKUP in multiple Excel documents?

I am trying to VLOOKUP reference data with around 400 seperate Excel files. Is it possible to do this in a quick way rather than doing it m...