I want to Get-Content
of a large (1GB - 10GB) .txt
file (which has only 1 line!) and split it into multiple files with multiple lines, but whenever I try to I end up with a System.OutOfMemoryException
.
Of course I did search for a solution, but all the solutions I found were reading a file line by line which is kinda hard to do when the file only has 1 line.
Although PowerShell takes up to 4 GB of RAM when loading a 1 GB file, the issue is not connected to my RAM as I have a total of 16 GB and even with a game running in the background the peak usage is at around 60%.
I'm using Windows 10 with PowerShell 5.1 (64 bit) and my MaxMemoryPerShellMB
is set to the default value of 2147483647
.
This is the script I wrote and am using, it works fine with a filesize of e.g. 100MB:
$source = "C:\Users\Env:USERNAME\Desktop\Test\"
$input = "test_1GB.txt"
$temp_dir = "_temp"
# 104'857'600 bytes (or characters) are exactly 100 MB, so a 1 GB file has exactly
# 10 temporary files, which have all the same size, and amount of lines and line lenghts.
$out_size = 104857600
# A line length of somewhere around 18'000 characters seems to be the sweet spot, however
# the line length needs to be dividable by 4 and at best fit exactly n times into the
# temporary file, so I use 16'384 bytes (or characters) which is exactly 16 KB.
$line_length = 16384
$file = (gc $input)
$in_size = (gc $input | measure -character | select -expand characters)
if (!(test-path $source$temp_dir)) {ni -type directory -path "$source$temp_dir" >$null 2>&1}
$n = 1
$i = 0
if ($out_size -eq $in_size) {
$file -replace ".{$line_length}", "$&`r`n" | out-file -filepath "$temp_dir\_temp_0001.txt" -encoding ascii
} else {
while ($i -le ($in_size - $out_size)) {
$new_file = $file.substring($i,$out_size)
if ($n -le 9) {$count = "000$n"} elseif ($n -le 99) {$count = "00$n"} elseif ($n -le 999) {$count = "0$n"} else {$count = $n}
$temp_name = "_temp_$count.txt"
$i += $out_size
$n += 1
$new_file -replace ".{$line_length}", "$&`r`n" | out-file -filepath "$temp_dir\$temp_name" -encoding ascii
}
if ($i -ne $in_size) {
$new_file = $file.substring($i,($in_size-$i))
if ($n -le 9) {$count = "000$n"} elseif ($n -le 99) {$count = "00$n"} elseif ($n -le 999) {$count = "0$n"} else {$count = $n}
$temp_name = "_temp_$count.txt"
$new_file -replace ".{$line_length}", "$&`r`n" | out-file -filepath "$temp_dir\$temp_name" -encoding ascii
}
}
If there is an easier solution which doesn't use Get-Content
I gladly take it, too. It really doesn't matter too much how I achieve the result as long as it is possible to do so with every up to date windows machine and no extra software. If this, however, should not be possible, I would also consider other solutions.
Answer
Reading large files into memory simply to split them, while easy, will never be the most efficient method, and you will run into memory limits somewhere.
This is even more apparent here because Get-Content
works on strings — and, as you mention in the comments, you are dealing with binary files.
.NET (and, therefore, PowerShell) stores all strings in memory as UTF-16 code units. This means each code unit takes up 2 bytes in memory.
It happens that a single .NET string can only store (2^31 - 1) code units, since the length of a string is tracked by an Int32
(even on 64-bit versions). Multiply that by 2, and a single .NET string can (theoretically) use about 4 GB.
Get-Content
will store every line in its own string. If you have a single line with > 2 billion characters... that's likely why you're getting that error despite having "enough" memory.
Alternatively, it could be because there's a limit of 2 GB for any given object unless larger sizes are explicitly enabled (are they for PowerShell?). Your 4 GB OOM could also be because there's two copies/buffers kept around as Get-Content
tries to find a line break to split on.
The solution, of course, is to work with bytes and not characters (strings).
If you want to avoid third-party programs, the best way to do this is to drop in to the .NET methods. This is easiest done with a full language like C# (which can be embedded into PowerShell), but it is possible to do purely with PS.
The idea is you want to work with byte arrays, not text streams. There are two ways to do this:
Use
[System.IO.File]::ReadAllBytes
and[System.IO.File]::WriteAllBytes
. This is pretty easy, and better than strings (no conversion, no 2x memory usage), but will still run into issues with very large files - say you wanted to process 100 GB files?Use file streams and read/write in smaller chunks. This requires a fair bit more maths since you need to keep track of your position, but you avoid reading the entire file into memory in one go. This will likely be the fastest approach: allocating very large objects will probably outweigh the overhead of multiple reads.
So you read chunks of a reasonable size (these days, the minimum is 4kB at a time) and copy them to the output file one chunk at a time, rather than reading the entire file into memory and splitting it. You may wish to tune the size upwards, e.g. 8kB, 16kB, 32kB, etc., if you need to squeeze every last drop of performance out - but you'd need to benchmark to find the optimum size, as some larger sizes are slower.
An example script follows. For reusability it should be converted into a cmdlet or at least a PS function, but this is enough to serve as a working example.
$fileName = "foo"
$splitSize = 100MB
# need to sync .NET CurrentDirectory with PowerShell CurrentDirectory
# https://stackoverflow.com/questions/18862716/current-directory-from-a-dll-invoked-from-powershell-wrong
[Environment]::CurrentDirectory = Get-Location
# 4k is a fairly typical and 'safe' chunk size
# partial chunks are handled below
$bytes = New-Object byte[] 4096
$inFile = [System.IO.File]::OpenRead($fileName)
# track which output file we're up to
$fileCount = 0
# better to use functions but a flag is easier in a simple script
$finished = $false
while (!$finished) {
$fileCount++
$bytesToRead = $splitSize
# Just like File::OpenWrite except CreateNew instead to prevent overwriting existing files
$outFile = New-Object System.IO.FileStream "${fileName}_$fileCount",CreateNew,Write,None
while ($bytesToRead) {
# read up to 4k at a time, but no more than the remaining bytes in this split
$bytesRead = $inFile.Read($bytes, 0, [Math]::Min($bytes.Length, $bytesToRead))
# 0 bytes read means we've reached the end of the input file
if (!$bytesRead) {
$finished = $true
break
}
$bytesToRead -= $bytesRead
$outFile.Write($bytes, 0, $bytesRead)
}
# dispose closes the stream and releases locks
$outFile.Dispose()
}
$inFile.Dispose()
No comments:
Post a Comment