Progressive Copier v1.1

Overview

The ProgressiveCopier (pcopier) is a utility to aid in imaging drives that have errors on them. A common problem with damaged drives is that copying them may take an extraordinarily long time from start to finish, sector by sector. It may never finish, and with damaged drives, getting as much data off as quickly as possible is important. Once the image has been created, it can be mounted or opened directly by a data recovery program without having to worry about bad sectors preventing the program from scanning the drive.

This utility aids this process in two ways. First, it maintains a map of sectors and whether they have been read or not. Second (and more importantly), it has a hierarchial strategy to copy the block device. The drive is divided up into blocks, and when copying a block of data, any errors will cause it to mark the block as incomplete and move to the next block. Each of these blocks is in turn divided into smaller blocks, forming a large hierarchy of blocks. The topmost (first) level 0 is a block large enough to encompass the entire device. The bottom (last) level is composed of blocks the size of the input's sector size. The number of sub-blocks each block is broken into defaults to 4, but can specified.

The advantage of this approach is that larger contiguous blocks of error-free data can be ready quickly, and the areas with errors are left for later. As the copy progresses, the detail with which it attempts to read the bad areas is reduced to smaller and smaller blocks.

As an example, imagine starting with a 128GB hard drive. The utility will start reading the whole 128GB drive as a single large block. If an error occurs 79GB into the drive, the 128GB block is marked as partial, and since it's the only block in the top level, the utility progresses to the next level. At this level, the drive is divided into 32GB blocks. The first two blocks are already complete and the third block is already partially read, so they are skipped. It then resumes at the fourth block. Suppose it completes fully, and moves onto the next level. At this level, the blocks are 8GB in size, and the only region that needs to be dealt with is from 64-96GB. The block from 64-72GB is complete, and the block from 72-79GB is marked as partial, so they will not be read. The blocks from 80-88GB and 88-96GB are marked as unread, so they are read. After reading, it then continues to the next level, etc, until execution is stopped or the last level is completed.

In cases where this utility is useful, the sector level will generally not be reached. If it is reached, it indicates that a straight start-to-end sector by sector copy would have been faster. The only utility in that case is where the ability to resume a copy is needed.

Both a command-line interface to run the program standalone and an API are provided. The API allows you to provide objects which implement the org.blinkenbyte.io.BlockDevice interface as the input and output devices, so that things other than files and devices can be used, or existing files and devices can be filtered (e.g., a RAID implementation using files). See the API for more details.

Under Windows, native code (WindowsBlockDevice.dll) is needed to read from physical hard drives (e.g. "\\?\PhysicalDrive0"). Under Linux, no native library is needed to open devices (e.g. "/dev/hda").

Download

Jar file: pcopier.jar
Windows native library: WindowsBlockDevice.zip
Source (Java and C): pcopier-src.zip
API: api.zip or browse online.

Map file details

This section be skipped unless you're interested in the details of how sector statuses are kept track of.

The map file has a header followed by the map data, which uses 2 bits per block. The number of blocks will depend on the number of sub-blocks (childrenPerNode) each block is divided into, and the virtual size of the input device. The virtual size of the input device is smallest block size that is equal or larger in size than the input device. It can be determined by starting with the sector size and repeatedly multiplying by childrenPerNode until a number greater or equal to the input device size is reached. For example, if the sector size is 512 bytes, childrenPerNode is 4, and the device's size is 48K, the virtual size will be 512 * 4 * 4 * 4 * 4 = 128K. This also tells me the number of levels and their block sizes. In this case, the levels are numbered 0-4. Level 0 is a single 128K block, level 1 is four 32K blocks, level 2 is sixteen 8K blocks, level 3 is 64 2K blocks, and level 4 is 256 512-byte blocks (256 sectors). The status for the virtual blocks past the end of the input is maintained as if they were real (and always successful), but read requests for these virtual blocks will never be passed to the input device.

The map file stores the status of every block in a large array of bits. The status of child blocks are not always updated; if a parent block is marked as complete, the status of the children is irrelevant. This saves a considerable amount of I/O to the map file. The storage is byte-oriented, so endian-ness is not a concern when reading the map. The low order bits of a byte are considered to come before the high-order bits of a byte. Because each status is 2 bits, a single byte can hold four statuses. At the last level, one byte of the map is responsible for 2048 bytes of input (assuming 512 byte sectors). Depending on childrenPerNode, the ratio between the virtual size of the input device and the size of the map will be roughly in the range of 1024-2048 to 1. If childrenPerNode is 4, the ratio is roughly 1536:1; assuming a worst-case scenario of the virtual size being 4 times as large as the actual input size, the ratio works out to 384:1, or roughly 0.26% of the input size, so the overhead compared to the resulting output is minimal.

As an example of how the map file stores statuses, bits 1-0 of byte 0 of the map contain the status for level 0 block 0, bits 3-2 contain the status for level 1 block 0, bits 5-4 for level 1 block 1, bits 7-6 for level 1 block 2, bits 1-0 of byte 1 for level 1 block 3, bits 3-2 of byte 1 for level 2 block 0, etc. More details of the map file header and storage can be found in the source code for org.blinkenbyte.pcopier.ProgressiveCopier.

Command-line Reference

The utility can be started from the command line by running "java -jar pcopier.jar". If unpacked, the name of the main class is org.blinkenbyte.pcopier.ProgressiveCopier. If running under Windows, the native library WindowsBlockDevice.dll should be in the same directory, or in a directory pointed to by the java.library.path property.

There are four modes of operation:

Progressive copy
Dump unread sectors list
Clear map region by level and block range
Clear map by sector range

Number postfixes
Numbers can end with K/k, M/m, and G/g, which multiply the number by 1024, 1048576, and 1073741824 respectively.

Sector range lists
Sector range lists are formatted one range per line, where a range consists of a starting sector number separated by a tab, space, or comma from the sector count. These two numbers can be written in hexadecimal by prefixing them with "0x" or "0X". Number postfixes are not allowed. The output format will always use a space as the separator character and decimal numbers.

Progressive copy

Performs a progressive copy from the specified input to the specified output, using the specified map file or memory.

ProgressiveCopier -i inputFile -o outputFile {-m mapFile | --map-memory}
    [--input-sector-size inputSectorSize] [--input-offset inputOffset]
    [--input-size inputSize] [--output-offset outputOffset]
    [--children-per-node childrenPerNode]
    [--initial-block-size initialBlockSize] [--end-block-size endBlockSize]
    [--input-bad-sectors inputBadSectorsFile] [--buffer-size bufferSize]
    [--preallocate-output] [--display-interval displayIntervalSeconds]
    [--output-incomplete-sectors incompleteSectorsFilename]
    [--force-input-device | --force-input-not-device]
    [--force-output-device | --force-output-not-device]

Example: java -jar pcopier.jar -i \\?\PhysicalDrive5 -o e:\image1.bin -m e:\image1.map --preallocate-output --display-interval 30

Option Description

-i inputFile Name of the input file/device.

-o outputFile Name of the output file/device.

-m mapFile Name of the map file.

--map-memory Instead of using a map file, construct the map in memory. Consequently, the state of the copy is volatile and will be lost if interrupted.

--input-sector-size inputSectorSize Sector size of the input device. Under Windows, an input opened as a device can have its sector size determined; otherwise, the default is 512 bytes.

--input-offset inputOffset Offset in bytes into the input device to start reading.

--input-size inputSize Size of the input file/device in bytes, overrides the automatically determined size.

--output-offset outputOffset Offset in bytes into the output device to start writing.

--children-per-node childrenPerNode Number of sub-blocks a block is split into at every new level. The value must be a power of 2, with a minimum of 2. This value plays a part in determining the size of topmost block that encompasses the entire input device. The program starts with the sector size, and repeatedly multiplies it by childrenPerNode until it obtains a value greater or equal to the input size. As explained in the map file details, each of the intermediate values is the size of a block at a particular level. The final value becomes the block size of the topmost block. A large value of childrenPerNode can cause the final value to overshoot the input size greatly, leading to an excessively large map.
Smaller values greatly increase the number of levels in the hierarchy, which may mean that bad areas are re-read excessively (the progression to small blocks happens slowly). Large values will have the opposite problem. The default value is 4; I think a value of 4, 8, or 16 is reasonable, but may vary depending on the situation. I think 2 is generally going to be too slow.
Using the default value of 4 and sector size 512 bytes, the worst-case size of the map works out to roughly 0.26% of the input size.

--initial-block-size initialBlockSize By default, the copier starts with the topmost block (the largest block), and progresses down to smaller blocks. If a smaller starting block size is desired, it can be specified by initialBlockSize. Valid block sizes can be determined by multiplying the sector size by childrenPerNode repeatedly (details in the --children-per-node option).

--end-block-size endBlockSize By default, the copier ends with the smallest block size, the sector. If you want the copier to stop after reaching a specific block size, you can specify endBlockSize. Valid block sizes can be determined by multiplying the sector size by childrenPerNode repeatedly (details in the --children-per-node option).

--input-bad-sectors inputBadSectorsFile If certain sector ranges should always be read as bad, the copier can wrap the input in a bad block emulator that will return errors for the desired sector ranges. This will prevent read requests from being made to those sector ranges. The list should be put in a file formatted as explained above in "Sector range lists".

--buffer-size bufferSize Specify a read buffer size. This is the largest read request size that will be made to the input device. If the copier is copying from a level with block sizes smaller than the buffer size, the read requests will be limited in size to the block size. The default buffer size is 256K.

--preallocate-output If the output does not exist or is undersized, this option will allocate the full size of the output by writing 0's to it.

--display-interval displayIntervalSeconds Normally, the program displays no output while the copy is in progress. This option enables it to print the current level and block number periodically. It runs in the same thread as the copier, checking to see if the requested interval has elapsed since the last output after every I/O. Consequently, a bad read that takes a long time to return can delay the output until it finishes.

--output-incomplete-sectors incompleteSectorsFilename When the copy finishes, a list of incomplete sector ranges can be written to a file. This can also be done with performing a copy using other options. The format of the list is explained above in "Sector range lists".

--force-input-device Force the input to be treated as a device. The default behavior is to look at the filename and treat it as a device if the device name starts with "\\?\" or "/dev/".

--force-input-not-device Force the input to not be treated as a device. The default behavior is to look at the filename and treat it as a device if the device name starts with "\\?\" or "/dev/".

--force-output-device Force the output to be treated as a device. The default behavior is to look at the filename and treat it as a device if the device name starts with "\\?\" or "/dev/".

--force-output-not-device Force the output to not be treated as a device. The default behavior is to look at the filename and treat it as a device if the device name starts with "\\?\" or "/dev/".

Option	Description
-i inputFile	Name of the input file/device.
-o outputFile	Name of the output file/device.
-m mapFile	Name of the map file.
--map-memory	Instead of using a map file, construct the map in memory. Consequently, the state of the copy is volatile and will be lost if interrupted.
--input-sector-size inputSectorSize	Sector size of the input device. Under Windows, an input opened as a device can have its sector size determined; otherwise, the default is 512 bytes.
--input-offset inputOffset	Offset in bytes into the input device to start reading.
--input-size inputSize	Size of the input file/device in bytes, overrides the automatically determined size.
--output-offset outputOffset	Offset in bytes into the output device to start writing.
--children-per-node childrenPerNode	Number of sub-blocks a block is split into at every new level. The value must be a power of 2, with a minimum of 2. This value plays a part in determining the size of topmost block that encompasses the entire input device. The program starts with the sector size, and repeatedly multiplies it by childrenPerNode until it obtains a value greater or equal to the input size. As explained in the map file details, each of the intermediate values is the size of a block at a particular level. The final value becomes the block size of the topmost block. A large value of childrenPerNode can cause the final value to overshoot the input size greatly, leading to an excessively large map. Smaller values greatly increase the number of levels in the hierarchy, which may mean that bad areas are re-read excessively (the progression to small blocks happens slowly). Large values will have the opposite problem. The default value is 4; I think a value of 4, 8, or 16 is reasonable, but may vary depending on the situation. I think 2 is generally going to be too slow. Using the default value of 4 and sector size 512 bytes, the worst-case size of the map works out to roughly 0.26% of the input size.
--initial-block-size initialBlockSize	By default, the copier starts with the topmost block (the largest block), and progresses down to smaller blocks. If a smaller starting block size is desired, it can be specified by initialBlockSize. Valid block sizes can be determined by multiplying the sector size by childrenPerNode repeatedly (details in the --children-per-node option).
--end-block-size endBlockSize	By default, the copier ends with the smallest block size, the sector. If you want the copier to stop after reaching a specific block size, you can specify endBlockSize. Valid block sizes can be determined by multiplying the sector size by childrenPerNode repeatedly (details in the --children-per-node option).
--input-bad-sectors inputBadSectorsFile	If certain sector ranges should always be read as bad, the copier can wrap the input in a bad block emulator that will return errors for the desired sector ranges. This will prevent read requests from being made to those sector ranges. The list should be put in a file formatted as explained above in "Sector range lists".
--buffer-size bufferSize	Specify a read buffer size. This is the largest read request size that will be made to the input device. If the copier is copying from a level with block sizes smaller than the buffer size, the read requests will be limited in size to the block size. The default buffer size is 256K.
--preallocate-output	If the output does not exist or is undersized, this option will allocate the full size of the output by writing 0's to it.
--display-interval displayIntervalSeconds	Normally, the program displays no output while the copy is in progress. This option enables it to print the current level and block number periodically. It runs in the same thread as the copier, checking to see if the requested interval has elapsed since the last output after every I/O. Consequently, a bad read that takes a long time to return can delay the output until it finishes.
--output-incomplete-sectors incompleteSectorsFilename	When the copy finishes, a list of incomplete sector ranges can be written to a file. This can also be done with performing a copy using other options. The format of the list is explained above in "Sector range lists".
--force-input-device	Force the input to be treated as a device. The default behavior is to look at the filename and treat it as a device if the device name starts with "\\?\" or "/dev/".
--force-input-not-device	Force the input to not be treated as a device. The default behavior is to look at the filename and treat it as a device if the device name starts with "\\?\" or "/dev/".
--force-output-device	Force the output to be treated as a device. The default behavior is to look at the filename and treat it as a device if the device name starts with "\\?\" or "/dev/".
--force-output-not-device	Force the output to not be treated as a device. The default behavior is to look at the filename and treat it as a device if the device name starts with "\\?\" or "/dev/".

Dump unread sectors list

Dumps a list of incomplete sector ranges, optionally to a file.

ProgressiveCopier --read-map -m mapFile
    [--output-incomplete-sectors incompleteSectorsFilename]

Example: java -jar pcopier.jar --read-map -m e:\image1.map --output-incomplete-sectors e:\incomplete.txt

Option Description

-m mapFile Name of the map file.

--output-incomplete-sectors incompleteSectorsFilename Output filename. If this option is not used, the list will be dumped to the standard output.

Option	Description
-m mapFile	Name of the map file.
--output-incomplete-sectors incompleteSectorsFilename	Output filename. If this option is not used, the list will be dumped to the standard output.

Clear map region by level and block range

Marks a range of blocks and all their sub-blocks as unread in the map file, starting from a specified level. This does not necessarily mean the blocks will be considered unread -- if a higher block has marked the region as completely read, the status of a sub-block does not matter. This is more useful as a debugging tool, or to re-read an area starting from a specific level that has already been completed.

ProgressiveCopier --clear-map-blocks -m mapFile
    [--clear-starting-level clearStartingLevel]
    [--clear-block-range clearStartBlock clearBlockCount]

Example: java -jar pcopier.jar --clear-map-blocks -m e:\image1.map --clear-starting-level 5 --clear-block-range 768 -1

Option Description

-m mapFile Name of the map file.

--clear-starting-level clearStartingLevel The level at which to start marking blocks as unread in the map file. Default is 0 (the large topmost block).

--clear-block-range clearStartBlock clearBlockCount The range of blocks to mark as unread. The value clearStartBlock is the starting block number, and clearBlockCount is the number of blocks. If clearBlockCount is less than zero, the value is calculated as the remaining number of blocks in the level.

Option	Description
-m mapFile	Name of the map file.
--clear-starting-level clearStartingLevel	The level at which to start marking blocks as unread in the map file. Default is 0 (the large topmost block).
--clear-block-range clearStartBlock clearBlockCount	The range of blocks to mark as unread. The value clearStartBlock is the starting block number, and clearBlockCount is the number of blocks. If clearBlockCount is less than zero, the value is calculated as the remaining number of blocks in the level.

Clear map by sector range

Marks a range of sectors as unread in the map file. This differs from clearing by block range because the status of the selected sectors will be changed to unread by modifying the appropriate blocks at all necessary levels of the map. The basic procedure it uses is to start from the topmost level, and any complete blocks will either be changed to partial or unread depending on the overlap. Complete blocks that are changed to partial will push the complete status down to their sub-blocks so that when finished, only the selected sectors will have been marked as unread.

ProgressiveCopier --clear-map-sectors -m mapFile
    [--clear-sector-range clearSectorStart clearSectorCount]

Example: java -jar pcopier.jar --clear-map-sectors -m e:\image1.map --clear-sector-range 234123567 4096

Option Description

-m mapFile Name of the map file.

--clear-sector-range clearSectorStart clearSectorCount The range of sectors to mark as unread. The value clearSectorStart is the starting sector number, and clearSectorCount is the number of sectors. If clearSectorCount is less than zero, the value is calculated as the remaining number of sectors in the map.

Option	Description
-m mapFile	Name of the map file.
--clear-sector-range clearSectorStart clearSectorCount	The range of sectors to mark as unread. The value clearSectorStart is the starting sector number, and clearSectorCount is the number of sectors. If clearSectorCount is less than zero, the value is calculated as the remaining number of sectors in the map.

History

1.0 - Release

1.1 - Minor change to the special case where the first read of a block fails. This has to be handled specially, because if the first read of a block fails and only that block is marked at partial, the copier will attempt to read the exact same area on the next level. The behavior in v1.0 was to mark all blocks down to the level just above sector-size as partial to prevent rereads from occurring there. The new behavior is to mark all the blocks down as partial down to the smallest block size that is still larger than the requested read size (typically the size of the input buffer). Suppose the very first read of a 32MB block fails. Rather than mark all blocks aligned on that boundary down to the 2K block as partial, if the buffer size was 256K, all blocks down to the 512K block size level will be marked as partial, and blocks below that level (128K to 512 bytes) will still be marked as unread. This was on my to-do list for 1.0, but I forgot about it.

2008/11/24 - Updated source and jar, map-memory option not recognized.

Future Plans

At some point I'll add synchronization to make monitoring the ProgressiveCopier from another thread safe.

One behavior I'd like to make explicit is whether blocks should be marked as partial before reading in case the read locks up the machine. As of now, if a read never finishes, the block remains marked as empty if it was the first read of the block.

One more change I should make is to make blocks immutable. There's no good reason for them not to be.