About a decade ago, virtually all electronically stored information (ESI) productions we performed were in static format (i.e., PDF/TIFF/JPG with accompanying load files). Legal review platforms were designed to work with static productions, and law firms preferred them due to their plug-and-play nature—a proper static production can be loaded into a review platform without much effort. During the past two years, we have seen an increasing interest in productions in native file format. Considering the amount of information that can be extracted from raw data, it is not hard to understand why lawyers demand access to electronic documents in their native format.
What is Native File Format?
Native file format is the file structure in which a document is created and maintained by the original creating application. For example, the native file format of a file created by the Microsoft Excel application would be Microsoft Excel format. When it comes to e-mails, the native file format would be the format in which the e-mails are created and stored. For example, Microsoft Exchange Server stores e-mails in a proprietary database format which consists of “.edb” (MAPI-based Database) and “.stm” (Streaming Database) files. These binary files are created and maintained by the Exchange Server software and are the native format for Exchange e-mails. Additionally, local copies of cached and archived e-mails can be found on users’ workstations. These e-mails are stored as OST (Offline Storage Table) and PST (Personal Storage Table) files, which would be their native format.
Feasibility of Producing in Native File Format
In some cases, obtaining a copy of a set of documents in their native file format can be very burdensome or unfeasible. For example, imagine the following two scenarios:
- Webmail Productions Let’s assume that we are asked to produce webmail data (e.g., e-mails and their attachments stored in a service such as Yahoo Mail or GMail) in its native format. Presumably, the service provider would store the e-mail messages in a database-like structure, which the end user would not have direct access to. When the end user accesses her e-mails over the web interface, the back-end data structures are queried and the resulting data is used to populate the user interface. Similarly, when e-mails are downloaded via POP (Post Office Protocol) or IMAP (Internet Message Access Protocol ), the back-end data is queried and results are transmitted over the internet in text format.Additionally, e-mail account data for multiple customers (or even thousands of customers) may be stored together in a single container. Obtaining access to these back-end data structures in the service provider’s data center would be unfeasible in most cases.
- Exchange E-mail Productions Another common issue encountered during the production of e-mails in native file format is scope. If we were to produce e-mails from an Exchange Server in their native file format, we would need to produce the proprietary Exchange Server database in its entirety. In the event that this database contains data for custodians and/or time periods outside the scope of the investigation in hand, it may be burdensome, or even inappropriate to produce the entire Exchange Server database.
Usability of Native File Format Productions
In most cases, native file format has superior usability compared to other production formats. However, there are some special cases where an ESI production in native file format may not be reasonably usable. Let’s look at a couple of examples:
- Encrypted Data There may be cases where the native file format for an ESI source is in encrypted form. For example, a party may be ordered to produce a password-protected PDF file, or a relational database which cannot be accessed without decryption keys. In these cases, receiving the data in its native format would not be useful without the necessary decryption keys. This is not much different than turning over a computer with an encrypted disk or volume (e.g., BitLocker or FileVault 2) for forensic analysis without the decryption key.
- Proprietary Data Sources Another example would be ESI sources that cannot be reasonably interpreted without the creating software. For example, imagine that the producing party is an investment firm that is using a trading platform developed in-house. The software stores data in a back-end database, but the tables are not intelligible, or perhaps even intentionally obfuscated for security reasons. The only way to gather useful information is to access the data through the trading platform itself. In this case, receiving a copy of the back-end database in native file format would not be very useful to the requesting party.
The above scenarios—and others where feasibility or usability are of concern—call for an alternative to producing in native file format. Often, the alternative is producing in near-native file format.
What is Near-native File Format
Near-native format is a derived file format that retains the relevant metadata and essential functionality of the native file format. For example, producing a relevant subset of Microsoft Exchange e-mails in MSG or reconstituted PST format, or producing webmail in PST format would be considered producing in near-native file format. When dealing with proprietary databases, the near-native format can be relevant reports prepared in searchable electronic format. Sometimes, it may be appropriate to accompany the reports with electronic tabular data (e.g., in Excel or Access format) extracted from the proprietary database. The accompanying electronic tabular data can be easily queried, filtered and sorted using easily accessible tools.
Methods Used While Producing ESI in Native File Format
Native file format productions typically fall into the following three categories:
- Original Folder Structure and File Names Native files can be produced in their original folder structure, without changing their file names. This can be accomplished by copying them in a forensically sound manner, or producing them inside a full or targeted forensic image. When files are copied, it is important to verify that the copy operation was performed accurately (i.e., source and destination cryptographic hashes match) and that the file system metadata was preserved.
- Flattened Folder Structure and File Names with Control Numbers In some cases, native files are placed in a flattened folder structure—or multiple folders containing a fixed number of files each, such as 1,000 files per folder—and their names are prefixed or suffixed with a combination of production control numbers (i.e., Bates numbers) and confidentiality designations (see Figure 1 below). The same concerns regarding cryptographic hash match and file system metadata preservation apply here as well.
- Extracted Native/Near-native Files with Accompanying Load Files Sometimes, native files are run through native-only e-Discovery processing before production. In this case, compound files would be extracted, parent/child relationships would be established, text and metadata would be extracted, and files without extractable text—such as images and non-searchable PDFs—would be OCRed. After processing, the native/near-native files would be exported—typically each file named after its corresponding control number—and produced along with a metadata load file and extracted/OCR text files. The metadata load file would contain, among other things, the cryptographic hash of the original file as well as the file system and internal metadata captured during processing.The requesting party can usually load a native/near-native file production prepared in this manner into their legal review platform, without having to perform their own e-Discovery processing. It is important to come to an agreement on how compound files should be extracted, and which metadata fields should be provided so that parties can avoid processing the files twice—once on the producing and once on the receiving end.
Reviewing Native File Format ESI Productions
Legal review tools usually facilitate the review of native files by using the original application (e.g., Microsoft Excel for an Excel spreadsheet) or a third-party viewer—such as Oracle Outside In (formerly Stellent)—which presents a rendering of each native file as close to that of the original application as possible. The latter option results in a conversion process; the reviewer would be looking at a near-native rendering rather than the native file displayed by the original application.
One of the concerns during native review and production is that the reviewer would need to be a skilled user of the original application in order to ensure that all relevant data was examined. For instance, a reviewer lacking the requisite experience can miss hidden content and comments in a spreadsheet, revisions in a word processing application or the different layers, viewports and metadata in a 3D modeling application. This increases the risk of inadvertent disclosure of protected information during an ESI production.
In my experience, a significant part of the pain associated with native file productions stems from the fact that the terminology is not fully understood among the parties, and the feasibility and usability of a native file production is not thoroughly evaluated. Sometimes the requesting party requires a native file production, based solely on the benefits that they heard, and ends up with a production that is not reasonably usable to them. On the other hand, producing parties agree to native file productions and ultimately have to produce in a different format because they find down the road that producing in native format is not feasible in their case.
It is crucial to fully understand what it means to produce in native file format, and to weigh the benefits and disadvantages of doing so before requesting an ESI production in native format or agreeing to produce in native format.