One-Pagers‎ > ‎

OP3: Implementing a file system which provides provenance on top of git

In class we designed a filesystem that provides provenance. The task for this OP is to "implement" (conceptually) such a filesystem on top of the Git distributed version control system, in such a way that your filesystem can gracefully handle disconnected operation. Once you have gone through this design exercise, describe in your OP specifically how you chose to map the file data and corresponding metadata onto the Git elements, with a particular focus on how to name your content-addressable files and their content-addressable components.
 
Class design topic description (note: as with real life system designs, this design is under-specified, and it is your job to complete the specification in a sensible way given the overall requirements):
 
Design a file system that keeps track of provenance for files and that allows users to query that provenance. Think of tracking that provenance in particular as a naming problem; in particular, one should be able to look up parts of files based on content, not explicit name.
 
The filesystem shall be distributed (e.g., like Dropbox or Wuala), and it shall keep track of where the contents of a particular file came from, and which other files may have in turn been influenced by this file.
 
For example, if you edit a PowerPoint presentation, you sometimes copy individual slides from other presentations, and then modify them slightly. Later, you may want to find where did a particular slide get copied from, or, when discovering a mistake in some slide, you may want to know what other presentations the mistake may have been copied to (even if that slide may have changed in those other presentations).
 
Your system should support files that are logically composed of many components, and where each component might have a different provenance. For example, a PowerPoint file contains multiple slides, each of which may have been copied from a different source. Similarly, you may have a zip or tar file, which in turn contains several files that have their own provenance.
 
Your system should also allow applications to supply additional information about the provenance of files. For example, a single process running PowerPoint may access many different presentation files, and modify many presentation files as well. With some changes to the application by PowerPoint developers, it should be possible to precisely keep track of which individual slides of the modified presentations depend on which individual slides of the presentations accessed by that process.
 
Each file in the file system has associated with it a set of parts. Each part corresponds to an application-defined unit of the file, such as a slide in a PointPoint file, or an individual compressed file in a zip file. A null part indicates the entire file, for applications that do not otherwise define any parts.
 
The system keeps track of the provenance for each part of the file. Beware of name overwriting along a given timeline: Say a user who copies a PowerPoint slide from file B to C, and then replaces that slide in file B with a different slide from file A. If the provenance for this slide in file C points to the corresponding slide in file B, and the provenance for B's slide now points to file A, one may somehow conclude that C's slide was sourced from file A's slide, wherease this is incorrect.
Your system must support the following use cases:
 
  1. PowerPoint slide copying. Suppose a user Alice opens several PowerPoint presentations, copies several slides from each of them into a new PowerPoint presentation, and saves the new presentation file. Later on, she should be able to track down which presentation each slide in the new file came from, and to be able to open up that file. If Alice discovers a mistake in a PowerPoint slide, she should also be able to track down all copies of that slide in other files, without having to look at irrelevant files (e.g., files that contain a copy of a different slide from that presentation).
  2. Compiling software. Suppose Alice compiles a large piece of software using make, and then installs the resulting binary onto her system. Alice should be able to later track down the sources she used to compile the binary. Ideally, it should also be possible for Alice to track down the URL from which she downloaded the software in the first place.
  3. Copying files with unaware programs. Alice should be able to copy files (such as PowerPoint files or compiled executables) with a program that has not been modified to know about provenance, and still be able to correctly track the provenance of the files afterwards.
  4. Handling tar/zip files. Alice should be able to pack several files into a single tar or zip file, extract such tar or zip files, and the provenance information should make sense. For example, suppose Alice creates a zip file containing two PowerPoint files, where the first file contains slides copied from the second file, and sends the zip file to Bob. When Bob extracts this zip file, he should be able to tell that one of the second file's slides came from the first file. It's OK to require some modifications to the zip and tar programs to support this. Separately, what should be the provenance on the zip file itself, after it is created?
Your design should store the provenance information persistently. That is, if the computer is turned off, and then turned on again, the provenance information must be available. You don't have to worry about the computer crashing during normal operation.
 
Your design does not need to be robust in the face of buggy applications, malicious applications, or any security issues. It's OK if your design allows a specially-designed application to erase the provenance of some files. Your design does not need to deal with failures (e.g., power outages, disk failures, etc.), or with malicious users (e.g., users changing the provenance information with a disk editor).
 
The system must scale using reasonable hardware requirements, not Google-scale million-server farms. You are allowed to assume you can change the OS kernels of all the clients involved (e.g., you could be Apple, intending to make this filesystem part of Mac OS).
 
Some more philosophical questions to ponder:
  • at the end of the day, is there a need for unique names for the file objects in this system?
  • how can such a system deal with security and privacy issues?
  • how does one garbage-collect the system?