In class we designed a filesystem that provides provenance. The task for this OP is to "implement" (conceptually) such a filesystem on top of the Git distributed version control system, in such a way that your filesystem can gracefully handle disconnected operation. Once you have gone through this design exercise, describe in your OP specifically how you chose to map the file data and corresponding metadata onto the Git elements, with a particular focus on how to name your content-addressable files and their content-addressable components.
Class design topic description (note: as with real life system designs, this design is under-specified, and it is your job to complete the specification in a sensible way given the overall requirements):
Design a file system that keeps track of provenance for files and that allows users to query that provenance. Think of tracking that provenance in particular as a naming problem; in particular, one should be able to look up parts of files based on content, not explicit name.
The filesystem shall be distributed (e.g., like Dropbox or Wuala), and it shall keep track of where the contents of a particular file came from, and which other files may have in turn been influenced by this file.
For example, if you edit a PowerPoint presentation, you sometimes copy individual slides from other presentations, and then modify them slightly. Later, you may want to find where did a particular slide get copied from, or, when discovering a mistake in some slide, you may want to know what other presentations the mistake may have been copied to (even if that slide may have changed in those other presentations).
Your system should support files that are logically composed of many components, and where each component might have a different provenance. For example, a PowerPoint file contains multiple slides, each of which may have been copied from a different source. Similarly, you may have a zip or tar file, which in turn contains several files that have their own provenance.
Your system should also allow applications to supply additional information about the provenance of files. For example, a single process running PowerPoint may access many different presentation files, and modify many presentation files as well. With some changes to the application by PowerPoint developers, it should be possible to precisely keep track of which individual slides of the modified presentations depend on which individual slides of the presentations accessed by that process.
Each file in the file system has associated with it a set of parts. Each part corresponds to an application-defined unit of the file, such as a slide in a PointPoint file, or an individual compressed file in a zip file. A null part indicates the entire file, for applications that do not otherwise define any parts.
The system keeps track of the provenance for each part of the file. Beware of name overwriting along a given timeline: Say a user who copies a PowerPoint slide from file B to C, and then replaces that slide in file B with a different slide from file A. If the provenance for this slide in file C points to the corresponding slide in file B, and the provenance for B's slide now points to file A, one may somehow conclude that C's slide was sourced from file A's slide, wherease this is incorrect.
Your system must support the following use cases:
Your design should store the provenance information persistently. That is, if the computer is turned off, and then turned on again, the provenance information must be available. You don't have to worry about the computer crashing during normal operation.
Your design does not need to be robust in the face of buggy applications, malicious applications, or any security issues. It's OK if your design allows a specially-designed application to erase the provenance of some files. Your design does not need to deal with failures (e.g., power outages, disk failures, etc.), or with malicious users (e.g., users changing the provenance information with a disk editor).
The system must scale using reasonable hardware requirements, not Google-scale million-server farms. You are allowed to assume you can change the OS kernels of all the clients involved (e.g., you could be Apple, intending to make this filesystem part of Mac OS).
Some more philosophical questions to ponder: