File system interfaces have remained remarkably unchanged throughout the computing era. The stability and simplicity of the commonly used interface of modern file systems has created consistent and performance oriented environments on which other applications may be layered. Using hierarchical organizational representations of stored content, users can easily navigate and manage large numbers of files using the current command line and graphical interfaces. However, users and applications are more interested in the information that can be retrieved from the data, leading to data quality checks being an important requirement in the contemporary file systems especially for scientific data sets. Additionally, in the age of distributed computing better metadata management is required to organize files across several devices while maintaining relevant information for interfacing.The proposed Khan file system aims to solve these issues surrounding data quality and organization of rich metadata information in a distributed world. Using unique directory structures, intelligent metadata management and a semantic approach, Khan builds upon other file systems to solve problems in the ubiquitous computing age. Goals addressed by Khan include determining data quality using user-defined parameters, interfacing and organization of the files from interactions with new stores like the cloud and mobile devices and remaining extensible for future needs while retaining and improving upon the performance characteristics of contemporary file system.
Source Code: http://svn.research.cc.gatech.edu/kaos/khan/
Students: Drew Bratcher, Akash Gangil
Faculty Advisor: Matthew Wolf
SciKhan is a distributed data management and analytics system for Scientific Data derived from Khan. Scientific experiments continue to generate large amount of data ranging from a few terabytes to petabytes. Owing to the increasingly common usage of a wide range of sensors which can now be a part of experimental setup, the amount of experimental data is bound to grow exponentially. Numerous data management and big data analytics systems have been designed to manage this data burst but most of them ignore the near real time requirements of scientific computing. In this paper, we present SciKhan, a scalable data management solution enabling scientists to build custom functional pipelines to extract relevant metadata which could further be used, to semantically organize the data as well as aid in early error detection during the experiment’s run-time. SciKhan supports deploying analytics routines at run time allowing users to apply exploratory error detection techniques on the experimental data. It provides scientists with semantic file-system based interface which also serves as a feedback mechanism to reconfigure the system during the experiment.
Current Student: Akash Gangil, Erich Lohrmann
Faculty Advisor: Matthew Wolf