One of the challenges facing enterprise data managers is understanding the usage patterns of files residing on unstructured storage.
Software for storage management should quickly bring to the attention of data managers those events and patterns of events that suggest problems and opportunities.
Here are some examples of file usage that may be of interest to data managers:
1. If a user attempts to read a file containing sensitive salary data, is the access attempt indicative of a worrisome security violation?
2. Suppose spreadsheet files with financial information reside unprotected in a directory. Should the data manager consider changing the access rights for the directory?
3. If a directory containing a large amount of data has had few accesses over the past month, should the data manager consider moving those files to offline storage?
4. Given a high level of usage of files residing on a CIFS server, should the data manager consider, for performance reasons, redistributing the files across other servers?
5. If a user often writes large files to disk, is this suggestive of inappropriate behavior — such as misuse of company resources for a personal business? (Note that information about the content of the files would be useful in evaluating the inappropriateness of the behavior.)
6. If a user reads a large number of files at an unusual time (e.g., in the middle of the night, or just before leaving the company), is this suggestive of undesired behavior?
How can data management software help data managers zero in on such usage patterns of interest?
First, the software should display current and historical data in tabular or graphical reports, with results ordered by likely significance. Second, the software should implement alerts that notify data managers, by email or by graphical display elements, of situations of particular interest. Third, the software should let data managers explore the data along different dimensions and with various restrictions. Finally, the software should allow data managers to customize reports and alerts to omit useless information and focus on interesting patterns.
These requirements can place a significant burden on data managers, both for customizing reports and alerts and for examining the data. As much as possible, the software should help data managers by automatically predicting which events are of interest. Furthermore, the software should allow data managers to provide feedback that the software can use to customize reports and alerts.
In other words, storage management software needs to perform predictive analytics.
Some examples of predictive analytics in other domains are:
1. Given query terms, predict which documents will be of interest (search engine querying).
2. Given a customer, their history of purchases, and the purchase histories of other customers, predict which products the customer might be interested in buying (recommendation engine, collaborative filtering).
3. Given a loan application, predict whether the applicant is a worthy credit risk (credit evaluation).
4. Given the submission of a product order on a web store, predict whether the order is fraudulent (fraud detection).
5. Given data about corporations, along with historical stock market performance data, predict the price of a stock at some time in the future (time series analysis, regression).
6. Given an email message, along with a history of messages, predict whether the email message is spam (spam filtering).
Predictive analytics software typically builds a statistical model of the domain and applies machine-learning techniques to categorize or cluster items into interesting classes. Such techniques are broadly categorized into two types: supervised learning and unsupervised clustering algorithms.
With supervised learning, the software categorizes data into two or more classes (e.g., “spam” and “not spam”), based on a training set of labeled historical data. For example, given a large number of emails labeled as “spam” and a large number of emails labeled as “not spam,” spam filtering software categorizes a newly arrived email by comparing it to the two sets of training examples and deciding, based on statistical or geometrical measures, which class (“spam” or “not spam”) is the likely correct label.
With unsupervised algorithms, on the other hand, the aim is to automatically organize items into similar sets by discovering patterns in the data. Unsupervised clustering is often useful as a preliminary step for exploring data and for optimizing supervised learning.
In the context of data management software, the feedback from data managers provides the training data with which the supervised algorithms can learn to improve their predictions. Unsupervised learning is useful for modeling typical usage patterns of files by users.
The key to successful predictions is threefold: (1) identify the correct predictive attributes; (2) model the data using an appropriate formalism (e.g., Bayesian networks, nearest neighbor search, or logistic regression); and (3) settle on a mechanism for feedback so that the software can learn over time.