The inexpensive storage and the ubiquity of digital systems, has resulted in an increasing number of entities including cities, federal governments, retailers, scientific organizations, NGOs, and even individuals are amassing huge databases, approaching terabytes and petabytes. Therefore, the need for data mining, i.e. extracting knowledge from the data in the form of useful and interesting models and trends, has become more important than ever.
Keywords that describe the current research in the lab within the context of Data Mining are: Outlier Detection, Anomaly Detection, Distributed Datasets, Mixed Attribute Datasets High-Dimensional Datasets, Frequent Itemset Mining, Non-Derivable Itemsets.
Outlier detection has attracted substantial attention in many applications and research areas; some of the most prominent applications are network intrusion detection or credit card fraud detection. Many of the existing approaches are based on calculating distances among the points in the dataset. These approaches cannot easily adapt to current datasets that usually contain a mix of categorical and continuous attributes, and may be distributed among different geographical locations. In addition, current datasets usually have a large number of dimensions. These datasets tend to be sparse, and traditional concepts such as Euclidean distance or nearest neighbor become unsuitable. We propose a fast distributed outlier detection strategy intended for datasets containing mixed attributes. The proposed method takes into consideration the sparseness of the dataset, and is experimentally shown to be highly scalable with the number of points and the number of attributes in the dataset. Experimental results show that the proposed outlier detection method compares very favorably with other state-of-the art outlier detection strategies proposed in the literature and that the speedup achieved by its distributed version is very close to linear.
- Michael Georgiopoulos
The premier technical journal focused on the theory, techniques and practice for extracting information from large databases.
KDD is one of the premier conferences in the field of knowledge discovery in databases.