13518

Page 1 of 6

Journal for Studies in Management and Planning

Available at http://edupediapublications.org/journals/index.php/JSMaP/

ISSN: 2395-0463

Volume 04 Issue 04

April 2018

Available online: http://edupediapublications.org/journals/index.php/JSMaP/ P a g e | 114

The KDD Process for Extracting Useful

Knowledge from Volumes of Data

Mining

Pardeep Nehra

Department of Computer Science

E-Mail:- par.nehra82@yahoo.com

Abstract: The aim of this research work is to discover the exception by using the rough set

approach and to structure/represent the exceptions in the form of rule pair, a knowledge

structure that consist of commonsense rule and exception rule. Knowledge structures are

compact representation of rules and increase the comprehensibility. Data mining refers to

extracting or mining knowledge from large amounts of data. The overall process of extracting

useful information is referred as Knowledge Discovery in Databases. Data mining is particular

step in this process application of specific algorithms for extracting patterns (models) from data.

Mining exceptions is getting attention of researchers because it is interesting to discover

exceptions, as they challenge the existing knowledge, lead to the growth of knowledge in new

directions and help decision makers to make right decisions even in rare circumstances.

Keywords: KDD, Data Mining, NN, Rough Set, Fuzzy Set.

Introduction:

The amount of data available from various

sources continues to grow fast. The large

amount of data stored in databases contains

valuable hidden knowledge that could be

used to improve the decision-making

process of an organization. For an instance,

data about previous sales might contain

interesting relationships between products

and customers. The discovery of such

relationships can be very useful to increase

the sales of a company. So, there is a clear

need for semiautomatic methods for

extracting knowledge from data. This need

has led to the emergence of a field called

data mining and knowledge discovery. The

goal of KDD (Knowledge Discovery in

Databases) is to identify the valid, novel,

potentially useful and ultimately

understandable patterns of data. Data

Mining is a stage in the entire process of

KDD which applies an algorithm to extract

interesting patterns. Rough set theory is one

of the popular theories in the field of data

mining. One proposes a formal framework

for the transformation of data into

knowledge. Rough set theory is relatively

simple and it comes handy for dealing with

vagueness and uncertainty problems that are

inherent to decision making situation. Data

mining extract the patterns/Rule. Exceptions

are deviations from the commonsense rules.

Exceptions are interesting as they exhibit the

unexpectedness and contradict prior

knowledge about the domain.

Page 2 of 6

Journal for Studies in Management and Planning

Available at http://edupediapublications.org/journals/index.php/JSMaP/

ISSN: 2395-0463

Volume 04 Issue 04

April 2018

Available online: http://edupediapublications.org/journals/index.php/JSMaP/ P a g e | 115

Knowledge Discovery in Databases

Process:

Knowledge discovery in Databases (KDD)

is the process of finding useful information

and patterns in data have defined KDD as

“The nontrivial process of identifying valid,

novel, potentially, and ultimately

understandable patterns in data”. The KDD

process consists of the following steps:

1) Data cleaning and integration:

The data to be used by the process may

have incorrect or missing data. It means data

may be noisy or inconsistent. The erroneous

data may be corrected or removed and

missing values tuples could be deleted or

missing values can be calculated on average

basis of other values under the data cleaning

step of KDD. While there are multiple

sources of data then data from different

sources can be combined under the data

integration step.

2) Selection and Transformation:

Data relevant to the analysis task are

retrieved from the databases under the step

of data selection. But data from different

sources must be transformed or consolidated

into forms appropriate for mining by

performing summary or aggregation

operation. Data reduction may be used to

reduce the number of possible data values

being considered.

3) Data mining:

This step consists of the use of

algorithms to extract interesting and useful

information and patterns from large

databases for decision making.

4) Pattern evaluation:

As all patterns that are generated are not

of interest. Only some of them are actually

interesting. Under this step truly interesting

patterns are identified on the basis of various

interestingness measures.

5) Knowledge presentation:

This step describes how the data mining

results are presented to the users. This is an

extremely important step because the

usefulness of the results is dependent on it.

This process consists of an important

activity known as post processing. Post

processing make results obtained from data

mining easy to understand for user. Various

visualization and knowledge representation

techniques are used at this step.

Classification Models:

Classification is the process of classifying

data items of a database into groups of

classes. The various types of classification

models are used for classification.

Classification models can be classified into

two categories evolutionary and non- evolutionary approaches. Evolutionary

approaches based classification models

consist of genetic algorithms and non- evolutionary approach based classification

models consist of decision trees, neural- network, rough set, fuzzy set and statistical

techniques.

Decision Trees:

A decision tree is a flow chart like tree structure where each internal node denotes a test on an

attribute, each branch represent an outcome of that test. Leaf nodes represent classes. Figure

Page 3 of 6

Journal for Studies in Management and Planning

Available at http://edupediapublications.org/journals/index.php/JSMaP/

ISSN: 2395-0463

Volume 04 Issue 04

April 2018

Available online: http://edupediapublications.org/journals/index.php/JSMaP/ P a g e | 116

show that how a decision tree is used to classify an organization employees according to their

heights.

Decision Tree

In order to classify an unknown sample, the attribute values of the sample are tested against the

decision tree. A path is traced from the root to a leaf node that holds the class prediction of that

sample. Decision tree can be easily converted to classification rules. Important decision tree

algorithms are C5.0, CHAID and QUEST.

Neural Network (NN):

A NN is an information processing system that consists of a graph representing the processing

system as well as various algorithms that access that graph. A NN is also a predictive model. A

neural network is a directed graph with various nodes that is processing elements and arcs.

Nodes in neural network consist of input, output and hidden layer nodes. To perform a data

mining task, a sample tuple is input through the input nodes and output nodes determine what the

prediction is? Hidden layer consist of learning mechanism. Each link is assigned a weight and

learning process like back propagation adjusts these weights so that our prediction becomes

accurate. The working of the simple NN. Suppose a tuple contain two attributes age and income.

These two attributes become input to processing elements and after processing, the NN in the

diagram predict the output that is whether a customer is defaulter or not. Back propagation is an

important neural network algorithm.

Rough Set:

Rough set theory can be used for classification to discover structural relationships within

imprecise or noisy data. It applies to discrete-valued attributes. Continuous-valued attributes

must therefore be discredited prior to its use. Rough set theory is based on the establishment of

equivalence classes within the given training data. All of the data samples forming an

equivalence class are discernible, that is, the samples are identical with respect to the attributes

describing the data. Given real-world data, it is common that some classes cannot be

distinguished in terms of the available attributes. Rough sets can be used to approximately or