Importance Of Data Integration In Data Mining

763 Words4 Pages
Data integration is the merging of data from multiple sources/data stores which are stored to provide a unified outlook of the data. Data Mining often requires data integration as it helps to reduce and avoid inconsistencies and redundancies within a dataset. This also aids the process by improving the accuracy and speed of the subsequent data mining process. [4] 3.1 Tuple duplication As well as distinguishing redundancies between attributes, duplication should also be detected at a tuple level. Discrepancies can often arise between many duplicates because of the inaccurate data entry or due to updating some but not all data incidents. 3.2 Entity identification problem There are a variety of concerns to consider when integrating data. The…show more content…
Irregularities in naming can also cause redundancies in the resulting data set. Some of these redundancies can be discovered through correlation analysis. Analysis can be used to indicate how strong an attribute implies to another based off of the available data. When dealing with nominal data a test known as the chi-square (χ 2) test can be executed. However when dealing with numeric attributes you can use coefficient correlation and covariance which access the variance between one another…show more content…
This can come down to differences in scaling, representation or encoding. For instance, in a hotel chain, the price range of rooms may vary from city to city as well as the difference in currency and tax. 4 Data Transformation Data Transformation is data is consolidated into forms which are suitable for data mining. A data transformation converts a dataset from the original systems data format to the format of its destinations system. There are multiple strategies involved in data transformation which such as normalisation, aggregation and smoothing. 4.1 Smoothing Smoothing works to remove any noise from a dataset. Techniques include regression, binning, and clustering. Data smoothing can be carried out through regression, a technique which conforms data values to a function. Linear regression consist of finding the “best” line that will fit two attributes so that one attribute can be used to forecast the other. Binning of data is a pre-processing technique which is used to reduce the effect of minor observation errors. These binning methods smooth out a fixed data value by referring to other values that surround it. The values are then circulated into a number of bins where local smoothing is

More about Importance Of Data Integration In Data Mining

Open Document