0% found this document useful (0 votes)
50 views1 page

Larose, D. T. (2006) Data Mining Methods and Models, Hoboken: John Wiley & Sons, Inc. Morgan Kaufmann

The document summarizes the process of analyzing data from the 2011 Ethiopian Demographic and Health Survey. It extracted data from 01.01.2012 to 31.12.2012 on 11,654 records meeting the criteria. The data was cleaned, coded, transformed, and analyzed in WEKA software to identify factors associated with children being alive or dead, as the dataset was stratified on this outcome. Synthetic Minority Oversampling Technique was used to balance the subgroups. The document also notes the importance of data preprocessing when using pre-existing datasets for mining, as the data may contain errors or be unbalanced.

Uploaded by

Anonymous dS2muU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
0% found this document useful (0 votes)
50 views1 page

Larose, D. T. (2006) Data Mining Methods and Models, Hoboken: John Wiley & Sons, Inc. Morgan Kaufmann

The document summarizes the process of analyzing data from the 2011 Ethiopian Demographic and Health Survey. It extracted data from 01.01.2012 to 31.12.2012 on 11,654 records meeting the criteria. The data was cleaned, coded, transformed, and analyzed in WEKA software to identify factors associated with children being alive or dead, as the dataset was stratified on this outcome. Synthetic Minority Oversampling Technique was used to balance the subgroups. The document also notes the importance of data preprocessing when using pre-existing datasets for mining, as the data may contain errors or be unbalanced.

Uploaded by

Anonymous dS2muU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
You are on page 1/ 1

1. The extracted data cover the period from 01.01.2012 to 31.12.2012.

2. EDHS 2011 dataset was utilized. The data were originally collected by Macro International United States of America (USA) and
CSA Ethiopia
3. A total of 11,654 records that met inclusion criteria were retrieved. Data was extracted from EDHS 2011 children’s dataset.
Extracted data were cleaned, coded, transformed and entered into
Waikato Environment for Knowledge Analysis (WEKA) 3.6.4 software. The extracted dataset was stratified into “Alive” and
“Dead”
groups. The “Alive” group comprised mothers whose child was
alive during the survey. The “Dead” group comprised mothers who
had one or more dead child. Since sample sizes of ‘Alive’ and
‘Dead’ subgroups is not balanced we have applied Synthetic Minority Oversampling Technique (SMOTE) was applied to
balance the
dataset and minimize sampling errors. Pruning techniques were
used to clean rules that were insignificant. The 10 fold cross validation and 95% split was done to oversee the strength of the
association of determinants with the outcome variable.
4. Most of the data sets used in data mining were not
necessarily gathered with a specific goal in mind. Some of
them may contain errors, outliers or missing values. In order
to use those data sets in the data mining process, the data
needs to undergo preprocessing, using data cleaning,
discretization and data transformation [9]. It has been
estimated that data preparation alone accounts for 60% of all
the time and effort expanded in the entire data mining process
[10].
5. Larose, D. T. (2006) Data Mining Methods and Models, Hoboken: John Wiley & Sons, Inc.
6.
[10] Pyle, D. (1999) Data Preparation for Data Mining, San Francisco: Morgan Kaufmann

You might also like