Statoo Consulting's logo Statoo Consulting
Statistical Consulting + Data Analysis + Data Mining Services
Switzerland


AntBig Data Analytics - Methodological Training in Statistical Data Science, May or October 2018, Berne, Switzerland
AntOur View on Big Data and Data Science
Home

AntWhat is
 Statistical Thinking?
 Statistics?
 Data Mining (Data Science)?

AntNews

About Us

Consulting Services

Training Services

Clients

Publications

Partners

Feedback

Contact Us

Jobs

Search
2016 - 2001 = 15 + ε
Bookmark and Share
What is Data Mining (Data Science)?

`We are drowning in information but starved for knowledge.'
John Naisbitt



Data mining (now rebranded as data science) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns or structures or models or trends or relationships in data to enable data-driven decision making.

What is meant by these terms?
  • `Non-trivial': it is not a straightforward computation of predefined quantities like computing the average value of a set of numbers.
  • `Valid': the patterns hold in general, i.e. being valid on new data in the face of uncertainty.
  • `Novel': the patterns were not known beforehand.
  • `Potentially useful': lead to some benefit to the user.
  • `Understandable': the patterns are interpretable and comprehensible - if not immediately then after some postprocessing.

Is data mining (data science) `statistical déjà vu'?

Statistics is the science of learning from data (or making sense out of data), and of measuring, controlling and communicating uncertainty. If you want to know more about what statistics is, please click here.

Like statistical thinking and statistics, data mining (data science) is not only modelling and prediction, nor a product that can be bought, but a whole iterative problem solving cycle/process that must be mastered through interdisciplinary and transdisciplinary team effort.

Data mining (data science) projects are not simple. They usually start with high expectations but may end in failure if the engaged team is not guided by a clear methodological framework. We follow a methodology called CRISP-DM (`CRoss Industry Standard Process for Data Mining'). If you want to know more about CRISP-DM, please click here.

`Coming together is a beginning. Keeping together is progress. Working together is success.'
Henry Ford


What distinguishes data mining (data science) from statistics?

Statistics traditionally is concerned with analysing primary (e.g. experimental) data that have been collected to explain and check the validity of specific existing ideas (hypotheses). As such statistics is `primary data analysis', top-down (explanatory and confirmatory) analysis or `idea (hypothesis) evaluation or testing.

Data mining (data science), on the other hand, typically is concerned with analysing secondary (e.g. observational or `found') data that have been collected for other reasons (and not `under control' of the investigator). The usage of these data is to create new ideas (hypotheses). As such data mining (data science) is `secondary data analysis', bottom-up (exploratory and predictive) analysis, `idea (hypothesis) generation' (or `knowledge discovery').

The two approaches of `learning from data' or `turning data into knowledge' are complementary and should proceed side by side - in order to enable proper data-driven decision making.
  • The information obtained from a bottom-up analysis, which identifies important relations and tendencies, can not explain why these discoveries are useful and to what extent they are valid. The confirmatory tools of top-down analysis need to be used to confirm the discoveries and evaluate the quality of decisions based on those discoveries.
  • Performing a top-down analysis, we think up possible explanations for the observed behaviour and let those hypotheses dictate the data to be analysed. Then, performing a bottom-up analysis, we let the data suggest new hypotheses (ideas) to test.

We already applied this complementary view several times successfully within client projects.

For example, when historical data were available the idea to be generated from a bottom-up analysis (e.g. using a mixture of so-called `ensemble techniques') was `which are the most important (from a predictive point of view) factors (among a `large' list of candidate factors) that impact a given process output (or a given KPI, `Key Performance Indicator')'. Mixed with subject-matter knowledge this idea resulted in a list of a 'small' number of factors (i.e. `the critical ones'). The confirmatory tools of top-down analysis (statistical `Design Of Experiments', DOE, in most of the cases) was then used to confirm and evaluate the idea. By doing this, new data will be collected (about `all' factors) and a bottom-up analysis could be applied again - letting the data suggest new ideas to test.



Want to know more about the relation between data mining (data science) and statistics? Check out some additional papers in our `Publications' section.


Interested in our data mining (data science) services? Are you drowning in uncertainty and starving for knowledge? Interested to get Statooed? Have a question about our data mining (data science) services? Contact us to allow us to help you.


© 2001-2016 by Statoo Consulting, Switzerland. All rights reserved.
Statoo is a registered trademark of Statoo Consulting.
Privacy Policy. Usage Terms and Conditions.
Last updated on July 20, 2016.
www.statoo.com/en/datamining/