Computer Science/Basics

Data Science Methodology (from Coursera)

metamong 2022. 3. 27.

- From problem to approach -

Q1. What is the problem that you are trying to solve?

Q2. How can you use data to answer the question?

 

- Working with the data -

Q3. What data do you need to answer the question?

Q4. Where is the data coming from (identify all sources) and how will you get it? 

Q5. Is the data that you collected representative of the problem to be solved?

Q6. What additional work is required to manipulate and work with the data?

 

- Deriving the answer -

Q7. In what way can the data be visualized to get to the answer that is required?

Q8. Does the model used really answer the initial question or does it need to be adjusted?

Q9. Can you put the model into practice?

Q10. Can you get constructive feedback into answering the question?


cf. Data Mining - CRISP-DM Methodology

* Intent: to take case specific scenarios and general behaviors to make them domain neutral

 

 

→ Business Understanding) This stage is the most important because this is where the intention of the project is outlined. It requires communication and clarity. The difficulty here is that stakeholders have different objectives, biases, and modalities of relating information. They don’t all see the same things or in the same manner. Without clear, concise, and complete perspective of what the project goals are resources will be needlessly expended.

→ Data Understanding) Data understanding relies on business understanding. Data is collected at this stage of the process. The understanding of what the business wants and needs will determine what data is collected, from what sources, and by what methods.

→ Data Preparation) Once the data has been collected, it must be transformed into a useable subset unless it is determined that more data is needed. Once a dataset is chosen, it must then be checked for questionable, missing, or ambiguous cases.

→ Modeling) Once prepared for use, the data must be expressed through whatever appropriate models, give meaningful insights, and hopefully new knowledge. This is the purpose of data mining: to create knowledge information that has meaning and utility. The use of models reveals patterns and structures within the data that provide insight into the features of interest. Models are selected on a portion of the data and adjustments are made if necessary. Model selection is an art and science.

→ Evaluation) The selected model must be tested. This is usually done by having a pre-selected test, set to run the trained model on. This will allow you to see the effectiveness of the model on a set it sees as new. Results from this are used to determine efficacy of the model and foreshadows its role in the next and final stage

→ Deployment) In the deployment step, the model is used on new data outside of the scope of the dataset and by new stakeholders. The new interactions at this phase might reveal the new variables and needs for the dataset and model.

※ The key point of this process is that it’s cyclical; therefore, even at the finish you are having another business understanding encounter to discuss the viability after deployment. The journey continues.

 

 

1) From Problem to Approach

# Business Understanding: What is the problem that you are trying to solve?

- Understanding the GOAL of the person who is asking the question

- Figure out the objectives that are in support of the goal

- 

 

# Analytic Approach: How can you use data to answer the question?

- The correct approach depends on business requirements for the model

e.g)

- if the question is to determine probabilities of an action) a predictive model

- if the questions is to show relationships) a descriptive model (where clusters of similar activities based on events and preferences are examined)

- if the question requries a yes/no answer) a classification model

→ categorical outcome & explicit 'decision path' showing conditions leading to high risk & likelihood of classified outcome, easy to understand & apply

e.g) A data scientist determines that building a recommender system is the solution for a particular business problem at hand

2) From Requirements to Collection

# Data RequirementsWhat are Data Requirements?

- involves identifying the necessary data content, formats and sources for initial data collection

 

# Data CollectionWhat occurs during Data Collection?

- 

- 

- merging data - 

3) From Understanding to Preparation

# Data Understanding: encompasses all activities related to constructing the data set.

- Is the data that you collected representative of the problem to be solved?

- 

 

# Data Preparation: 

- 

- (1) define the variables to be used in the model / (2) determine the timing of events / (3) aggregate the data and merge them from different sources / (4) identify missing data

- 

- can be accelerated through automation

- feature engineering = the process of using domain knowledge of the data to create features that make the machine learning algorithms work (critical when ML tools are being applied to analyze the data)

( 

4) From Modeling to Evaluation

# Data Modeling: In what way can the data be visualized to get to the answer that is required?

- focuses on developing models that are either descriptive or predictive

 

 

* descriptive model) if a person did this, then they're likely to prefer that

* predictive model) tries to yield yes/no, stop/go outcomes

 

→ The data scientist will use a training set for predictive modelling

→ 

 

# Data Evaluation: Does the model used really answer the initial question or does it need to be adjusted?

* includes ensuring that the data are properly handled and interpreted

 

(1) Diagnostic Measures Phase - to ensure the model is working as intended

 

(2) Statistical Significane Testing - 

e.g.) ROC curve (a useful diagnostic tool for determining the optimal classification model)

 

* modeling and evaluation are iterative processes

5) From Deployment to Feedback

# Data Deployment

- 

 

# Data Feedback

- 

 


* 출처1) ≪Data Science Methodology≫ by Coursera 🧡

* 출처2) www.educba.com/predictive-analytics-vs-descriptive-analytics/

댓글