Your Perfect Assignment is Just a Click Away
We Write Custom Academic Papers

100% Original, Plagiarism Free, Customized to your instructions!


Data Mining Assignment 3

Data Mining Assignment 3

Running head: Data Mining Assignment 3 1

Data Mining Assignment 3 11

Data Mining Assignment 3

Dhivya Balasubramanian

University of the Cumberlands

Date 3/28/2019


Data Analytics has two divisions one is data mining and the other is data analysis. These two are usually confused as someone will think these two are the same due to its coinciding features. However, it is not the same both have quite different properties. Data Mining helps to find out the hidden arrangements of a large dataset, whereas data analysis is to verify the statistical models and to find out hypotheses in a dataset. Even we can classify data mining as an activity of analysis. Because data analysis deals with the collection of data, treating it, cleaning and preparing the data and finally creating a visualization using the data. One main difference between mining and analysis is data mining is for structured dataset, data analysis if for both structured and unstructured datasets. (“”, n.d)


Statistical analysis is a part of data analysis. For a Business intelligence, analysis covers data collection, exploring data from the dataset. Statistical analysis has five steps Describe explore, create a model, validate and deploy. Analysis helps to identify the trend for business.

Hypothesis Testing

A hypothesis is used to test a link between variables in a dataset. Usually hypothesis is used to predict findings and to identify the probability whether the hypothesis is true or not. (“Wolfram MathWorld”, n.d.) The visualization or figure 1 is for hypothesis testing, I have used linear regression model to find the outcomes. Two measure values Profit and COGS is used to point the scatter chart based on product and month dimensions. Linear regression is used in visualization by drawing a line through data points and the best fit is determined by the distance between the points and the lines. On other word, linear regression helps to find a relationship between dimension and measures .So this model say our outcome is a strong regression model. Six trend lines are used for hypothesis testing. Formula used is Product*(In (profit) +intercept) and the resulting P-value is <0.0001 and R-squared 0.617396 (Figure 2).

From the visualization it’s evident that Paseo product is having more profit and P-value is more than rest of the products. Below is the details of Paseo product regression model

P-value: 0.0006748
Equation: ln(Cost over gross sale) = 1.01648*ln(Profit) + 1.48378
Term Value StdErr t-value p-value
ln(Profit) 1.01648 0.209731 4.84661 0.0006748
intercept 1.48378 2.62999 0.564176 0.585065

The least one is Velo product and its hypothesis calculation details are

P-value: 0.0408779
Equation: ln(Cost over gross sale) = 0.851062*ln(Profit) + 3.36857
Term Value StdErr t-value p-value
ln(Profit) 0.851062 0.35681 2.3852 0.0408779
intercept 3.36857 4.11629 0.818351 0.434288


Anomalies or otherwise called as outliers is used to represent data which lies outside than normal one. From our dataset I choose Months and sales for predicting anomalies. I added sales data twice to the rows, one is sum of sale another is standard deviation of sale with months in columns. To find the outliers the below formula is used in calculated fields of tableau (figure 4)

IF SUM([Sales]) <


THEN “Bad Anomaly” ELSEIF SUM([Sales]) >


THEN “Good Anomaly” ELSE “Expected” END,

After the adding the anomalies formula to the worksheet in tableau, we are able to find out only on the month of Nov and Dec there are good Anomalies with an avg sale more than 14,947,804. (Figure 3)


Cluster analysis is one of the sampling methods, it used to separate data into groups. Clusters are also known as classification analysis. In this type of analysis, there is no preceding information about the cluster groups for any objects. This analysis is been used in market research for any purpose. It helps to identify the homogeneous groups in buying a product. (“Statistics Solutions”, n.d.)

So far cluster analysis i have chosen products month and profit from dataset, created a circle view visualization with month in columns and profit in rows. Tableau has a built in analysis for clusters (refer figure 6). So here I chose 2 segmentation in clusters, cluster 1 is mostly presented below average line whereas cluster 2 have values more than average.

Recommendations from the analysis and visualization

From figure 7 it’s evident that sales and profit is very less in the USA than rest of the countries with Canada have max sale, next is France and followed by Germany. The most like’s product in all country is Paseo. Even from forecast in figure 8 Mexico and USA don’t have much improvement in sales and profit so the organization needs to work on both of these countries to increase its sales.


What Are The Differences Between Data Analytics and Data Mining? (2018, October 15). Retrieved March 27, 2019, from

Cluster Analysis – Statistics Solutions. (n.d.). Retrieved March 29, 2019, from

Hypothesis Testing — from Wolfram MathWorld. (n.d.). Retrieved March 30, 2019, from


Figure 1 Scatter chart with linear regression trend lines drawn for each products

Figure 2: Description of the linear trend model

Figure 3: Visualization for Anomalies

Figure 4: Formula used in tableau for finding anomalies

Figure 5: Worksheet description of anomalies visualization

Figure 6: Clusters

Figure 7: Recommendation

Figure 8: Forecast

Our Service Charter

1. Professional & Expert Writers

2. Top Quality Papers

3. Plagiarism-Free Papers: 

4. Timely Delivery

5. Affordable Prices

6. 24/7 Customer Support