Running head: Data Mining Assignment 3 1
Data Mining Assignment 3 11
Data Mining Assignment 3
University of the Cumberlands
Data Analytics has two divisions one is data mining and the other is data analysis. These two are usually confused as someone will think these two are the same due to its coinciding features. However, it is not the same both have quite different properties. Data Mining helps to find out the hidden arrangements of a large dataset, whereas data analysis is to verify the statistical models and to find out hypotheses in a dataset. Even we can classify data mining as an activity of analysis. Because data analysis deals with the collection of data, treating it, cleaning and preparing the data and finally creating a visualization using the data. One main difference between mining and analysis is data mining is for structured dataset, data analysis if for both structured and unstructured datasets. (“imarticus.org”, n.d)
Statistical analysis is a part of data analysis. For a Business intelligence, analysis covers data collection, exploring data from the dataset. Statistical analysis has five steps Describe explore, create a model, validate and deploy. Analysis helps to identify the trend for business.
A hypothesis is used to test a link between variables in a dataset. Usually hypothesis is used to predict findings and to identify the probability whether the hypothesis is true or not. (“Wolfram MathWorld”, n.d.) The visualization or figure 1 is for hypothesis testing, I have used linear regression model to find the outcomes. Two measure values Profit and COGS is used to point the scatter chart based on product and month dimensions. Linear regression is used in visualization by drawing a line through data points and the best fit is determined by the distance between the points and the lines. On other word, linear regression helps to find a relationship between dimension and measures .So this model say our outcome is a strong regression model. Six trend lines are used for hypothesis testing. Formula used is Product*(In (profit) +intercept) and the resulting P-value is <0.0001 and R-squared 0.617396 (Figure 2).
From the visualization it’s evident that Paseo product is having more profit and P-value is more than rest of the products. Below is the details of Paseo product regression model
|Equation:||ln(Cost over gross sale) = 1.01648*ln(Profit) + 1.48378|
The least one is Velo product and its hypothesis calculation details are
|Equation:||ln(Cost over gross sale) = 0.851062*ln(Profit) + 3.36857|
Anomalies or otherwise called as outliers is used to represent data which lies outside than normal one. From our dataset I choose Months and sales for predicting anomalies. I added sales data twice to the rows, one is sum of sale another is standard deviation of sale with months in columns. To find the outliers the below formula is used in calculated fields of tableau (figure 4)
IF SUM([Sales]) <
(WINDOW_AVG(SUM([Sales])) – WINDOW_STDEV(SUM([Sales])))
THEN “Bad Anomaly” ELSEIF SUM([Sales]) >
(WINDOW_AVG(SUM([Sales])) + WINDOW_STDEV(SUM([Sales])))
THEN “Good Anomaly” ELSE “Expected” END,
After the adding the anomalies formula to the worksheet in tableau, we are able to find out only on the month of Nov and Dec there are good Anomalies with an avg sale more than 14,947,804. (Figure 3)
Cluster analysis is one of the sampling methods, it used to separate data into groups. Clusters are also known as classification analysis. In this type of analysis, there is no preceding information about the cluster groups for any objects. This analysis is been used in market research for any purpose. It helps to identify the homogeneous groups in buying a product. (“Statistics Solutions”, n.d.)
So far cluster analysis i have chosen products month and profit from dataset, created a circle view visualization with month in columns and profit in rows. Tableau has a built in analysis for clusters (refer figure 6). So here I chose 2 segmentation in clusters, cluster 1 is mostly presented below average line whereas cluster 2 have values more than average.
Recommendations from the analysis and visualization
From figure 7 it’s evident that sales and profit is very less in the USA than rest of the countries with Canada have max sale, next is France and followed by Germany. The most like’s product in all country is Paseo. Even from forecast in figure 8 Mexico and USA don’t have much improvement in sales and profit so the organization needs to work on both of these countries to increase its sales.
What Are The Differences Between Data Analytics and Data Mining? (2018, October 15). Retrieved March 27, 2019, from https://imarticus.org/what-are-the-differences-between-data-analytics-and-data-mining/
Cluster Analysis – Statistics Solutions. (n.d.). Retrieved March 29, 2019, from https://www.statisticssolutions.com/directory-of-statistical-analyses-cluster-analysis/
Hypothesis Testing — from Wolfram MathWorld. (n.d.). Retrieved March 30, 2019, from http://mathworld.wolfram.com/HypothesisTesting.html
Figure 1 Scatter chart with linear regression trend lines drawn for each products
Figure 2: Description of the linear trend model
Figure 3: Visualization for Anomalies
Figure 4: Formula used in tableau for finding anomalies
Figure 5: Worksheet description of anomalies visualization
Figure 6: Clusters
Figure 7: Recommendation
Figure 8: Forecast