Final Project
Problem Description:
I wanted to explore a dataset that was rich enough to support several different visual analysis types yet simple enough to interpret. And so I selected the diamonds dataset built into R, which has over 53,000 observations and ten variables describing carat weight / cut / color / clarity / price. My research question in general was: How do physical and quality attributes of a diamond affect its price - and what patterns emerge when they are considered individually and together?
From that question I sliced the analysis into several connected ideas. I wanted to know how the distribution of diamond prices and carat size varies across cut categories. I also wondered which cut categories command higher prices and if there were significant deviations from the average price across the dataset. Last but not least, I wanted to test whether price was strongly associated with carat weight and how multiple attributes like cut and color affect average price. Such small questions helped me sequence and design the visualizations I produced.
Related Work:
I reviewed examples from our class modules and from Chapter 13 of Now You See It which used distribution charts, rankings, correlation graphics and multivariate displays when exploring complex datasets. Several of the student examples on the course dashboard used histograms and boxplots to introduce datasets and compare categories and these guided my decisions on where to start. Besides these, I was inspired by examples in the R Graphics Cookbook and other Tableau Public dashboards that often use facetting and heatmaps to show layered relationships across multiple variables. Reviewing this work helped me to structure my own analysis so that each visualization told a different story rather than just repeating information in different forms.
Solution: Technical Approach and Visualization Choices:
To explore my research question I used RStudio with ggplot2 package. Several visualizations I created highlighted an analytical perspective from Chapter 13. First a histogram of diamond prices to show distribution. This helped establish the highly skewed nature of the data - most diamonds are inexpensive but a smaller number are extremely high.
Then I examined carat size with facet histograms to compare carat distributions across the five cut categories. This comparison showed how size varies by cut type. And I plotted that price by cut to see how the median price and variability vary across categories. This visualization ranked cuts by median price and illustrated variation within groups.
To understand the dataset composition I created a part-to-whole bar chart showing the percentage of diamonds in each cut category. This step helped contextualise later findings - for example, some categories dominate the dataset affecting average pricing patterns. I followed this with a deviation chart comparing each cut's mean price to the dataset average price. This visualization made clear which cuts are priced above or below the general benchmark.
I then plotted the carat weight - price relationship against LOESS smoothers colored by cut in a scatterplot. Plot showed a strong positive correlation with increasing carat weight and trend lines showed that this relationship is variable with cut quality. Finally, to study interactions between variables, I made a heatmap of the average price for each cut/color combination. In this multivariate view, deeper trends emerged that were not apparent in one-variable analyses, for example, color grades are priced higher by cut category.
Working through this project helped me better understand how different diamond attributes shape the price and how visualization techniques can uncover meaningful patterns. The histograms clarified distribution shapes, the ranking and deviation charts highlighted how quality categories differ, and the correlation and heatmap visualizations demonstrated how price is influenced by multiple variables at once. Overall, the diamonds dataset provided an excellent opportunity to apply a wide range of visual analysis techniques while also telling a coherent, data-driven story.







Comments
Post a Comment