Teaching:TUW - UE InfoVis WS 2008/09 - Gruppe 02 - Aufgabe 1 - Scatterplot: Difference between revisions

From InfoVis:Wiki
Jump to navigation Jump to search
mNo edit summary
(Ok for now...)
Line 1: Line 1:
A scatterplot (also called a ''scatter chart'', ''scatter diagram'' or ''scatter graph'' [Wikipedia]) is a diagram in which the values of two ''metric'' variables are applied to the horizontal and vertical axes of a cartesian coordinate system. The resulting point in the graph represents one record from a data set. The distribution pattern of points from multiple records reveals, among other qualities, the correlation between the selected variables in the data set. The scatterplot is not to be confused with the ''correlation plot'' [Information Technology Lab, NIST #2] which treats already adopted correlation coefficients in different data groups, while the term ''correlation diagram'' does not seem to be bound.
A scatterplot (also called a ''scatter chart'', ''scatter diagram'' or ''scatter graph'' [Wikipedia1, 2008]) is a diagram in which the values of two ''metric'' variables are applied to the horizontal and vertical axes of a cartesian coordinate system. The resulting point in the graph represents one record from a data set. The distribution pattern of points from multiple records reveals, among other qualities, the correlation between the selected variables in the data set. The scatterplot is not to be confused with the ''correlation plot'' [NIST, 2008] which treats already adopted correlation coefficients in different data groups, while the term ''correlation diagram'' does not seem to be bound.


===Revealed Information===
===Revealed Information===


Perfect linear correlation results in all samples lying on the regression line with positive or negative incline dependent on the sign of the correlation coefficient [University of Illinois]. Note, that the nonzero incline of the line is insignificant in this kind of diagram [Wikipedia Correlation, EN] since it is dependent on axis scales.
Perfect linear correlation results in all samples lying on the regression line with positive or negative incline dependent on the sign of the correlation coefficient [MSTE, 1997]. Note, that the nonzero incline of the line is insignificant in this kind of diagram [Wikipedia 3, 2008] since it is dependent on axis scales.


An example of perfect correlation can be seen on the right together with other patterns: strong  
An example of perfect correlation can be seen on the right together with other patterns: strong  
Line 10: Line 10:
[[Image:SomeScatterplots.jpg|right|200px|thumb|Some scatterplots.]]
[[Image:SomeScatterplots.jpg|right|200px|thumb|Some scatterplots.]]


The plot below features a regression line to further increase expressiveness. The regression function is not necessarily chosen linear as in this example. Any kind of curve may fit a plot (quadratic, splines, ...). Generally, the curve with the smallest sum of squared distances to the points is sought after, [NetMBA].
The plot below features a regression line to further increase expressiveness. The regression function is not necessarily chosen linear as in this example. Any kind of curve may fit a plot (quadratic, splines, ...). Generally, the curve with the smallest sum of squared distances to the points is sought after, [NetMBA, 2008]. For an introduction on linear regression, see [Wikipedia 4, 2008; Wikipedia 5, 2008].
 
[Wikipedia Linear Regression]


[[Image:WeakNegativeCorrelationLine.jpg|right|200px|thumb|Regression line.]]
[[Image:WeakNegativeCorrelationLine.jpg|right|200px|thumb|Regression line.]]
Generally: refer to regression analysis for further ...


Further properties of data sets that are easily discovered are the presence of clusters and outlyers.
Further properties of data sets that are easily discovered are the presence of clusters and outlyers.
density (-> cluster analysis) & outlyers


[[Image:ClustersOutlyers.jpg|right|200px|thumb|Clusters, outlyers.]]
[[Image:ClustersOutlyers.jpg|right|200px|thumb|Clusters, outlyers.]]
Line 28: Line 21:
===Scatterplots of Higher Dimensions===
===Scatterplots of Higher Dimensions===


Scatterplots are not restricted to records with only two variables. Higher dimensional data can be displayed by adding the third axis to the plotspacially or by point properties (color, size, shape)
Scatterplots are not restricted to records with only two variables. Higher dimensional data can be displayed by adding the third axis to the plot or by assigning point properties (color, size, shape).


TODO: add figure with colored 3D plot,  
TODO: add figure with colored 3D plot,  


[Wikipedia, EN]
For an example of a threedimensional scatterplot refer to [Wikipedia 1, 2008]. A way of plotting multidimensional data without the use of the third axis can be found in [Demsar, 2008].
 
Nice example of plotting multidimensional data: [AI Lab]


===Treating Discrete Data===
===Treating Discrete Data===


For continuously distributed data, scatterplots do well in visualizing density. The problem with discrete data is the possibility of more than one record sharing one point in the diagram (''overplotting''). One solution is to alter the point representation according to density, as is achieved by ''sun flower plots'' in which each point symbol gains radial segments as a consequence, [Wikipedia, DE]. Examples can be found here: [Friendly, 2006], [addictedtor.free.fr].
For continuously distributed data, scatterplots do well in visualizing density. The problem with discrete data is the possibility of more than one record sharing one point in the diagram (''overplotting''). One solution is to alter the point representation according to density, as is achieved by ''sun flower plots'' in which each point symbol gains radial segments as a consequence, [Wikipedia 2, 2008]. Examples can be found here: [Friendly, 2006; addictedtor, 2005].




=References=
=References=


*[Wikipedia 1, 2008] Scatterplot, Wikipedia. Retrieved at: November 2, 2008.
*[MSTE, 1997]: Anonymous Carolyn, Carolyn's Unit on Graphing. MSTE, University of Illinois. Retrieved at: November 2, 2008. [http://www.mste.uiuc.edu/courses/ci330ms/youtsey/scatterinfo.html http://www.mste.uiuc.edu/courses/ci330ms/youtsey/scatterinfo.html]
[http://en.wikipedia.org/wiki/Scatterplot http://en.wikipedia.org/wiki/Scatterplot]
*[NIST, 2008] Carroll Croarkin, Paul Tobias. ''NIST/SEMATECH e-Handbook of Statistical Methods''. Retrieved at November 2, 2008 [http://www.itl.nist.gov/div898/handbook/ http://www.itl.nist.gov/div898/handbook/]
*[Wikipedia 2, 2008] Streudiagramm, Wikipedia. Retrieved at: November 2, 2008.
*[NetMBA, 2008] Anonymous, Scatter Plot. NetMBA. Retrieved at: 2 November, 2008. [http://www.netmba.com/statistics/plot/scatter/ http://www.netmba.com/statistics/plot/scatter/]
[http://de.wikipedia.org/wiki/Streudiagramm http://de.wikipedia.org/wiki/Streudiagramm]
*[addictedtor, 2005]: Anonymous. Sun Flower Plot. R Graph Gallery. Changed: October 6, 2005 Retrieved at: 2 November, 2008. [http://addictedtor.free.fr/graphiques/graphcode.php?graph=59 http://addictedtor.free.fr/graphiques/graphcode.php?graph=59]
*[Wikipedia 3, 2008] Correlation, Wikipedia. Retrieved at: November 2, 2008.
*[Friendly, 2006] Michael Friendly, The sunplot macro. York University. Changed at: November 2, 2006. Retrieved at: November 2, 2008 [http://www.math.yorku.ca/SCS/sasmac/sunplot.html http://www.math.yorku.ca/SCS/sasmac/sunplot.html]
[http://en.wikipedia.org/wiki/Correlation http://en.wikipedia.org/wiki/Correlation]
*[Demsar, 2008] Janez Demsar, Simple Visualization Examples. A.I. Lab Ljubljana. Retrieved at: November 2, 2008 [http://www.ailab.si/janez/visualizations.html http://www.ailab.si/janez/visualizations.html]
*[Wikipedia 4, 2008] Linear Regression, Wikipedia. Retrieved at 2 November, 2008.
*[Wikipedia 1, 2008] Scatterplot, Wikipedia. Retrieved at: November 2, 2008. [http://en.wikipedia.org/wiki/Scatterplot http://en.wikipedia.org/wiki/Scatterplot]
[http://en.wikipedia.org/wiki/Linear_regression http://en.wikipedia.org/wiki/Linear_regression]
*[Wikipedia 2, 2008] Streudiagramm, Wikipedia. Retrieved at: November 2, 2008. [http://de.wikipedia.org/wiki/Streudiagramm http://de.wikipedia.org/wiki/Streudiagramm]
*[MSTE, 1997]: Anonymous Carolyn, Carolyn's Unit on Graphing. MSTE, University of Illinois. Retrieved at: November 2, 2008.
*[Wikipedia 3, 2008] Correlation, Wikipedia. Retrieved at: November 2, 2008. [http://en.wikipedia.org/wiki/Correlation http://en.wikipedia.org/wiki/Correlation]
[http://www.mste.uiuc.edu/courses/ci330ms/youtsey/scatterinfo.html http://www.mste.uiuc.edu/courses/ci330ms/youtsey/scatterinfo.html]  
*[Wikipedia 4, 2008] Linear Regression, Wikipedia. Retrieved at 2 November, 2008. [http://en.wikipedia.org/wiki/Linear_regression http://en.wikipedia.org/wiki/Linear_regression]
*[NIST , 2008] Carroll Croarkin, Paul Tobias. ''NIST/SEMATECH e-Handbook of Statistical Methods''. Retrieved at November 2, 2008
*[Wikipedia 5, 2008] Regression Analysis, Wikipedia. Retrieved at 2 November, 2008. [http://en.wikipedia.org/wiki/Regression_analysis http://en.wikipedia.org/wiki/Regression_analysis]
[http://www.itl.nist.gov/div898/handbook/ http://www.itl.nist.gov/div898/handbook/]
=External Links=


http://www.itl.nist.gov/div898/handbook/eda/section3/eda33q.htm
http://www.itl.nist.gov/div898/handbook/eda/section3/linecorr.htm
*[NetMBA, 2008] Anonymous, Scatter Plot. NetMBA. Retrieved at: 2 November, 2008.
[http://www.netmba.com/statistics/plot/scatter/ http://www.netmba.com/statistics/plot/scatter/]
*[addictedtor.free.fr, 2005]: Anonymous. Sun Flower Plot. R Graph Gallery. Changed at: October 6, 2005 Retrieved at: 2 November, 2008.
[http://addictedtor.free.fr/graphiques/graphcode.php?graph=59 http://addictedtor.free.fr/graphiques/graphcode.php?graph=59]
*[Friendly, 2006] Michael Friendly, The sunplot macro. York University. Changed at: November 2, 2006. Retrieved at: November 2, 2008
[http://www.math.yorku.ca/SCS/sasmac/sunplot.html http://www.math.yorku.ca/SCS/sasmac/sunplot.html]
*[Demsar, ] Janez Demsar, Simple Visualization Examples. A.I. Lab Ljubljana. Retrieved at: November 2, 2008
[http://www.ailab.si/janez/visualizations.html http://www.ailab.si/janez/visualizations.html]
=External Links=
*Java Applet: [http://matti.usu.edu/nlvm/nav/frames_asid_144_g_4_t_5.html http://matti.usu.edu/nlvm/nav/frames_asid_144_g_4_t_5.html]
*Java Applet: [http://matti.usu.edu/nlvm/nav/frames_asid_144_g_4_t_5.html http://matti.usu.edu/nlvm/nav/frames_asid_144_g_4_t_5.html]

Revision as of 20:28, 6 November 2008

A scatterplot (also called a scatter chart, scatter diagram or scatter graph [Wikipedia1, 2008]) is a diagram in which the values of two metric variables are applied to the horizontal and vertical axes of a cartesian coordinate system. The resulting point in the graph represents one record from a data set. The distribution pattern of points from multiple records reveals, among other qualities, the correlation between the selected variables in the data set. The scatterplot is not to be confused with the correlation plot [NIST, 2008] which treats already adopted correlation coefficients in different data groups, while the term correlation diagram does not seem to be bound.

Revealed Information

Perfect linear correlation results in all samples lying on the regression line with positive or negative incline dependent on the sign of the correlation coefficient [MSTE, 1997]. Note, that the nonzero incline of the line is insignificant in this kind of diagram [Wikipedia 3, 2008] since it is dependent on axis scales.

An example of perfect correlation can be seen on the right together with other patterns: strong positive, weak negative and one example of variables without significant correlation.

Some scatterplots.

The plot below features a regression line to further increase expressiveness. The regression function is not necessarily chosen linear as in this example. Any kind of curve may fit a plot (quadratic, splines, ...). Generally, the curve with the smallest sum of squared distances to the points is sought after, [NetMBA, 2008]. For an introduction on linear regression, see [Wikipedia 4, 2008; Wikipedia 5, 2008].

Regression line.

Further properties of data sets that are easily discovered are the presence of clusters and outlyers.

Clusters, outlyers.


Scatterplots of Higher Dimensions

Scatterplots are not restricted to records with only two variables. Higher dimensional data can be displayed by adding the third axis to the plot or by assigning point properties (color, size, shape).

TODO: add figure with colored 3D plot,

For an example of a threedimensional scatterplot refer to [Wikipedia 1, 2008]. A way of plotting multidimensional data without the use of the third axis can be found in [Demsar, 2008].

Treating Discrete Data

For continuously distributed data, scatterplots do well in visualizing density. The problem with discrete data is the possibility of more than one record sharing one point in the diagram (overplotting). One solution is to alter the point representation according to density, as is achieved by sun flower plots in which each point symbol gains radial segments as a consequence, [Wikipedia 2, 2008]. Examples can be found here: [Friendly, 2006; addictedtor, 2005].


References

External Links