Teaching:TUW - UE InfoVis WS 2008/09 - Gruppe 02 - Aufgabe 1 - Scatterplot: Difference between revisions

From InfoVis:Wiki
Jump to navigation Jump to search
m (Revamp formatting of the text, correct some errors, order references, make links to the references)
m (Some error correction and put links to Figure 1 to 5)
Line 2: Line 2:




Also called a ''scatter chart'', ''scatter diagram'' or ''scatter graph'' [http://en.wikipedia.org/wiki/Scatterplot [Wikipedia 1, 2008]]. It is in it's most basic form a diagram in which the values of two ''metric'' variables are applied to the horizontal and vertical axes of a cartesian coordinate system. The resulting point in the graph represents one record from a data set. The distribution pattern of points from multiple records reveals, among other qualities, the correlation between the selected variables in the data set. The scatterplot is not to be confused with the ''correlation plot'' [http://www.itl.nist.gov/div898/handbook/ [NIST, 2008]] which treats already adopted correlation coefficients in different data groups, while the term ''correlation diagram'' does not seem to be bound.
Also called a ''scatter chart'', ''scatter diagram'' or ''scatter graph'' [[http://en.wikipedia.org/wiki/Scatterplot Wikipedia 1, 2008]]. It is in it's most basic form a diagram in which the values of two ''metric'' variables are applied to the horizontal and vertical axes of a cartesian coordinate system. The resulting point in the graph represents one record from a data set. The distribution pattern of points from multiple records reveals, among other qualities, the correlation between the selected variables in the data set. The scatterplot is not to be confused with the ''correlation plot'' [[http://www.itl.nist.gov/div898/handbook/ NIST, 2008]] which treats already adopted correlation coefficients in different data groups, while the term ''correlation diagram'' does not seem to be bound.


==Revealed Information==
==Revealed Information==
----
----
Perfect linear correlation results in all samples lying on the regression line with positive or negative incline dependent on the sign of the correlation coefficient [http://www.mste.uiuc.edu/courses/ci330ms/youtsey/scatterinfo.html [MSTE, 1997]]. Note, that the nonzero incline of the line is insignificant in this kind of diagram [http://en.wikipedia.org/wiki/Correlation [Wikipedia 3, 2008]] since it is dependent on axis scales.
Perfect linear correlation results in all samples lying on the regression line with positive or negative incline dependent on the sign of the correlation coefficient [[http://www.mste.uiuc.edu/courses/ci330ms/youtsey/scatterinfo.html MSTE, 1997]]. Note, that the nonzero incline of the line is insignificant in this kind of diagram [[http://en.wikipedia.org/wiki/Correlation Wikipedia 3, 2008]] since it is dependent on axis scales.


An example of perfect correlation can be seen in the upper left of Figure 1 together with other patterns: strong positive (upper right), weak negative (lower left) and one example of variables without significant correlation.
An example of perfect correlation can be seen in the upper left of [http://www.infovis-wiki.net/index.php?title=Image:SomeScatterplots.jpg| Figure 1] together with other patterns: strong positive (upper right), weak negative (lower left) and one example of variables without significant correlation.


[[Image:SomeScatterplots.jpg|right|200px|thumb|Figure 1: Some scatterplots.]]
[[Image:SomeScatterplots.jpg|right|200px|thumb|Figure 1: Some scatterplots.]]


Figure 2 features a regression line to further increase expressiveness. It is the line that passes through the plot as close to the points as possible. The regression function is not necessarily chosen linear as in this example. Any kind of curve may fit a plot (quadratic curves, splines, ...) and must be chosen according to subject matter. Generally, the curve with the smallest sum of squared distances to the plotted points is sought after (''least squares fitting''), [http://www.netmba.com/statistics/plot/scatter/ [NetMBA, 2008]]. For an introduction on linear regression, see [http://en.wikipedia.org/wiki/Linear_regression [Wikipedia 4, 2008]; [http://en.wikipedia.org/wiki/Regression_analysis Wikipedia 5, 2008]].
[http://www.infovis-wiki.net/index.php?title=Image:WeakNegativeCorrelationLine.jpg| Figure 2] features a regression line to further increase expressiveness. It is the line that passes through the plot as close to the points as possible. The regression function is not necessarily chosen linear as in this example. Any kind of curve may fit a plot (quadratic curves, splines, ...) and must be chosen according to subject matter. Generally, the curve with the smallest sum of squared distances to the plotted points is sought after (''least squares fitting''), [[http://www.netmba.com/statistics/plot/scatter/ NetMBA, 2008]]. For an introduction on linear regression, see [[http://en.wikipedia.org/wiki/Linear_regression Wikipedia 4, 2008]; [http://en.wikipedia.org/wiki/Regression_analysis Wikipedia 5, 2008]].


[[Image:WeakNegativeCorrelationLine.jpg|right|200px|thumb|Figure 2: Regression line.]]
[[Image:WeakNegativeCorrelationLine.jpg|right|200px|thumb|Figure 2: Regression line.]]


Further properties of data sets that are easily discovered are the presence of clusters and outliers as denoted in Figure 3.
Further properties of data sets that are easily discovered are the presence of clusters and outliers as denoted in [http://www.infovis-wiki.net/index.php?title=Image:ClustersOutlyers.jpg| Figure 3].


[[Image:ClustersOutlyers.jpg|right|200px|thumb|Figure 3: Clusters, outliers.]]
[[Image:ClustersOutlyers.jpg|right|200px|thumb|Figure 3: Clusters and outliers.]]


==Scatterplots of Higher Dimensions==
==Scatterplots of Higher Dimensions==
----
----
Scatterplots are not restricted to records with only two variables. Higher dimensional data can be displayed by adding the third axis to the plot or by assigning point properties like color, size or shape. Figure 4 shows a plot done with Matlab's command "plot4D", [http://cosmologist.info/cosmomc/readme.html [Lewis, 2008]].
Scatterplots are not restricted to records with only two variables. Higher dimensional data can be displayed by adding the third axis to the plot or by assigning point properties like color, size or shape. [http://www.infovis-wiki.net/index.php?title=Image:Plot4D.png| Figure 4] shows a plot done with Matlab's command "plot4D", [[http://cosmologist.info/cosmomc/readme.html Lewis, 2008]].


[[Image:Plot4D.png|right|200px|thumb|Figure 4: 4D plot.]]
[[Image:Plot4D.png|right|200px|thumb|Figure 4: A Matlab 4D plot.]]


For another example of a threedimensional scatterplot refer to [http://en.wikipedia.org/wiki/Scatterplot [Wikipedia 1, 2008]]. A way of plotting multidimensional data without the use of the third axis can be found in [http://www.ailab.si/janez/visualizations.html [Demsar, 2008]].
For another example of a threedimensional scatterplot refer to [[http://en.wikipedia.org/wiki/Scatterplot Wikipedia 1, 2008]]. A way of plotting multidimensional data without the use of the third axis can be found in [[http://www.ailab.si/janez/visualizations.html Demsar, 2008]].


==Treating Discrete Data==
==Treating Discrete Data==
----
----
For continuously distributed data, scatterplots do well in visualizing density. The problem with discrete data is the possibility of more than one record sharing one point in the diagram (''overplotting''). One solution is to alter the point representation according to density, as is achieved by ''sun flower plots'' in which each point symbol gains radial segments as a consequence, [http://de.wikipedia.org/wiki/Streudiagramm [Wikipedia 2, 2008]]. Examples can be found here: [http://www.math.yorku.ca/SCS/sasmac/sunplot.html [Friendly, 2006]; [http://addictedtor.free.fr/graphiques/graphcode.php?graph=59  addictedtor, 2005]].
For continuously distributed data, scatterplots do well in visualizing density. The problem with discrete data is the possibility of more than one record sharing one point in the diagram (''overplotting''). One solution is to alter the point representation according to density, as is achieved by ''sun flower plots'' in which each point symbol gains radial segments as a consequence, [[http://de.wikipedia.org/wiki/Streudiagramm Wikipedia 2, 2008]]. Examples can be found here: [[http://www.math.yorku.ca/SCS/sasmac/sunplot.html Friendly, 2006]; [http://addictedtor.free.fr/graphiques/graphcode.php?graph=59  addictedtor, 2005]].


==Example of Application==
==Example of Application==
----
----
One prominent example of a scatterplot of two variables is the Hertzsprung-Russell diagramm (HRD for short) in astronomy (Figure 5). It plots absolute magnitude (visual brightness) of stars against their effective temperature and shows a rich set of features such as clusters and a central filament (the ''main sequence''), [http://www.britannica.com/EBchecked/topic/263951/Hertzsprung-Russell-diagram [Britannica, 2008]].  
One prominent example of a scatterplot of two variables is the Hertzsprung-Russell diagram (HRD for short) in astronomy ([http://www.infovis-wiki.net/index.php?title=Image:Hr_diagram_local.png| Figure 5]). It plots absolute magnitude (visual brightness) of stars against their effective temperature and shows a rich set of features such as clusters and a central filament (the ''main sequence''), [[http://www.britannica.com/EBchecked/topic/263951/Hertzsprung-Russell-diagram Britannica, 2008]].  


[[Image:hr_diagram_local.png|right|200px|thumb|Figure 5: The HRD of some nearby stars.]]
[[Image:hr_diagram_local.png|right|200px|thumb|Figure 5: The HRD of some nearby stars.]]

Revision as of 19:05, 4 November 2009

A scatterplot allows effective visualization of the relation between variables in multidimensional data.


Also called a scatter chart, scatter diagram or scatter graph [Wikipedia 1, 2008]. It is in it's most basic form a diagram in which the values of two metric variables are applied to the horizontal and vertical axes of a cartesian coordinate system. The resulting point in the graph represents one record from a data set. The distribution pattern of points from multiple records reveals, among other qualities, the correlation between the selected variables in the data set. The scatterplot is not to be confused with the correlation plot [NIST, 2008] which treats already adopted correlation coefficients in different data groups, while the term correlation diagram does not seem to be bound.

Revealed Information


Perfect linear correlation results in all samples lying on the regression line with positive or negative incline dependent on the sign of the correlation coefficient [MSTE, 1997]. Note, that the nonzero incline of the line is insignificant in this kind of diagram [Wikipedia 3, 2008] since it is dependent on axis scales.

An example of perfect correlation can be seen in the upper left of Figure 1 together with other patterns: strong positive (upper right), weak negative (lower left) and one example of variables without significant correlation.

Figure 1: Some scatterplots.

Figure 2 features a regression line to further increase expressiveness. It is the line that passes through the plot as close to the points as possible. The regression function is not necessarily chosen linear as in this example. Any kind of curve may fit a plot (quadratic curves, splines, ...) and must be chosen according to subject matter. Generally, the curve with the smallest sum of squared distances to the plotted points is sought after (least squares fitting), [NetMBA, 2008]. For an introduction on linear regression, see [Wikipedia 4, 2008; Wikipedia 5, 2008].

Figure 2: Regression line.

Further properties of data sets that are easily discovered are the presence of clusters and outliers as denoted in Figure 3.

Figure 3: Clusters and outliers.

Scatterplots of Higher Dimensions


Scatterplots are not restricted to records with only two variables. Higher dimensional data can be displayed by adding the third axis to the plot or by assigning point properties like color, size or shape. Figure 4 shows a plot done with Matlab's command "plot4D", [Lewis, 2008].

Figure 4: A Matlab 4D plot.

For another example of a threedimensional scatterplot refer to [Wikipedia 1, 2008]. A way of plotting multidimensional data without the use of the third axis can be found in [Demsar, 2008].

Treating Discrete Data


For continuously distributed data, scatterplots do well in visualizing density. The problem with discrete data is the possibility of more than one record sharing one point in the diagram (overplotting). One solution is to alter the point representation according to density, as is achieved by sun flower plots in which each point symbol gains radial segments as a consequence, [Wikipedia 2, 2008]. Examples can be found here: [Friendly, 2006; addictedtor, 2005].

Example of Application


One prominent example of a scatterplot of two variables is the Hertzsprung-Russell diagram (HRD for short) in astronomy (Figure 5). It plots absolute magnitude (visual brightness) of stars against their effective temperature and shows a rich set of features such as clusters and a central filament (the main sequence), [Britannica, 2008].

Figure 5: The HRD of some nearby stars.

References


External Links