More

Creating histogram/statistics/scatterplot by classification tool in ArcMap?

Creating histogram/statistics/scatterplot by classification tool in ArcMap?


I'm currently working on a land cover classification with Landsat 8 imagery. I already subjected the imagery to pre-processing (atmospheric and topographic correction) and I mosaiced two images. Now I want to apply a classification in ArcMap. I understand what I should do, but somehow I cannot mannage to create the statistics/histograms/scatterplots. When I click on the statistics, all values are zero. When I look at the properties of the raster layers on the other hand, I can see the statistics, calculated with the 'calculate statistics' tool. It doesn't create a histogram at all and the scatterplots are just 1 dot in the middle of the screen. Does somebody know what my problem is and how I can fix this?


The problem might be that you have not selected the proper raster layer in the classification toolbar.

Just check the raster layer highlighted in the classification toolbar and verify that the it is the same raster that you wanted to classify.


If your raster is floating point then you will have this problem. So you need them to be in integer, the Math tool Int() will help you with that.


Epidemiology and ArcGIS Insights - Part 1

I’ve spent most of my professional career working in spatial analysis and epidemiology. These were terms that were often met with blank stares when I was asked what I did. But now, after years of having to explain what they mean, and furthermore, how GIS is related, during the COVID-19 pandemic previously specialist terms like ‘epidemic curve’ have entered the everyday language. It therefore seems a perfect time for a quick blog on this topic.

Epidemiology sits at an intersection of a number of different disciplines and uses knowledge and methods from, for example, the fields of health, medicine and, statistics. There are numerous disciplines even within the broad framework of epidemiology that focus on infectious disease, genetics, chronic disease, and environmental and spatial epidemiology. While I could passionately write about environmental and spatial epidemiology in particular, I have tried to keep this blog a little more generic but thought I should declare my (spatial) bias upfront. For consistency, during this overview I’ll demonstrate epidemiology using examples of COVID-19 from April 2020. I’ll also demonstrate how ArcGIS Insights provides a powerful, yet accessible solution for some of the analytical needs of the epidemiologist, how it can be used in unison with other epidemiological approaches widely used, and how it can help convey information to the general public and decision makers.

I have identified ten key topics that I will briefly explore, with examples. These will be split between two blogs, just to keep them to coffee break length! In total, the two blogs identify ten major areas of epidemiological study and the scope of GIS to provide an analytical framework. In Part 1 I’ll outline the first five areas. In Part 2 I’ll round it up with a further five areas to ten.

Characteristics of health data

Even the simplest health event data will be collected, analyzed and reported in very different ways. Total numbers of cases, and rate of health events are often used interchangeably, yet each convey very different information.

The total number of health events can be valuable for capacity planning and funding. In times of health response, the number of health events such as death, birth and hospitalization are valuable to quantify the extent of any prevention measures required, or indeed, healthcare that may be needed.

In most other situations, the number of health events can only be understood with reference to the size of the population from which it is derived. In epidemiology, a rate is the frequency of event occurrence in a defined population over a specified period of time. Rates are, therefore, useful for comparing health events in different populations.

Mapping totals and rates also requires different techniques, most commonly using proportional symbols and choropleths respectively. The projection used to display your map should also be a consideration, particularly with rates, when values are shown by area, and particularly with larger areas (i.e. smaller scales).

Health data distributions

Prior to any modeling, data needs to be explored and well understood. Many approaches require a number of assumptions to be met. Health events are usually characterized by infrequent, sometimes recurring, events for example hospitalizations, that are non-normally distributed, highly positively skewed with a Poisson distribution (Poisson distribution is used to describe the distribution of rare events in a large population). In most health analysis, there are often strong interrelationships, and data collinearity is an important consideration for some methods.

To understand data distributions, histograms and boxplots, together with statistics such as skewness and kurtosis, can be used. Data correlations between variables can be evaluated using scatterplots and scatterplot matrices, while regression analysis can be used to estimate the strength and direction of the relationship between dependent and independent variables. Spatial data distributions should also be analyzed to check for data gaps, patterns or skew.

A histogram allows the distribution of numeric data to be explored. They allow visual assessment of distribution shape, central tendency, data variation and gaps or outliers in data values. Some statistics can be added to the histogram such as the mean, median and normal distribution. Additional related statistics can also be calculated on the data and, in ArcGIS Insights, are automatically included on the back of the chart cards to quantify the chart. A histogram with normal distribution is symmetrical and will have a skewness of 0. The direction of skewness is shown by the tail of the distribution so if the tail on the right is longer (as shown above), the skewness is positive. If the tail on the left side is longer, skewness is negative.

Box plots can be grouped by a categorical variable, such as state, which allows for a comparison of distributions. The data is plotted so that 50% of the data is inside the box between the lower (Q1) and upper (Q3) quartile and, the median is shown as a line. Whiskers contain a further 25% of the data, above and below the interquartile range (IQR), which is the length of the box (Upper quartile – lower quartile). Values that extend beyond 1.5 IQR are outliers.

Visually exploring data is a key step of analysis and can mitigate modeling errors. During modeling, data is often aggregated to ensure that there are enough data points in the analysis for it to have statistical robustness, but this step can hide missing data or data collection changes, such as changes in international classification of disease coding practices.

Different visualizations will give a different perspective on data and being able to explore and visualize data in numerous ways can help with understanding many aspects of the study data. The more involved the analysis, the more important it is to describe and visualize data before any modeling is carried out.

Temporal dimensions of health data

Time associations and patterns with epidemiological data are most commonly visualized using line graphs for continuous date/time data, and epidemic curves that traditionally use bars without gaps.

Epidemic curves graphically show the frequency of new cases compared to the date of disease onset. An epidemic or epi curve shows date or time of illness onset among cases on the x-axis and vertically, the y-axis shows the number of cases. The unit of time used is based on the incubation period of the disease and the time over which cases are distributed. The overall shape of the curve can reveal the type of outbreak (for example, common source, point source or propagated).

Epidemiological analyses can involve data that spans long periods of time (to capture sufficient events or rare outcomes), within which there may have been many changes to the data collection methodology. As part of the process of analysis, input data should be well understood, and limitations noted particularly for studies with complex interactions that may not be fully understood. The same might be true for new diseases which, by definition, will be poorly understood. Although past information and similar events will be used to understand potential patterns of disease spread over space and time, data reported in the early phases will be prone to unknown (and unquantifiable) error and uncertainty. This uncertainty has the additional impact of making it difficult to understand if previous events are in fact similar and, therefore, comparable.

Visualizing temporal data on a timeline helps to reveal data gaps, for example, in data collection. Analyzing data that may vary over space and time should not be done without evaluating the data prior to analysis, both temporally and spatially.

A lot of temporal analysis will use generic data, such as the results of decennial census surveys, to evaluate patterns among different population sub-groups. However, the further you are from a census year, the more the accuracy of that data will reduce. Although this limitation must be accepted, exploring the temporal differences between the known data may help modeling and can certainly help interpretation.

Dealing with different health geographies

Intervention and response areas can differ to those used for epidemiological analysis, with each having very different requirements. Response needs may be driven by health regions, for example, whereas analysis tends to be more closely aligned to census areas due to ancillary data availability and (often assumed) socio-economic homogeneity of those areas.

Spatial analysis can be used to define the study area(s). Filtering the data can be done by selecting areas from the map or using additional boundary datasets. This can be valuable to sub-divide data into exposed populations or cases and non-exposed or control populations. Most of the data used for analysis will be aggregated based on administrative boundaries, whereas exposed populations not defined by administrative areas.

In some cases, when the dataset contains spatial units as a data field, data can be analyzed non-spatially by different geographic boundaries. In other cases, when the data needs to be ‘shifted’ to geographic areas not contained in the dataset, spatial location can be used to ‘move’ the data to different areas. In these cases, the data can be available as individual counts or even total by area. Reapportionment of data between different geographies permits the translation of data between very different geographies and, thus, allows reporting of aggregated data at different boundaries.

Traditionally, there have been marked socio-economic differences between urban and rural populations. Although this trend is starting to change, spatial data accuracy and precision are often linked to population density, with rural areas tending to cover large areas that can encompass marked social and economic differences. These differences can result in disparities between urban and rural areas. Incorporating spatial analysis ensures that data can easily be stratified, for example by urban/rural areas for epidemiological modeling.

Different types of data joins for health analysis

Traditionally, a GIS stores spatial data as a feature by location. The data may be raster, using regular cells, or vector, using points, lines or polygons (areas). At each location there may be one or more associated pieces of information (for example, population by administrative area). However, in epidemiology, almost all analysis must include multiple components by location (for example, population by age and gender breakdown). Technically, this requires a one-to-many (feature to health and demographic variables) relationship.

To overcome these different data structures, data can to be joined as a step of the analysis so that each location, be that point, line or area, can be associated with multiple attributes or rows of information. This is a crucial step in ensuring that spatial and epidemiological analysis can be successfully integrated. Furthermore, in some cases, compound joins (for example, using location and time) are needed.

Summary

This blog has briefly outlined five topics of consideration in epidemiology and how ArcGIS Insights can be used as part of the analysis solution.

Many of these topics are far more involved and, as with all analytical work, effective analysis requires reliable data, in tandem with sound knowledge of previous relevant studies. An epidemiologist should be well versed in dealing with a lack of either and often, this is where true expertise lies.

Complex models and effective communication of results are a key part of the process. In Part 2 of this blog, we will explore those topics amongst others.


If you use this material for teaching, research or anything else please let me (Andy) know via Twitter or email — a [dot] maclachlan [at] ucl [dot] ac [dot] uk).

Share — copy and redistribute the material in any medium or format

Adapt — remix, transform, and build upon the material for any purpose, even commercially.

However, you give appropriate credit, provide a link to the license, and indicate if changes were made. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

But, you do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation.

The code within this pracical book is available under the MIT license so it is free to use (for any purpose) as long as you cite the source.


Scatter Plot Navigation Controls

  • To zoom in or out of the plot, click in the Scatter Plot and roll the mouse wheel up to zoom in, and down to zoom out. Or, press and hold the middle mouse button (wheel) and use Ctrl+drag to draw a box around the area to zoom into. To reset the plot view, click the Reset Range button .
  • If the Scatter Plot is not in Full Band mode, you can zoom into both the Image window and Scatter Plot. Click in the Image window and roll the mouse wheel up or down.
  • You can click on other manipulators in the main toolbar to zoom, pan, fly, and so forth in the Image window. To return the focus to the Scatter Plot window, click the Scatter Plot Tool button.

In this important update, SpaceStat extended the range of import/export file types to the geodatabase (gdb) format. SpaceStat’s advanced visualization, space-time analysis, and modeling techniques are easily integrated into workflows that use Esri technologies. For example, you can use Esri’s ArcGIS to acquire, edit, and manipulate your data, and then use SpaceStat to analyze time-dynamic data to target health interventions, to assess health disparities, and to undertake predictive modeling.

(Note on terminology change: Based on the results of usability studies and a survey we conducted, in this release, we have changed the method name “spatial interpolation” to “scale conversion/interpolation” to help our users understand the multiple applications this procedure can serve for our users.)

(Esri, and esri.com are trademarks, registered trademarks, or service marks of Esri in the United States, the European Community, or certain other jurisdictions.)


Creating histogram/statistics/scatterplot by classification tool in ArcMap? - Geographic Information Systems

Image processing application to display and analyze geospatial images

You must login before you can run this tool.

Version 3.51 - published on 10 Sep 2020

Category

Published on

Abstract

MultiSpec is an image processing tool to display and analyze geospatial images. The online version has all of the features in the Macintosh and Windows desktop versions. More information on MultiSpec can be found at the MultiSpec site.

Note that you need to create an account (register) on mygeohub. You can set it up so that you stay logged in so that you do not have to log in each time. (Note that there will be a delay after registering before the account is approved. Send an email to bڞhl@puܭue.ٝu to check into getting the account approved.)

The MultiSpec Reference contains the documentation for MultiSpec. Several tutorials (listed below) are also available.

Tutorials on using the Processor->Display Image menu item are at:

- Tutorial 2: Image enhancement features.

A tutorial on Unsupervised Classification is at:

- Tutorial 3: Uses the Processor->Cluster menu item.

A tutorial on Supervised Classification is at:

- Tutorial 4: Uses the Processor->Statistics menu item (and several more menu items).

Other tutorials high lighting features in MultiSpec are:

- Tutorial 5: Combining Separate Image Files into a Single Multispectral Image File.

- Tutorial 6: Overlay Shape Files on Image Window.

- Tutorial 7: Selecting Areas in Image Window and the Coordinate View.

- Tutorial 8: Creating Vegetation Indices Images.

- Tutorial 9: Handling HDF and netCDF Formatted Image Files.

- Tutorial 10: Visualizing Growing Degree Day (GDD) Images.

Changes were made so the the channels descriptions will be associated for the Landsat Analysis Ready Data (ARD) sets and the Sentinel image files. Sentenel images files will be recognized as such as long as S2A_ and S2B_ are someplace in the full path name.

A change was made so the the histogram statistics for ERDAS Imagine formatted files will be read correctly. It did not work for some Imagine formatted files.

A fix was made so that MultiSpec will save the correct area when an image window selection was being used instead of the entire image window.

A fix was made so that MultiSpec would not crash when saving histograms to a disk file. Crashes would occur frequently with the MacOS version less often in Windows and Online versions. Changes were also made in the formatting for histogram summaries.

The maximum length for channel descriptions, which are included in channel dialog boxes and processor output in the text window, was changed from 16 to 24. The default channel descriptions for known sensors such as those for Landsat and Sentinel now include the band identification as Bn) before the wavelength information. MultiSpec, by default, attempts to put the bands in wavelength order which in some cases is not the order of the sensor band identification.

Version 3.33 (03/31/2020) fixes a problem in the Edit->Map Parameters menu item which caused geographic coordinate systems specified by EPSG codes to not be recognized and/or handled correctly. Also a change was made to allow more precision for horizontal and vertical pixel sizes. This may be needed for geographic coordinate systems.

Version 3.32 (02/20/2020) includes changes in the license information in each of the files in preparation to making the code for MultiSpec Online open source.

cjlin/libsvm for details about this SVM classifier and the options available.

Shift Key: If one holds the shift key down, the cursor will change to an eye. Clicking the (left) mouse button down will change the color of the class or group to the background color. Releasing the mouse button will change the color back to the original.

Shift and Control or z or / Keys: If one holds the shift and control or z or / keys down, the cursor will change to an eye. Clicking the mouse button down will change the color of all of the other classes or groups to the background color. Releasing the mouse button will change the colors back to the original. Note that whether one uses the Control or z or / keys is browser dependent. The control key does not work in some browsers. Therefore other options are provided not perfect but the capability does work.

Shift and Option or a or ' Keys: If one holds shift and option or a or ' keys down, the cursor will change to an eye. Clicking the mouse button down will change the colors of this class or group and all with class/group numbers less than the selected one to the background color. Releasing the mouse button will change the colors back to the original. This option was specifically made available for probability images generated by the classify processor.


Land Classification and Land Use

After the completion of my Geo-referencing tasks (years 1995, 1975, and 1959), I was given the option between more Geo-referencing (1965) or a slightly different route, which consisted of creating a method to classify land types and land uses. If it wasn’t obvious by the title, I chose Geo-referencing 1965…

Land classification is the method of determining what a feature is on imagery purely based on pixel value (pixel value can be interpreted differently depending on the situation). This allows for a colorful rendition and separation, which results in an easy to read and visualize context of where different features are located. Results can vary and are heavily reliant on image quality. The lower quality the image or imagery, the more generalization and inaccuracy of the classifications.

Anyway, land classification can be simple and it can also be quite difficult. If you are using tools that already exist, or software that are built to classify imagery, you can easily begin land classification/land use. If you are using preexisting material it will quickly become a matter of finding the right combination of numbers in order to get the classifications you want. This method is not too difficult, just more tedious in regards to acquiring your result. However, if you approach it from scratch, it will be significantly more engaging. In order to approach it from the bottom up, you have to essentially dissect the process. You have to analyze your imagery, extract pixel values, group the pixel values, combine all of them into a single file, and finally symbolize them based on attribution or pixel value which was recorded earlier. It is much easier said than done.

I am currently approaching the task via already created tools, however if I had a choice in the matter, I would have approached it via the bottom up method and attempted to create it from scratch as there is more learning in that and it is much more appealing to me. Regardless, I am creating info files, or files that contain the numbers, ranges, and classifications I am using to determine good land classifications. In contrast to what I stated earlier, this is quite difficult for me as the imagery is low quality and I am not a fan of continuously typing in ranges until I thread the needle.

The current tool I am using is the reclassify tool that is available through the ESRI suite and it requires the Spatial Analyst extension. This tool allows for the input of a single image, ranges you would like to use to classify the selected image, and output file. After much testing, I am pretty sure there can only be a maximum of 24 classifications (which is probably more than enough). In addition, the tool can be batch ran (as most ESRI tools can be), which means it can be run on multiple images at once. This is a much needed features for many situations, as I presume most times, individuals are not going to classify one image and be done (or at least I am not going to be one and done).

That is an image that was reclassified using the reclassify tool. I am not sure how good of a classification this is as I have not fully grasped the tool yet and every time I give it ranges, it spits out the same generic ranges that I did not input (which is a bit frustrating, but it comes with the territory). I am sure it is human error though and not the tool messing up. I am not sure what the final result is supposed to be, but I will be sure to fill you in once I achieve it (if I ever do…).


Analyzing the Distribution of a Single Variable

Histogram

We begin our analysis with the simple description of the distribution of a single variable. Arguably the most familiar statistical graphic is the histogram, which is a discrete representation of the density function of a variable. In essence, the range of the variable (the difference between maximum and minimum) is divided into a number of equal intervals (or bins), and the number of observations that fall within each bin is depicted in a bar graph.

The histogram functionality is started by selecting Explore > Histogram from the menu, or by clicking on the Histogram toolbar icon, the left-most icon in the set in Figure 2.

This brings up the Variable Settings dialog, which lists all the numeric variables in the data set (string variables cannot be analyzed). Scroll down the list as in Figure 3 until you can select kids2000, the percentage of households with kids under age 18 in 2000. This is the same variable we used to illustrate some of the mapping functionality.

Figure 3: Histogram variable selection

After clicking OK, the default histogram appears, showing the distribution of the 55 observations over seven bins, as in Figure 4. Interestingly, we find that the second bin lacks observations, suggesting that a different set of intervals may be more appropriate.

Figure 4: Default histogram

There are a number of important options for the histogram. Arguably the most important one is to set the number of bins or, alternatively, the values for the cut points.

The histogram options shown in Figure 5 are brought up in the usual fashion, by right clicking on the graph.

Figure 5: Choose intervals histogram option

Selecting the number of histogram bins

The Choose Intervals option, shown in Figure 5, allows for the customization of the number of bins in the histogram. A dialog appears that lets you set this value explicitly. The default is 7, but in our example, we change this to 5, as in Figure 6.

Figure 6: Histogram intervals set to 5

The resulting histogram now has five bars, as in Figure 7.

Figure 7: Histogram with 5 intervals

This takes care of the problem of the bin with missing observations.

Using a custom classification

Recall how we created a custom map classification based on the range of values for kids2000, and labeled it custom1. If you have loaded the project file with the NYC data, then this custom classification will be listed as an option for Histogram Classification, as shown in Figure 8. If you started from scratch, you will have to recreate the custom classification (for specifics, see the mapping chapter).

Figure 8: Selecting a custom histogram classification

The custom classification is the way GeoDa allows for cut points to be specified, instead of the number of bins. With custom1 selected, the histogram takes the form as in Figure 9, with six bins, as defined by this classification. The histogram has the exact same shape as the one portrayed in the Category Editor interface when creating these custom categories.

Figure 9: Histogram with custom intervals

Histograms for categorical variables

The default logic behind the histogram is to consider the range of the variable of interest (max - min) and compute the cut points based on the specified number of bins. For categorical variables, this leads to undesirable results.

To illustrate this, we create a map for kids2000 with the custom1 categories, and use Save Categories to create a categorical variable (say catkid20) for this classification. 2 The default histogram for this variable is as in Figure 10, clearly not something that reflects the discrete integer values associated with the categories. Rather, the cut points are based on the range of 5, divided by the default number of bins of 7, or a bin width of approximately 0.7. Indeed, the first bin goes from 1 to 1.7.

Figure 10: Default histogram for categorical variables

The View option of the histogram provides a way to deal with categorical variables by means of the Set as Unique Value item, shown in Figure 11. This option recognizes the discrete nature of the categorical variable and adjusts the cut point accordingly.

Figure 11: Selecting a unique value histogram classification

The result is shown in Figure 12, with six categories each associated with an identifying integer value.

Figure 12: Unique value histogram for categorical variables

Display histogram statistics

An important option for the histogram (and any other statistical graph) is to be able to display descriptive statistics for the variable of interest. This is accomplished by selecting Display Statistics in the View option for the histogram (see Figure 11)

This option adds a number of descriptors below the graph. The summary statistics are given at the bottom, illustrated in Figure 13 for kids2000 with custom categories. We see that the 55 observations have a minimum value of 8.3815, a maximum of 55.3666, median of 38.2278, mean of 36.04 and a standard deviation of 11.2881. In addition, for the histogram, descriptive statistics are provided for each interval, showing the range for the interval, the number of observations as a count and as a percentage of the total number of observations, and the number of standard deviations away from the mean for the center of the bin. This allows us to identify potential outliers, e.g., as defined by those observations more than two standard deviations from the mean. In our example, no category satisfies this criterion.

The summary characteristics for a given bin also appear in the status bar when the cursor is moved over the corresponding bar. This works whether the descriptive statistics option is on or not. In our example in Figure 13, the cursor is over the fourth category.

Figure 13: Histogram with descriptive statistics

Other histogram options

Other items available in the View option include customizing the precision of the axes and displayed statistics, respectively through View > Set Display Precision on Axes and View > Set Display Precision.

In addition, standard options for the histogram include adjustments to various color settings (Color), saving the selection (similar to what we saw for the map functionality), Copy the Image to Clipboard and saving the graph as an image file (again, identical to the map functionality).

Linking a histogram and a map

To illustrate the concept of linked graphs and maps, we continue with the custom histogram and make sure the default themeless map is available. When we select the two right-most bars in the histogram (click and shift-click to expand the selection), the highlighted bars keep their color, whereas the non-selected ones become transparent, as in the right-hand graph in Figure 14. This is the standard approach to visualize a selection in a graph in GeoDa . 3

Immediately upon selection of the bars in the graph, the corresponding observations in the map are also highlighted, as in the left-hand graph in Figure 14. In our current example, the map is a simple themeless map (all areal units are green), but in more realistic applications, the map can be any type of choropleth map, for the same variable or for a different variable. The latter can be very useful in the exploration of categorical overlap between variables.

Figure 14: Linking a histogram and a map

The reverse linking works as well. For example, using a rectangular selection tool on the themeless map, we can select sub-boroughs in Manhattan and adjoining Brooklyn, as in the map in Figure 15. The linked histogram (right-hand graph in Figure 15) will show the attribute distribution for the selected spatial units as highlighted fractions of the bars (the transparent bars correspond to the unselected areal units).

In practice, we will be interested in assessing the extent to which the distribution of the selected observations (e.g., a sub-region) matches the overall distribution. When it does not, this may reveal the presence of spatial heterogeneity, to which we return below.

Figure 15: Linking a map and a histogram

As we have seen before, it is also possible to save the selection in the form of a 0-1 indicator variable with the Save Selection option.

The technique of linking, and its dynamic counterpart of brushing (more later) is central to the data exploration philosophy that is behind GeoDa (for a more elaborate exposition of the philosophy behind GeoDa , see Anselin, Syabri, and Kho 2006) .

Box Plot

A box plot is an alternative visualization of the distribution of a single variable. It is invoked as Explore > Box Plot, or by selecting the Box Plot as the second icon from the left in the toolbar, shown in Figure 2.

Identical to the approach followed for the histogram, next appears a Variable Settings dialog to select the variable. In GeoDa , the default is that the variable from any previous analysis is already selected. In our example, we change this to the variable rent2008, which we already encountered in the illustration of the box map in the mapping Chapter.

The box plot for rent2008 is shown in Figure 16 (make sure to turn off any previous selection of observations).

Figure 16: Default box plot

The box plot focuses on the quartiles of the distribution. The data points are sorted from small to large. The median (50 percent point) is represented by the horizontal orange bar in the middle of the distribution. The green dot above corresponds with the mean.

The brown rectangle goes from the first quartile (25th percentile) to the third quartile (75th percentile). The difference between the values that correspond to the third (1362.5) and the first quartile (1000) is referred to as the inter-quartile range (IQR). The inter-quartile range is a measure of the spread of the distribution, a non-parametric counterpart to the standard deviation. In our example, the IQR is 362.5 (1362.5 - 1000).

The horizontal lines drawn at the top and bottom of the graph are the so-called fences or hinges. They correspond to the values of the first quartile less 1.5xIQR (i.e., roughly 1000 - 362.5x1.5 = 275), and the third quartile plus 1.5xIQR (i.e., roughly 1362.5 + 362.5x1.5 = 2087.5). Observations that fall outside the fences are considered to be outliers. 4

In our example in Figure 16, we have a single lower outlier value (corresponding to three observations), and six upper outlier observations. Note that the lower outliers are the observations that correspond with a value of 0 (the minimum), which we earlier had flagged as potentially suspicious. The outlier detection would seem to confirm this. Checking for strange values that may possibly be coding errors or suggest other measurement problems is one of the very useful applications of a box plot.

Box plot options

The default in GeoDa is to list the summary statistics at the bottom of the box plot. As was the case for the histogram, the statistics include the minimum, maximum, mean, median and standard deviation. In addition, the values for the first and third quartile and the resulting IQR are given as well. The listing of descriptive statistics can be turned off by unchecking View > Display Statistics (i.e., the default is the reverse of what held for the histogram, where the statistics had to be invoked explicitly).

The typical multiplier for the IQR to determine outliers is 1.5 (roughly equivalent to the practice of using two standard deviations in a parametric setting). However, a value of 3.0 is fairly common as well, which considers only truly extreme observations as outliers. The multiplier to determine the fence can be changed with the Hinge > 3.0 option (right click in the plot to select the options menu, and then choose the hinge value, as in Figure 17).

Figure 17: Change the box plot hinge

The resulting box plot, shown in Figure 18, no longer characterizes the lowest value as an outlier.

Figure 18: Box plot with hinge = 3.0

The other options for the box plot can be seen in Figure 17. Except for the Hinge option, these are the same as for the histogram, and are not further considered here.

Also, as is the case for any graph in GeoDa , linking and brushing are implemented, as already illustrated in the mapping Chapter.

The main purpose of the box plot in an exploratory strategy is to identify outlier observations. We have already seen how that is implemented in the idea of a box map to show whether such outliers also coincide in space. In later Chapters, we will cover more formal methods to assess such patterns.


6. Conclusions

This study was designed as a demonstration project to quantify the spatial and temporal characteristics of supercells across Oklahoma over a 10-yr period. A criteria-based approach was applied to the identification and classification of storm types using level-II and level-III radar data. Furthermore, GIS was utilized in a new and innovative way to organize, visualize, and analyze the spatial aspects of storms across various time scales. This methodology resulted in the identification of 943 supercells across Oklahoma during 1994–2003. While the observation of nearly 1000 supercells during a decade is quite significant, the sample size is too small to represent long-term spatial and temporal characteristics of supercell thunderstorms across Oklahoma.

A number of key findings resulted from the spatiotemporal analysis of supercells across Oklahoma during the limited 10-yr demonstration study period. Key results included the following:

  • The location of the maxima of supercell occurrences was identified across three main regions: east-central Oklahoma, southwest Oklahoma, and west-central into northeast Oklahoma.
  • The mean supercell initiation location moved west between January and September and moved east from September through the end of the calendar year.
  • Initiation was most frequent between 2000 and 0000 UTC.
  • Termination was most common between 2300 and 0300 UTC.
  • Supercell initiation density was the greatest across portions of southwest, north-central, and east-central Oklahoma.
  • Supercell termination density was most common across northern and northeastern Oklahoma.
  • The month of May was composed of three important climatological features: a supercell outbreak peak in early May, a midmonth relative minimum of activity, and a peak in supercell days at the end of May.
  • The secondary supercell season was identified during late September to early October.
  • The monthly mean supercell tracks were oriented from southwest to northeast from January through May, from northwest to southeast from June through September, and from southwest to northeast through the end of the year.

Storm report data were analyzed using several spatial density tools and revealed that the distribution of point reports (wind, hail, and tornadoes) was approximately correlated with population centers. The density of tornado tracks did not exhibit the same population bias however, only north-central Oklahoma was strongly correlated with supercell locations for the same period. Overall, the GIS-based supercell dataset was found to be a valuable, new form of storm archive that enabled the efficient query of past storms, powerful spatial analyses, and multiple data overlay. The combined use of radar storm classification and GIS as a database creation and analysis tool proved highly effective in quantifying the spatial characteristics of past supercells across Oklahoma during a 10-yr period. If applied on a larger scale, utilizing a set of more automated methods such as storm algorithm identification combined with quality assurance measures, similar detailed analyses could be extended to larger regions of the United States over longer periods of time.

It is the authors’ recommendation that a national center be given the task of creating an automated framework for developing GIS datasets consisting of critical storm information gathered in a real-time, quality-assured manner. While Storm Data will continue to serve as a useful storm reporting and National Weather Service verification tool, new approaches are needed to more effectively document and research storm occurrences. For example, with the availability of extensive WSR-88D coverage across the country, the potential exists for more effective use and storage of important radar-derived storm features such as hail detections, mesocyclone detections, or storm cell identification and tracking information. The storage of such data into GIS datasets would enable effective data mining of past storm days, facilitate incorporation with other datasets, and ultimately foster further meteorological research and data discovery. The resultant storm datasets would provide beneficial information to a range of sectors, including forecast operations, synoptic and mesoscale research, and economic interests. With continued increases in GIS-compatible meteorological datasets, such as the ones proposed herein, it appears likely that GIS will serve as an important tool for archiving, visualizing, and analyzing a vast array of meteorological data in the future.


Statistics of Multiple Attributes

A data set often has multiple attributes that may or may not depend on each other.

Dependence and Independence

Quite often two sets of data may be related to each other, at the very least because their values are measured at the same time or location, or both. For example, a weather station might make hourly measurements of temperature, humidity, wind speed, etc.

Census data is another common example, such as the layer MASSCENSUS2010BLOCKGROUPS.shp , whose attribute table includes information not only about total population but also the white population, black population, hispanic population, housing units, etc. in particular locations in a particular year:

Beyond the basic connection they have due to their location-based collection, these different sets of data might have other relationships, e.g. there can be simple constraints of definition such as:

POP_2010 = POP_WHITE + POP_BLACK + POP_NATV + POP_ASN + POP_ISLND + POP_OTHER + POP_MULTI

See the U.S. Census Bureau's document “About Race” to learn how they define these categories.

The Census Bureau also allows for the possiblity that a person of Hispanic or Latinx ethnicity could be in any one of these categories. See the U.S. Census Bureau Guidance on the Presentation and Comparison of Race and Hispanic Origin Data for more information.

Importantly, there can also be more complicated relationships resulting from societal factors. For example, the ratio of blacks to whites is not uniform but tends to be inversely related as whites and blacks cluster together in different locations.

The relationship between different attributes can be visualized, to some extent, by plotting each pair within a record on a two-dimension graph of their values, which is known as a .

Procedure 5: Visualizing Attribute Relationships with Scatterplots

  1. In ArcMap , menu View , then select the menu item Graphs , and then select the menu item Create Scatterplot Matrix… .
  2. In the dialog Create Scatterplot Matrix Wizard , in the menu Layer/Table , select the layer or table of interest, e.g. MASSCENSUS2010BLOCKGROUPS.shp .

  1. Show all features/records with selected items highlighted (the default)
  2. Show all features/records with selected items appearing the same as others
  3. Show only selected records.

Scatterplots often reveal several types of relationships between attributes:

    A linear relationship, clearly visible in the the POP_OTHER vs. HISPanic graph expanded above:

Recall that &alpha (the Greek letter “alpha”) is the and &beta (the Greek letter “beta”) is the of the line.

In other words, where there are more whites there are fewer blacks, and where there are fewer whites, there are more blacks.

Inverse relationships can often be approximated by linear relationships with negative slopes.

Some pairs of attributes may have no obvious relationship, such as POP_OTHER vs. POP_MULTI, perhaps indicating an overlap in meaning or a more complicated relationship involving other attributes. Relationships between z-scores can sometimes be clearer, because these values are mostly smaller than 1 (mathematically speaking, nonlinear terms will be less important).

When an attribute remains constant relative to another attribute, or if they have a purely random relationship, we can say that they are of each other if, on the other hand, the attribute has a clear mathematical relationship to another attribute, we can say they are on each other.

Somewhat confusingly, when expressed as a mathematical relationship such as the above, the attribute on the left of the equal sign is called the or the , and the attribute in the expression on the right is called the , which implies an asymmetric relationship that requires qualification or justification.

An important aphorism to remember when considering dependent relationships is that correlation does not imply causation, i.e. two attributes may be dependent upon each other not because one causes the other, but because they both arise from a third attribute. For example, black households are more likely to have lower incomes than white households, not because being black causes lower incomes but because of their historical origins and ongoing discrimination.

Correlation

The degree to which the two sets of data have a linear relationship can be described by calculating their , defined by Pearson as

This expression multiplies two attributes’ z-scores feature-by-feature, sums the result, and divides by the total number N (replaced by N̂ &minus 1 for sample data sets).

The correlation of two attributes will vary between &minus1 and +1, with the latter occurring if all pairs of values < ai , bi > are exactly the same (because the sum is then the same as that of the standard deviation squared), and the former when the values differ only by a minus sign.

If two attributes are independent of each other, the correlation will be close to zero. This is obviously true when one of the attributes is constant, since that value will equal its mean and its z-score will always be zero. More generally, since z-scores are distributed around zero, there will be roughly the same number of positive and negative terms, which will tend to cancel each other out.

In ArcGIS, you can calculate the correlation of two attributes by calculating their z-scores, then calculating a third attribute that is the product of their z-scores, then summarizing the latter to find its mean value. (You can also calculate a linear regression see the next section.) Excel provides a function CORREL which is somewhat easier to use to calculate correlations.

For the Massachusetts data above, we can create a with the same form as the scatterplot matrix:

POP_WHITE 0.88
POP_BLACK 0.13 -0.27
POP_NATV 0.16 -0.02 0.24
POP_ASN 0.33 0.10 0.10 0.00
POP_ISLND 0.13 0.03 0.12 0.15 0.06
POP_OTHER 0.14 -0.21 0.40 0.46 0.03 0.20
POP_MULTI 0.42 0.04 0.52 0.41 0.27 0.24 0.70
HISP 0.17 -0.16 0.38 0.48 0.05 0.20 0.95 0.67
POP_2010 POP_WHITE POP_BLACK POP_NATV POP_ASN POP_ISLND POP_OTHER POP_MULTI

The color codes indicate the strength and sign of the correlation (similar to the standardized map above). From this we see that the POP_OTHER and HISP attributes have the strongest correlation at 0.95, while for POP_WHITE and POP_BLACK there is a weak negative correlation of &minus0.27, both matching our visual characterization.

Question: The second strongest correlation is between the white and total populations at 0.88 why do you think that would be?

Linear Regression

An attribute such as the Hispanic population can be characterized by its mean value and standard deviation, but consider the graph at the right, which plots HISP on the y axis vs. POP_OTHER on the x axis.

The mean value of HISP, &mu = 126 (the solid green line), is also plotted, along with the confidence interval 3 &sigma = 636 (the dashed green line).

Clearly a significant fraction of the HISP data is quite far from the mean and even outside of the 3 &sigma confidence interval — but it’s much closer to the blue line, which varies with POP_OTHER.

If we want to model the relationship between , the simplest type of relationship between two attributes A and B is a linear one, viz.

The &alpha and &beta are called the and , respectively. Note that if the slope &beta is zero, then A will be represented by the constant value &alpha , which we might expect to be the mean &mu .

In general there will be a dispersion of data that prevents a perfect representation by such a line, as in the graph at the right.

The difference between a dependent value and the corresponding calculated value of a representational line is known as a :

(&epsilon is the Greek letter “epsilon”).

We’d like to calculate values for the coefficients &alpha and &beta , a process known as . The most common procedure, , is based on the idea that the line that fits the data best is the one that minimizes the :

squaring the residuals puts values above and below the regression line on an even footing. Also note that, if the slope &beta is zero, the sum is the same as that in the expression for &sigmaa , since the mean &mu is the value of &alpha that minimizes the sum.

Question: Where have you seen a least-squares fit previously? (Hint: the residuals were represented by blue lines between two geographic locations.)

is also possible when there is more than one explanatory variable:

In this case, with n coefficients and n – 1 different explanatory variables, it’s helpful to express the latter as z-scores in order to compare their relative importance to the dependent variable. Then the slopes < &beta k > will represent the effect of a one-standard-deviation change in the corresponding variables.

The derived expressions for the intercept &alpha and slopes < &beta k > are unenlightening and won’t be listed here. But they can be calculated with a number of tools, including Excel and ArcGIS (see below).

As an example, consider the relationship discussed earlier,

which was notable because these two attributes appear to be strongly correlated. It has least-squares intercept and slope of

resulting in the equation

HISP = 16.5 + 1.788 × POP_OTHER

and the solid blue regression line that is plotted in the graph above.

Question: How might you interpret a slope of 1.788 in this case?

Goodness of Fit

How well a linear regression equation fits the data is an important consideration, and a number of statistical measures have been devised to test its .

The describes the distribution of the dependent values around the best-fit line, and is similar to the standard deviation around the mean value:

Again the &epsiloni are the residuals of the dependent values, and smaller values represents a smaller spread from the regression line, as seen in the graph to the right.

As before N is the number of data points, so if more of them fit within a given spread of residuals, that will reduce the standard error.

Finally, n is the number of coefficients the more of them there are the greater the standard error, because they add to the equation and make it easier to fit more precisely, even though the data hasn’t changed. It is therefore subtracted from the total number of data points N , which decreases the denominator and increases the standard error.

Remember “ n equations for n unknowns”? That means that one data point is required for each coefficient to determine them exactly, and the remaining N – n data points are responsible for the variation around the line (the residuals).

The standard error of the HISP(POP_OTHER) regression is

Note that in the above graph, almost all of the data lies close to the regression line, falling within the confidence interval ±3 &Sigma = ±201, denoted by the dashed blue lines. This is much better than simply describing the dependent variable by its mean value, since ±3 &sigmaa = ±636. This model therefore accounts for a large fraction of the variation in the HISP data, leaving a much smaller set of residuals that must be accounted for by other factors. We can say that we have the variation between the model and the remaining residuals.

The is a convenient and accepted way to compare the standard error of the equation &Sigma and the dependent variable’s standard deviation &sigmaa , and thereby describe the overall goodness-of-fit of the equation:

If the regression line perfectly fits the data, the residuals &epsilon i will all be zero and R 2 will be one when the residuals approach the standard deviation of the dependent variable, the second term will be one and R 2 will be zero.

One way to interpret the coefficient of determination is as a generalization of correlation to a set of explanatory variables. It can be shown that, when there is only one explanatory variable, R 2 will equal the square of the correlation &rho with the dependent variable. For the HISP(POP_OTHER) regression,

which matches the correlation calculated above, since 0.95 2 = 0.90. So R 2 .

Because the coefficient of determination can improve simply by adding more explanatory variables, i.e. by increasing n , a related quantity that provides a better estimate of significance is the :

R̅ 2 will always be less than or equal to R 2 , and it can be negative, unlike R 2 . The significance of your equation will be greatest when R̅ 2 is maximized.

For the example regression,

since N (4979) is much larger than n (2).

The is another common way to analyze the dependence of your model on the number of explanatory variables you’ve chosen. It compares the “explained” variance R 2 that follows from these n – 1 variables to the “unexplained” variance 1 – R 2 remaining in the N – n unfitted data points:

F can be as small as 0, when the numerator R 2 /( n – 1) is 0: none of the variance in the dependent variable is explained.

F can be as large as ∞, when the denominator (1 – R 2 )/( N – n ) is 0: all of the variance in the dependent variable is explained.

So the regression is better when F >> 1 for the HISP(POP_OTHER) regression, F = 45,000.

But could a different set of coefficient values be substituted and produce a better result? When coefficient values are selected with random probability and their F values are calculated, an results, such as the graph of ∂F p versus F shown at the right clearly some F values are more likely than others.

Generally speaking values of F >> 1 have a low probability per unit value ∂F p , and the total probability p that random coefficient values will have F > F Regression is very small, as suggested by the red portion of the F distribution graph.

Is there a significant probability p that random coefficient values could produce better results than the regression best-fit? This question is an example of a .

A is a value of p below which you may decide to reject the null hypothesis, i.e. decide that F Regression is significant. Commonly these are stated in the form p < 0.1 or p < 0.05. The former represents a less-than-1-in-10 chance and the latter a less-than-1-in-20 chance that a random result will produce a better F .

For the HISP(POP_OTHER) regression, p ≈ 0, so F Regression is clearly significant and we can reject the null hypothesis.

Standard Errors of the Coefficients

Once the overall goodness-of-fit has been established, the individual coefficients should come under scrutiny.

Because the best-fit regression line is only one of many that could pass through the data, the coefficients also clearly have a range of values, e.g. tilting the line upward for a larger slope or downward for a smaller slope. These values therefore have their own distributions whose widths are described by , which for the HISP(POP_OTHER) regression are:

You will commonly see coefficient errors expressed together with the coefficients in the form &beta ± s &beta , e.g.

HISP = (16.5 ± 1.1) + (1.788 ± 0.008) × POP_OTHER

Note that this is an expression of just one possible confidence interval to claim more certainty, a multiple of this value is generally necessary.

In addition, we can set up another null hypothesis: can these values be left out of the model with little effect, i.e. are they significantly different than zero? A simple test for their significance is based on the :

Like the F-statistic, we can test these values with the , which, like the F distribution, charts the probability that a random set of values could produce the observed coefficient.

When these values are greater than two, i.e. the coefficients ± the standard errors are significantly different than zero, the values are considered good estimates. More precisely, suppose the data was completely random, e.g. HISP showed no dependence on POP_OTHER then we would expect the coefficients to be all zero and &alpha = &mu .

The coefficient of determination for the dependence of the HISP attribute on the POP_OTHER attribute is good, but looking at the scatterplot matrix there appears to be correlation not just with POP_OTHER but also with POP_MULTI and, to a lesser extent, with POP_BLACK and POP_NATV. In general, we also know that Spanish-speaking people can be of any racial background. We may therefore be able to produce a better fit by including them in the analysis with a multiple linear regression.

Procedure 6: Multiple Linear Regression

ArcGIS provides a tool for calculating the ordinary least squares fit to a multiple linear regression of an attribute dependent on multiple other attributes, providing detailed statistical characteristics of a fit described by the equation

This includes the coefficient of determination R 2 , meaning that it can also be used to calculate the correlation between any pair of attributes, too.

  1. The Ordinary Least Squares tool requires that the input feature class have an integer attribute with unique values for every feature if your layer doesn’t already have one, open its attribute table and add a new field, e.g. UniqueID , and use the field calculator as described above to copy the attribute FID (which unfortunately doesn’t work for this purpose).
  2. In ArcMap , open ArcToolbox (see Constructing and Sharing Maps for details).
  3. Double-click on Spatial Statistics Tools , then on Modeling Spatial Relationships , and finally on Ordinary Least Squares .
  4. In the dialog Ordinary Least Squares , in the menu Input Feature Class , select the data layer to be symbolized, e.g. MASSCENSUS2010BLOCKGROUPS . If the layer is not already added to ArcGIS, you can click instead on the button Browse to select one.
  5. In the menu Unique ID Field , choose an integer field with unique values, e.g UniqueID .
  6. In the text field Output Feature Class , choose a location and name for the output layer file, e.g. Geostatistics.gdbHISP_OLS , by typing it or by clicking on the button Browse to select it. You will probably want to put it in the same location as the data layer it’s modeling.
  7. In the menu Dependent Variable , choose the attribute you would like to explain, e.g HISP .
  8. In the menu Explanatory Variables , click on the attribute(s) that you think will explain the dependent variable, e.g POP_OTHER_Z .
  9. In the text field Output Report File , choose a location and name for an output report in PDF format, e.g. HISP_OLS_Report.pdf , by typing it or by clicking on the button Browse to select it. You will probably want to put it in the same location as the data layer it’s modeling.
  10. Optionally, you can request a Coefficient Output Table and a Diagnostic Output Table these have almost the same information as in the PDF report, but in a table format that can and will be loaded into ArcGIS. One statistic the former provides that isn’t in the PDF report is the standard error of the equation S .
  11. Click on the button OK .
  12. If you have turned off background processing (see Constructing and Sharing Maps for details), the dialog Ordinary Least Squares will appear, describing the process, and eventually displaying the Completed results (you may need to enlarge the window and scroll up to see everything):

Quite a few statistical characteristics are included here, including the ones we have already described. In particular, this model of the hispanic population

Again, if there are a large number of polygons you may want to turn off the polygon outlines as described in step 12 of Procedure 2.

Excel provides a function LINEST that can also be used to calculate regression coefficients and standard errors, but it’s a bit cumbersome to use.