More

Using Zonal Statistics As Table for overlapping polygons in ArcPy?

Using Zonal Statistics As Table for overlapping polygons in ArcPy?


I'm using ArcGIS 10.2 for Desktop and would like to calculate the population within a specific distance from bus stops.

My input layers are:

  1. census block (polygon) with population
  2. undissolved bus buffers that overlap each other (polygon, in 1 layer)

I calculated population density and converted my block data into raster, then used Zonal Statistics as Table tool to calculate the sum of population in each buffer. If I use this tool directly, it will ignore 2/3 of my buffers because they overlap with others. Therefore I need code to iterate this through each buffer (about 600 in total).

My situation is the same to the one in this topic Intersecting Overlapping Polygons Individually in ArcGIS

I used the code by @phloem but it returns a table with only one row (which is one of the buffers I have). Could someone look through and give me some advice? I'm new to Python.

store_buffs = r"H:DMLANTASMapShapefiles2013_A30min_weekday.shp" density_ras = r"D:Downloadslock_clipped_Raster1.tif" table_list = [] with arcpy.da.SearchCursor(store_buffs, ["FacilityID"]) as cursor: for row in cursor: exp = '"FacilityID" = ' + str(row[0]) temp_table10 = r"in_memory	emp_table10" + str(row[0]) temp_shp10 = r'in_memory	emp_shp10' arcpy.Select_analysis(store_buffs, temp_shp10, exp) arcpy.sa.ZonalStatisticsAsTable(temp_shp10, 'FacilityID', density_ras, temp_table10, 'DATA', 'SUM') table_list.append(temp_table10) del row final_table = r"D:Downloads2013_A30min_weekday.dbf" arcpy.Merge_management(table_list, final_table)

I cannot post comment on phloem's answer because I don't have enough reputation. That's why I have to create another topic.


I found an ESRI supplemental tool called Zonal Statistics As Table 2, which allows me to calculate overlapping polygons. However, this tool takes forever to run (6 hours for me). I would like to have a more efficient way to process data.


You CAN do it using vectors, 100% agree with others. Anyway, I slightly modified your code, replacing very long names and it works as expected:

import arcpy store_buffs = r"D:ScratchA30min_weekday.shp" density_ras = r"D:ScratchRaster1.tif" table_list = [] with arcpy.da.SearchCursor(store_buffs, ["FacilityID"]) as cursor: for row in cursor: arcpy.AddMessage(row[0]) exp = '"FacilityID" = ' + str(row[0]) temp_table10 = r"in_memory	emp_" + str(row[0]) temp_shp10 = r'in_memory	emp_shp10' arcpy.Select_analysis(store_buffs, temp_shp10, exp) arcpy.sa.ZonalStatisticsAsTable(temp_shp10, 'FacilityID', density_ras, temp_table10, 'DATA', 'SUM') table_list.append(temp_table10) del row final_table = r"D:scratchall_rows.dbf" arcpy.Merge_management(table_list, final_table)

It takes ages, i.e. almost 4 mins for 100 pgons. Try this one, it takes 25 sec to do the same job. It works from ArcMap, assumes you have layer called 'A30min_weekday' and it has field SUM.

import arcpy, traceback, os, sys from arcpy import env env.overwriteOutput = True mxd = arcpy.mapping.MapDocument("CURRENT") store_buffs=arcpy.mapping.ListLayers(mxd,'A30min_weekday')[0] density_ras = r"D:ScratchRaster1.tif" parID="FacilityID" parID2="FacilityID_1" env.workspace = "in_memory" dbf="stat" joinLR="SD" try: def showPyMessage(): arcpy.AddMessage(str(time.ctime()) + " - " + message) def Get_V(aKey): try: return smallDict[aKey] except: return (-1) arcpy.AddMessage("Defining neighbours… ") arcpy.SpatialJoin_analysis(store_buffs, store_buffs, joinLR,"JOIN_ONE_TO_MANY") arcpy.AddMessage("Creating empty dictionary") dictFeatures = {} with arcpy.da.SearchCursor(store_buffs, parID) as cursor: for row in cursor: dictFeatures[row[0]]=() del row, cursor arcpy.AddMessage("Assigning neighbours… ") with arcpy.da.SearchCursor(joinLR, (parID,parID2)) as cursor: for row in cursor: aKey=row[0] aList=dictFeatures[aKey] aList+=(row[1],) dictFeatures[aKey]=aList del row, cursor arcpy.AddMessage("Defining non-overlapping subsets… ") runNo=0 while (True): toShow,toHide=(),() nF=len(dictFeatures) for item in dictFeatures: if item not in toShow and item not in toHide: toShow+=(item,) toHide+=(dictFeatures[item]) m=len(toShow)FacilityID" IN"+str(toShow) if m==1:FacilityID" ="+str(toShow[0]) store_buffs.definitionQuery=quer runNo+=1 arcpy.AddMessage("Run %i, %i polygon(s) found" % (runNo,m)) arcpy.AddMessage("Running Statistics… ") arcpy.gp.ZonalStatisticsAsTable_sa(store_buffs, parID, density_ras, dbf, "DATA", "SUM") arcpy.AddMessage("Data transfer… ") smallDict={} with arcpy.da.SearchCursor(dbf, (parID,"SUM")) as cursor: for row in cursor: smallDict[row[0]]=row[1] del row, cursor with arcpy.da.UpdateCursor(store_buffs, (parID,"SUM")) as cursor: for row in cursor: aKey=row[0] row[1]=Get_V(aKey) cursor.updateRow(row) del row, cursor for item in toShow: del dictFeatures[item] m=len(dictFeatures) if m==0: break store_buffs.definitionQuery="" except: message = "
*** PYTHON ERRORS *** "; showPyMessage() message = "Python Traceback Info: " + traceback.format_tb(sys.exc_info()[2])[0]; showPyMessage() message = "Python Error Info: " + str(sys.exc_type)+ ": " + str(sys.exc_value) + "
"; showPyMessage()

Usage

A zone is defined as all areas in the input that have the same value. The areas do not have to be contiguous. Both raster and feature can be used for the zone input.

If the Input raster or feature zone data ( in_zone_data in Python) is a raster, it must be an integer raster.

If the Input raster or feature zone data ( in_zone_data in Python) is a feature, it will be converted to a raster internally using the cell size and cell alignment from the Input value raster ( in_value_raster in Python).

When the cell size of the Input raster or feature zone data ( in_zone_data in Python) and the Input value raster ( in_value_raster in Python) is different, the output cell size will be the Maximum Of Inputs value, and the Input value raster will be used as the snap raster internally. If the cell size is the same but the cells are not aligned, the Input value raster will be used as the snap raster internally. Either of these cases will trigger an internal resampling before the zonal operation is performed.

When the zone and value inputs are both rasters of the same cell size and the cells are aligned, they will be used directly in the tool and will not be resampled internally during tool execution.

If the Input raster or feature zone data ( in_zone_data in Python) is a feature, for any of the zone features that do not overlap any cell centers of the value raster, those zones will not be converted to the internal zone raster. As a result, those zones will not be represented in the output. You can manage this by determining an appropriate value for the cell size environment that will preserve the desired level of detail of the feature zones, and specify it in the analysis environment.

If the Input raster or feature zone data ( in_zone_data in Python) is a point feature, it is possible to have more than one point contained within any particular cell of the value input raster. For such cells, the zone value is determined by the point with the lowest ObjectID field (for example, OID or FID ).

If the Input raster or feature zone data ( in_zone_data in Python) has overlapping features, the zonal analysis will be performed for each individual feature.

When specifying the Input raster or feature zone data ( in_zone_data in Python), the default zone field will be the first available integer or text field. If no other valid fields exist, the ObjectID field (for example, OID or FID ) will be the default.

The Input value raster ( in_value_raster in Python) can be either integer or floating point. However, when it is floating-point type, the options for calculating majority, median, minority, and variety will not be available.

For majority and minority calculations, when there is a tie, the output for the zone is based on the lowest of the tied values. See How Zonal Statistics works for more information.

A field or series of fields will be created in the output table, depending on the Statistics type parameter setting. When the value input is integer, all of the statistics (Mean, Majority, Maximum, Median, Minimum, Minority, Percentile, Range, Standard deviation, Sum, and Variety) are available for calculation. If the value input is floating point, the Majority, Minority, Median, Percentile, and Variety statistics will not be calculated.

Supported multidimensional raster dataset types include multidimensional raster layer, mosaic, image service and Esri's CRF.

The data type for each value of the items in the output table is dependent on the zonal calculation being performed. See How Zonal Statistics works for the specific behavior of a statistic.

The number of rows in the output table is the number of zones.

By default, this tool will take advantage of multicore processors. The maximum number of cores that can be used is four.

To use fewer cores, use the parallelProcessingFactor environment setting.

See Analysis environments and Spatial Analyst for additional details on the geoprocessing environments that apply to this tool.


Parameters

The dataset that defines the zones.

The zones can be defined by an integer raster or a feature layer.

The field that contains the values that define each zone.

It must be an integer field of the zone dataset.

Output table that will contain the summary of the values in each zone.

The format of the table is determined by the output location and path. By default, the output will be a geodatabase table. If the path is not in a geodatabase, the format is determined by the extension. If the extension is .dbf , it will be in dBASE format. If no extension is specified, the output will be an INFO table. Note that INFO tables are not supported as input in ArcGIS Pro and cannot be displayed.

The cell size of the output raster that will be created.

This parameter can be defined by a numeric value or obtained from an existing raster dataset. If the cell size hasn't been explicitly specified as the parameter value, the environment cell size value will be used if specified otherwise, additional rules will be used to calculate it from the other inputs. See the usage section for more detail.

The dataset that defines the zones.

The zones can be defined by an integer raster or a feature layer.

The field that contains the values that define each zone.

It must be an integer field of the zone dataset.

Output table that will contain the summary of the values in each zone.

The format of the table is determined by the output location and path. By default, the output will be a geodatabase table. If the path is not in a geodatabase, the format is determined by the extension. If the extension is .dbf , it will be in dBASE format. If no extension is specified, the output will be an INFO table. Note that INFO tables are not supported as input in ArcGIS Pro and cannot be displayed.

The cell size of the output raster that will be created.

This parameter can be defined by a numeric value or obtained from an existing raster dataset. If the cell size hasn't been explicitly specified as the parameter value, the environment cell size value will be used if specified otherwise, additional rules will be used to calculate it from the other inputs. See the usage section for more detail.

Code sample

This example determines the geometry measures for each zone defined by the input polygon shapefile.

This example determines the geometry measures for each zone defined by the input polygon shapefile.


Syntax

Dataset that defines the zones.

The zones can be defined by an integer raster or a feature layer.

Field that holds the values that define each zone.

It can be an integer or a string field of the zone dataset.

The raster values to create the histograms.

The format of the table is determined by the output location and path. By default, the output will be a geodatabase table. If the path is not in a geodatabase, the format is determined by the extension. If the extension is .dbf , it will be in dBASE format. If no extension is specified, the output will be an INFO table.


1 Answer 1

You could try resampling the raster to a finer resolution. Not sure why this can work but when Reading the help section for version 10.3 (not present in 10.6):


If the zone input is a feature dataset with relatively small features,
keep in mind that the resolution of the information needs to be
appropriate relative to the resolution of the value raster. If the
areas of single features are similar to or smaller than the area of
single cells in the value raster, in the feature-to-raster conversion
some of these zones may not be represented.


I had the same problem. Especially features that were very thin had null values. So I think when they do not cross the centroid of the raster cell, you get NULLS.


Altering spatial references in Python

With ArcGIS 10.1, a spatial reference object can be created using a name or well-known ID (WKID).

However, once a spatial reference is created, many of the properties cannot be altered because they are read-only.

Instead, if you need to change a property, you will need to take advantage of Python’s string manipulation capabilities. Since spatial reference properties can be expressed as well known strings, one solution is to export the spatial reference to a string, modify that string, and then use the altered string to create a new spatial reference.

In addition to the documentation above, storage parameters like the coordinate domains and resolution, as well as tolerances are included in the spatial reference string.


3 Answers 3

This question can have multiple answers depending on who is doing the answering, or, depending on the interpretation placed on it, it might be unanswerable.

As I understand the question, we are given that $X_1, X_2$, and $X_3$ are pairwise jointly normal random variables with specified mean vectors and covariance matrices. We are asked for a trivariate joint normal distribution that is "most consistent" with this information.

As an engineer (and wannabe probabilist), the first thing I would do is to check the given information for consistency.

It must be the case that $mu_ = (mu_i, mu_j)$ where $mu_i = E[X_i]$, $i = 1,2,3$. So, if the problem claims that $mu_ <12>= (4,10)$ while $mu_ <23>= (5,-3)$, the given information is inconsistent.

Similarly, the covariance matrices must be consistent: in $Sigma_$, the diagonal terms are $sigma_i^2 = operatorname(X_i)$ and $sigma_j^2 = operatorname(X_j)$. Each variance occurs in two of the three covariance matrices and it must have the same value in both occurencse. It must also be true that the off-diagonal terms in $Sigma_$ must be equal, and that $Sigma_$ must have determinant $geq 0$.

If the above consistency checks are satisfied, then, regardless of whether the $X_i$ are pairwise jointly normal or not, the mean vector of $(X_1,X_2,X_3)$ is $mu = (mu_1,mu_2,mu_3) ag<1>$ and the covariance matrix is $Sigma = left[egin sigma_1^2& sigma_<12>& sigma_<13> sigma_ <21>& sigma_2^2 & sigma_<23> sigma_ <31>& sigma_ <32>& sigma_3^2end ight] ag<2>$ in which the upper left $2 imes 2$ submatrix is the given $Sigma_<12>$, the lower right $2 imes 2$ submatrix is the given $Sigma_<23>$, and the "four-corners" submatrix is the given $Sigma_<13>$.

At this point, the probabilist's answer to the question

Which trivariate joint normal distribution is most consistent with the given information?

would be that $mathcal N(mu,Sigma)$ is the only trivariate normal distribution that is exactly consistent with the given information (that the marginal bivariate distributions are the specified bivariate joint normal distributions). No other trivariate normal distribution can be as good a match with the given information.

On the other hand, the question might be interpreted as

Which trivariate joint normal distribution is the "best approximation" to the actual unspecified joint distribution of $(X_1,X_2,X_3)$?

and to my lowly engineering mindset, this question is unanswerable since neither the meaning of "best" nor the target to be matched is known dyed-in-the-wool probabilists might choose to give a different response. Other readers who disagree with my interpretation might want to try their hand at finding the trivariate joint normal distribution that best approximates $f(x_1,x_2,x_3) = egin 2phi(x_1)phi(x_2)phi(x_3), & ext

x_1 geq 0, x_2 geq 0, x_3 geq 0, & ext

x_1<0, x_2< 0, x_3 geq 0, & ext

x_1<0, x_2geq 0, x_3 < 0, & ext

x_1geq 0, x_2< 0, x_3 < 0, 0, & ext,end$ where $phi(cdot)$ is the standard normal density function. Note that the $X_i$ are pairwise independent standard normal random variables.

On the other hand, a good statistician might take it that the OP has masses of data (maybe even "Big Data") from which the OP has drawn the inference that it is reasonable to assume that the $X_i$ are pairwise jointly normal random variables, and that $mu_<12>$, $mu_<23>$, $mu_<13>$, $Sigma_<12>$ $Sigma_<23>$, $Sigma_<13>$ are sample means, variances, and covariances gleaned from three different data sets of the form $<(x_<1,i>, x_<2,i>)colon i = 1,2,ldots n_1>$, $<(x_<2,j>,x_<3,j>)colon j = 1,2,ldots, n_2>$ and $<(x_<1,k>,x_<3,k>)colon k = 1,2,ldots n_3>$. Then it is almost certain that the first entry in $mu_<12>$ (which is $frac<1>sum_i x_<1,i>$) is not the same as the first entry in $mu_<13>$ (which is $frac<1>sum_k x_<1,k>$). Similarly for the other means, variances and covariances. So, should the data be merged to get better estimates of what the means, variances, etc are? Is there data available of the form $<(x_<1,m>, x_<2,m>, x_<3,m>)colon m = 1,2,ldots N>$ that might allow some form of estimation as to what the actual joint distribution of $(X_1,X_2,X_3)$ is?

Considerations of this type lead to the kinds of answers given by Tim and Karel Macek.


3 Answers 3

The boxplots ignore the repeated nature of the data.

If you want a plot of these data (and you should want one!) you can make a plot where the x axis is time, the y axis is score and each participant gets a line. With n = 38 this ought to be readable, but if not, you can separate the data into two parts, either based on some relevant independent variable or on starting value.

In your particular case, it is possible that every single person went up by a small amount between 1 and 3.

Another point: If you know when the measurements happened (e.g. time 1 = day 1, time 2 = day 5, time 3 = day 21 or whatever) it may be better to use that rather than a factor the way you have it.

I also would be wary of RM-ANOVA. It makes strong assumptions (in particular, it assumes sphericity) that are often not reasonable with repeated data. I'd go with either generalized estimating equations (GEE) or a multilevel model.

The correct way to make a plot of this sort of data as Peter Flom said would be to make a line for each individual, like this:

if you take a closer look, you can study how each individual behaves, doing it "fast" and bad, you get something like this, where I draw a red square over the individuals that somehow obtained a bigger score over time (around 19), yellow in those who stayed the same (around 14) and green in the ones that go down (5):

Even with a lot of doubts about some individuals you can see some tendency.

That visual tendency gets corroborated by a model:

About the overlapping boxplots and the signification, is it possible that there is some confusion with overlapping boxplots and overlapping confidence intervals of a mean? It is a widespread myth that overlapping mean CI implies no statistical differences (although it is not true)

Your plot -- although not actively misleading -- nevertheless doesn't do justice to all the fine structure of your data.

A box plot can't always work well when data are granular with lots of ties: here values are reported only as multiples of 0.1. My bias is that the jittering you use can't be as clear as the stacking of identical values that is possible in a sample this small.

Further, there is a hint in the values, accidentally obscured a little by showing 0.3, 0.6 and 0.9 only as labelled points on the score axis, together with the negative skewness, that scores just possibly have an upper limit at 1. Whether that is so and what that implies about the analysis of variance are open questions.

The display here uses a slightly unorthodox box plot with whiskers just to the extremes. The common convention of drawing whiskers only to points within 1.5 IQR of the nearer quartile I find a little oversold. In any case it can't work especially well with so many ties. But which whisker rule is used is immaterial whenever any display, like that below, shows all the data any way.

With so many ties, the medians can't work well at showing the levels of the distributions. To that extent the box plots are misleading. Naturally showing means as well is a good idea. Here they are diamonds.

This display leaves aside the important detail of which individual is which, explored in another answer.


Using Landsat 8 to calculate NDVI question

in raster calculator to get a unitless raster ranging from -1 to 1. This seems to line up with most online guides showing how to calculate NDVI. However, I also came across a step by step instruction to calculate NDVI that first has you adjust each band by reflectance and sun angle, which can be found in the downloaded metadata. Then you do the standard NDVI equation listed above using the adjusted bands. I don't get a raster that ranges from -1 to 1 when I do this. Am I doing something wrong or is the first way the right way.

If you want to compare your NDVI to different images, you need to correct your image (atmospherically, topographically, etc.) to reflectance values. If you use luminance (which is what the raw landsat image is), you will have variations in NDVI that are not due to differences in vegetation but in illumination conditions.

In theory, reflectance values let you compare any image from any sensor and you should have NDVI between -1 and 1 regardless of the platform used to acquire the imagery and regardless of acquisition conditions. Luminance on the other hand is completely sensor and scene dependant.

You can let USGS do all the reflectance calculations for you, when you download your scenes from Earth Explorer, you can choose landsat reflectance product. You order the scene you want (it is free) and then get an email when you are ready. That would be the way I would go ahead and do NDVI.

So the answer is, it depends. You are doing nothing wrong if you are just interested in looking at one scene. If you want to compare, you are better off using reflectance values.

I won't start on how Landsat 7 NDVI and Landsat 8 NDVI don't compare due to the slight difference in the spectral windows each instrument uses.


Modeling land use/land cover change using remote sensing and geographic information systems: case study of the Seyhan Basin, Turkey

Land use and land cover (LULC) changes affect several natural environmental factors, including soil erosion, hydrological balance, biodiversity, and the climate, which ultimately impact societal well-being. Therefore, LULC changes are an important aspect of land management. One method used to analyze LULC changes is the mathematical modeling approach. In this study, Cellular Automata and Markov Chain (CA-MC) models were used to predict the LULC changes in the Seyhan Basin in Turkey that are likely to occur by 2036. Satellite multispectral imagery acquired in the years 1995, 2006, and 2016 were classified using the object-based classification method and used as the input data for the CA-MC model. Subsequently, the post-classification comparison technique was used to determine the parameters of the model to be simulated. The Markov Chain analyses and the multi-criteria evaluation (MCE) method were used to produce a transition probability matrix and land suitability maps, respectively. The model was validated using the Kappa index, which reached an overall level of 77%. Finally, the LULC changes were mapped for the year 2036 based on transition rules and a transition area matrix. The LULC prediction for the year 2036 showed a 50% increase in the built-up area class and a 7% decrease in the open spaces class compared to the LULC status of the reference year 2016. About an 8% increase in agricultural land is also likely to occur in 2036. About a 4% increase in shrub land and a 5% decrease in forest areas are also predicted.

This is a preview of subscription content, access via your institution.


Watch the video: How to join an Excel spreadsheet onto an attribute table in ArcMap