function [xy,distance,t_a] = distance2curve(curvexy,mapxy,interpmethod) % distance2curve: minimum distance from a point to a general curvilinear n-dimensional arc % usage: [xy,distance,t] = distance2curve(curvexy,mapxy) % uses linear curve segments % usage: [xy,distance,t] = distance2curve(curvexy,mapxy,interpmethod) % % Identifies the closest point along a general space curve (a 1-d path % in some space) to some new set of points. The curve may be piecewise % linear or a parametric spline or pchip model. % % arguments: (input) % curvexy - An nxp real numeric array containing the points of the % curve. For 2-dimensional curves, p == 2. This will be a list % of points (each row of the array is a new point) that % define the curve. The curve may cross itself in space. % Closed curves are acceptable, in which case the first % and last points would be identical. (Sorry, but periodic % end conditions are not an option for the spline at this time.) % % Since a curve makes no sense in less than 2 dimensions, % p >= 2 is required. % % mapxy - an mxp real numeric array, where m is the number of new points % to be mapped to the curve in term of their closest distance. % % These points which will be mapped to the existing curve % in terms of the minimium (euclidean, 2-norm) distance % to the curve. Each row of this array will be a different % point. % % interpmethod - (OPTIONAL) string flag - denotes the method % used to compute the arc length of the curve. % % method may be any of 'linear', 'spline', or 'pchip', % or any simple contraction thereof, such as 'lin', % 'sp', or even 'p'. % % interpmethod == 'linear' --> Uses a linear chordal % approximation to define the curve. % This method is the most efficient. % % interpmethod == 'pchip' --> Fits a parametric pchip % approximation. % % interpmethod == 'spline' --> Uses a parametric spline % approximation to fit the curves. Generally for % a smooth curve, this method may be most accurate. % % DEFAULT: 'linear' % % arguments: (output) % xy - an mxp array, contains the closest point identified along % the curve to each of the points provided in mapxy. % % distance - an mx1 vector, the actual distance to the curve, % in terms minimum Euclidean distance. % % t - fractional arc length along the interpolating curve to that % point. This is the same value that interparc would use to % produce the points in xy. % % % Example: % % Find the closest points and the distance to a polygonal line from % % several test points. % % curvexy = [0 0;1 0;2 1;0 .5;0 0]; % mapxy = [3 4;.5 .5;3 -1]; % [xy,distance,t] = distance2curve(curvexy,mapxy,'linear') % % xy = % % 2 1 % % 0.470588235294118 0.617647058823529 % % 1.5 0.5 % % distance = % % 3.16227766016838 % % 0.121267812518166 % % 2.12132034355964 % % t = % % 0.485194315877587 % % 0.802026225550702 % % 0.34308419095021 % % % plot(curvexy(:,1),curvexy(:,2),'k-o',mapxy(:,1),mapxy(:,2),'r*') % hold on % plot(xy(:,1),xy(:,2),'g*') % line([mapxy(:,1),xy(:,1)]',[mapxy(:,2),xy(:,2)]','color',[0 0 1]) % axis equal % % % Example: % % Solve for the nearest point on the curve of a 3-d quasi-elliptical % % arc (sampled and interpolated from 20 points) mapping a set of points % % along a surrounding circle onto the ellipse. This is the example % % used to generate the screenshot figure. % t = linspace(0,2*pi,20)'; % curvexy = [cos(t) - 1,3*sin(t) + cos(t) - 1.25,(t/2 + cos(t)).*sin(t)]; % % s = linspace(0,2*pi,100)'; % mapxy = 5*[cos(s),sin(s),sin(s)]; % xy = distance2curve(curvexy,mapxy,'spline'); % % plot3(curvexy(:,1),curvexy(:,2),curvexy(:,3),'ko') % line([mapxy(:,1),xy(:,1)]',[mapxy(:,2),xy(:,2)]',[mapxy(:,3),xy(:,3)]','color',[0 0 1]) % axis equal % axis square % box on % grid on % view(26,-6) % % % Example: % % distance2curve is fairly fast, at least for the linear case. % % Map 1e6 points onto a polygonal curve in 10 dimensions. % curvexy = cumsum(rand(10,10)); % mapxy = rand(1000000,10)*5; % tic,[xy,distance] = distance2curve(curvexy,mapxy,'linear');toc % % Elapsed time is 2.867453 seconds. % % % See also: interparc, spline, pchip, interp1, arclength % % Author: John D'Errico % e-mail: woodchips@rochester.rr.com % Release: 1.0 % Release date: 9/22/2010 % check for errors, defaults, etc... if (nargin < 2) error('DISTANCE2CURVE:insufficientarguments', ... 'at least curvexy and mapxy must be supplied') elseif nargin > 3 error('DISTANCE2CURVE:abundantarguments', ... 'Too many arguments were supplied') end % get the dimension of the space our points live in [n,p] = size(curvexy); if isempty(curvexy) || isempty(mapxy) % empty begets empty. you might say this was a pointless exercise. xy = zeros(0,p); distance = zeros(0,p); t_a = zeros(0,p); return end % do curvexy and mapxy live in the same space? if size(mapxy,2) ~= p error('DISTANCE2CURVE:improperpxorpy', ... 'curvexy and mapxy do not appear to live in the same dimension spaces') end % do the points live in at least 2 dimensions? if p < 2 error('DISTANCE2CURVE:improperpxorpy', ... 'The points MUST live in at least 2 dimensions for any curve to be defined.') end % how many points to be mapped to the curve? m = size(mapxy,1); % make sure that curvexy and mapxy are doubles, as uint8, etc % would cause problems down the line. curvexy = double(curvexy); mapxy = double(mapxy); % test for complex inputs if ~isreal(curvexy) || ~isreal(mapxy) error('DISTANCE2CURVE:complexinputs','curvexy and mapxy may not be complex') end % default for interpmethod if (nargin < 3) || isempty(interpmethod) interpmethod = 'linear'; elseif ~ischar(interpmethod) error('DISTANCE2CURVE:invalidinterpmethod', ... 'Invalid method indicated. Only ''linear'',''pchip'',''spline'' allowed') else validmethods = {'linear' 'pchip' 'spline'}; ind = strmatch(lower(interpmethod),validmethods); if isempty(ind) || (length(ind) > 1) error('DISTANCE2CURVE:invalidinterpmethod', ... 'Invalid method indicated. Only ''linear'',''pchip'',''spline'' allowed') end interpmethod = validmethods{ind}; end % if the curve is a single point, stop here if n == 1 % return the appropriate parameters xy = repmat(curvexy,m,1); t_a = zeros(m,1); % 2 norm distance, or sqrt of sum of squares of differences distance = sqrt(sum(bsxfun(@minus,curvexy,mapxy).^2,2)); % we can drop out here return end % compute the chordal linear arclengths, and scale to [0,1]. seglen = sqrt(sum(diff(curvexy,[],1).^2,2)); t0 = [0;cumsum(seglen)/sum(seglen)]; % We need to build some parametric splines. % compute the splines, storing the polynomials in one 3-d array ppsegs = cell(1,p); % the breaks for the splines will be t0, unless spline got fancy % on us here. breaks = t0; for i = 1:p switch interpmethod case 'linear' dt = diff(t0); ind = 1:(n-1); ppsegs{i} = [(curvexy(ind + 1,i) - curvexy(ind,i))./dt,curvexy(ind,i)]; case 'pchip' spl = pchip(t0,curvexy(:,i)); ppsegs{i} = spl.coefs; case 'spline' spl = spline(t0,curvexy(:,i)); breaks = spl.breaks'; nc = numel(spl.coefs); if nc < 4 % just pretend it has cubic segments spl.coefs = [zeros(1,4-nc),spl{i}.coefs]; spl.order = 4; end ppsegs{i} = spl.coefs; end end % how many breaks did we find in the spline? This is % only a thing to worry about for a spline based on few points, % when the function spline.m may choose to use only two breaks. nbr = numel(breaks); % for each point in mapxy, find the closest point to those % in curvexy. This part we can do in a vectorized form. pointdistances = ipdm(mapxy,curvexy,'metric',2, ... 'result','structure','subset','nearestneighbor'); % initialize the return variables, using the closest point % found in the set curvexy. xy = curvexy(pointdistances.columnindex,:); distance = pointdistances.distance; t = t0(pointdistances.columnindex); % we must now do at least some looping, still vectorized where possible. % the piecewise linear case is simpler though, so do it separately. if strcmp(interpmethod,'linear'); % loop over the individual points, vectorizing in the number of % segments, when there are many segments, but not many points to map. if n >= (5*m) % many segments, so loop over the points in mapxy for i = 1:m % the i'th point in mapxy xyi = mapxy(i,:); % Compute the location (in t) of the minimal distance % point to xyi, for all lines. tnum = zeros(nbr - 1,1); tden = tnum; for j = 1:p ppj = ppsegs{j}; tden = tden + ppj(:,1).^2; tnum = tnum + ppj(:,1).*(xyi(j) - ppj(:,2)); end tmin = tnum./tden; % toss out any element of tmin that is less than or equal to % zero, or or is greater than dt for that segment. tmin((tmin <= 0) | (tmin >= diff(t0))) = NaN; % for any segments with a valid minimum distance inside the % segment itself, compute that distance. dmin = zeros(nbr - 1,1); for j = 1:p ppi = ppsegs{j}; dmin = dmin + (ppi(:,1).*tmin + ppi(:,2) - xyi(j)).^2; end dmin = sqrt(dmin); % what is the minimum distance among these segments? [mindist,minind] = min(dmin); if ~isnan(mindist) && (distance(i) > mindist) % there is a best segment, better than the % closest point from curvexy. distance(i) = mindist; t(i) = tmin(minind) + t0(minind); for j = 1:p ppj = ppsegs{j}; xy(i,j) = ppj(minind,1).*tmin(minind) + ppj(minind,2); end end end else for i = 1:(n-1) % the i'th segment of the curve t1 = t0(i); t2 = t0(i+1); % Compute the location (in t) of the minimal distance % point to mapxy, for all points. tnum = zeros(m,1); tden = 0; for j = 1:p ppj = ppsegs{j}; tden = tden + ppj(i,1).^2; tnum = tnum + ppj(i,1).*(mapxy(:,j) - ppj(i,2)); end tmin = tnum./tden; % We only care about those points for this segment where there % is a minimal distance to the segment that is internal to the % segment. k = find((tmin > 0) & (tmin < (t2-t1))); nk = numel(k); if nk > 0 % for any points with a valid minimum distance inside the % segment itself, compute that distance. dmin = zeros(nk,1); xymin = zeros(nk,p); for j = 1:p ppj = ppsegs{j}; xymin(:,j) = ppj(i,1).*tmin(k) + ppj(i,2); dmin = dmin + (xymin(:,j) - mapxy(k,j)).^2; end dmin = sqrt(dmin); L = dmin < distance(k); % this segment has a closer point % closest point from curvexy. if any(L) distance(k(L)) = dmin(L); t(k(L)) = tmin(k(L)) + t0(i); xy(k(L),:) = xymin(L,:); end end end end % for the linear case, t is identical to the fractional arc length % along the curve. t_a = t; else % cubic segments. here it is simplest to loop over the % distinct curve segments. We need not test the endpoints % of the segments, since the call to ipdm did that part. xytrans = zeros(1,p); polydiff = @(dp) dp(1:6).*[6 5 4 3 2 1]; for j = 1:(n-1) % the j'th curve segment t1 = t0(j); t2 = t0(j+1); % for a polynomial in t that looks like % P(t) = a1*t^3 + a2*t^2 + a3*t + a4, in each dimension, % extract the polynomial pieces for the 6th degree polynomial % in t for the square of the Euclidean distance to the curve. % Thus, (P_x(t) - x0)^2 + (P_y(t) - y0)^2 + ... % % a1^2*t^6 % 2*a1*a2*t^5 % (2*a1*a3 + a2^2)*t^4 % (2*a2*a3 - 2*a1*x0 + 2*a1*a4)*t^3 % (a3^2 - 2*a2*x0 + 2*a2*a4)*t^2 % (-2*a3*x0 + 2*a3*a4)*t % x0^2 - 2*a4*x0 + a4^2 % % here, only the parts of this distance that are independent of % the point itself are computed. so the x0 terms are not built % yet. All of the terms with a4 in them will go away because % of the translation. distpoly0 = zeros(1,7); for i = 1:p ppi = ppsegs{i}; % this will allow us to translate each poly to pass through % (0,0) (i.e., at t = 0) xytrans(i) = ppi(j,4); distpoly0(1:2) = distpoly0(1:2) + ppi(j,1).*[ppi(j,1),2*ppi(j,2)]; distpoly0(3) = distpoly0(3) + 2.*ppi(j,1).*ppi(j,3) + ppi(j,2).^2; distpoly0(4) = distpoly0(4) + 2.*ppi(j,2).*ppi(j,3); distpoly0(5) = distpoly0(5) + ppi(j,3).^2; end for i = 1:m % the i'th point, translated by xytrans. The translation does % not change the distance to this segment, but it does make % the computations more robust to numerical problems. xyi = mapxy(i,:) - xytrans; % update the poly for this particular point % (-2*a1*x0)*t^3 % (-2*a2*x0)*t^2 % (-2*a3*x0)*t % x0^2 distpoly = distpoly0; for k = 1:p ppk = ppsegs{k}; distpoly(4:6) = distpoly(4:6) - 2.*ppk(j,1:3).*xyi(k); distpoly(7) = distpoly(7) + xyi(k).^2; end % find any minima of this polynomial in the interval (0,t2-t1). % we can ignore solutions that happen at the endpoints of the % interval, since those are already covered by ipdm. % % merely compute the zeros of the derivative polynomial diffpoly = polydiff(distpoly); tstationary = roots(diffpoly); % discard any with an imaginary part, those that are less % than 0, or greater than t2-t1. k = (imag(tstationary) ~= 0) | ... (real(tstationary) <= 0) | ... (real(tstationary) >= (t2 - t1)); tstationary(k) = []; % for any solutions that remain, compute the distance. if ~isempty(tstationary) mindist = zeros(size(tstationary)); xyij = zeros(numel(tstationary),p); for k = 1:p xyij(:,k) = polyval(ppsegs{k}(j,:),tstationary); mindist = mindist + (mapxy(i,k) - xyij(:,k)).^2; end mindist = sqrt(mindist); % just in case there is more than one stationary point [mindist,ind] = min(mindist); if mindist < distance(i) % we found a point on this segment that is better % than the endpoint values for that segment. distance(i) = mindist; xy(i,:) = xyij(ind,:); t(i) = tstationary(ind) + t0(j); end end % if ~isempty(tstationary) end % for i = 1:n end % for j = 1:(n-1) % do we need to return t_a? t_a is the same number that interparc % uses, whereas t as we have computed it so far is just the fractional % chordal arclength. % % Don't bother doing this last piece unless that argument is requested, % since it takes some additional work to do. if nargout >= 2 % build new piecewise polynomials for each segment that % represent (dx/dt)^2 + (dy/dt)^2 + ... % % Since each poly must be cubic at this point, the result will be % a 4th degree piecewise polynomial. kernelcoefs = zeros(nbr-1,5); for i = 1:p ppi = ppsegs{i}; kernelcoefs = kernelcoefs + [9*ppi(:,1).^2, ... 12*ppi(:,1).*ppi(:,2), ... 4*ppi(:,2).^2 + 6*ppi(:,1).*ppi(:,3), ... 4*ppi(:,2).*ppi(:,3), ppi(:,3).^2]; end % get the arc length for each segment. quadgk will suffice here % since we need to integrate the sqrt of each poly arclengths = zeros(nbr-1,1); for i = 1:(nbr - 1) lengthfun = @(t) sqrt(polyval(kernelcoefs(i,:),t)); arclengths(i) = quadgk(lengthfun,0,t0(i+1) - t0(i)); end % get the cumulative arclengths, then scale by the sum % this gives us fractional arc lengths. arclengths = cumsum(arclengths); totallength = arclengths(end); arclengths = [0;arclengths/totallength]; % where does each point fall in terms of fractional cumulative % chordal arclength? (i.e., t0?) [tbin,tbin] = histc(t,t0); tbin(tbin < 1) = 1; % being careful at the bottom end tbin(tbin >= nbr) = nbr - 1; % if the point fell at the very top... % the total length below the segment in question t_a = arclengths(tbin); % now get the piece in the tbin segment for i = 1:m lengthfun = @(t) sqrt(polyval(kernelcoefs(tbin(i),:),t)); t_a(i) = t_a(i) + quadgk(lengthfun,0,t(i) - t0(tbin(i)))/totallength; end end end % if strcmp(interpmethod,'linear'); % ========================================================== function d = ipdm(data1,varargin) % ipdm: Inter-Point Distance Matrix % usage: d = ipdm(data1) % usage: d = ipdm(data1,data2) % usage: d = ipdm(data1,prop,value) % usage: d = ipdm(data1,data2,prop,value) % % Arguments: (input) % data1 - array of data points, each point is one row. p dimensional % data will be represented by matrix with p columns. % If only data1 is provided, then the distance matrix % is computed between all pairs of rows of data1. % % If your data is one dimensional, it MUST form a column % vector. A row vector of length n will be interpreted as % an n-dimensional data set. % % data2 - second array, supplied only if distances are to be computed % between two sets of points. % % % Class support: data1 and data2 are assumed to be either % single or double precision. I have not tested this code to % verify its success on integer data of any class. % % % Additional parameters are expected to be property/value pairs. % Property/value pairs are pairs of arguments, the first of which % (properties) must always be a character string. These strings % may be shortened as long as the shortening is unambiguous. % Capitalization is ignored. Valid properties for ipdm are: % % 'Metric', 'Subset', 'Limit', 'Result' % % 'Metric' - numeric flag - defines the distance metric used % metric = 2 --> (DEFAULT) Euclidean distance = 2-norm % The standard distance metric. % % metric = 1 --> 1-norm = sum of absolute differences % Also sometimes known as the "city block % metric", since this is the sum of the % differences in each dimension. % % metric = inf --> infinity-norm = maximum difference % over all dimensions. The name refers % to the limit of the p-norm, as p % approaches infinity. % % metric = 0 --> minimum difference over all dimensions. % This is not really a useful norm in % practice. % % Note: while other distance metrics exist, IMHO, these % seemed to be the common ones. % % % 'Result' - A string variable that denotes the style of returned % result. Valid result types are 'Array', 'Structure'. % Capitalization is ignored, and the string may be % shortened if you wish. % % result = 'Array' --> (DEFAULT) A matrix of all % interpoint distances will be generated. % This array may be large. If this option % is specified along with a minimum or % maximum value, then those elements above % or below the limiting values will be % set as -inf or +inf, as appropriate. % % When any of 'LargestFew', 'SmallestFew', % or 'NearestNeighbor' are set, then the % resulting array will be a sparse matrix % if 'array' is specified as the result. % % result = 'Structure' --> A list of all computed distances, % defined as a structure. This structure % will have fields named 'rowindex', % 'columnindex', and 'distance'. % % This option will be useful when a subset % criterion for the distances has been % specified, since then the distance matrix % may be very sparsely populated. Distances % for pairs outside of the criterion will % not be returned. % % % 'Subset' - Character string, any of: % % 'All', 'Maximum', 'Minimum', 'LargestFew', 'SmallestFew', % 'NearestNeighbor', 'FarthestNeighbor', or empty % % Like properties, capitalization is ignored here, and % any unambiguous shortening of the word is acceptable. % % DEFAULT = 'All' % % Some interpoint distance matrices can be huge. Often % these matrices are too large to be fully retained in % memory, yet only the pair of points with the largest % or smallest distance may be needed. When only some % subset of the complete set of distances is of interest, % these options allow you to specify which distances will % be returned. % % If 'result' is defined to be an array, then a sparse % matrix will be returned for the 'LargestFew', 'SmallestFew', % 'NearestNeighbor', and 'FarthestNeighbor' subset classes. % 'Minimum' and 'Maximum' will yield full matrices by % default. If a structure is specified, then only those % elements which have been identified will be returned. % % Where a subset is specified, its limiting value is % specified by the 'Limit' property. Call that value k. % % % 'All' --> (DEFAULT) Return all interpoint distances % % 'Minimum' --> Only look for those distances above % the cutoff k. All other distances will % be returned as -inf. % % 'Maximum' --> Only look for those distances below % the cutoff k. All other distances will % be returned as +inf. % % 'SmallestFew' --> Only return the subset of the k % smallest distances. Where only one data % set is provided, only the upper triangle % of the inter-point distance matrix will % be generated since that matrix is symmetric. % % 'LargestFew' --> Only return the subset of the k % largest distances. Where only one data % set is provided, only the upper triangle % of the inter-point distance matrix will % be generated since that matrix is symmetric. % % 'NearestNeighbor' --> Only return the single nearest % neighbor in data2 to each point in data1. % No limiting value is required for this % option. If multiple points have the same % nearest distance, then return the first % such point found. With only one input set, % a point will not be its own nearest % neighbor. % % Note that exact replicates in a single set % will cause problems, since a sparse matrix % is returned by default. Since they will have % a zero distance, they will not show up in % the sparse matrix. A structure return will % show those points as having a zero distance % though. % % 'FarthestNeighbor' --> Only return the single farthest % neighbor to each point. No limiting value % is required for this option. If multiple % points have the same farthest distance, % then return the first such point found. % % % 'Limit' - scalar numeric value or []. Used only when some % Subset is specified. % % DEFAULT = [] % % % 'ChunkSize' - allows a user with lower RAM limits % to force the code to only grab smaller chunks of RAM % at a time (where possible). This parameter is specified % in bytes of RAM. The default is 32 megabytes, or 2^22 % elements in any piece of the distance matrix. Only some % options will break the problem into chunks, thus as long % as a full matrix is expected to be returned, there seems % no reason to break the problem up into pieces. % % DEFAULT = 2^25 % % % Arguments: (output) % d - array of interpoint distances, or a struct wth the % fields {'rowindex', 'columnindex', 'distance'}. % % d(i,j) represents the distance between point i % (from data1) and point j (from data2). % % If only one (n1 x p) array is supplied, then d will % be an array of size == [n1,n1]. % % If two arrays (of sizes n1 x p and n2 x p) then d % will be an array of size == [n1,n2]. % % % Efficiency considerations: % Where possible, this code will use bsxfun to compute its % distances. % % % Example: % Compute the interpoint distances between all pairs of points % in a list of 5 points, in 2 dimensions and using Euclidean % distance as the distance metric. % % A = randn(5,2); % d = ipdm(A,'metric',2) % d = % 0 2.3295 3.2263 2.0263 2.8244 % 2.3295 0 1.1485 0.31798 1.0086 % 3.2263 1.1485 0 1.4318 1.8479 % 2.0263 0.31798 1.4318 0 1.0716 % 2.8244 1.0086 1.8479 1.0716 0 % % (see the demo file for many other examples) % % See also: pdist % % Author: John D'Errico % e-mail: woodchips@rochester.rr.com % Release: 1.0 % Release date: 2/26/08 % Default property values params.Metric = 2; params.Result = 'array'; params.Subset = 'all'; params.Limit = []; params.ChunkSize = 2^25; % untangle the arguments if nargin<1 % if called with no arguments, then the user probably % needs help. Give it to them. help ipdm return end % were two sets of data provided? pvpairs = {}; if nargin==1 % only 1 set of data provided dataflag = 1; data2 = []; else if ischar(varargin{1}) dataflag = 1; data2 = []; pvpairs = varargin; else dataflag = 2; data2 = varargin{1}; if nargin>2 pvpairs = varargin(2:end); end end end % get data sizes for later [n1,dim] = size(data1); if dataflag == 2 n2 = size(data2,1); end % Test the class of the input variables if ~(isa(data1,'double') || isa(data1,'single')) || ... ((dataflag == 2) && ~(isa(data2,'double') || isa(data2,'single'))) error('data points must be either single or double precision variables.') end % do we need to process any property/value pairs? if nargin>2 params = parse_pv_pairs(params,pvpairs); % check for problems in the properties % was a legal Subset provided? if ~isempty(params.Subset) && ~ischar(params.Subset) error('If provided, ''Subset'' must be character') elseif isempty(params.Subset) params.Subset = 'all'; end valid = {'all','maximum','minimum','largestfew','smallestfew', ... 'nearestneighbor','farthestneighbor'}; ind = find(strncmpi(params.Subset,valid,length(params.Subset))); if (length(ind)==1) params.Subset = valid{ind}; else error(['Invalid Subset: ',params.Subset]) end % was a limit provided? if ~ismember(params.Subset,{'all','nearestneighbor','farthestneighbor'}) && ... isempty(params.Limit) error('No limit provided, but a Subset that requires a limit value was specified') end % check the limit values for validity if length(params.Limit)>1 error('Limit must be scalar or empty') end switch params.Subset case {'largestfew', 'smallestfew'} % must be at least 1, and an integer if (params.Limit<1) || (round(params.Limit)~=params.Limit) error('Limit must be a positive integer for LargestFew or NearestFew') end end % was a legal Result provided? if isempty(params.Result) params.result = 'Array'; elseif ~ischar(params.Result) error('If provided, ''Result'' must be character or empty') end valid = {'array','structure'}; ind = find(strncmpi(params.Result,valid,length(params.Result))); if (length(ind)==1) params.Result = valid{ind}; else error(['Invalid Result: ',params.Subset]) end % check for the metric if isempty(params.Metric) params.Metric = 2; elseif (length(params.Metric)~=1) || ~ismember(params.Metric,[0 1 2 inf]) error('If supplied, ''Metric'' must be a scalar, and one of [0 1 2 inf]') end end % if nargin>2 % If Metric was given as 2, but the dimension is only 1, then it will % be slightly faster (and equivalent) to use the 1-norm Metric. if (dim == 1) && (params.Metric == 2) params.Metric = 1; end % Can we use bsxfun to compute the interpoint distances? % Older Matlab releases will not have bsxfun, but if it is % around, it will ne both faster and less of a memory hog. params.usebsxfun = (5==exist('bsxfun','builtin')); % check for dimension mismatch if 2 sets if (dataflag==2) && (size(data2,2)~=dim) error('If 2 point sets provided, then both must have the same number of columns') end % Total number of distances to compute, in case I must do it in batches if dataflag==1 n2 = n1; end ntotal = n1*n2; % FINALLY!!! Compute inter-point distances switch params.Subset case 'all' % The complete set of interpoint distances. There is no need % to break this into chunks, since we must return all distances. % If that is too much to compute in memory, then it will fail % anyway when we try to store the result. bsxfun will at least % do the computation efficiently. % One set or two? if dataflag == 1 d = distcomp(data1,data1,params); else d = distcomp(data1,data2,params); end % Must we return it as a struct? if params.Result(1) == 's' [rind,cind] = ndgrid(1:size(d,1),1:size(d,2)); ds.rowindex = rind(:); ds.columnindex = cind(:); ds.distance = d(:); d = ds; end case {'minimum' 'maximum'} % There is no reason to break this into pieces if the result % sill be filled in the end with +/- inf. Only break it up % if the final result is a struct. if ((ntotal*8)<=params.ChunkSize) || (params.Result(1) == 'a') % its small enough to do it all at once % One set or two? if dataflag == 1 d = distcomp(data1,data1,params); else d = distcomp(data1,data2,params); end % Must we return it as a struct? if params.Result(1) == 'a' % its an array, fill the unwanted distances with +/- inf if params.Subset(2) == 'i' % minimum d(d<=params.Limit) = -inf; else % maximum d(d>=params.Limit) = +inf; end else % a struct will be returned if params.Subset(2) == 'i' % minimum [dist.rowindex,dist.columnindex] = find(d>=params.Limit); else % maximum [dist.rowindex,dist.columnindex] = find(d<=params.Limit); end dist.distance = d(dist.rowindex + n1*(dist.columnindex-1)); d = dist; end else % we need to break this into chunks. This branch % will always return a struct. % this is the number of rows of data1 that we will % process at a time. bs = floor(params.ChunkSize/(8*n2)); bs = min(n1,max(1,bs)); % Accumulate the result into a cell array. Do it this % way because we don't know in advance how many elements % that we will find satisfying the minimum or maximum % limit specified. accum = cell(0,1); % now loop over the chunks batch = 1:bs; while ~isempty(batch) % One set or two? if dataflag == 1 dist = distcomp(data1(batch,:),data1,params); else dist = distcomp(data1(batch,:),data2,params); end % big or small as requested if ('i'==params.Subset(2)) % minimum value specified [I,J,V] = find(dist>=params.Limit); else % maximum limit [I,J] = find(dist<=params.Limit); I = I(:); J = J(:); V = dist(I + (J-1)*length(batch)); I = I + (batch(1)-1); end % and stuff them into the cell structure if ~isempty(V) accum{end+1,1} = [I,J,V(:)]; %#ok end % increment the batch batch = batch + bs; if batch(end)>n1 batch(batch>n1) = []; end end % convert the cells into one flat array accum = cell2mat(accum); if isempty(accum) d.rowindex = []; d.columnindex = []; d.distance = []; else % we found something % sort on the second column, to put them in a reasonable order accum = sortrows(accum,[2 1]); d.rowindex = accum(:,1); d.columnindex = accum(:,2); d.distance = accum(:,3); end end case {'smallestfew' 'largestfew'} % find the k smallest/largest distances. k is % given by params.Limit % if only 1 set, params.Limit must be less than n*(n-1)/2 if dataflag == 1 params.Limit = min(params.Limit,n1*(n1-1)/2); end % is this a large problem? if ((ntotal*8) <= params.ChunkSize) % small potatoes % One set or two? if dataflag == 1 dist = distcomp(data1,data1,params); % if only one data set, set the diagonal and % below that to +/- inf so we don't find it. temp = find(tril(ones(n1,n1),0)); if params.Subset(1) == 's' dist(temp) = inf; else dist(temp) = -inf; end else dist = distcomp(data1,data2,params); end % sort the distances to find those we need if ('s'==params.Subset(1)) % smallestfew [val,tags] = sort(dist(:),'ascend'); else % largestfew [val,tags] = sort(dist(:),'descend'); end val = val(1:params.Limit); tags = tags(1:params.Limit); % recover the row and column index from the linear % index returned by sort in tags. [d.rowindex,d.columnindex] = ind2sub([n1,size(dist,2)],tags); % create the matrix as a sparse one or a struct? if params.Result(1)=='a' % its an array, so make the array sparse. d = sparse(d.rowindex,d.columnindex,val,n1,size(dist,2)); else % a structure d.distance = val; end else % chunks % this is the number of rows of data1 that we will % process at a time. bs = floor(params.ChunkSize/(8*n2)); bs = min(n1,max(1,bs)); % We need to find the extreme cases. There are two possible % algorithms, depending on how many total elements we will % search for. % 1. Only a very few total elements. % 2. A relatively large number of total elements, forming % a significant fraction of the total set. % % Case #1 would suggest to retain params.Limit numberr of % elements from each batch, then at the end, sort them all % to find the best few. Case #2 will result in too many % elements to retain, so we must distinguish between these % alternatives. if (8*params.Limit*n1/bs) <= params.ChunkSize % params.Limit is small enough to fall into case #1. % Accumulate the result into a cell array. Do it this % way because we don't know in advance how many elements % that we will find satisfying the minimum or maximum % limit specified. accum = cell(0,1); % now loop over the chunks batch = (1:bs)'; while ~isempty(batch) % One set or two? if dataflag == 1 dist = distcomp(data1(batch,:),data1,params); k = find(tril(ones(length(batch),n2),batch(1)-1)); if ('s'==params.Subset(1)) dist(k) = inf; else dist(k) = -inf; end else dist = distcomp(data1(batch,:),data2,params); end % big or small as requested, keeping only the best % params.Limit number of elements if ('s'==params.Subset(1)) % minimum value specified [tags,tags] = sort(dist(:),1,'ascend'); %#ok tags = tags(1:bs); [I,J] = ndgrid(batch,1:n2); ijv = [I(tags),J(tags),dist(tags)]; else % maximum limit [tags,tags] = sort(dist(:),1,'descend'); %#ok tags = tags(1:bs); [I,J] = ndgrid(batch,1:n2); ijv = [I(tags),J(tags),dist(tags)]; end % and stuff them into the cell structure accum{end+1,1} = ijv; %#ok % increment the batch batch = batch + bs; if batch(end)>n1 batch(batch>n1) = []; end end % convert the cells into one flat array accum = cell2mat(accum); % keep only the params.Limit best of those singled out accum = sortrows(accum,3); if ('s'==params.Subset(1)) % minimum value specified accum = accum(1:params.Limit,:); else % minimum value specified accum = accum(end + 1 - (1:params.Limit),:); end d.rowindex = accum(:,1); d.columnindex = accum(:,2); d.distance = accum(:,3); % create the matrix as a sparse one or a struct? if params.Result(1)=='a' % its an array, so make the array sparse. d = sparse(d.rowindex,d.columnindex,d.distance,n1,size(dist,2)); end else % params.Limit forces us into the domain of case #2. % Here we cannot retain params.Limit elements from each chunk. % so we will grab each chunk and append it to the best elements % found so far, then filter out the best after each chunk is % done. This may be slower than we want, but its the only way. ijv = zeros(0,3); % loop over the chunks batch = (1:bs)'; while ~isempty(batch) % One set or two? if dataflag == 1 dist = distcomp(data1(batch,:),data1,params); k = find(tril(ones(length(batch),n2),batch(1)-1)); if ('s'==params.Subset(1)) dist(k) = inf; else dist(k) = -inf; end else dist = distcomp(data1(batch,:),data2,params); end [I,J] = ndgrid(batch,1:n2); ijv = [ijv;[I(:),J(:),dist(:)]]; %#ok % big or small as requested, keeping only the best % params.Limit number of elements if size(ijv,1) > params.Limit if ('s'==params.Subset(1)) % minimum value specified [tags,tags] = sort(ijv(:,3),1,'ascend'); %#ok else [tags,tags] = sort(ijv(:,3),1,'ascend'); %#ok end ijv = ijv(tags(1:params.Limit),:); end % increment the batch batch = batch + bs; if batch(end)>n1 batch(batch>n1) = []; end end % They are fully trimmed down. stuff a structure d.rowindex = ijv(:,1); d.columnindex = ijv(:,2); d.distance = ijv(:,3); % create the matrix as a sparse one or a struct? if params.Result(1)=='a' % its an array, so make the array sparse. d = sparse(d.rowindex,d.columnindex,d.distance,n1,size(dist,2)); end end end case {'nearestneighbor' 'farthestneighbor'} % find the closest/farthest neighbor for every point % is this a large problem? Or a 1-d problem? if dim == 1 % its a 1-d nearest/farthest neighbor problem. we can % special case these easily enough, and all the distance % metric options are the same in 1-d. % first split it into the farthest versus nearest cases. if params.Subset(1) == 'f' % farthest away % One set or two? if dataflag == 1 [d2min,minind] = min(data1); [d2max,maxind] = max(data1); else [d2min,minind] = min(data2); [d2max,maxind] = max(data2); end d.rowindex = (1:n1)'; d.columnindex = repmat(maxind,n1,1); d.distance = repmat(d2max,n1,1); % which endpoint was further away? k = abs((data1 - d2min)) >= abs((data1 - d2max)); if any(k) d.columnindex(k) = minind; d.distance(k) = d2min; end else % nearest. this is mainly a sort and some fussing around. d.rowindex = (1:n1)'; d.columnindex = ones(n1,1); d.distance = zeros(n1,1); % One set or two? if dataflag == 1 % if only one data point, then we are done if n1 == 2 % if exactly two data points, its trivial d.columnindex = [2 1]; d.distance = repmat(abs(diff(data1)),2,1); elseif n1>2 % at least three points. do a sort. [sorted_data,tags] = sort(data1); % handle the first and last points separately d.columnindex(tags(1)) = tags(2); d.distance(tags(1)) = sorted_data(2) - sorted_data(1); d.columnindex(tags(end)) = tags(end-1); d.distance(tags(end)) = sorted_data(end) - sorted_data(end-1); ind = (2:(n1-1))'; d1 = sorted_data(ind) - sorted_data(ind-1); d2 = sorted_data(ind+1) - sorted_data(ind); k = d1 < d2; d.distance(tags(ind(k))) = d1(k); d.columnindex(tags(ind(k))) = tags(ind(k)-1); k = ~k; d.distance(tags(ind(k))) = d2(k); d.columnindex(tags(ind(k))) = tags(ind(k)+1); end % if n1 == 2 else % Two sets of data. still really a sort and some fuss. if n2 == 1 % there is only one point in data2 d.distance = abs(data1 - data2); % d.columnindex is already set correctly else % At least two points in data2 % We need to sort all the data points together, but also % know which points from each set went where. ind12 and % bool12 will help keep track. ind12 = [1:n1,1:n2]'; bool12 = [zeros(n1,1);ones(n2,1)]; [sorted_data,tags] = sort([data1;data2]); ind12 = ind12(tags); bool12 = bool12(tags); % where did each point end up after the sort? loc1 = find(~bool12); loc2 = find(bool12); % for each point in data1, what is the (sorted) data2 % element which appears most nearly to the left of it? cs = cumsum(bool12); leftelement = cs(loc1); % any points which fell below the minimum element in data2 % will have a zero for the index of the element on their % left. fix this. leftelement = max(1,leftelement); % likewise, any point greater than the max in data2 will % have an n2 in left element. this too will be a problem % later, so fix it. leftelement = min(n2-1,leftelement); % distance to the left hand element dleft = abs(sorted_data(loc1) - sorted_data(loc2(leftelement))); dright = abs(sorted_data(loc1) - sorted_data(loc2(leftelement+1))); % find the points which are closer to the left element in data2 k = (dleft < dright); d.distance(ind12(loc1(k))) = dleft(k); d.columnindex(ind12(loc1(k))) = ind12(loc2(leftelement(k))); k = ~k; d.distance(ind12(loc1(k))) = dright(k); d.columnindex(ind12(loc1(k))) = ind12(loc2(leftelement(k)+1)); end % if n2 == 1 end % if dataflag == 1 end % if params.Subset(1) == 'f' % create the matrix as a sparse one or a struct? if params.Result(1)=='a' % its an array, so make the array sparse. d = sparse(d.rowindex,d.columnindex,d.distance,n1,n2); end elseif (ntotal>1000) && (((params.Metric == 0) && (params.Subset(1) == 'n')) || ... ((params.Metric == inf) && (params.Subset(1) == 'f'))) % nearest/farthest neighbour in n>1 dimensions, but for an % infinity norm metric. Reduce this to a sequence of % 1-d problems, each of which will be faster in general. % do this only if the problem is moderately large, since % we must overcome the extra overhead of the recursive % calls to ipdm. % do the first dimension if dataflag == 1 d = ipdm(data1(:,1),data1(:,1),'subset',params.Subset,'metric',params.Metric,'result','struct'); else d = ipdm(data1(:,1),data2(:,1),'subset',params.Subset,'metric',params.Metric,'result','struct'); end % its slightly different for nearest versus farthest here % now, loop over dimensions for i = 2:dim if dataflag == 1 di = ipdm(data1(:,i),data1(:,i),'subset',params.Subset,'metric',params.Metric,'result','struct'); else di = ipdm(data1(:,i),data2(:,i),'subset',params.Subset,'metric',params.Metric,'result','struct'); end % did any of the distances change? if params.Metric == 0 % the 0 norm, with nearest neighbour, so take the % smallest distance in any dimension. k = d.distance > di.distance; else % inf norm. so take the largest distance across dimensions k = d.distance < di.distance; end if any(k) d.distance(k) = di.distance(k); d.columnindex(k) = di.columnindex(k); end end % create the matrix as a sparse one or a struct? if params.Result(1)=='a' % its an array, so make the array sparse. d = sparse(d.rowindex,d.columnindex,d.distance,n1,n2); end elseif ((ntotal*8) <= params.ChunkSize) % None of the other special cases apply, so do it using brute % force for the small potatoes problem. % One set or two? if dataflag == 1 dist = distcomp(data1,data1,params); else dist = distcomp(data1,data2,params); end % if only one data set and if a nearest neighbor % problem, set the diagonal to +inf so we don't find it. if (dataflag==1) && (n1>1) && ('n'==params.Subset(1)) diagind = (1:n1) + (0:n1:(n1^2-1)); dist(diagind) = +inf; end if ('n'==params.Subset(1)) % nearest [val,j] = min(dist,[],2); else % farthest [val,j] = max(dist,[],2); end % create the matrix as a sparse one or a struct? if params.Result(1)=='a' % its an array, so make the array sparse. d = sparse((1:n1)',j,val,n1,size(dist,2)); else % a structure d.rowindex = (1:n1)'; d.columnindex = j; d.distance = val; end else % break it into chunks bs = floor(params.ChunkSize/(8*n2)); bs = min(n1,max(1,bs)); % pre-allocate the result d.rowindex = (1:n1)'; d.columnindex = zeros(n1,1); d.distance = zeros(n1,1); % now loop over the chunks batch = 1:bs; while ~isempty(batch) % One set or two? if dataflag == 1 dist = distcomp(data1(batch,:),data1,params); else dist = distcomp(data1(batch,:),data2,params); end % if only one data set and if a nearest neighbor % problem, set the diagonal to +inf so we don't find it. if (dataflag==1) && (n1>1) && ('n'==params.Subset(1)) diagind = 1:length(batch); diagind = diagind + (diagind-2+batch(1))*length(batch); dist(diagind) = +inf; end % big or small as requested if ('n'==params.Subset(1)) % nearest [val,j] = min(dist,[],2); else % farthest [val,j] = max(dist,[],2); end % and stuff them into the result structure d.columnindex(batch) = j; d.distance(batch) = val; % increment the batch batch = batch + bs; if batch(end)>n1 batch(batch>n1) = []; end end % did we need to return a struct or an array? if params.Result(1) == 'a' % an array. make it a sparse one d = sparse(d.rowindex,d.columnindex,d.distance,n1,n2); end end % if dim == 1 end % switch params.Subset % End of mainline % ====================================================== % begin subfunctions % ====================================================== function d = distcomp(set1,set2,params) % Subfunction to compute all distances between two sets of points dim = size(set1,2); % can we take advantage of bsxfun? % Note: in theory, there is no need to loop over the dimensions. We % could Just let bsxfun do ALL the work, then wrap a sum around the % outside. In practice, this tends to create large intermediate % arrays, especially in higher numbers of dimensions. Its also when % we might gain here by use of a vectorized code. This will only be % a serious gain when the number of points is relatively small and % the dimension is large. if params.usebsxfun % its a recent enough version of matlab that we can % use bsxfun at all. n1 = size(set1,1); n2 = size(set2,1); if (dim>1) && ((n1*n2*dim)<=params.ChunkSize) % its a small enough problem that we might gain by full % use of bsxfun switch params.Metric case 2 d = sum(bsxfun(@minus,reshape(set1,[n1,1,dim]),reshape(set2,[1,n2,dim])).^2,3); case 1 d = sum(abs(bsxfun(@minus,reshape(set1,[n1,1,dim]),reshape(set2,[1,n2,dim]))),3); case inf d = max(abs(bsxfun(@minus,reshape(set1,[n1,1,dim]),reshape(set2,[1,n2,dim]))),[],3); case 0 d = min(abs(bsxfun(@minus,reshape(set1,[n1,1,dim]),reshape(set2,[1,n2,dim]))),[],3); end else % too big, so that the ChunkSize will have been exceeded, or just 1-d if params.Metric == 2 d = bsxfun(@minus,set1(:,1),set2(:,1)').^2; else d = abs(bsxfun(@minus,set1(:,1),set2(:,1)')); end for i=2:dim switch params.Metric case 2 d = d + bsxfun(@minus,set1(:,i),set2(:,i)').^2; case 1 d = d + abs(bsxfun(@minus,set1(:,i),set2(:,i)')); case inf d = max(d,abs(bsxfun(@minus,set1(:,i),set2(:,i)'))); case 0 d = min(d,abs(bsxfun(@minus,set1(:,i),set2(:,i)'))); end end end else % Cannot use bsxfun. Sigh. Do things the hard (and slower) way. n1 = size(set1,1); n2 = size(set2,1); if params.Metric == 2 % Note: While some people might use a different Euclidean % norm computation based on expanding the square of the % difference of two numbers, that computation is inherantly % inaccurate when implemented in floating point arithmetic. % While it might be faster, I won't use it here. Sorry. d = (repmat(set1(:,1),1,n2) - repmat(set2(:,1)',n1,1)).^2; else d = abs(repmat(set1(:,1),1,n2) - repmat(set2(:,1)',n1,1)); end for i=2:dim switch params.Metric case 2 d = d + (repmat(set1(:,i),1,n2) - repmat(set2(:,i)',n1,1)).^2; case 1 d = d + abs(repmat(set1(:,i),1,n2) - repmat(set2(:,i)',n1,1)); case inf d = max(d,abs(repmat(set1(:,i),1,n2) - repmat(set2(:,i)',n1,1))); case 0 d = min(d,abs(repmat(set1(:,i),1,n2) - repmat(set2(:,i)',n1,1))); end end end % if 2 norm, then we must sqrt at the end if params.Metric==2 d = sqrt(d); end % ============================================================== % end main ipdm % begin included function - parse_pv_pairs % ============================================================== function params=parse_pv_pairs(params,pv_pairs) % parse_pv_pairs: parses sets of property value pairs, allows defaults % usage: params=parse_pv_pairs(default_params,pv_pairs) % % arguments: (input) % default_params - structure, with one field for every potential % property/value pair. Each field will contain the default % value for that property. If no default is supplied for a % given property, then that field must be empty. % % pv_array - cell array of property/value pairs. % Case is ignored when comparing properties to the list % of field names. Also, any unambiguous shortening of a % field/property name is allowed. % % arguments: (output) % params - parameter struct that reflects any updated property/value % pairs in the pv_array. % % Example usage: % First, set default values for the parameters. Assume we % have four parameters that we wish to use optionally in % the function examplefun. % % - 'viscosity', which will have a default value of 1 % - 'volume', which will default to 1 % - 'pie' - which will have default value 3.141592653589793 % - 'description' - a text field, left empty by default % % The first argument to examplefun is one which will always be % supplied. % % function examplefun(dummyarg1,varargin) % params.Viscosity = 1; % params.Volume = 1; % params.Pie = 3.141592653589793 % % params.Description = ''; % params=parse_pv_pairs(params,varargin); % params % % Use examplefun, overriding the defaults for 'pie', 'viscosity' % and 'description'. The 'volume' parameter is left at its default. % % examplefun(rand(10),'vis',10,'pie',3,'Description','Hello world') % % params = % Viscosity: 10 % Volume: 1 % Pie: 3 % Description: 'Hello world' % % Note that capitalization was ignored, and the property 'viscosity' % was truncated as supplied. Also note that the order the pairs were % supplied was arbitrary. npv = length(pv_pairs); n = npv/2; if n~=floor(n) error 'Property/value pairs must come in PAIRS.' end if n<=0 % just return the defaults return end if ~isstruct(params) error 'No structure for defaults was supplied' end % there was at least one pv pair. process any supplied propnames = fieldnames(params); lpropnames = lower(propnames); for i=1:n p_i = lower(pv_pairs{2*i-1}); v_i = pv_pairs{2*i}; ind = strmatch(p_i,lpropnames,'exact'); if isempty(ind) ind = find(strncmp(p_i,lpropnames,length(p_i))); if isempty(ind) error(['No matching property found for: ',pv_pairs{2*i-1}]) elseif length(ind)>1 error(['Ambiguous property name: ',pv_pairs{2*i-1}]) end end p_i = propnames{ind}; % override the corresponding default in params. % Use setfield for comptability issues with older releases. params = setfield(params,p_i,v_i); %#ok end