# Research Notes

## Table of Contents

- 1. Introduction
- 2. Basic Definitions
- 3. Analysis Techniques
- 4. Computational Problems
- 5. From Adaptive to Succinct
- 6. An Introduction to Instance Optimality
- 7. Intersection Algorithms
- 8. Succinct Data Structures
- 9. Compressed Data Structures
- 10. Deferred Data Structures
- 11. Succinct Indexes
- 12. Definition of \(o()\) asymptotics in dimension larger than \(1\)
- 13. From Space to Time
- 14. Adaptive Edit Distance

## 1 Introduction

In the field of Theoretical Computer Science, I am interested in refining the worst case analysis over instances of fixed size n by adding more parameters to it, in order to intuitively measure the "difficulty" of the instance.

- I explored this concept for
**algorithms**and**data structures**:- the
**adaptive**(analysis of)**algorithms**, - (input order oblivious)
**instance optimality**, **succinct**and**compressed data structures**,**succinct**and**compressed indexes**, and**Deferred Data Structures**for Rank and Select.

- the
- I applied those techniques in
**Information Retrieval**,**Compression**and**Computational geometry**:- answering
**conjunctive queries**in an indexed database, - answering labeled tree-pattern queries in labeled trees (e.g. an XML database),
- encoding
**binary relations**, - encoding labeled trees (e.g. an XML database); and
- computing the
**convex hull**in two or three dimensions, - computing the merging of convex hulls in two dimensions,
- encoding planar labeled graphs,
- computing the rectangular planar box of maximum score, where the score is the sum of the weights of the points covered by the box.

- answering

In the past I have also worked on

- Mathematical models of the co-evolution of species,
- Cryptography and Computer Security, and
- Pedagogy using collective tools.

I gather here some unsorted notes on those topics, fragments that I write here to use and reuse in articles and research proposals.

If you quote it, make sure to use my name, the website url, and the date at which you accessed it as the content could change over time.

## 2 Basic Definitions

- Algorithm
- a process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer.
- Data Structure
- A \emph{Data Structure} \(\mathcal{D}\) (e.g., a run encoding of permutations~\cite{barbay09compressed}) specifies how to encode data from some \emph{Data Type} \(\cal T\) (e.g., permutations) so as to support the operators specified by a given \emph{Abstract Data Type} \(\abstractDataType\) (e.g., direct and inverse applications).
- Encoding
- An \emph{Encoding} is merely a data structure that supports \emph{access} to the data, either through queries specific to this data type (e.g., \(i\)-th value of a permutation) or through access to a binary representation of the object (e.g., the \(i\)-th block of \(\lg n\) bits of the integer array representing the permutation). The {\access} operator is any operator through which (after sufficient applications) the data object represented can be uniquely identified.
- Best/Worst/Average Analysis
- In computer science, best, worst and average cases of a given algorithm express what the resource usage is at least, at most and on average, respectively. Usually the resource being considered is running time, i.e. time complexity, but it could also be memory or other resources. http://en.wikipedia.org/wiki/Worst-case_analysis
- Index
- An \emph{Index} is a structure that, given access to some data structure \(\cal D\) supporting a defined abstract data type \(\abstractDataType\) (e.g., a data structure for bit-vectors supporting the {\access} operator), extends the set of operators supported to a more general abstract data type \(\abstractDataType'\) (e.g., {\rank} and {\select} operators). By analogy with succinct data structures, the space used by an index is called \emph{redundancy}.
- Succinct Index
- A \emph{Succinct Index}~\cite{barbay07succinct2} or \emph{Systematic Data Structure}~\cite{gal07cell} \(\cal I\) is simply an index whose redundancy is asymptotically negligible in comparison to the uncompressed baseline when the size \(n\) of the instance goes to infinity, that is, \(o(\lg f(n))\) bits. Although the concept of separating the encoding from the index was introduced to prove lower bounds on the trade-off between space and operation times of succinct data structures~\cite{golynski07optimal} (it had only been implicitly used before~\cite{sadakane06squeezing}), it has other advantages. In particular, the concept of succinct index is additive in the sense that, if \(\cal D\) is a succinct data structure, then the data structure formed by the union of \(\cal D\) and \(\cal I\) is a succinct data structure as well. This property permits the combination of succinct indices for distinct abstract data types on similar data types~\cite{barbay07succinct2}, and suggests the possibility of a library of succinct indices that can be combined for each particular application. For example, the discussed compressed data structure for bit-vectors \cite{raman02succinct} is indeed formed by a compressed encoding of the data using \(nH_0 + o(n)\) bits and offering access to \(\Oh(\lg n)\) consecutive bits, plus a succinct index using \(o(n)\) bits. The succinct index can be used alone if we have a different encoding of the bitmap offering the same functionality.
- Compressed Index
- A \emph{Compressed Index} for a compressibility measure \(\mu\) is an index whose redundancy is asymptotically negligible in comparison to the compressed baseline for \(\mu\) when \(n\) goes to infinity, that is, it uses \(o(\lg f(n,\mu))\) bits of space.

## 3 Analysis Techniques

### 3.1 A simple example of my work on adaptive algorithms

A simple example of my work on Adaptive Algorithms [TCS2013] is a variant of the `MergeSort`

algorithm based on `Huffman`

's code:

- given an unsorted array of numbers <1,3,5,7,9,2,6,4,0,8>
- cut it in "runs" already sorted <1,3,5,7,9>, <2,6>, <4>, <0,8>
- merge the "runs" two my two, always choosing the two smallest ones:
- In <1,3,5,7,9>, <2,6>, <4>, <0,8>
- two of the smallest pieces are <2,6> and <4>,
- once merged, they are replaced by <2,4,6>,

- Now the list of pieces is <1,3,5,7,9>, <0,8>, <2,4,6>
- the two smallest pieces are <0,8> and <2,4,6>
- once merged, they are replaced by <0,2,4,6,8>,

- Now the list of pieces is <1,3,5,7,9>, <0,2,4,6,8>,
- once merged, they yield the final result <0,1,2,3,4,5,6,7,8,9>

- In <1,3,5,7,9>, <2,6>, <4>, <0,8>

This is faster than the traditional `MergeSort`

when there are few runs, and faster than the previous `Runs Adaptive MergeSort`

when the size of those runs vary by large factors.

### 3.2 Motivations for Adaptive (Analysis of) Algorithms

- In certain cases
Traditionally the analysis of algorithms measures the complexity of a problem or algorithm in terms of the worst-case behavior over all inputs of a given size. However, in certain cases an improved algorithm can be obtained by considering a finer partition of the input space. This idea has been independently rediscovered in many areas, under the names of Output sensitive algorithms (Computational geometry), Parameterized Complexity (computational complexity), Adaptive Algorithms (algorithmic), instance optimal algorithms (database), online algorithms (algorithmic). We propose a general survey of the development of this concept in these different fields, in order to explore the impact and the limits of this technique.

- Finer partition
Traditionally the analysis of algorithms measures the complexity of a problem or algorithm in terms of the worst-case behavior over all inputs of a given size. However, in certain cases an improved algorithm can be obtained by considering a finer partition of the input space by a difficulty measure, sometime down to the instance. This finer partition is defined through a difficulty measure, which groups the instance by both their size and their difficulty, in order to refine the worst case analysis for both upper and lower bounds on the complexity of the instance.

- A reaction to overly pessimistic Worst case Analysis over instances of fixed sizes
Standard worst-case analysis of algorithms has often been criticized as overly pessimistic. As a remedy, some researchers have turned towards {\em adaptive\/} analysis where the execution cost of algorithms is measured as a function of not just the input size but other parameters that capture in some ways the difficulty of the input instance.

- Adaptive definition by reference to compression
As compression schemes take advantage of the "easiness" (in term of compressibility) of practical data as compared to the worst possible data (e.g. random), Adaptive Algorithms take advantage of the "easiness" (in term of computability) of practical instance as compared to the worst case complexity. The efficiency of such algorithms is measured by their Parameterized Complexity, a measure of computational complexity taking as parameter the size of the input (as in traditional computational complexity analysis) but also one or more measures of difficulty.

Note that the parameters measure the difficulty rather than the easiness so that to maintain the traditional view of computational complexity as a monotonely increasing function in the values of its parameters.

- Adaptive (Analysis of) Intersection/Union Algorithms
Some instances can be solved in less than linear time: for instance, deciding if one relation \(R_1\) holds a single element, the worst case complexity to join \(R_1\) with one extra relation \(R_2\) is at most logarithmic in the size of \(R_2\). Given the importance of the join operations, practical search engine use

**Adaptive Algorithms**which take advantage of easier instances. The complexity of such algorithms is still linear in \(N\) in the worst case over "difficult" instances of size \(N\), but much less on "easier" instances, where the "difficulty" of the instance is measured by an additional parameter (various measures of difficulty can be defined for the same problem, in the same way as various measures of input size can be defined). - Adaptive Optimality
Various measures of difficulty and corresponding algorithms have been defined for problems such as searching in a sorted array, sorting arrays, computing the convex hull, computing the intersection of sorted sets~\cite{2001-SODA-ALinearLowerBoundOnIndexSizeForTextRetrieval-DemaineLopezOrtiz,2001-ALENEX-ExperimentsOnAdaptiveSetIntersectionsForTextRetrievalSystems-DemaineLopezOrtizMunro,2008-TALG-AlternationAndRedundancyAnalysisOfTheIntersectionProblem-BarbayKenyon,2004-CPM-AFastSetIntersectionAlgorithmForSortedSequences-BaezaYates,2005-SPIRE-ExperimentalAnalysisOfAFastIntersectionAlgorithmForSortedSequences-BaezaYatesSalinger}, solving structured queries on XML data and computing hierarchies of intersection and unions of sorted arrays~\cite{2005-ICALP-WorstCaseOptimalUnionIntersectionExpressionEvaluation-ChiniForooshanFarzanMirzazadeh}, all using the following definitions and properties:

- An algorithm is "optimal" to a measure of difficulty \(d\) if its complexity in the worst case over instances of fixed size and difficulty as measured by \(d\): this implies the more general optimality in the worst case over instances of fixed size, but permits to distinguish between algorithms with the same general worst case complexity (as in the case of join queries mentionned above).
- For a given problem there can exists several measures of difficulty. A measure \(m\) can be reduced to another measure \(d\), in the sense that any algorithm optimal for \(m\) is also optimal for \(m\); and two measures \(m\) and \(d\) can be independant, in the sense that none can be reduced to the other (i.e. there are algorithms \(A_m\) and \(A_d\) respectively optimal for each measure, but also instances \(I_m\) and \(I_d\) where the respective complexities of each algorithm is not optimal).
- For any given problem, those reductions and independances relations define a partial order on the difficulty measures. One of the major objectives of research on Adaptive Algorithms is to find for each computational problem the measures of difficulty which admit an optimal algorithm and are not dominated by any other measure of difficulty.

- Instance Optimality
It is possible to define for specific problems some ultimate measure of difficulty, to which all others can be reduced. An even stronger notion of optimality is that of instance optimality, introduced by Fagin~\etal~\cite{2001-PODS-OptimalAggregationAlgorithmsForMiddleware-FaginLotemNaor} in 2001 when considering algorithm aggregating the ranks of \(m\) attributes via an arbitrary aggregation function. In their own words, "Intuitively, instance optimality corresponds to optimality in every instance, as opposed to just the worst case or the average case." More formally, given a class \(A\) of algorithms and a class \(D\) of databases and noting \(cost({\cal A,D})\) be the middleware cost incurred by running an algorithm \(\cal A\) over a database \(\cal D\), they define an algorithm \({\cal B}\) to be

**instance optimal**over \(A\) and \(D\) if \(B \in A\) and if for every \({\cal A}\in A\) and every \({\cal D}\in D\) we have \(cost({\cal B,D}) \in O(cost({\cal A,D}))\). For some other problems, lesser notions are defined: for instance for the convex hull problem, Afshani et al.\cite{instanceOptimalGeometricAlgorithmsFOCS} defined an algorithm to be*input order oblivious instance-optimal among \(\A\)*if for every instance \(I\) of size \(n\) and for every algorithm \(A'\) in a certain class \(\A\), the running time of \(A\) on input \(I\) is at most a constant factor times the running time of \(A'\) on the worst possible permutation of \(I\) for \(A'\). Similar properties can be defined on the average running time rather than the worst case running time. - comparing algorithms with same \(n\)-fixed worst case
Traditionally in the field of algorithms, the time efficiency of algorithms has been summarized by their performance in the worst case over instances of fixed size \(n\). Such a pessimistic approach, albeit usefull to compare algorithms with distinct worst case asymptotic performance, loses it usefulness when comparing algorithms which have the same asymptotic performance in the worst case in theory, yet very distinct performance in practice. In such cases, a solution is to refine the analysis by considering additional parameters such as a difficulty \(\delta\), and to compare the worst case complexity of algorithm over instances of fixed size \(n\) and fixed difficulty \(\delta\). Given a well chosen difficulty measure, such \emph{adaptive analysis} reconnects theory and practice by outlining the superiority of \emph{adaptive algorithms} over non-adaptive ones in practice, even when they have the same complexity over instances of fixed size.

- Gap with practice
The run-time analysis of an algorithm is typically stated in terms of its performance on its worst-case input (which may be different for each algorithm). In many practical cases, the input is far more well-behaved, and we would like to obtain an algorithm that performs provably better for this kind of input. For example, it is well-known that sorting a sequence of~\(n\) numbers requires~\(\log n!\) comparisons in the worst case. But, if the input is almost sorted, we could design algorithms that take asymptotically fewer comparisons. To take such effect in account, some researchers have turned towards \emph{adaptive analysis} where the execution cost of algorithms is measured as a function of not just the input size but other parameters that capture in some ways the difficulty of the input instance. This approach has been the subject of extensive work on problems such as sorting \cite{1976-IPL-AnAlmostOptimalAlgorithmForUnboundedSearching-BentleyYao,1980-ICALP-OptimalUnboundedSearchStrategies-RaoultVuillemin,1990-JCom-UnboundedSearchingAlgorithms-Beigel,1991-JCom-MoreNearlyOptimalAlgorithmsForUnboundedSearchingPartII-ReingoldShen,1991-JCom-MoreNearlyOptimalAlgorithmsForUnboundedSearchingPartI-ReingoldShen}, computing the intersection \cite{2009-JEA-AnExperimentalInvestigationOfSetIntersectionAlgorithmsForTextSearching-BarbayLopezOrtizLuSalinger,2008-TALG-AlternationAndRedundancyAnalysisOfTheIntersectionProblem-BarbayKenyon,2007-TCS-AdaptiveSearchingInSuccinctlyEncodedBinaryRelationsAndTreeStructuredDocuments-BarbayGolynskiMunroRao,2005-ICALP-WorstCaseOptimalUnionIntersectionExpressionEvaluation-ChiniForooshanFarzanMirzazadeh,2001-ALENEX-ExperimentsOnAdaptiveSetIntersectionsForTextRetrievalSystems-DemaineLopezOrtizMunro,2000-SODA-AdaptiveSetIntersectionsUnionsAndDifferences-DemaineLopezOrtizMunro} or the union \cite{2005-ICALP-WorstCaseOptimalUnionIntersectionExpressionEvaluation-ChiniForooshanFarzanMirzazadeh,2000-SODA-AdaptiveSetIntersectionsUnionsAndDifferences-DemaineLopezOrtizMunro} of sorted arrays, or sorting \cite{1992-ACMCS-ASurveyOfAdaptiveSortingAlgorithms-EstivillCastroWood,1995-DAM-AFrameworkForAdaptiveSorting-PeterssonMoffat,2013-TCS-CompressedRepresentationsOfPermutationsAndApplications-BarbayNavarro,2012-TCS-LRMTreesCompressedIndicesAdaptiveSortingAndCompressedPermutations-BarbayFischerNavarro,1994-IC-SortingShuffledMonotoneSequences-LevcopoulosPetersson,1985-TCom-MeasuresOfPresortednessAndOptimalSortingAlgorithms-Mannila,1979-CTCS-SortingPresortedFiles-Mehlhorn,1980-CACM-BestSortingAlgorithmForNearlySortedLists-CookKim,1958-InfAndC-SortingTreesAndMeasuresOfORder-Burge}.

### 3.3 List of Adaptive Analysis Techniques

- Standard worst-case analysis
- The standard analysis of algorithms, pioneered by Knuth\cite{1968-BOOK-TheArtOfComputerProgramming-Knuth}: it measures the performance of the algorithm in the worst case over instances of fixed size. As the volume of data procesed grows, the gap between the performance of the best algorithm "optimal in the worst case" and the best algorithm "in practice" grows as well, which reduce the interest of worst case theoretical analysis.
- Output Sensitivity
- The first technique of adaptive analysis of algorithms to be introduced in Computational geometry, by Kirkpatrick and Seidel \cite{1986-JCom-TheUltimatePlanarConvexHullAlgorithm-KirkpatrickSeidel}, it measures the complexity of the algorithm in the worst case over instances of fixed input and output size. This type of analysis was already implicitly performed for all operations on texts and databases which output size can vary from null to exponential in the size of the input.
- Input-Order Aware Adaptivity
- The only technique of adaptive analysis which can be performed for algorithms sorting permutations \cite{1992-ACMCS-ASurveyOfAdaptiveSortingAlgorithms-EstivillCastroWood,1995-DAM-AFrameworkForAdaptiveSorting-PeterssonMoffat,2013-TCS-CompressedRepresentationsOfPermutationsAndApplications-BarbayNavarro,2012-TCS-LRMTreesCompressedIndicesAdaptiveSortingAndCompressedPermutations-BarbayFischerNavarro,1994-IC-SortingShuffledMonotoneSequences-LevcopoulosPetersson,1985-TCom-MeasuresOfPresortednessAndOptimalSortingAlgorithms-Mannila,1979-CTCS-SortingPresortedFiles-Mehlhorn,1980-CACM-BestSortingAlgorithmForNearlySortedLists-CookKim,1958-InfAndC-SortingTreesAndMeasuresOfORder-Burge}, it measures the complexity of the algorithm in the worst case over all permutations of a given input. In the case of permutation sorting algorithms, the input is a permutation over \([1..n]\) and is simply described by \(n\).
- Input-Order Oblivious Adaptivity
- This analysis refers to algorithms which ignore the order of the input and focus on its content. While meaningless on permutations, Munro and Spira \cite{1976-JComp-SortingAndSearchingInMultisets-MunroSpira} showed how it does apply to sorting multisets, and Boyar and Favrholdt \cite{2007-TALG-TheRelativeWorstOrderRatioForOnlineAlgorithms-BoyarFavrholdt} introduced it as ``relative worst order ratio'' in the context of online algorithms.
- Instance Optimality
- Fagin et al. \cite{2001-PDS-OptimalAggregationAlgorithmsForMiddleWare-FaginLotemNaor,2003-JCSS-OptimalAggregationAlgorithmsForMiddleWare-FaginLotemNaor} introduced the concept of an instance-optimal algorithm such that its cost is at most a constant factor from the cost of any other algorithm running on the same input, for every input instance. For many problems, this requirement is too stringent, so we consider instead variants of instance optimality on restricted models.

### 3.4 Standard worst-case Analysis

The traditional analysis of algorithms "in the worst case" over instances of fixed size is, by definition, pessimistic. As the volume of data procesed grows, the gap between the performance of the best algorithm "optimal in the worst case" and the best algorithm "in practice" grows as well, which reduces the interest of worst case theoretical analysis.

- In Spanish: SPANISH
El análisis tradicional de algoritmos "en el peor caso" sobre instancias de tamaño fijo es, por definiciÃ³n, pesimista. A medida que el volumen de datos tratados crece, el espacio entre el rendimiento del mejor algoritmo "en el peor caso" y el mejor algoritmo "en la práctica" crece, lo que reduce el interés de un análisis teórico.

### 3.5 Output Sensitivity

### 3.6 Instance Optimality

(See also *An Introduction to Instance Optimality*)

- Definitions
- Instance Optimality
- An algorithm is instance-optimal if its cost is at most a constant factor from the cost of any other algorithm running on the same input, for
*every*input instance. For many problems, this requirement is too stringent, so we consider instead variants of instance optimality where we ignore the order in which the input elements are given: - Order-Oblivious Instance Optimality
- An algorithm is instance-optimal in the Order-Oblivious setting if its cost is at most a constant factor from the cost of any other algorithm running on the same input in the worst order possible, for
*every*input instance. This is similar to the relative worst order ratio introduced by Boyar and Favrholdt \cite{2007-TALG-TheRelativeWorstOrderRatioForOnlineAlgorithms-BoyarFavrholdt} in the ontext of online algorithms. - Instance-Optimality in the Random-Order Setting
- An algorithm is instance-optimal in the Random-Order setting if its cost is at most a constant factor from the average cost of any other algorithm running on the same input but given in a random order, for
*every*input instance and for*any*probabilistic distribution on permutations of the input order. This is similar to the random order ratio introduced by Claire (Kenyon) Mathieu \cite{1996-SODA-BestFitBinPackingWithRandomOrder-Kenyon} in the context of Online algorithms.

- Results
We list here various problems for which Instance optimal algorithms have been described, explicitly or not.

- Finding \(k\) top aggregate scores in databases
Fagin et al. \cite{2001-PDS-OptimalAggregationAlgorithmsForMiddleWare-FaginLotemNaor,2003-JCSS-OptimalAggregationAlgorithmsForMiddleWare-FaginLotemNaor} were the first to coin the term "insance optimality", in 2001. They described an instance optimal algorithm to find the \(k\) entries with the top aggregate scores in a database.

- Dynamic Optimality conjecture
See Demaine et al. \cite{2007-JCom-DynamicOptimalityAlmost-DemaineHarmonIaconoPatrascu} for a summary of the bibliography

- Union of \(k\) sorted sets
Demaine et al. \cite{2000-SODA-AdaptiveSetIntersectionsUnionsAndDifferences-DemaineLopezOrtizMunro} described an algorithm to compute the union of \(k\) sorted arrays, and analyzed its complexity in function of the size of the description of the union. Given a set of \(k\) sorted arrays, the descriptions of their union in the comparison model are all permutations of each other, with the same arguments in distinct orders, and

*verifying*a union certificate is almost as difficult as computing the union itself. As such, Demaine et al.'s algorithm is instance optimal in the comparison model.In the same article, Demaine et al. \cite{2000-SODA-AdaptiveSetIntersectionsUnionsAndDifferences-DemaineLopezOrtizMunro} described an algorithm to compute the intersection of \(k\) sorted arrays, and analyzed its complexity in function of (one component \(\cal G\) of) the size of a certificate (they call it "proof") of the intersection. They prove the optimality of their algorithm over instances of given size \(n\) and difficulty \(\cal G\) via a counting arguments (for a specific instance and a given certificate for it, one can generate an exponential number of distinct instances with certificates of the same length).

This intersection algorithm is

**not**instance optimal, and I think that no instance optimal algorithm can exist for the intersection problem:- certificates for a fixed instance are varying in sizes, and
- the order in which the arrays of the instance are given can hint to some certificate over some other.

For a fixed instance, an adversary algorithm could take advantage of such a hint and "beat" any general purpose intersection algorithm. I am wondering if one can consider the intersection algorithm to be input order oblivious instance optimal?

- Convex Hull in 2D and 3D (and other problems in Computational Geometry)
In collaboration with Peyman Afshani and Timothy Chan \cite{2009-FOCS-InstanceOptimalGeometricAlgorithms-AfshaniBarbayChan}, we proved the existence of an algorithm \(A\) for computing 2-d or 3-d convex hulls that is optimal for

*every point set*in the following sense: for every sequence \(S\) of \(n\) points and for every algorithm \(A'\) in a certain class \(A\), the maximum running time of \(A\) on input \(S\) is at most a constant factor times the running time of \(A'\) on the worst possible permutation of \(S\) for \(A'\). In fact, we can establish a stronger property: for every sequence \(S\) of points and every algorithm \(A'\), the maximum running time of algorithm \(A\) is at most a constant factor times the average running time of \(A'\) over all permutations of \(S\). We call algorithms satisfying these properties*instance-optimal*in the*Order-Oblivious*and*Random-Order*setting. Such instance-optimal algorithms simultaneously subsume output-sensitive algorithms and distribution-dependent average-case algorithms, and all algorithms that do not take advantage of the order of the input or that assume the input is given in a random order.The class \(A\) under consideration consists of all algorithms in a decision tree model where the tests involve only

*multilinear*functions with a constant number of arguments. To establish an instance-specific lower bound, we deviate from traditional Ben–Or-style proofs and adopt an interesting adversary argument. For 2-d convex hulls, we prove that a version of the well known algorithm by Kirkpatrick and Seidel (1986) or Chan, Snoeyink, and Yap (1995) already attains this lower bound. For 3-d convex hulls, we propose a new algorithm.We further obtain instance-optimal results for a few other standard problems in Computational geometry, such as maxima in 2-d and 3-d, orthogonal line segment intersection in 2-d, finding bichromatic \(L_\infty\)-close pairs in 2-d, off-line orthogonal range searching in 2-d, off-line dominance reporting in 2-d and 3-d, off-line halfspace range reporting in 2-d and 3-d, and off-line point location in 2-d.

See the article itself for more details.

- Finding \(k\) top aggregate scores in databases

### 3.7 Parameterized Complexity

- History
- initiated in 1990 by Rod Downey and Michael Fellows
- see their Monograph in 1999

- Definitions
- Parameterized Problem: fix a parameter \(k\)
- Fixed Parameter Tractable Algorithm: [FPT]
- exponential only in k
- polynomial in n

- Example
- Vertex Cover: Definition
- Given G=(V,E)
- find S⊂ V
- s.t. all edges have at least one vertex in S

- Algorithm
- O(kn+1.274
^{k}) where- kn is the checking time

- idea:
- knowing k,
- traverse a binary tree
- back track on at most k levels

- O(kn+1.274

- Vertex Cover: Definition
- More formally:
- Definition
- a
**parameterized problem**is a language L\subseteq Σ^{*}×\Naturals - it is FPT if "(x,k)∈ L?" can be decided in time \(f(k)|x|^{O(1)}\)

- a
- This view is by opposition to problems for which only a solution
in O(n
^{k}) is known. - We also distinguish problems for which the best solution known is
- in O(2
^{k}n) from problems for which it is - in O(2
^{k}+ n) (even though they are both FPT).

- in O(2

- Definition
- Other Examples
- \(\rho\)-SAT = (SAT,k)
- Definicion
- k = number of variables of X
- Parameter k

- Result
- There is an algorithm in time \(O(n^k n)\) for n clauses and k variables

- Definicion
- \(\rho\)-Independant Set
- Definition
- Given a graph \(G=(V,E)\) and a integer \(k\in\Naturals\)
- Parameter k
- Decide if G has an independant set of size k
- (where an
**independant set**is a set of pairwise non-adjacent vertices)

- (where an

- Result
- It is not known if it is FPT or not
- Conjecture to NOT be FPT

- Definition
- \(\rho\)-colorability
- Definition
- Given a graph G and an integer \(k\)
- Parameter k
- Decide \(k\) colorability

- Result
- is NOT fixed parameter tractable unless P=NP

- Definition

- \(\rho\)-SAT = (SAT,k)
- Techniques
- Lower Bound
- Class W[P] analogue of NP with reductions
- W hierarchy ∀∃∀…
- A hierarchy (?)

- Solving
- kernelisation
- automata theoretic method (chap 10)
- color coding (chap 13)

- Refining
- Bounded Fixed Parameter Tractability (chap 15-16)
- Rather than any f(k) in f(k) n
^{O(1)}consider f(k)∈ 2^{O(k)}- yield a new hierarchy [Fagin]

- Lower Bound

## 4 Computational Problems

Here is a list of problems that I have worked with, and their definitions, grouped by are. Note that although some of those problems are known to be reducible to others in terms of worst-case complexity, such reductions do not imply similar ones between the adaptive analysis of their complexity. For example, an instance-optimal algorithm for a problem does not imply an instance-optimal algorithm for a restriction of the problem in a subdomain, because in the latter case, we are competing against algorithms that have to be correct only for input from this subdomain.

### 4.1 Conjunctive Queries and variants

- Sorted Intersection
- Given \(k\) sorted arrays, compute their sorted intersection.
- Sorted Union
- Given \(k\) sorted arrays, compute their sorted union.
- Sorted t-Threshold Set
- Given a parameter \(t\) and \(k\) sorted arrays, compute the set of elements present in at least t arrays.
- Sorted OPT-Threshold Set
- Given \(k\) sorted arrays, compute the largest \(t\) such that the \(t\)-Threshold set is not empty.

### 4.2 Computational Geometry

Computational geometry consists in the design and analysis of algorithms for geometric computation. Computational geometry finds application in problems from solid modeling, CAD/CAM, computer graphics, molecular biology, data structuring, and robotics, as well as problems from discrete geometry and topology.

- Bichromatic \(L_\infty\)-close pairs in 2-d
- given a set \(S\) of \(n\) red/blue points in 2-d, report all pairs \((p,q)\) where \(p\) is red, \(q\) is blue, and \(p\) and \(q\) have \(L_\infty\)-distance at most 1, or count the number of such pairs.
- Convex Hull
- Given \(n\) points in \(d\) dimensions, find their convex hull, i.e. the smallest convex object containing all \(n\) points.
- Convex Hull Merging
- Given \(k\) convex hulls of respective sizes \(n_1,...,n_k\), compute the convex hull of the \(n_1+...+n_k\) points.
- Convex Hull queries
- Given a set of \(n\) points \(S\) and a point \(p\), a \emph{Convex Hull Point Query} for \(p\) in \(S\) is answered by either a subset of points of \(S\) forming a convex cell containing \(p\) or the statement that \(p\) is part of the convex hull.
- Maxima
- Given \(n\) points in \(d\) dimensions, find the maxima points, i.e. which are not dominated by any other point. A point \((x_1, ..., x_d)\) dominates a point \((y_1, ..., y_d)\) if \(x_i > y_i\) for all \(i\in[1..d]\).
- Maxima queries
- Given a set of \(n\) points \(S\) and a point \(p\), a \emph{Maximal Dominating Point Query} for \(p\) in \(S\) is answered by at least one point of the maximal dominating point set \(M\) of \(S\) which dominates \(p\) (this point is \(p\) itself if \(p\) is in \(M\)).
- Delaunay Triangulation
- Given \(n\) points, compute a triangulation of those points such that, for each triangle \(T\), no other point is contained in the circumscribed circle to \(T\).
- Delaunay Triangulation merging
- Given \(k\) Delaunay triangulations of respective sizes \(n_1,\ldots,n_k\), compute the Delaunay triangulation of the union of the set of points of each input triangulation.
- Klee's measure
- Given \(n\) rectangles in \(d\) dimensions, compute the volume covered by their union (counting only once their intersections).
- Off-line dominance reporting in 2-d and 3-d
- given a set \(S\) of red/blue points, report the subset of red points dominated by each blue point.
- Off-line halfspace range reporting in 2-d and 3-d
- given a set \(S\) of \(n\) points and halfspaces, report the subset of points inside each halfspace.
- Off-line orthogonal range searching in 2-d
- given a set \(S\) of \(n\) points and axis-aligned rectangles, report the subset of points inside each rectangle, or count the number of such points inside each rectangle.
- Off-line point location in 2-d
- given a set \(S\) of \(n\) points and a planar connected polygonal subdivision of size \(O(n)\), report the face in the subdivision containing each point.
- Orthogonal segment intersection in 2-d
- given a set \(S\) of \(n\) horizontal/vertical line segments, report all intersections between the horizontal and vertical segments, or count the number of such intersections.

### 4.3 Data Indexing

- Access
- Given a data \(S\) and a position \(i\), access its component at position \(i\).
- Select
- Given a string \(S\), a label \(a\) and a rank \(r\), returns the position of the \(r\) th occurence of label \(a\) in \(S\).
- Range Select
- Given a string \(S\), a range of labels \([a..b]\) and a rank \(r\), returns the position of the \(r\) th occurence of symbols falling in the range \([a..b]\) in \(S\).
- Rank
- Given a string \(S\), a label \(a\) and a position \(i\), count the number of occurences of \(a\) in \(S[1..i-1]\).
- Range Rank
- Given a string \(S\), a range of labels \([a..b]\) and a position \(i\), count the number of occurences of symbols falling in the range \([a..b]\) in \(S[1..i-1]\).

### 4.4 Permutation: From Sorting to Compressing

- Permutations
Permutations of the integers \([1..n] = \{1,\ldots,n\}\) are not only a fundamental mathematical structure, but also a basic building block for the succinct encoding of integer functions, strings, binary relations, and geometric grids, among others. A permutation \(\pi\) can be trivially encoded in \(n\lceil\lg n\rceil\) bits, which is within \(O(n)\) bits of the information theory lower bound of \(\lg(n!)\) bits, where \(\lg x=\log_2 x\) denotes the logarithm in base two.

- Adaptive Sorting of Permutations
The lower bound of \(\lg(n!)\) bits yields a lower bound of \(\Omega(n\lg n)\) comparisons to sort such a permutation in the comparison model, in the worst case over all permutations of \(n\) elements. Several sorting algorithms sort \(n\) elements in \(O(n\lg n)\) comparisons, hence achieving the optimal

*worst case*complexity.Yet, there are sorting algorithms (e.g.

`Merge Sort`

) which take advantage of specificities of each permutation to sort. Some examples are permutations composed of a few sorted blocks (e.g., \((\mathit{1},\mathit{3},\mathit{5},\mathit{7},\mathit{9},\mathbf{2},\mathbf{4},\mathbf{6},\mathbf{8},\mathbf{10})\) \((\mathit{6},\mathit{7},\mathit{8},\mathit{9},\mathit{10},\mathbf{1},\mathbf{2},\mathbf{3},\mathbf{4},\mathbf{5})\)), or permutations containing few sorted subsequences (e.g., \((\mathit{1},\mathbf{6},\mathit{2},\mathbf{7},\mathit{3},\mathbf{8},\mathit{4},\mathbf{9},\mathit{5},\mathbf{10})\)). Algorithms performing possibly much less than \(n\lg n\) comparisons on such permutations, yet still \(O(n\lg n)\) comparisons in the worst case, are achievable and preferable if those permutations arise with sufficient frequency. - Compressed Representations of Permutations
Each sorting algorithm in the comparison model yields an encoding scheme for permutations: the result of all comparisons performed uniquely identifies the permutation sorted, and hence encodes it. Since an adaptive sorting algorithm performs \(o(n\lg n)\) comparisons on a class of ``easy'' permutations, each adaptive algorithm yields a

*compression scheme*for permutations, at the cost of losing a constant factor on the complementary class of ``hard'' permutations.Furthermore, by cleverly arranging the bits corresponding to the comparison results performed by

`MergeSort`

, one can support efficiently the computation of the permutation, \(\pi(i)\), or \(\pi^{-1}(i)\) of the inverse permutation for an arbitrary value \(i\in[1..n]\). This yields compressed representations of permutations whose space and time to compute any \(\pi(i)\) and \(\pi^{-1}(j)\) are proportional to the*entropy*of the distribution of the sizes of the runs in \(\pi\).The development of this data structure for encoding permutations, in turn, yielded a new family of sorting algorithms, which in general refine the known complexities to sort those classes of permutations: While there exist sorting algorithms taking advantage of the number of runs of various kinds, ours take advantage of their size distribution and are strictly better (or equal, at worst).

- Conclusion
Many problems present a duality between time and space, process and product, adaptive execution and compressed representation. A similar relation exists between algorithms to

**search in sorted arrays**and**integer encodings**(e.g. sequential search corresponds to the unary code, binary search to the binary code and doubling search to gamma code), and more complex problems, such as the computation of the union or the intersection of sorted sets.

## 5 From Adaptive to Succinct

### 5.1 Short Description

There is a relation between the execution time of comparison based algorithms and the representation of the data it processes. In particular, the relation between wavelet trees for permutations and the mergesort algorithm [STACS] yielded a better encoding for permutations. We propose to explore the application of this concept to other sorting algorithms, and other data structures on permutations; and to extend this approach to other forms of data.

### 5.2 List Description

An arbitrary integer \(n\) can be encoded in \(2\lg n\) bits, the same way than an adaptive algorithm find the element \(x\) in position \(p\) of a sorted array in \(2\lg p\) comparisons. As illustrated by the relation between search algorithms and various codes for integer [anAlmostOptimalAlgorithmForUnboundedSearching], there is a relation between the execution time of some algorithms, the representation of some related data, and the support of some operators on this data, examplified in the case of the relation between wavelet trees for permutations and the mergesort algorithm [STACS].

We propose

- to explore further this link, in particular in the context of Adaptive Algorithms and compressed succinct data-structures and compressed encodings, and to check its practicality on real-world problems.
- to extend this link to other problems and abstract data-types:
- Integer encoding vs Sorted Search
- Example:
- gamma codes and
- doubling search

- Example:
- Dynamic Search and string compression
- Example:
- Move To Front is used with the Burows Wheeler's transform to compress
- Can Splay Trees and variants do the same, or better, and
**for which measure of compressability**?

- Example:
- Set encoding and Set Merging
- Example:
- compression of bit vectors (and support of rank and select on them)
- adaptive k-merge [DLM00], other 2-merge algorithms

- Example:
- Permutation encoding and sorting
- Example:
- wavelet tree on permutations
- (adaptive) merge sort

- Example:
- Multisets, Strings and Functions encoding and sorting
- Examples:
- wavelet trees on
- strings to support rank and select and compress taking advantage of frequenceis
- functions to support f
^{-1}and compress taking advantage of the number and interleaved intervals of pitchs

- multiset sorting:
- adaptive merge sort
- input order oblivious instance optimal sorting of multisets [MunroSpira]
- real instance optimal sorting of multisets

- wavelet trees on

- Examples:
- Binary Relations, and Travelling Salesman or other.
- Example:
- Wavelet Tree for binary relations between objects and labels
- Travelling Salesman Problem finds the best order on objects and labels in order to compress better the binary relation.

- Example:

- Integer encoding vs Sorted Search
- Eventually, we might consider also
- the compression and computation of Posets, and Planar Triangulations (note that in those case, the Adaptive Algorithms are those computing the compressed representation, which is a more basic case of the relation between time complexity and space, called output sensitivity).
- The relation between Fixed Parameter Tractability (i.e. there
is an algorithm for it in O(2
^{k}f(n))) and the compression of graphs (and hence of binary relations). -> file:///home/jbarbay/Organisation/projects.org#*Graph Compression from Kernelisation algorithm

### 5.3 Bibliography

@article{anAlmostOptimalAlgorithmForUnboundedSearching, author = {Jon Louis Bentley and Andrew Chi-Chih Yao}, title = {An almost optimal algorithm for unbounded searching}, journal = {Information processing letters}, year = {1976}, OPTkey = {}, volume = {5}, number = {3}, pages = {82–87}, OPTmonth = {}, OPTnote = {}, OPTannote = {} } @Comment J. Ian Munro, Philip M. Spira: Sorting and Searching in Multisets. SIAM J. @Comment Comput. 5(1): 1-8 (1976) @Comment MunroSpira" @article{MunroSpira, author = {J. Ian Munro and Philip M. Spira}, title = {Sorting and Searching in Multisets}, journal = sjc, volume = {5}, number = {1}, year = {1976}, pages = {1–8}, bibsource = {DBLP, http://dblp.uni-trier.de} }

- Dynamic optimality conjecture

- Parameterized complexity and Compression

- Serges suggested:
I recommend to look at the lecture notes about "Kernel : Lower and upper bounds" of Daniel Lokshtanov and Saket Saurabh here: http://www-sop.inria.fr/mascotte/seminaires/AGAPE/notes There are lecture notes on other topics in exact algorithms and Parameterized Complexity and notes of the open question session under Mike's name.

- Serges suggested:

## 6 An Introduction to Instance Optimality

### 6.1 Introduction

Standard worst-case analysis of algorithms has often been criticized as overly pessimistic. As a remedy, some researchers have turned towards *adaptive* analysis where the execution cost of algorithms is measured as a function of not just the input size but other parameters that capture in some ways the difficulty of the input instance. Various measures of difficulty and corresponding algorithms have been defined for problems as diverse as searching in a sorted array \cite{1976-IPL-AnAlmostOptimalAlgorithmForUnboundedSearching-BentleyYao}, sorting arrays \cite{1992-ACJ-AnOverviewOfAdaptiveSorting-MoffatPetersson}, computing the convex hull \cite{1986-JCom-TheUltimatePlanarConvexHullAlgorithm-KirkpatrickSeidel}, computing the intersection of sorted sets \cite{dlm,dlmAlenex,alternationAndRedundancyAnalysisOfTheIntersectionProblem,aFastSetIntersectionAlgorithmForSortedSequences,experimentalAnalysisOfAFastIntersectionAlgorithmForSortedSequences}, solving structured queries on XML \cite{2007-TCS-AdaptiveSearchingInSuccinctlyEncodedBinaryRelationsAndTreeStructuredDocuments-BarbayGolynskiMunroRao} data and computing hierarchies of intersection and unions of sorted arrays \cite{worstCaseOptimalUnionIntersectionExpressionEvaluation}. Those works use similar definitions of the adaptive optimality of an algorithm for a given problem and measure of difficulty, which implies the more general optimality in the worst case over instances of fixed size, but permits to distinguish between algorithms with the same general worst case complexity. In the case of sorting, where many measures of difficulty have been defined, Manilla \cite{1985-TCom-MeasuresOfPresortednessAndOptimalSortingAlgorithms-Mannila} defined a preorder on those which allows them to be compared. One of the major objectives of research on Adaptive Algorithms is to find for each computational problem the measures of difficulty which admit an optimal algorithm and are not dominated by any other measure of difficulty.

An even stronger notion of optimality than adaptivity is that of *instance optimality*, introduced by Fagin et al. \cite{2001-PODS-OptimalAggregationAlgorithmsForMiddleware-FaginLotemNaor} in 2001 when considering algorithm aggregating the ranks of \(m\) attributes via an arbitrary aggregation function. In their own words, "Intuitively, instance optimality corresponds to optimality in every instance, as opposed to just the worst case or the average case." More formally, given a class \(A\) of algorithms and a class \(D\) of databases and noting \(cost({\cal A,D})\) be the middleware cost incurred by running an algorithm \(\cal A\) over a database \(\cal D\), they define an algorithm \({\cal B}\) to be **instance optimal** over \(A\) and \(D\) if \(B \in A\) and if for every \({\cal A}\in A\) and every \({\cal D}\in D\) we have \(cost({\cal B,D}) \in O(cost({\cal A,D}))\).

We aim to describe the sequence of analysis techniques which yield instance optimality, and to list the problems which have been shown to admit instance optimal results in some settings, with the hope that this will help to generate further such results. As a case study to demonstrate the techniques used to prove instance optimality, we analyze the computation of the maxima in 2 dimensions according to the *standard worst-case* analysis over instances of fixed size \(n\), the *output sensitive* analysis, the *vertical entropy* analysis, and the *input order oblivious instance optimal* analysis, and discuss the difficulties of a *input order aware instance optimal* result. For our survey of the related problems, we merely give the definition of the problem and summarize the results with a uniformized terminology: the variants of the concept were several time reinvented, with disctinct names. Among other we describe problems such as the support of Dynamic Ordered Dictionaries, Aggregation Algorithms, Bin Packing, the computation of the Union and Intersection of sorted arrays, the merging of convex hulls, the computation of the maxima in 3 dimensions, the computation of the convex hull in 2 and 3 dimensions, various reporting and counting problems in \(2\) dimensions, and the approximation of the point-curve distance.

### 6.2 Analysis Techniques

- Worst Case Analysis
For any algorithm \(A\) computing the intersection of sorted arrays in the comparison model, given any positive integers \(k\) and \(N\), there is an instance of \(k\) sorted sets, of \(N\) elements each, such that \(A\) performs \(kN\) comparisons, a lower bound easily achieved by a greedy algorithm searching sequentially for the elements of the smallest array in the other arrays: the computational complexity of the intersection problem in the worst case over instances of signature \((k,N)\) is \(\theta(kN)\), that is, linear in the size of the input [DLM00].

- Output Sensitive Analysis
(…)

- Adaptive (Analysis of) Algorithms
On the other hand, some instances can be solved in less than linear time: for instance, deciding if one relation \(R_1\) holds a single element, the worst case complexity to join \(R_1\) with one extra relation \(R_2\) is at most logarithmic in the size of \(R_2\). Given the importance of the join operations, practical search engine use

**Adaptive Algorithms**which take advantage of easier instances. The complexity of such algorithms is still linear in \(N\) in the worst case over "difficult" instances of size \(N\), but much less on "easier" instances, where the "difficulty" of the instance is measured by an additional parameter (various measures of difficulty can be defined for the same problem, in the same way as various measures of input size can be defined).The authors, inspired by one of the works from Demaine, Lopez-Ortiz and Munro [DLM00] on the intersection of sorted sets, propose a

**new measure of difficulty**(the number of "holes") and some**algorithms adaptive**to this measure, for the resolution of a particular case of join queries. - Adaptive Optimality
Various measures of difficulty and corresponding algorithms have been defined for problems such as searching in a sorted array, sorting arrays, computing the convex hull, computing the intersection of sorted sets \cite{dlm,dlmAlenex,alternationAndRedundancyAnalysisOfTheIntersectionProblem,aFastSetIntersectionAlgorithmForSortedSequences,experimentalAnalysisOfAFastIntersectionAlgorithmForSortedSequences}, solving structured queries on XML data and computing hierarchies of intersection and unions of sorted arrays \cite{worstCaseOptimalUnionIntersectionExpressionEvaluation}, all using the following definitions and properties:

- An algorithm is "optimal" to a measure of difficulty \(d\) if its complexity in the worst case over instances of fixed size and difficulty as measured by \(d\): this implies the more general optimality in the worst case over instances of fixed size, but permits to distinguish between algorithms with the same general worst case complexity (as in the case of join queries mentionned above).
- For a given problem there can exists several measures of difficulty. A measure \(m\) can be reduced to another measure \(d\), in the sense that any algorithm optimal for \(m\) is also optimal for \(m\); and two measures \(m\) and \(d\) can be independant, in the sense that none can be reduced to the other (i.e. there are algorithms \(A_m\) and \(A_d\) respectively optimal for each measure, but also instances \(I_m\) and \(I_d\) where the respective complexities of each algorithm is not optimal).
- For any given problem, those reduction and independance relations define a partial order on the difficulty measures. One of the major objectives of research on Adaptive Algorithms is to find for each computational problem the measures of difficulty which admit an optimal algorithm and are not dominated by any other measure of difficulty.

- Instance Optimality
It is possible to define for specific problems some ultimate measure of difficulty, to which all others can be reduced. An even stronger notion of optimality is that of instance optimality, introduced by Fagin et al. \cite{2001-PODS-OptimalAggregationAlgorithmsForMiddleware-FaginLotemNaor} in 2001 when considering algorithm aggregating the ranks of \(m\) attributes via an arbitrary aggregation function. In their own words, "Intuitively, instance optimality corresponds to optimality in every instance, as opposed to just the worst case or the average case." More formally, given a class \(A\) of algorithms and a class \(D\) of databases and noting \(cost({\cal A,D})\) be the middleware cost incurred by running an algorithm \(\cal A\) over a database \(\cal D\), they define an algorithm \({\cal B}\) to be

**instance optimal**over \(A\) and \(D\) if \(B \in A\) and if for every \({\cal A}\in A\) and every \({\cal D}\in D\) we have \(cost({\cal B,D}) \in O(cost({\cal A,D}))\).For some other problems, lesser notions are defined: for instance for the convex hull problem, Afshani et al. \cite{instanceOptimalGeometricAlgorithmsFOCS} defined an algorithm to be

*input order oblivious instance-optimal among \({\cal A}\)*if for every instance \(I\) of size \(n\) and for every algorithm \(A'\) in a certain class \({\cal A}\), the running time of \(A\) on input \(I\) is at most a constant factor times the running time of \(A'\) on the worst possible permutation of \(I\) for \(A'\). Similar properties can be defined on the average running time rather than the worst case running time.

### 6.3 Problems

- Dynamic Ordered Dictionaries
For example, the well known ``dynamic optimality conjecture'' is about instance optimality concerning algorithms for manipulating binary search trees (see \cite{DemaineHarmonSODA09} for the latest in a series of papers).

- Aggregation Algorithms
Fagin et al. \cite{2001-PODS-OptimalAggregationAlgorithmsForMiddleware-FaginLotemNaor}

- Boyar and Favrholdt (order-oblivious instance optimality)
under the name of

*relative worst order ratio*\cite{BoyarFavrholdtTALG07} - Bin Packing
Kenyon \cite{KenyonSODA96} analyzed the algorithm

`BestFit`

and proved that it is Random-Order instance optimal through a property of the`random order ratio`

. - Union of Sorted arrays
Demaine, L\'opez-Ortiz, and Munro \cite{2000-SODA-AdaptiveSetIntersectionsUnionsAndDifferences-DemaineLopezOrtizMunro} studied the problem of computing the union of \(k\) sorted sets and gave instance-optimal results for any \(k\)

- Union of planar Convex Hulls
Barbay and Chen \cite{BarbayChen} extended Demaine et al.'s results \cite{2000-SODA-AdaptiveSetIntersectionsUnionsAndDifferences-DemaineLopezOrtizMunro} to computing the convex hulls of \(k\) convex polygons in 2-d.

- Intersection of Sorted arrays
- Optimality for constant \(k\)
Demaine, L\'opez-Ortiz, and Munro \cite{2000-SODA-AdaptiveSetIntersectionsUnionsAndDifferences-DemaineLopezOrtizMunro} studied the problem of computing the intersection of \(k\) sorted sets and gave instance-optimal results for constant \(k\) for intersection in the comparison model.

- Non Instance Optimality for \(k\) non constant:
It has been claimed that the intersection algorithm from Demaine, Lopez-Ortiz and Munro \cite{2000-SODA-AdaptiveSetIntersectionsUnionsAndDifferences-DemaineLopezOrtizMunro} is instance-optimal for non-constant values of \(k\) (a result not claimed by Demaine et al.), for a relaxation of the definition of instance-optimality to complexities within polylogarithmic factors of each other, but it's not true: we show below that the complexity of Demaine et al.'s algorithm can be within a factor of more than the number of sorted arrays intersected, from the coplexity of another algorithm, and hence that it is not instance-optimal, even by this relaxed definition.

- PROOF
Given an instance \(I\) of certificate \(C\) and gap complexity \(G\) (where Demaine et al.'s algorithm performs up to \(kG\) comparisons), there is an algorithm which encodes \(C\) in its program and solves this particular instances in less than \(G\) comparisons, which results in a ratio superior to \(k\).

- For instance
- Consider the following instance, of \(k=5\) sorted arrays
A

_{1}= 0,1,2,3,4,10,11,12,13,14 A_{2}= 0,2,4,6,8,10,12,14,16,18 A_{3}= 0,2,4,6,8,10,12,14,16,18 A_{4}= 0,2,4,6,8,10,12,14,16,18 A_{5}= 5,6,7,8,9,15,16,17,18,19- A valid certificate of the emptyness of the intersection of those
arrays (the values are given only for clarity, and the first
element of an array \(A\) is \(A[0]\)) is
- A
_{1}^{1}=4 < A_{5}^{2}=5, - A
_{5}^{1}=9 < A_{1}^{3}=10, - A
_{1}^{4}=14 < A_{5}^{3}=15.

- A
- A more visual representation of the instance is
A
_{1}= 0,1,2,3,4, 10,11,12,13,14 A_{2}= 0, 2, 4, 6, 8, 10, 12, 14, 16, 18 A_{3}= 0, 2, 4, 6, 8, 10, 12, 14, 16, 18 A_{4}= 0, 2, 4, 6, 8, 10, 12, 14, 16, 18 A_{5}= 5,6,7,8,9, 15,16,17,18,19 - The certificate can then be represented as a path through the
instance, from left to right, which indicates for each element the
index of at least one array which does not contain it:
A

_{1}= 0,1,2,3,4,**********10,11,12,13,14*************** A_{2}= 0, 2, 4, 6, 8, 10, 12, 14, 16, 18 A_{3}= 0, 2, 4, 6, 8, 10, 12, 14, 16, 18 A_{4}= 0, 2, 4, 6, 8, 10, 12, 14, 16, 18 A_{5}= **********5,6,7,8,9,***************15,16,17,18,19

- A valid certificate of the emptyness of the intersection of those
arrays (the values are given only for clarity, and the first
element of an array \(A\) is \(A[0]\)) is

- Consider the following instance, of \(k=5\) sorted arrays
- An algorithm (specially designed for this instance, which is allowed by the definition of instance optimality) can hence solve this instance in 3 comparisons, while
- Demaine et al.'s algorithm will
- perform 4 comparisons initially to identify \(A_5[1]=5\) as the maximum candidate element of the intersection, and much more later to
- "gallop" in all sets in parallel for the successor of this element (first \(6\), then \(10\)), and then
- "gallop" in all sets in parallel for the successor (\(15\)) of the next candidate (\(10\)), till
- concluding that the intersection is empty since \(A_1\) does not contain any element larger or equal to \(15\).

- Demaine et al.'s algorithm will

- For instance

- PROOF

- Optimality for constant \(k\)
- Maxima in 2 or 3 dimensions ABC
\cite{2009-FOCS-InstanceOptimalGeometricAlgorithms-AfshaniBarbayChan}

- Convex Hull in 2 and 3 dimensions ABC
\cite{2009-FOCS-InstanceOptimalGeometricAlgorithms-AfshaniBarbayChan}

- reporting/counting intersection between horizontal and vertical line segments in 2-d, ABC
- reporting/counting pairs of \(L_\infty\)-distance at most 1 between a red point set and a blue point set in 2-d, ABC
- off-line orthogonal range reporting/counting in 2-d, ABC
- off-line dominating reporting in 2-d and 3-d, ABC
- off-line halfspace range reporting in 2-d and 3-d, ABC
- off-line point location in 2-d. ABC
`join`

OperatorOne of the most important operation for solving queries on relational databases is the join operation between relations. Such computation occurs for instance when searching for a document (e.g. webpages) matching \(k\) keywords, where the search engine (e.g. Google) indexes the document by precomputing for each keyword a sorted list of documents matching it. A particular case of joint operation reduces to the computation of the intersection between sets represented by sorted arrays. In its more general form, the relations to be joined are still represented by sorted arrays, but not necessarily sorted in the same order.

\cite{2013-ARXIV-TowardInstanceOptimalJoinAlgorithmsForDataInIndexes-NgoNguyenReRudra}

The authors generalize the queries corresponding to the intersection of sorted arrays to more complex particular cases of join queries, namely

- Hierarchical Queries;
- Bow-Tie Queries; and
- Acyclic Queries (already introduced by various authors, starting with Yannakakis in 1981).

- Approximating the Distance of a point to a curve under black box model
Baran and Demaine \cite{BaranDemaineIJCGA05} addressed an approximation problem about computing the distance of a point to a curve under a certain black-box model.

### 6.4 Discussion

- Non-applicability
Not all problems admit instance optimal results in the Order-Oblivious setting: a trivial example is the search for a single element in a non-sorted array of \(n\) distinct values, verifiable in constant time while requiring \(n\) comparisons for any algorith. We list below some more sophisiticated counter-examples:

- Computing the Voronoi diagram of \(n\) points or the trapezoidal decomposition of \(n\) disjoint line segments, both having \(\Theta(n)\) sizes, requires \(\Omega(n\log n)\) time for every point set by the naive information-theoretical argument.
- Computing the (\(L_\infty\)-)closest pair for a {\em monochromatic\/} point set requires \(\Omega(n\log n)\) time for every point set by our adversary lower-bound argument.

- vs Competitive Analysis
The concept of instance optimality resembles competitive analysis of on-line algorithms. In fact, in the on-line algorithms literature, our Order-Oblivious setting of instance optimality is related to what Boyar and Favrholdt called the

*relative worst order ratio*\cite{BoyarFavrholdtTALG07}, and our Random-Order setting is related to Kenyon's*random order ratio*\cite{KenyonSODA96}. What makes instance optimality more intriguing is that we are not bounding the objective function of an optimization problem but the cost of an algorithm. - vs Parameterized Complexity
- vs Compression
In 1975, Elias \cite{1975-TIT-UniversalCodewordSetsAndRepresentationsOfTheIntegers-Elias}defined the property of "universal codes" as follows (…) In particular, the \(\gamma\) code is not universally optimal while the \(\delta\) code is.

In 1976\cite{1976-IPL-AnAlmostOptimalAlgorithmForUnboundedSearching-BentleyYao} Bentley and Yao described a family of Adaptive Algorithms inspired by Elias's codes. They did not discuss what distinguishes the \(\gamma\) search algorithm from the \(\delta\) search algorithm (and the rest of the family of search algorithms).

Could \(\delta\)-search be \(1\)-instance optimal?

## 7 Intersection Algorithms

### 7.1 "A survey of Intersection Algorithms, their performances and their applications"

- Introduction
- Problem Definition
- Worst case complexity \(\Theta(n)\) useless
- Other analysis: experimental, on average, adaptive
- Important Relation to other problems
- join operator in relational databases, and XPath operators in XML databases.
- indexed Pattern matching, (?)
- convex hull (in low and high dimensions)
- threshold set and union
- recursive combination of Union and Intersection [Farzan et al]

- Objectives of this survey
- Content:
- a list of relevant intersection algorithms
- summary of results completed with pointers to the relevant papers
- summary of applications and links to other problems

- People potentially interested
- someone who wants to develop a new intersection algorithm
- someone who needs an intersection algorithm for another task

- Content:

- Problem Definition
- Algorithms
- Hwang and Lin algorithm [1971,1972] OFFICE READING
- SvS and Swapping SvS [DLM2001]
{\SvS} is a straightforward algorithm widely used, which intersects the sets two at a time in increasing order by size, starting with the two smallest % (see Algorithm~\ref{alg:svs}). % It performs a binary search to determine if an element in the first set appears in the second set. % We also consider variants of it which replace the binary search with various other searches.

Demaine~{\etal} considered the variant {\SwSvS}, where the searched element is picked from the set with the least remaining elements, instead of the first (initially smallest) set in {\SvS}. % This algorithm was first proposed by Hwang~\etal~\cite{optimalMergingOf2ElementsWithNElements}: it performs better when the size of the second set is substantially reduced after a search although experiments show that this does not happen often.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % ENGLISH STYLE ALGORITHM DESCRIPTION % FOR SvS %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{algorithm}[t] \({\SvS}(\idtt{set},\idtt{k})\) \begin{algorithmic}[1] \STATE Sort the sets by size (\(|set[0]|\leq |set[1]|\leq \ldots \leq |set[k]|\)). \STATE Let the smallest set \(set[0]\) be the candidate answer set. \STATE {\bf for} each set{ \(S\) from \(set\) \bf do} initialize \(\ell[S]=0\). \FOR{each set \(S\) from \(set\) } \FOR{each element \(e\) in the candidate answer set} \STATE search for \(e\) in \(S\) in the range \(\ell[S]\) to \(|S|\), \STATE and update \(\ell[S]\) to the rank of \(e\) in \(S\). \IF{ \(e\) was not found} \STATE remove \(e\) from candidate answer set, \STATE and advance \(e\) to the next element in the answer set. \ENDIF \ENDFOR \ENDFOR \end{algorithmic} \caption{Pseudo-code for {\SvS}}\label{t:svs}\label{alg:svs} \end{algorithm}%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

- Adaptive [DLM2000]
- Small Adaptive [DLM2001]
{\SmallAdaptive} is a hybrid algorithm, which combines the best properties of {\SvS} and {\Adaptive} (see Algorithm~\ref{alg:sadap}). % For each element in the smallest set, it performs a galloping search on the second smallest set. % If a common element is found, a new search is performed in the remaining \(k-2\) sets to determine if the element is indeed in the intersection of all sets, otherwise a new search is performed. % Observe that the algorithm computes the intersection from left to right, producing the answer in increasing order. After each step, each set has an already examined range and an unexamined range. {\SmallAdaptive} selects the two sets with the smallest unexamined range and repeats the process described above until there is a set that has been fully examined.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % ENGLISH STYLE ALGORITHM DESCRIPTION % FOR SMALL ADAPTIVE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{algorithm} \({\SmallAdaptive}(\idtt{set},\idtt{k})\) \begin{algorithmic}[1] \WHILE{no set is empty} \STATE Sort the sets by increasing number of remaining elements. \STATE Pick an eliminator \(e=\idtt{set}[0][0]\) from the smallest set. \STATE \(\idtt{elimset} \leftarrow 1\). \REPEAT \STATE search for \(e\) in \(\idtt{set}[\idtt{elimset}]\). \STATE increment \(\idtt{elimset}\); \UNTIL{ \(s=k\) or \(e\) is not found in \(\idtt{set}[\idtt{elimset}]\) } \IF{ \(s=k\) } \STATE add \(e\) to answer. \ENDIF \ENDWHILE \end{algorithmic} \caption{Pseudo-code for {\SmallAdaptive}}\label{t:sa}\label{alg:sadap} \end{algorithm}%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

- Sequential [Barbay and Kenyon 2002]
Barbay and Kenyon~\cite{adaptiveIntersectionAndTThresholdProblems} introduced a fourth algorithm, called {\Sequential}, which is optimal for a different measure of difficulty, based on the non-deterministic complexity of the instance. % It cycles through the sets performing one entire gallop search at a time in each (as opposed to a single galloping {\it step} in {\Adaptive}), so that it performs at most \(k\) searches for each comparison performed by an optimal non-deterministic algorithm: its pseudo-code is given in Algorithm~\ref{alg:sequential}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % ENGLISH STYLE ALGORITHM DESCRIPTION % FOR SEQUENTIAL %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{algorithm} \({\Sequential}(\idtt{set},\idtt{k})\) \begin{algorithmic}[1] \STATE Choose an eliminator \(e=\idtt{\idtt{set}}[0][0]\), in the set \(\idtt{elimset} \leftarrow 0\). \STATE Consider the first set, \(i\leftarrow 1\). \WHILE{the eliminator $e \neq \infty$} \STATE search in \(\idtt{set}[i]\) for \(e\) \IF{the search found $e$} \STATE increase the occurrence counter. \STATE {\bf if} {the value of occurrence counter is $k$} {\bf then} output \(e\) {\bf end if} \ENDIF \IF{the value of the occurrence counter is \(k\), or \(e\) was not found} \STATE update the eliminator to $e\leftarrow \idtt{set}[i][\idtt{succ}(e)].$ \ENDIF \STATE Consider the next set in cyclic order \(i\leftarrow i+1 \textrm{ mod } k\). \ENDWHILE \end{algorithmic} \caption{Pseudo-code for {\Sequential}}\label{t:s} \label{alg:sequential} \end{algorithm}%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

- RSequential [Barbay 2003]
A randomized variant~\cite{optimalityOfRandomizedAlgorithmsForTheIntersectionProblem}, {\RandomSequential}, performs less comparisons than {\Sequential} on average on instances where the searched elements are present in roughly half of the arrays, rather than in almost all or almost none of the arrays. % The difference with {\Sequential} corresponds to a single line, the choice of the next set where to search for the ``eliminator'' (line~\(12\) in Algorithm~\ref{alg:sequential}): {\Sequential} takes the next set available while {\RandomSequential} chooses one at random among all the sets not yet known to contain the eliminator.

- Baeza-Yates
^{5}{\BaezaYates} algorithm was originally intended for the intersection of two sorted lists. % It takes the median element of the smaller list and searches for it in the larger list. % The element is added to the result set if present in the larger list. % The median of the smaller list and the rank insertion of the median in the larger set divide the problem into two sub-problems. % The algorithm solves recursively the instances formed by each pair of subsets, always taking the median of the smaller subset and searching for it in the larger subset. % If any of the subsets is empty, it does nothing. % % In order to use this algorithm on instances with more than two lists, Baeza-Yates ~\cite{aFastSetIntersectionAlgorithmForSortedSequences} suggests to intersect the lists two-by-two, intersecting the smallest lists first. % As the intersection algorithm works for sorted lists and the result of the intersection may not be sorted, the result set needs to be sorted before intersecting it with the next list, which would be highly inefficient. % The pseudo-code for {\BaezaYates} algorithm is shown in Algorithm \ref{algo:BY}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % ENGLISH STYLE ALGORITHM DESCRIPTION FOR Baeza-Yates %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{INUTILE} \begin{algorithm} \({\BaezaYates}(\idtt{set},\idtt{k})\) \begin{algorithmic}[1] \STATE Sort the sets by size (\(|\idtt{set}[0]|\leq |\idtt{set}[1]|\leq \ldots \leq |\idtt{set}[k]|\)). \STATE Let the smallest set \(\idtt{set}[0]\) be the candidate answer set. %\STATE {\bf for} each set \(set[i]\), \(i=1\ldots k\) {\bf do} initialize \(\ell[k]=0\). \FOR{each set \(\idtt{set}[i]\), $i=1\ldots k$} \STATE \(\idtt{candidate} \leftarrow \idtt{BYintersect}(\idtt{candidate},\idtt{set}[i],0,|\idtt{candidate}|-1,0,|\idtt{set}[i]|-1)\) \STATE sort the candidate set. \ENDFOR \end{algorithmic} \vspace{0.1in} \(\idtt{BY\-intersect}(\idtt{setA},\idtt{setB},\idtt{minA},\idtt{maxA},\idtt{minB},\idtt{maxB})\) \begin{algorithmic}[1] \STATE {\bf if} \(\idtt{setA}\) or \(\idtt{setB}\) are empty {\bf then} return \(\emptyset\) {\bf endif}. \STATE Let \(\idtt{m}=(\idtt{minA}+\idtt{maxA})/2\) and let \(\idtt{medianA}\) be the element at \(\idtt{setA}[\idtt{m}]\). \STATE Search for \(\idtt{medianA}\) in \(\idtt{setB}\). \IF {\(\idtt{medianA}\) was found} \STATE add \(\idtt{medianA}\) to \(\idtt{result}\). \ENDIF \STATE let \(\idtt{r}\) be the insertion rank of \(\idtt{medianA}\) in \(\idtt{setB}\). \STATE //solve recursively on both sides of the sets. \IF {left subset of \(\idtt{setA}\) is smaller or equal than the left subset of $\idtt{setB}$} \STATE \(\idtt{result} \leftarrow \idtt{result} \cup \idtt{BYintersect}(\idtt{setA},\idtt{setB},\idtt{minA},\idtt{m}-1,\idtt{minB},\idtt{r}-1)\) \ELSE \STATE \(\idtt{result} \leftarrow \idtt{result} \cup \idtt{BYintersect}(\idtt{setB},\idtt{setA},\idtt{minB},\idtt{r}-1,\idtt{minA},\idtt{m}-1)\) \ENDIF \IF {right subset of \(\idtt{setA}\) is smaller or equal than the right subset of $\idtt{setB}$} \STATE \(\idtt{result} \leftarrow \idtt{result} \cup \idtt{BYintersect}(\idtt{setA},\idtt{setB},\idtt{m}+1,\idtt{maxA},\idtt{r},\idtt{maxB})\) \ELSE \STATE \(\idtt{result} \leftarrow \idtt{result} \cup \idtt{BYintersect}(\idtt{setB},\idtt{setA},\idtt{r},\idtt{maxB},\idtt{m}+1,\idtt{maxA})\) \ENDIF \begin{FORCOAUTHORS} \begin{ALEJANDRO} I think it is more proper to use the union, since I describe the algorithm in terms of sets, for ex, I return an empty set when the intersection is empty. \end{ALEJANDRO} \end{FORCOAUTHORS} \end{algorithmic} \caption{Pseudo-code for {\BaezaYates}}\label{algo:BY} \end{algorithm} \end{INUTILE}%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{algorithm} \({\BaezaYates}(\idtt{set},\idtt{k})\) \begin{algorithmic}[1] \STATE Sort the sets by size (\(|\idtt{set}[0]|\leq |\idtt{set}[1]|\leq \ldots \leq |\idtt{set}[k]|\)). \STATE Let the smallest set \(\idtt{set}[0]\) be the candidate answer set. \FOR{each set \(\idtt{set}[i]\), $i=1\ldots k$} \STATE \(\idtt{candidate} \leftarrow \idtt{BYintersect}(\idtt{candidate},\idtt{set}[i],0,|\idtt{candidate}|-1,0,|\idtt{set}[i]|-1)\) \STATE sort the candidate set. \ENDFOR \end{algorithmic} \vspace{0.1in} \(\idtt{BY\-intersect}(\idtt{setA},\idtt{setB},\idtt{minA},\idtt{maxA},\idtt{minB},\idtt{maxB})\) \begin{algorithmic}[1] \STATE {\bf if} \(\idtt{setA}\) or \(\idtt{setB}\) are empty {\bf then} return \(\emptyset\) {\bf endif}. \STATE Let \(\idtt{m}=(\idtt{minA}+\idtt{maxA})/2\) and let \(\idtt{medianA}\) be the element at \(\idtt{setA}[\idtt{m}]\). \STATE Search for \(\idtt{medianA}\) in \(\idtt{setB}\). \IF {\(\idtt{medianA}\) was found} \STATE add \(\idtt{medianA}\) to \(\idtt{result}\). \ENDIF \STATE Let \(\idtt{r}\) be the insertion rank of \(\idtt{medianA}\) in \(\idtt{setB}\). \STATE Solve the intersection recursively on both sides of \(\idtt{r}\) and \(\idtt{m}\) in each set. \end{algorithmic} \caption{Pseudo-code for {\BaezaYates}}\label{algo:BY} \end{algorithm}%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

- Baeza-Yates and Salinger
^{6} - Extrapolation Intersection
^{7}{\Interpolation} search has long been known to perform significantly better in terms of comparisons over binary search on data randomly drawn from a uniform distribution, and recent developments suggest that interpolation search is also a reasonable technique for non-uniform data~\cite{interpolationSearchForNonIndependantData}. % Searching for an element of value \(e\) in an array \(\idtt{set}[i]\) on the range \(a\) to \(b\), the algorithm probes position \(I(a,b,e)\) defined as follows:

\begin{eqnarray*} I(a,b,e) &= &\left\lfloor \frac{e-\idtt{set}[i][a]}{\idtt{set}[i][b]-\idtt{set}[i][a]}(b-a) \right\rfloor + a \end{eqnarray*}Barbay~\etal proposed a variant, {\Extrapolation} search, which involves extrapolating on the current and previous positions in~\(\idtt{set}[i]\). % Specifically, the extrapolation step probes the index \(I(p'_{i},p_{i},e)\), where \(p'_{i}\) is the previous extrapolation probe. % This has the advantage of using ``explored data'' as the basis for calculating the expected index: this strategy is similar to galloping, which uses the previous jump value as the basis for the next jump (i.e. the value of the next jump is the double of the value of the current jump).

- Extrapolation Look Ahead Intersection
^{7}Barbay~\etal proposed another search algorithm, {\ExtrapolateAhead}, which is similar to extrapolation, but rather than basing the extrapolation on the current and previous positions, we base it on the current position and a position that is further ahead. % Thus, our probe index is calculated by \(I(p_{i}, p_{i}{+}l,e)\) where \(l\) is a positive integer that essentially measures the degree to which the extrapolation uses local information. % The algorithm uses the local distribution as a representative sample of the distribution between \(\idtt{set}[i][p_{i}]\) and the eliminator: a large value of \(l\) corresponds to an algorithm using more global information, while a small value of \(l\) correspond to an algorithm using only local information. % If the index of the successor \(\idtt{succ}(e)\) of \(e\) in \(\idtt{set}[i]\) is not far from \(p_{i}\), then the distribution between \(\idtt{set}[i][p_{i}]\) and \(\idtt{set}[i][p_{i}+l]\) is expected to be similar to the distribution between \(\idtt{set}[i][p_{i}]\) and \(\idtt{set}[i][\idtt{succ}(e)]\), and the estimate will be fairly accurate. % Thus if the set is bursty, or piecewise uniform, we would expect this strategy to outperform interpolation because the set is locally representative. On the other hand, if the set comes from a random uniform distribution then we would expect interpolation to be better because in this case using a larger range to interpolate is more accurate than using a smaller range.

- Sorted Baeza-Yates
^{8}To avoid the cost of sorting each intermediate result set, we introduce {\BaezaYatesSorted}, a minor variant of {\BaezaYates}, which does not move the elements found from the input to the result set as soon as it finds them, but only at the last recursive step. % This ensures that the elements are added to the result set in order and trades the cost of explicitly sorting the intermediate results with the cost of keeping slightly larger subsets.

- Combined/Chimeric/ algorithms
^{8}While previous studies considered only specific combinations of search and intersection algorithms, such as % {\BaezaYates} using {\AdaptiveBinary} for the study by Baeza-Yates~{\etal}; % or {\SmallAdaptive} using {\Doubling}, {\SvS} and {\SwSvS} using {\AdaptiveBinary} for the study by Demaine~{\etal}; % or {\Sequential} and {\RandomSequential} using {\Doubling}; % without discussing which search algorithm is the most adapted for which intersection algorithm, % Barbay~\etal defined search and melding algorithms separately, in order to study the impact of new search algorithms on all melding algorithms, and find the best combination over all possible ones.

They restricted their study to intersection algorithm performing full searches in each sorted array. In total they consider four main algorithms and their variants, which have been introduced above (in general in association with the equivalent of a doubling search algorithm):

- {\SvS} and {\SwSvS}
- {\SmallAdaptive}
- {\Sequential} and {\RandomSequential}
- {\BaezaYates} and {\BaezaYatesSorted}

% They considered (…)

- TODO Read papers on Improved Hwang Lin algorithms READING
- DONE Recover reference of paper by Rasmus Pagh
Bille, Pagh, Pagh, Fast evaluation of union intersection expressions CoRR 2007

- Hwang and Lin algorithm [1971,1972] OFFICE READING
- Results
- Worst case analysis for \((n_1,\ldots,n_k)\) fixed
- Average case analysis for \((n_1,\ldots,n_k)\) fixed
- Adaptive Analysis
- Alternation
- Gap encoding
- Redundancy
- Relations between adaptive analysis

- Alternation
- Experimental results
- Random data
- Google public extract
- TREC extract

- Random data

- Worst case analysis for \((n_1,\ldots,n_k)\) fixed
- Discussion
- Applications of Intersection
- Join operators, in relational databases, and XPath operators in XML databases.
- Union, Difference and Threshold Set of sorted arrays
- Recursive combination of Union and Intersection
- Convex hull (in low and high dimensions)
- TODO Read again Olivier Devillers's mail about intersection

- TODO Read again Olivier Devillers's mail about intersection

- Join operators, in relational databases, and XPath operators in XML databases.
- Challenges
- In Theory
- An adaptive analysis subsuming both Redundancy and Gap encoding?
- Dovetailing from Kirkpatrick

- Other measures than comparison number:
- number of searches
- because there are now (succinct) data structure supporting searches in fixed short time.

- number of cache accesses

- number of searches

- An adaptive analysis subsuming both Redundancy and Gap encoding?
- In Practice
- An implementation with "real" difference in performance
- TODO Find critics of JEA reviewer

- TODO Find critics of JEA reviewer
- An experimentation with succinct data structures
- with Pizza Chili.

- Implementations issues
- cache issues
- specificity of the data
- parallelism
Barbay~\etal [JEA] restricted their study to intersection algorithm performing full searches in each sorted array, though ruling out the {\Adaptive} algorithm from Demaine~\etal, which performs the searches in parallel in all sets from both the left and right end, one comparison at a time. % Ignoring {\Adaptive} should not yield to consequence as Demaine~\etal's previous experimental study showed {\Adaptive} to perform worse than {\SvS} and {\SwSvS}, and much worse than {\SmallAdaptive} (the bad practical performance of {\Adaptive} being the motivation for the creation of {\SmallAdaptive}). % Yet this sets a bad precedent, as it should be clear that algorithms performing full searches can be tricked in performing arbitrarily costly operations. Even though the {\Adaptive} algorithm was not competitive with others, another algorithm searching in parallel in the sorted arrays, at different rates in each array, could very well show a better performance, both in practice and in theory for a relevant difficulty measure.

- cache issues

- An implementation with "real" difference in performance

- In Theory

- Applications of Intersection

### 7.2 Adaptive Intersection Algorithms (before )

- Introduction
Traditionally the analysis of algorithms measures the complexity of a problem or algorithm in terms of the worst-case behavior over all inputs of a given size. However, in certain cases an improved algorithm can be obtained by considering a finer partition of the input space by a difficulty measure, sometime down to the instance. This finer partition is defined through a difficulty measure, which groups the instance by both their size and their difficulty, in order to refine the worst case analysis for both upper and lower bounds on the complexity of the instance.

- Definition and trivial results: Theta(n)
- Definition
- In its simplest form, the intersection problem consists in finding the elements in common between \(k\) sorted arrays, of sizes \(n_1,\ldots,n_k\) summing to \(n=\sum_i^k n_i\).
- Lower Bounds
- Trivially, one can see that any algorithm solving this problem through mere comparisons on the elements requires \(\Omega(n)\) comparisons in the worst case over instances of fixed size \(n\).
A finer analysis in funtion of \(n\) and \(k\) yields a lower bound of (…)

The finest analysis for \(n\), \(k\) and \(n_1,\ldots,n_k\) fixed yields a lower bound of (…)

- (no term)
- Applications:
- Conjunctive queries
- Join operator in databases
- Pattern Matching in XML
- Convex Hull in large dimension?

- Necessity of an Adaptive Analysis
Demaine, López-Ortiz and Munro [DLM2000] described how to compute a description of the union of \(k\) sorted arrays (or, equivalently, of \(k\) B-trees) in sub-linear time on easy instance, defining their measure of the difficulty of the instance by the cost of encoding the description of the union. They generalized their approach to the computation of the intersection (and, equivalently, of the difference) or sorted arrays, with applications to the resolution of conjunctive queries in search engine such as Google [DLM2001]. Barbay and Kenyon [BK2002] gave two other adaptive analysis of the intersection problem, along with other applications such as to support multi-join operators in relational databases (e.g. OR or AND operator on more than two keys), and to support pattern matching queries in semi-structured data (e.g. file systems, XML). Barbay and Chen [BC2007] further generalized this approach to the merging of planar convex hulls, using as a measure of difficulty the complexity of the certificate of the merging of the convex hulls. To this day the analysis of Demaine, López-Ortiz and Munro [DLM2000] and of Kenyon and Barbay [BK2003] are incomparable, in the sense that no algorithm is known to be optimal for both analysis at once. We aim to define a yet finer analysis combining the main features of both analysis, along with the intersection algorithm which will be optimal for those, and the derived algorithms to compute adaptively the merging of convex hulls in two and three dimensions.

- Definition and trivial results: Theta(n)
- Two sets: Union = Intersection
- "Optimal Merging of 2 elements with n Elements", Hwang and Lin, 1970
- improvingTheBoundOnOptimalMerging, Christen, 1978
- "Significant Improvements to the Hwang Lin Merging Algorithm", Manacher, 1979
- "A Fast Merging Algorithm", Brown and Tarjan, 1979
- twoProbabilisticResultsOnMerging, Fernandez de la Vega, Frieze and Santha, 1993
- averageCaseAnalysisOfTheMergingAlgorithmOfHwangAndLin, Fernandez de la Vega, Frieze and Santha, 1998
- "A Fast Set Intersection Algorithm for Sorted Sequences", Baeza-Yates, 2004

- Alternation
- Barbay and Kenyon
- A first measure of difficulty, which yields a first separation between trivial algorithms and more sophisticated ones.
- Measures the number of comparison required to certify the result of the intersection.

- Barbay and Kenyon
- Gap Encoding
- \cite{dlm} Demaine, López-Ortiz, Munro, 2000
- A measure of difficulty finer than Alternation, inspired by a similar one developped for another problem (computing the union of k sorted sets).
- Measures a bound on the number of bits required to encode a certificate of the result of the intersection.

- \cite{dlm} Demaine, López-Ortiz, Munro, 2000
- Redundancy
- Barbay, Barbay and Kenyon
- A measure finer than alternation (but independant of Gap Encoding size), showing the separation between Randomized and Deterministic algorithms for the intersection problem.
- Measures the average number of searches required to find a certificate when choosing among candidate sets at random.

- Barbay, Barbay and Kenyon
- Open Problems
- Practical Results
- \cite{dlmAlenex}, 2001
- \cite{experimentalAnalysisOfAFastIntersectionAlgorithmForSortedSequences}, Baeza-Yates and Salinger, 2005
- \cite{fasterAdaptiveSetIntersectionsForTextSearching}, J{\'e}r{\'e}my Barbay and Alejandro L{\'o}pez-Ortiz and Tyler Lu, 2006
- BLST JEA

- Redundancy and Gaps -> Dovetailing?
- Generalisations
- \cite{worstCaseOptimalUnionIntersectionExpressionEvaluation}, Ehsan Chiniforooshan and Arash Farzan and Mehdi Mirzazadeh, 2005

- Practical Results

## 8 Succinct Data Structures

### 8.1 Introduction

- Encodings LUCA
- Succinct Data Structures MENG
- Succinct Indexes MENG
- Independant Succinct Indexes MENG

### 8.2 Atoms

- Integers JEREMY
- Bit Vectors JEREMY
- Trees LUCA
- Permutations JEREMY
- Multisets, Strings and Functions JEREMY
- Binary Relations and Bipartite Graphs JEREMY
- Triangulations, Quadrangulations and 3-connected graphs LUCA
- Planar Graphs and k-pages Graphs LUCA
- Separable Graphs LUCA

### 8.3 Applications

- Labeled Trees and XML MENG
- Labeled Planar Graphs MENG LUCA

### 8.4 Perspectives

- Indexing of Compressed Data
**H+o(n)**JEREMY - (Fully) Compressed Data Structures
**H+o(H)**JEREMY - Dynamicity MENG

## 9 Compressed Data Structures

In 2009, Barbay and Navarro~\cite{2009-STACS-CompressedRepresentationsOfPermutationsAndApplications-BarbayNavarro}, observing that any comparison-based sorting algorithm yields an encoding for permutations, exploited the similarity between the \texttt{Wavelet Tree}~\cite{highOrderEntropyCompressedTextIndexes}, a data structure used for compression, and the execution of an adaptive variant of \texttt{Merge Sort}, to develop and analyze various \emph{compressed data structures} for permutations and \emph{adaptive sorting algorithms}. In this case, the construction algorithm of each data structure is in fact an adaptive algorithm sorting the permutation, in a number of comparisons roughly the same as the number of bits used by the data structure. In addition, the operators of the data structures are themselves adaptive, in the sense that their running time is the smallest on the the most compressible permutations.

## 10 Deferred Data Structures

I explored the topic of data structures for unsorted arrays of integers supporting the select operator in the online model, in 2012, in collaboration with Ankur Gupta, Srinivasa Rao and John Sorenson, with some preliminary theoretical and practical results \cite{2013-ESA-OnlineRankSelect-BarbayGuptaJoRaoSorenson} presented at the European Symposium on Algorithms (ESA) in 2013.

We discovered later that Motwani and Raghavan \cite{1986-SoCG-DeferredDataStructuringQueryDrivenPreprocessingForGeometricSearchProblems-MotwaniRaghavan} had already explored this topic in 1986, under the distinct name of ``Deferred Data Structures''. The 1986's resuls are weaker than those of 2013, due to a weaker analysis model, yet the range of applications considered in the 1986's publication is wider and includes data structures for set of planar points supporing queries on the \textsc{convex hulls} in the online model.

### 10.1 Multiselection

The \emph{multiselection} problem asks for elements of rank~\(r_i\) from the sequence~\(R=r_1,r_2,\dots, r_q\) on an unsorted array~\(A\) drawn from an ordered universe of elements. We define \(B(S_q)\) as the information-theoretic lower bound on the number of comparisons required to answer the \(q\) queries, where \(S_q = \{s_i\}\) denotes the queries ordered by rank. We define \(\Delta_i = s_{i+1} - s_i\), where~\(s_0 = 0\) and \(s_{q+1} = n\). Then, $$ B(S_q) = \log n! - \sum_{i=0}^q \log \left(\Delta_i!\right) = \sum_{i=0}^q \Delta_i \log \frac{n}{\Delta_i} - O(n).\footnote{We use the notation \(\log_b a\) to refer to the base~\(b\) logarithm of~\(a\). By default, we let \(b=2\). We also define \(\ln a\) as the base~\(e\) logarithm of \(a\).} $$ The \emph{online multiselection} problem asks for elements of rank~\(r_i\), where the sequence~\(R\) given one element at a time and the $i$th query has to be answered before the $(i+1)$st query is given. The lower bound~\(B(S_q)\) also applies in the online case.

### 10.2 Maximal Dominating Point queries

Given two points \(p=(x,y)\) and \(q=(x',y')\), \(p\) \emph{dominates} \(q\) if and only if \(x\geq x'\) and \(y\geq y'\). Given a set~\(S= \{ p_1, p_2, \dots, p_n \}\) of~\(n\) points in the plane, the \emph{Maximal Dominating Point Set} is the maximal subset \(M\subset S\) such that no point of \(M\) is dominated by a point of \(S\). Given a set of points \(S\) and a point \(p=(x,y)\), a \emph{Maximal Dominating Point Query} for \(p\) is answered by at least one point of the maximal dominating point set of \(S\) which dominated \(p\).

### 10.3 Point Membership in a Convex Hull

Given a set~\(P= \{ p_1, p_2, \dots, p_n \}\) of~\(n\) points in the plane, let \(CH(P)\) be a set that denotes the convex hull of~\(P\), which is the smallest convex set that contains~\(P\).\footnote{Recall that a set~\(S\) is convex if and only if for every pair of points~\(x\) and~\(y\) in~\(S\), each point of the line segment connecting them is also an element of~\(S\).} In the \emph{point membership in convex hull} problem, we need to preprocess the set~\(P\) so that given a query point~\(p\), we can answer whether or not~\(p\) is an element of~\(CH(P)\).

### 10.4 Deferred Data Structure for Depth queries in \(d\) dimensions

Progressively index \(n\) \(d\)-dimensional boxes as \(q\) depths queries for \(d\)-dimensional points come.

- For \(N\) query points
- The greedy algorithm takes \(dNn\) comparisons
- Chan's algorithm \cite{2013-FOCS-KleeSMeasureProblemMadeEasy-Chan} takes
- \(n^{d/2}\) to index and
- \(dN\lg n\) to solve the queries, so
- $ in total

- A Static Deferred Data Structure would explore only part of the tree traversed by the \(O(n^{d/2})\) algorithm, and progressively build the tree as queries come, achieving a better solution than both above when \(N < \frac{n^{d/2}}{d(n-\lg n}\). An amortized dynamic Deferred Data Structure would be deduced.

- NEXT STUDY paper "Semi-online maintenance of geometric optima and measures" from Timothy Chan

### 10.5 Discussion

- Online Algorithm vs Deferred Data Structure
Our results can be presented either as online algorithms (as we do) or as data structures (as Motwani and Raghavan~\cite{1986-SoCG-DeferredDataStructuringQueryDrivenPreprocessingForGeometricSearchProblems-MotwaniRaghavan} and Ching et al. \cite{1990-IPL-DynamicDeferrwedDataStructuring-ChingMehlhornSmid} do): it is both, as it is in essence a data structure supporting queries and operators, of which we analyze the competitive ratio with an offline algorithm (A technique previously used only for online algorithms).

- Other Adaptive Results
Motwani and Raghavan~\cite{1986-SoCG-DeferredDataStructuringQueryDrivenPreprocessingForGeometricSearchProblems-MotwaniRaghavan} analyzed the data-structure they proposed in the worst case over instances with a fixed number \(q\) of queries. Such an analysis does not distinguish between sequences of queries all occuring in a small portion of the array and sequences of queries regularly spread in the array, when our results show that those are of very distinct difficulty. Other refinement of both the solution and the analysis are possible (and desirable), such as taking into account the possibility that (1) the input array could be already partially ordered (which saves some comparison on building the runs), (2) the input array could be on a limited set of values (which saves some comparisons on merging the runs), or (3) the queries could be similar in time

- Applications to other problems
The concept of Deferred Data Structure is quite powerful and is not limited to select queries: Motwani and Raghavan~\cite{1986-SoCG-DeferredDataStructuringQueryDrivenPreprocessingForGeometricSearchProblems-MotwaniRaghavan} defined it on various other problems on set of points with median-based solutions, such as maintaining the set of Dominating points or the upper hull

- Applications to real problems
More generally, Deferred Data Structures are particularly efficient when data arrives or is modified too fast to be indexed in real time, and is queried sporadically (both in time and in space). Such a powerful technique is promising in the era of Massive Data, to index the data produced by Astronomy, Social Networks or Wearable computing. Deferred Data Structures supporting queries for such basic algorithmic problems as Sorting (Select queries), Maxima (Dominating queries) and Convex Hull (Convex Edge queries) will be building blocks for more sophisticated queries on more complex data.

## 11 Succinct Indexes

### 11.1 What happens to \(nH_k +o(n log\sigma)\) when \(\sigma\) is constant?

A colleague asked me the following question (sligthly edited to protect the privacy of the sender and other destinary of the email) about the succinct index defined in [TALG2011]

> I have a small question about the redundancy of your
> respective data structures for rank and select queries. I
> mean your paper (Barbay et al, TALG 2011) results that achieve
> the k-th order entropy. The space usage that you achieve
> is nH_{k} +o(n logσ) bits. But these and other similar
> data structures use at least nH_{k} + O(n) bits because we
> must keep several bit vectors. Bit vectors are used to
> reduce the problem on the sequence of length n to the
> problem on a chunk of size σ.

> Hence we need nH_{k}+O(n) bits even if σ=const. Did I
> miss something?

The succinct index data structure from Theorem 1 in [TALG 2011] states
that in addition to the space required by **any** data structure
supporting access on strings over alphabet \([1..\sigma]\), one needs
only \(O(\frac{n \lg\sigma}{\lg\lg\lg\sigma})\subset n \cdot
o(\lg\sigma)\) bits to support rank and select. When \(\sigma\) is a
constant, this space boils down to within \(O(1) \cdot n = O(n)\).

But compressed data structures supporting constant time access require
more than \(nH_k\) bits anyway: the one we consider in the paper uses \(n
H_k(S) + O(\frac{n}{\log_\sigma n} (k\lg \sigma +\lg\lg n))\) bits,
which is \(H_k(S) + \lg\sigma\cdot o(n)\) when \(k\in o(\log_\sigma{n})\):
when \(\sigma\) is a constant, the space used is within \(nH_k+O(n)\)
bits, **just to store the string and support constant time access**.

When \(\sigma\) is a constant, you **cannot** reason on the asymptotic
notations (which simplified formulas based on the assumption that the
variables were going to infinity) and **must** go back to the original
formula, **before** it is being simplified by asymptotic notations.
I as a rule (fight with co-authors to) avoid asymptotic notations with
more than one variable, to escape the kind of confusion you describe:
\(o(n\lg\sigma)\) is much less informative than \(n o(\lg\sigma)\)

So yes, the space used by the data structures from [TALG 2011] is within \(nH_k+O(n)\) when \(\sigma\) is a constant. It is confusing only if you misuse asymptotic notations…

Yakov> For any constant c, there is some integer m, so that f(nσ) Yakov> < c nlogσ for all pairs n,σ satisfying n>m and Yakov> σ>m. right?

## 12 Definition of \(o()\) asymptotics in dimension larger than \(1\)

Take the unidimensional definition (e.g. from
http://en.wikipedia.org/wiki/Big_O_notation#Little-o_notation)
\(f(n)\in o(g(n)) \leftrightarrow \forall c \exists N, \forall n>N, f(n)

Extending this definition to dimension \(d\) yields the following:
\(f(x_1,\ldots,x_d)\in o(g(x_1,\ldots,x_d)) \leftrightarrow \forall c \exists X_1,\ldots,X_d, \forall x_1>X_1,\ldots,x_d>X_d, f(x_1,\ldots,x_d)

## 13 From Space to Time

- Any comparison based algorithm implies an encoding (by memorizing the answers in bits) and vice versa.
- But what we are interested in is Efficient algorithms and small data structures:
- Any EFFICIENT comparison based algorithm implies a SMALL encoding (by memorizing the answers in bits).
- The reverse is not always true: for instance in an encoding where each bit is the answer to a distinct NP-hard decision problem, an encoding correspond to an algorithm but not necesarily an efficient one.

## 14 Adaptive Edit Distance

We describe an algorithm which running time is smaller in easier cases when for each symbol \(\alpha\) there are few instance of choice between \texttt{swap} and \texttt{insertion}, such as the followings:

\begin{itemize} \item if the number \(n_{\alpha}\) of occurences of \(\alpha\) in \(S\) is small, then most symbols \(\alpha\) will be inserted into \(S\), the few others being swapped; \item If the number \(n_{\alpha}\) of occurences of \(\alpha\) in \(S\) is close to the number \(m_{\alpha}\) of occurences of \(\alpha\) in \(L\), then few symbols \(\alpha\) will be inserted in any correction of \(S\) into \(L\). \end{itemize}The parameters \(g_{\alpha}=\min\{n_{\alpha},m_{\alpha}-n_{\alpha}\}\) and \(g=\max_{\alpha\in[1..d]} g_\alpha\) capture those cases, so that any instances with small value of \(g_\alpha\) (or, in a broader analysis, of \(g\)) are easy.

## Footnotes:

^{1}

DEFINITION NOT FOUND.

^{2}

DEFINITION NOT FOUND.

^{3}

DEFINITION NOT FOUND.

^{4}

DEFINITION NOT FOUND.

^{5}

DEFINITION NOT FOUND.

^{6}

DEFINITION NOT FOUND.

^{7}

DEFINITION NOT FOUND.

^{8}

DEFINITION NOT FOUND.