# Thesis and Internship topics proposed by Jeremy

### 1.1Lazy Maxima Computation on twitter tweets

Basis
Use deferred data structure for Maxima to efficiently answer popularity queries on tweets
• Consider the hashtags of tweets:
• The measure of popularity for tweet hashtags is multidimensional:
1. number of tweets containing it,
2. number of users using it,
3. date of last usage,
4. length of time during which it was used (time since first usage - time since last usage).
• Consider "Popularity queries"
• A "popularity query" is
• formed of a hashtag, and
• answered by indicating if this hashtag is popular or not, and in the latter case point to at least one hashtag which is more popular.
• The frequency of tweets is much higher than of popularity queries.
• "Popular" hashtags form the "skyline" of these measures, the result of a "maxima computation"
Question
How frequently and detailed should the skyline be computed?
• too often (data changes and forces recomputation when no queries was asked between two computations) or too wide (some parts of the skyline are never queries) is not efficient, but
• not often enough or not wide enough results in too much waiting time for answering the query.
Hypothesis
A Deferred Data Structure for Maxima in high dimension would be the most efficient.
Measures of Success
A robot answering "Popularity queries" with a choice of two online algorithms
Plan
milestones
1. PROGRAM A real engine answering "Popularity queries" (with a greedy algorithm) and advertize it in social media;
2. COLLECT a set of user queries (and corresponding online data sets);
3. STUDY the litterature on Deferred Data structure and adaptive algorithms for Maxima;
4. PROGRAM better algorithm(s) for answering popularity queries
5. SIMULATE all algorithms on set of user queries obtained in step 2
6. COMPARE the performance of the various algorithms
7. DESCRIBE the conclusions in a report.

### 1.2Patience Sorting and Ping Pong Merge

Basis
A new article in SIGMOD 2014 (a database conference) claims that the "importance of sorting almost sorted data quickly has just emerged over the last decade", and propose adaptive improvement of Patience Sorting.
Objectives
Relating their results with existing adaptive results
Measures of Success
Plan
1. [ ] REVIEW \cite{2014-SIGMOD-PatienceIsAVirtueRevisitingMergeAndSortOnModernProcessors-ChandramouliGoldstein}
2. [ ] PERFORM an adaptive analysis of their algorithms
• [ ] Merging
1. [ ] Balance Ping-Pong
2. [ ] Unbalanced Ping-Pong
• [ ] Sorting
1. [ ] Naive P3 Sort
2. [ ] Cache-Sensitive $$P^3$$ Sort
3. [ ] Final $$P^3$$ Sort
4. [ ] Evaluating $$P^3$$ Sort
5. [ ] Replacement Selection Sort
6. [ ] Flat Replacement Selection (FRS) Sort
7. [ ] $$P^3$$ Replacement Selection Sort
3. [ ] ASK for their data
4. [ ] EXPERIMENTALLY Compare their algorithms to existing adaptive sorting and merging algorithms

### 1.3Prefix Free CodesBIGDATA

Basis
• There is a $$O(n\lg n)$$ algorithm to compute an optimal prefix free code for $$n$$ weights, yet there are easy instances (e.g. all weights within a factor of two of each other) where such codes can be found in linear time.
• There is an $$O(n4^k)$$ algorithm from Belal and Elmasry, where $$h$$ is the number of distinct code lengths produced \cite{2005-ARXIV-DistributionSensitiveConstructionOfMinimumRedundancyPrefixCodes-BelalElmasry}. Its improvement to $$O(nk)$$ solution \cite{2006-STACS-DistributionSensitiveConstructionOfMinimumRedundancyPrefixCodes-BelalElmasry} seems wrong.
• The bounds of $$O(n\lg n)$$ and $$O(n4^k)$$ (or even $$O(nk)$$ leave an intriguing gap, a bit like the old dichotomy between $$O(nh)$$ and $$O(n\lg n)$$ results for the convex hull before Kirkpatrick and Seidel algorithm running in time within $$O(n\lg h)$$
• There is a lower bound of $$\Omega(n\lg k)$$
• Using the (stupid) trick of simulating both algorithms in parallel does not yield an algorithm optimally adaptive to $$n$$ and $$k$$.
Objectives
Is there an optimally adaptive algorithm to compute Prefix Free Codes which is still optimal in the worst case?
Measures of Success
1. [ ] New measure of difficulty,
2. [ ] New adaptive algorithm,
3. [ ] Computational Lower bound matching upper bound.
Plan
• REVIEW bibliography on adaptive Prefix Free Codes
1. [ ] Huffman \cite{1952-IRE-amethodfortheinstructionofminimumredundancycodes-huffman}
2. [ ] van Leeuwen \cite{1976-ICALP-ontheconstructionofhuffmantrees-leeuwen}
3. [ ] Moffat and Turpin \cite{1998-TIT-efficientconstructionofminimumredundancycodesforlargealphabets-moffatturpin}
4. [ ] from Milidiu Pessoa and Laber \cite{2001-IEEE-threespaceeconomicalalgorithmsforcalculatingminimumredundancyprefixcodes-milidiupessoalaber}
5. [ ] Belal and Elmasry \cite{2006-STACS-distributionsensitiveconstructionofminimumredundancyprefixcodes-belalelmasry}
• IMPLEMENT [0/6] algorithms in python (even though I already have implementations)
1. [ ] huffman
2. [ ] van leeuwen
3. [ ] moffat and turpin
4. [ ] from milidiu pessoa and laber
5. [ ] belal and elmasry
6. [ ] adaptive van Leeuwen with Rank and Select

### 1.4TimSort algorithm

Basis
• TimSort is a sorting algorithm considered to be fast in practice, described in http://bugs.python.org/file4451/timsort.txt
• It is closely related to
• Merge sort and to
• Alphabetical Prefix Free Codes / Binary Trees (as the runs are merged only if they are consecutive)
• It seems to show an "adaptive" behavior:
• "on real-world data, TimSort often requires far fewer than comparisons, because it takes advantage of the fact that sublists of the data may already be ordered." (WIKIPEDIA)
• Beyond faster sorting, it could also yield to better compression of permutations
Objectives
Is TimSort adaptive to some "natural" measure of disorder?
Measures of Success
1. [ ] Identify one or more (known or new) measure $$\delta$$ of disorder/difficulty to which TimSort is adaptive
2. [ ] Show a computational lower bound for $$n$$ and $$\delta$$ fixed (if not done already)
3. [ ] Show a computational upper bound for $$n$$ and $$\delta$$ fixed (if not done already)
4. [ ] Define a compressed data structure for permutation based on TimSort (not just a compressed encoding, which would be trivial).
Plan
1. [ ] Review
• [ ] bibliography on adaptive sorting \cite{1992-ACJ-AnOverviewOfAdaptiveSorting-MoffatPetersson}
• [ ] litterature on TimSort
2. [ ] Analize worst case performance of TimSort on measures of disorder such as
• [ ] nRuns
• [ ] nInv
• [ ] nREM
3. [ ] Analyze the data on which TimSort was used in the past to find some structure that TimSort takes advantage of.

### 1.5Bounded Length Prefix Free CodesBIGDATA

Basis
• There is a $$O(nL\lg n)$$ algorithm to compute optimal Prefix Free Codes among all codes of bounded length $$L$$.
• Yet, there are
• easy instances (e.g. $$L<\lg n$$, or $$L=\lg n$$, or $$L\geq n$$) where such codes can be found in linear time, and
• other instances (e.g. $$h Objectives Is there an adaptive algorithm to compute Bounded Length Prefix Free Codes? Measures of Success 1. [ ] New measure of difficulty, 2. [ ] New adaptive algorithm, 3. [ ] Computational Lower bound matching upper bound. Plan 1. [ ] REVIEW bibliography on bounded length adaptive Prefix Free Codes 1. [ ] Dynamic programming solution \cite{2009-SODA-AGenericTopDownDynamicProgrammingApproachToPrefixFreeCoding-GolinXuYu} 2. [ ] first paper from Milidiu Pessoa and Laber \cite{1998-SPIRE-InPlaceLengthRestrictedPrefixCoding-MilidiuPessoaLaber} 3. [ ] second paper from Milidiu Pessoa and Laber \cite{1999-ALENEX-EfficientImplementationOfTheWarmUpAlgorithmForTheConstructionOfLengthRestrictedPrefixCodes-MilidiuPessoaLaber} 2. [ ] IMPLEMENT [0/3] algorithms in python 1. [ ] from milidiu pessoa and laber 2. [ ] TopDown 3. [ ] new adaptive algorithm 3. [ ] Experiment on word frequencies of various data sets. ### 1.6Optimal Boxes in high dimension BIGDATA GEOMETRY Basis There is an algorithm computing the planar box of maximal covering weight in time within \(O(n^2)$$ and less on easy instances. \cite{2014-IPL-MaximumWeightPlanarBoxesInON2TimeAndBetter-BarbayChanNavarroPerezLantero}
Objectives
Generalize those adaptive results to higher dimensions.
Measures of Success
List which of the difficulty measures defined by Barbay et al. \cite{2014-IPL-MaximumWeightPlanarBoxesInON2TimeAndBetter-BarbayChanNavarroPerezLantero} do generalize to high dimension, and explain why for those which do not.
Plan
1. [ ] Study \cite{2014-IPL-MaximumWeightPlanarBoxesInON2TimeAndBetter-BarbayChanNavarroPerezLantero}
2. [ ] Formalize the generalization of the $$O(n^2)$$ algorithms to high dimension
3. [ ] Study one by one the difficulty measures and adaptive techniques to decide if they apply to high dimension.

Basis
Existing Adaptive Algorithms for the Intersection problem performs well in theory but badly in practice on data too large to fit in the cache.
Objectives
Can we design an adaptive cache-friendly algorithm for the Intersection problem, starting from known adaptive algorithms (as opposed to starting from a different paradigm)?
Measures of Success
• experimentation on TREC data set with yahoo! queries showing
• [ ] a reduction in time
• [ ] a reduction of the number of cache misses
• of some new cache-friendly algorithm,
• as compared to state-of-the art intersection algorithm \cite{2009-JEA-AnExperimentalInvestigationOfSetIntersectionAlgorithmsForTextSearching-BarbayLopezOrtizLuSalinger}
Plan
1. [ ] Review
1. the Intersection problem \cite{2009-JEA-AnExperimentalInvestigationOfSetIntersectionAlgorithmsForTextSearching-BarbayLopezOrtizLuSalinger},
2. external memory models, and
3. external memory solutions, starting with $$B$$-trees;
2. [ ] Implement existing Algorithm, or learn how to use the existing simulator from \cite{2009-JEA-AnExperimentalInvestigationOfSetIntersectionAlgorithmsForTextSearching-BarbayLopezOrtizLuSalinger}
3. [ ] Design a new concurrent intersection algorithm, similar to the original one proposed by Demaine~\etal~\cite{dlm}, in which a different thread searches in each inverted list, so that blocking memory accesses are not spent waiting but searching in other inverted lists;
4. [ ] Analyze the theoretical performance of the new algorithm;
5. [ ] Implement and Experimentaly test the new algorithm
6. [ ] Analize the results
7. [ ] If necessary, design a new variant of the algorithm and go back to step 4.
• General presentation of the topic Conjunctive queries are simple and popular queries in search engines. Given an index composed of inverted lists coded in sorted arrays or B-trees, those queries are reduced to the computation of the {\em intersection} of the sets represented by those structures.

Most algorithms studied for the intersection problem assume the inverted lists are all in memory, take litle or no account of the cache, and are not written in a concurrent way, which means that the process solving the intersection problem is constantly blocking on page faults.

• Objective of the internship We propose to program:
1. a structure of implicit B-tree supporting finger search in a similar way to the way doubling search is supported on sorted arrays, so that each inverted list can be searched in a cache-friendly way;
2. a concurrent intersection algorithm, similar to the original one proposed by Demaine~\etal~\cite{dlm}, in which a different thread searches in each inverted list, so that blocking memory accesses are not spent waiting but searching in other inverted lists.

The performance of this algorithm will be compared to the performances of existing non-threading algorithms on real (notably from Google and TREC) and random data sets, with a simulator in C++ already available~\cite{fasterSetIntersectionAlgorithmsForTextSearching}. If time permits, we will analyze theoretically the performance in a cache-aware memory model.

• Expected ability of the student

Knowledge of

• Operating Systems and Concurrency,
• C++ and
• B-trees.
Notes
Stefan Buettcher pointed out that:
• this approach is valid {\em only} if the caching is between the memory and the hard-drive (because the memory exception of the memory cache is blocking the CPU);
• as communication (and context switch) between threads is expensive, the threads should be user-level ones, and should access the hard-drive using asynchronous load instructions;
• the algorithm should include a prefetching algorithm.

### 1.8Delaunay Triangulation and Voronoi Diagrams   BIGDATA GEOMETRY

Basis
Various techniques yield good adaptive sorting algorithms for permutations, such as the decomposition into runs and the merging via Huffman trees, local insertion sort, and REM-filtering.
Objectives
Do those techniques, applied to the computation of the Convex Hull or Delaunay Triangulation, yield interesting adaptive solutions?
Measures of Success
1. [ ] new measures of difficulty $$\delta$$;
2. [ ] theoretical lower bounds on the worst case complexity over instances of fixed $$n$$ and $$\delta$$;
3. [ ] algorithms adaptive to $$\delta$$, while still asymptotically optimal in the worst case for $$n$$ fixed;
4. [ ] matching upper and lower bounds for $$n$$ and $$\delta$$ fixed;
Plan
1. [ ] Study litterature on
1. [ ] Adaptive Sorting \cite{2013-IanFest-FromTimeToSpace-Barbay}
2. [ ] Convex Hull \cite{1986-JCom-TheUltimatePlanarConvexHullAlgorithm-KirkpatrickSeidel}
3. [ ] Delaunay Triangulation
2. [ ] Show results on the problem of Planar Convex Hull:
1. [ ] $$O((n+1)\lg n - \sum r_i\lg r_i)$$ complexity
2. [ ] $$\Omega((n+1)\lg n - \sum r_i\lg r_i)$$ complexity
3. [ ] Inv-related results
4. [ ] REM-related results
3. [ ] Show results on the problem of Planar Delaunay Triangulation:
1. [ ] $$O((n+1)\lg n - \sum r_i\lg r_i)$$ complexity
2. [ ] Proof of difficulty of partitioning input into optimal number of fragments.

### 1.9Klee's measureBIGDATA GEOMETRY

Basis
• There is an algorithm computing the Klee's measure of $$n$$ rectangle in finite $$d$$ dimensions in time within $$n^{d/2}$$. \cite{2014-FOCS-KleeSMeasureProblemMadeEasy-Chan}.
• Yet, for some particular instances (e.g. All rectangles are imbricated like a russian doll), there are much faster solutions (time within $$O(nd)$$).
• There are similar results and discrepencies on the Klee's measure of Orthants.
• It is only conjectured that this complexity is optimal in some model.
Objectives
1. [ ] An algorithm computing Klee's measure in time within $$n^{d/2}$$ in the worst case, but much faster on "easy" instances, for some well defined measure of difficulty, and
2. [ ] a (dynamical) data structure for a certificate of the value of the Klee's measure of an instance.
Measures of Success
any of the results below:
1. [ ] An adaptive algorithm for either of the variants of Klee's measure, along with
2. [ ] a matching computational reduction (rather than a lower bound),
3. [ ] a formal definition of a "certificate" of a value of the Klee's measure of an instance;
4. [ ] a dynamical data structure for such a certificate.
Plan
1. Study Degenerated Instances
• on Orthants
• on Boxes
2. Study other types of "easy" instances
3. Implement
1. [ ] Timothy Chan's simple algo for Klee's measure \cite{2013-FOCS-KleeSMeasureProblemMadeEasy-Chan}.
2. [ ] new adaptive algorithms
4. Find Synthetic and Real Data Sets
5. Experiment

### 1.10InterUnion of sorted arrays

Basis
Ehsan Chiniforooshan, Arash Farzan and Mehdi Mirzazadeh proposed an adaptive algorithm for recursive combination of Intersection and Unions. InterUnion instances appear in "Google queries" as soon as one integrate synonimous in the search.
Objectives
How does their algorithm perform in practice?
Measures of Success
• [ ] Validate or invalidate the approach from Chiniforooshan et al.
• [ ] Eventually propose a better algorithm (e.g. using adaptive techniques developped for the Intersection problem)
Plan
1. [ ] Study
• [ ] \cite{2005-ICALP-WorstCaseOptimalUnionIntersectionExpressionEvaluation-ChiniForooshanFarzanMirzazadeh} and the publications after that.
• [ ] Mehdi Mirzazadeh's PhD thesis
2. [ ] Implement
3. [ ] Generate InterUnion queries from Yahoo!'s queries and a dictionary of synonimous
4. [ ] Experiment on TREC Data Set

## 2 HCI

### 2.1 Corregame.cl

Basis
It is useful to have someone else read your draft before submitting it, whether you are a student submitting an essay or an academic professor submitting a research article. Finding peers to review your work comes naturally in tight communities and large laboratories, but can be harder to come in other contexts. Calibrated Peer Review implement such a concept in the classroom.
Objectives
Can a social network help people finding peers to faithfully review their work, by requiring people to review other people's work in order to be allowerd to submit their own work? Can the quality of the reviews be evaluated in a semi-automatic way, in order to reward good reviews?
Measures of Success
Having several distinct communities using this reviewing process for some events such as a conferences, and reporting positively on it.
Plan
1. Developping a program or script introducing a finite number of small errors at random in a pdf article, at specifically verifiable coordinates.
2. Developping a form for users to easily report errors in reviewed document, in a way which can be formally compared with randomly introduced errors and with other reviewers.
3. Developping a website receiving submissions, assigning reviews,
4. Having a small community using this reviewing process for a small event such as a conference, and reporting positively on it.
5. Implement a selection of the recomandations of the users of the conference
6. Have a second community using this reviewing process for a small event such as a conference, and reporting positively on it.
Extentions
• Other types of documents (e.g. simple text, LaTeX, word, …)
• More sophisiticated types of errors than typographical ones (e.g. errors in proofs)

### 2.2 Collective Mountaineering App and Server

Title
Collective Mountaineering App and Server
Problem identified
Tagging of trails in mountains is of inequal quality and deteriorating with time.
Material/Solutions Available
???
Solution proposed
An app and server to
• coordinate and encourage hikers to
• evaluate (and update old evaluations of) the marking of trails,
• submit GPS information of existing and new trails,
• validate submitted GPS information of trails.
• inform the administration of parcs of the quality of their trails marking.
Draft of Calendar
1. Elaboration of survey 1 of Andinist needs.
2. Study of existing collaborative quality control engine, Repositorium.
3. Processing of results of survey 1
4. Development of inicial prototype
5. Alpha testing and survey 2
6. Prototype 2
7. Survey 2
8. Projection of Project diary to report.
Evaluation Conditions
• 75% Positive evaluation of beta testers with a pool of at least 20 users.
Mode of Documentation
Report in english

### 2.3 Collective evaluation of Video Projects for TUApp

• Design a Collective Quality Control website such that
• each (registered) student
• desiring to upload a video must
• evaluate 6 Quality criteria of two videos (one is new, one is from the database, unknowingly to the student)
• (and is rejected if his answers do not match the ones from the database for the video already known.

### 2.4 Generator of Composite Images and Boot Strapping Database of Images.

• Technical Outline
• Given a set of images, compute a set of descriptors for each of them (in term of shape and colors) and enter them in a database system [Based on Diego Diaz's internship].
• Define a set of 9 image-to-image distances which can be computed based on those descriptors.
• At each user query, consisting of a picture P and a parameter x,
• cut P in pieces of x*x pixels;
• For each piece and for each distance d,
• search in the database for the picture the most similar in color and shape according to d.
• For each distance,
• form a composite image where each piece is replaced by the corresponding picture.
• Propose to the user a low resolution version of each of the 9 composite images, and let him choose one to download in high resolution.
• Log which distances generated the image chosen.
• Compute the descriptors of the image and add it to the database.
• Generate an histogram to represent the statistics on the choices of the user.
• Notes:
• Boot Strapping aspect
1. The quality of composite images will grow with the size of the database.
2. The size of the database will grow with usage.
3. The usage will grow with the quality of the composite images.
• Legal aspects.
• the images originally in the database need to be free of copyright.
• the user, being given access to those images through the composite image, must agree to release all copyright from the image he uploads (must he be forced to give a mailing address?)

### 2.5 Collect and Validate of Short Pedagogical Videos generated by Students

• Technical Outline
• Given
• a small set of quality criteria on short pedagogical videos
• a kernel of short pedagogical videos such that, for each criteria, there are some videos both matching and not matching the criteria.
• Produce an engine which
• requires each student to
• submit his own short pedagogical video
• evaluate pedagogical videos (both from others and from the kernel)
• so that to be allowed to
• search and

### 2.6 Categories in Repositorium

1. Description
• Repositorium is a web engine for users to exchange documents and evaluate them.
• It insures the quality of the documents shared by requesting users to evaluate submitted documents in exchange of the possibility to download documents which have already been evaluated and validated.
• It insures the quality of the evaluations performed by the users by mixing into the set of documents to be evaluated some which evaluation has been already validated: if the user's evaluation does not agree on those documents, the contribution is rejected.
• A server is running the latest stable code on http://www.repositorium.cl/
2. Current State
• Supports multiples criterias of quality, all boolean.
• Supports deterministic challenges (such that the proportion of hidden tests is fixed)
• Supports multiple communities on the same server, which share user's accounts but separate
• definitions of criteria of quality control
• credit of users (won by evaluating succesfully documents, or by uploading documents which are later succesfully evaluated)
• Supports bookmarking of communities by a user
• Supports multiple types of documents
3. Desired State at the end of the project
• Support for the definition and publication of a Category by any user, local to the community
• a Category is a list of criteria of Quality control
• The cost of downloading a document from a category is computed from the sum of the cost of acquiring a document from each criteria of quality control
• Support for the bookmarking of categories by a user
• Support for the reputation and credit of a user as a vector composed by the value in each quality criteria of the community (i.e. in a larger context than the category)
• (OPT) Small hack of the code to allow (at compilation time?) to desactivate the multi-community feature:
• creation of new communities by users
• display of options to choose community if there is only one community available.
4. REFERENCES

### 2.7 Duplication detection via Collective Quality Control

1. Detection of Duplicatas
• when a new document $$N$$ is added which share the same set of tags than an older document $$D$$, one must decide if
• $$N$$ is equivalent (a copy?) to $$D$$ (and should be refused)
• $$N$$ is completely distinct of $$D$$ (the tags should be refined)
• $$N$$ is a newer version of $$D$$ ($$D$$ should be deprecated)
• a new type of challenge is created, asking the user to either
• add tags to each document in order to distinguish them
• decide that those documents are equivalent
• decide that one document is a more advanced version of the other (which corresponds to add a tag "DEPRECATED" to the old version)
• in order to keep users honest, real queries are mixed with some pairs of documents which are known to be
• distinct but similar
• different but equivalent
• the result of those "duplicata" challenges is also submitted to the expert for validation.
2. OPEN QUESTIONS:
1. Will the owner of the collective repository create the pairs of dupliquata tests or will they be automatically generated? Can we imagine dual solutions, such as
• user is offered some default values
• user can enter some at the begining, but when the database is big, the motor automatically generate tests.
2. Should we check for duplicatas when an expert is submitting a bunch of documents
• inside the bunch?
• between documents from the bunch and the content of the database?
3. Improved Search Engine
• a Better search engine will help with duplicata detection.
• Hierarchy of tags
• e.g.
• "CC4102" => "Algoritmos,Estructuras de Datos"
• "Algoritmos" => "Ciencias de la Computacion"
• Questions:
1. How will those hierarchies be learned? (Will the users/owners type and edit them?)
• Features
• 1 Button for administrator to check for duplication within
• for when an administrator takes over, or if the definition of what is a duplicata changes over time
• algorithm (if mySQL does not support duplicata detection already)
• build for each document a signature composed of all tags are associated with it, in alphabetical order
• sort documents by their signature
• scan sorted list for sequences of documents with identical signature.

### 2.8 Video Streaming API for Repositorium

1. General Objective:
• Define a new bootstrapping client
• which can be either captured from their phones or realized on computers.
2. LIST of tasks for the student
1. [ ] Write proposal
2. [ ] Install Repositorium on Buho.cl
3. [ ] Write first android app which captures and uploads video to a specific repository on repositorium.cl
3. Resources available:
1. Video Streaming server, ssh://buho.dcc.uchile.cl
2. Previous Prototype, Repositorium,
• lacking video management
• containing an API
• Containing collective quality control
3. Pool of Android Devices
4. STUDENTS Interested
5. MEETINGS

### 2.9 Repositorium for troll control on Forums

• Collective Quality Control applied as an Anti Spam and Trolls filter
• in order to post, each user must check a set of 5 posts for spam and troll criteria.

### 2.10 (real) Collective Quality Control (Wikipedia ++, Repositorium)

• Implement a wiki similar to wikipedia, with the additional specificity that
• the free access is only for a subset of the pages (e.g. high quality stable ones) and
• having an account
• having earned some virtual credit by
• submitting new content which has been aproved
• validating quality criteria on existing content.

### 2.11 Collective indexation of pictures (OJO Fech)

• Implement a collective indexing system so that
• users who upload pictures are required a minor amount of indexing (checked by collective quality control)
• users who search the pictures (e.g. journalist) are performing more indexing

### 2.12 DB for Composite Images

"BootStrapping Database of Pictures for Composite Images"

1. Short Technical Description
• Boot Strapping Database of Thematic Images
• Given an inicial set of pictures on one theme (e.g. dogs in Santiago)
• Offer to people to
• submit a new picture P on this theme (with all copyright),
• answer some questions about existing pictures in the database, (contain a dog or not, is well framed or not, is decent or not), truthfullness checked via some challenge pictures for which the answer is known
• choose one among a selection of composite copies of P,
2. Longer description

Given a picture and a parameter x, cut it in pieces of x*x pixels. For each piece, search in the database for the picture the most similar in color and shape. For a composite image where each piece is replaced by the corresponding picture. With a database sufficiently large, and a search engine sufficiently good with forms, one could have a large value of x.

This algorithm use for instance in a database of only dog images, to generate composite images of dogs. In later steps, one would consider a more diverse database, and add keywords to allow restricting the images used for the composite image. The description above is kept simple.

The part I am really interested in, is to create a bootstrapping database (e.g. with the 150 pictures of dog I took) which grows each time a person use it. At the very least, each person requesting a composite image would contribute his/her picture (in exchange for the right to use the ones composing his/her composite image), and perform a small validation job on pictures submitted by others (e.g. "while we compute your composite image, please validate 8 pictures, among which 2 are known to be invalid and 2 to be valid, and the four others are not yet validated: you will get your composite image only if your validation confirm ours on the know pictures")

There are two theses to be checked:

1. That a database of larger volume with an indexing scheme taking into account shapes (and not only color histograms) will yield better composite images. I would prefer Diego to work with someone else on this one, for instance you if you are interested: I am really not an expert in this.
2. That starting from a database of 150 pictures and composite images of relatively low quality (low value of x), available for free (beyond the small validation task), the database will grow in volume and diversity, the quality of composite images will increase, up to a point where users can be asked to perform more complex tasks (e.g. add keywords to pictures) in exchange for their composite image, or be given the alternative to pay a fee for the service. If confirmed on this example, the thesis would have many other applications that I am interested in.

## 3 Life Hacking

### 3.1 Collective Gathering of Air Quality

Basis
Objectives
BUILD a system to map the air polution in Santiago over time, independently from the government (and industrial lobies)
Measures of Success
• [ ] BUILD 10 DIY kits of air mesuring and bind them with 10 Android phones from the department of CS at UChile;
• [ ] ADAPT existing open-source application;
• [ ] EQUIP 10 students with those kits during the months of July and August;
• [ ] COMPARE the independant results with those published by the government;
• [ ] PUBLISH the results obtained.
Plan
• March-June:
• BUILD 10 DIY kits:
• July August:
• EQUIP 10 students with those kits during the months of July and August;
• Sept:
• PUBLISH the results obtained.

### 3.2 OpenMailBox

• Digital Mailbox / Dropbox
• sender can
• drop a text or document to a recipient (i.e. sending an email);
• see at any time if his drop has been checked or not;
• later update it or remove it if the recipient did not check it yet (similar to an open mail box in the physical world).
• update of versions sent is transparent to the recipient, in a similar way to sharing a Dropbox folder with sender; but
• the exchange is private between sender and recipient, contrarily to what would happen if recipient leaves a dropbox folder available to everybody, and
• the sender does not have to configure one dropbox account in advance for each potential sender, contrarily to what would be required in order to insure privacity using dropbox accounts.
• sender can define a time window for recipient to check it:
• after date x, recipient receives text (e.g. someone going on a dangerous trip can send a letter to be opened in case of death, and cancel the letter on his way back)
• until date x, recipient can access text (e.g. vacation message, equivalent to autoreply but personalized)
• Folder updated with the content of the open inbox
• rather than a webserver where the user can see a list of entries,
• propose to install a program such as the dropbox client to let a recipient manage his open inbox like any syncronized folder.
• free up to some space usage and amount of messages pending in receiver open inbox,
• for a monthly fee in order to allow bigger space usage or number of messages pending in user open inbox.

### 3.3 Audio Interfaces

"Better Audio Interfaces for Blind and Busy People"

1. Introduction
1. Our society is mainly based on a visual interface, but joggers and busy business people rediscovered the pleasure of audio interfaces, with the development of applications to listen to audio books, to read out loud ones' electronic email.
2. Blind people are mainly restricted to an audio and tactile interface to access the world, and in particular the electronic world. Some tactile interfaces has been developped to replace the visual interfaces (or to provide an interface of translation to the visual interface more generally available), but the potential of improvement of audio interfaces seems much larger: one can improve it easily, when it is not so clear how to improve the tactile interface.
3. The improvements proposed here will benefit both blind and non-blind people, by transposing the recent improvements in the technique of visual interfaces to the more neglected area of audio interfaces. In particular we consider the easiness to annotate, navigate and search in text files, which technology permits to do in audio files, but for which the interface has not been created yet.
2. Concrete proposals
1. (better) navigation system in large audio files
• current navigation systems are adequate for a large file system of small audio files, but not for a system composed of very large audio files, such as an entire audiobook, or the recording of a talk or conversation of 1h. Breaking those large files systematicall in small pieces is not a convenient solution neither, so much for
• fractionned listening (when you listen to the entire file in several runs, and need to restart from where you stopped at the end of the last run)
• searching for a specific passage in it (whether it is one big file or a sequence of small ones, the search interface is sequential, when a better interface would allow logarithmic time)
• we need audio bookmark system for locations in a file
• we need exponential navigation (keeping pressed the button double the speed of deplacement in the audio file, rather than keeping it constant)
2. Audio Hilighting
• a button (on the iphone, Sony Jukebox, or Android) to add some light background noise to a recording, such as to an audio book or a recorded talk.
• navigation tools to secuentially jump from one hilight to the other.
3. Audio Annotation System
• more advanced than the hilighting, allow the addition of audio notes associated to a specific location in the audio text (let it be an audio book or the recording of a talk or conversation), and signaled via some audio hilighting.
3. Thesis that could be proved or disproved
1. A better navigation system (e.g. exponential speed) permits users to find back more quickly the location where they stopped listening to a large recording.
2. A hilighting and annotation system permits blind and non-blind people to interact better with audio texts, in a similar way to hilighting and annotation tools on pdf documents (such as open source jarnal and commercial PDF annotator.

## 4 Pedagogy

### 4.1 GoPro / Swimovate for digital improvement of swimming lessons

• Technical Outline
• Given
• one go pro aquatic camera
• 10 swimovate watches
• one swimming course.
• Distribute the 10 watches to 10 swimmers, with instruction to upload their data to a central repository.
• Select 3 groups of swimmers with watches
1. one group is
• taken on video while swimming and
• shown videos of
• themselves swimming and of
• experts swimming, and
• given the feedback of the watches.
2. one group is
• only shown videos of experts swimming, and
• given the feedback of the watches.
3. one group is
• only given the feedback of the watches
• Analyze the data at the end of the experiment to rank the effect of the following pedagogical techniques:
1. comparing videos of yourself swimming with experts
2. watching videos of experts
3. getting (more detailed) feedback by swimming watches

### 4.2 Pedagogical Games "Many to Many teaching" (Duo Lingo ++)

• Short description
• Choose a topic to teach (e.g. Arithmetic)
• Design a kernel of simple problems and solutions (e.g. "Jack has 5 candies and gives 3 to Ellen. How much does he have left?")
• List some quality criteria and labels for such problems (e.g. "substraction", "easy", "candies"…)
• Design a boot strapping database and the corresponding application which
• propose to the user a mix of
• 1 problem with 1 solution on which to validate quality criteria (including correctness of the solution);
• 1 problem with 4 solutions, on which to choose the possible correct solution(s);
• 4 problems with 1 solutions, on which to choose the possible correct problem(s);
• 1 problem without solution, on which to write a solution.

### 4.3 Multi Choice Question Generator, Database, and Collective Quality Control.

1. Objective

Build a relational database and a graphical interface to manage simple multi-choice questions and their solutions by each instructor. Each multi-choice question can have a positive and negative statement, and must have at least one correct and three incorrect answers if it has a positive statement, and at least one incorrect and three correct answers if it has a negative statement. In practice it is expected to have many more both incorrect and correct answers.

Adapt the existing scripts allowing for the automatic generation, given a list of question, the date and title of the control, and the number n of distinct control desired, of the control itself, of n randomly generated controls, featuring at the bootom the encrypted information to correct it quickly. Generate the page helping the teaching assistant to mark the controls, and a global solution of the control to be distributed to the students after correction.

If time permits, consider the design of a collective quality control which trades

• the task (controled) of validating a newly shared question.
• For instance, given an instructor who desire to access a question shared by others, require him to evaluate the quality of four related questions of four solutions each. Give him access to the question of his choice only if he can correctly solved the validated questions, hence validating his evaluation of the fourth question.
1. Concept
1. Reproduce in PHP MySQL the generator described in \cite{2004-Havana-onTheUseOfAComputerToTeachMoreAndBetterInACollectiveManner-Barbay}
2. "Force" the sharing of questions by putting them in a common pot one month after the generation of the questionnaire.
2. Site Map
1. Question: Create
2. Control: Create
3. Search
4. My Account
2. Create and Edit Questions
1. Text of Question ""
2. Negative text of Question ""
• one text ""
• two exclusive fields [ ] Correct or [ ] Incorrect
• a comment on the answer: ""
4. (Additional) keywords for the question: ""
5. Buttons:
• [ ] Question is Publicly available
• [ ] Question will be made public on date
3. Create and Edit Control
1. Title ""
2. Date ""
3. List of Questions, each with
• a simple link to a question, with the first line of the text of the question;
• a button Remove
• a button Edit
• a button Mark
5. Button Export (in pdf, latex or xml)
4. Search
1. Keywords ""
2. Restrict to my questions [ ]
3. Restrict to (my) private questions [ ]
5. Result of Search: Some question and control each according to the following format:
1. Question
• Text
• Keywords Matched
• boxes Mark and Edit
2. Control
• Title and date
• Keyword Matched
• some questions
• Button Edit
6. Page Export
1. Visualisation of an example of questionaire.
2. Menu "Number of questionaires" input n from 1 to 1000
3. Menu "Export Format" input f chosen between
• LaTeX
• XML
• PDF
• ASCII
• Open Office ODT
• RTF for Microsoft Word
4. Button Generate
• generates a document in format f containing n questionaire randomly generated from the current control definition, and one extra templates to make the controls easy to mark.

### 4.4 Searchable Database of Solved Problems

1. Objective

Build a relational database and a graphical interface to manage typed assignment problems, their translation in various languages, their typed solutions, marking schemes, historic of usage (by date, course, and exam or assignment), for each professor having an account, and allowing for the sharing of selected problems between professors.

Write the scripts allowing for the automatic insertion of the existing database of solved problems in LaTeX with one problem (and solution, marking scheme and translation) per file.

Adapt the scripts allowing for the automatic generation, given a list of problems, the date and title of the assignment, of the assignment itself, of the assignment with solution and marking scheme for the teaching assistant, and of the assignment with complete solutions for the students after the correction of the assignment.

If time permits, consider the design of a collective quality control which trades

• the task (controled) of validating a newly shared problem.
• For instance, given an instructor who desire to access problems shared by others, require him to evaluate the quality of four related problems and their solutions, among which one or two are known to be false, one or two are known to be correct, and one has not been validated. Give him access to the problem of his choice only if he can identify the correct and incorrect problems, hence validating his evaluation of the fourth problem.
2. Technical description
1. Set-up a simple SQL database with basic fields
• tables:
• problem ( string text, string keywords, authorID)
• solution ( string text, string keywords, authorID)
• author (string name, string email )
• assignment (date, string occasion, authorID, string course, string location )
• solving (problemID, solutionID, controlID )
• useOfProblem (problemID, assignment)
• publicationOfSolution (solutionID, assignment)
2. Set up a PHP interface to the table allowing to
• add a solution to a problem
• add an entry in :solution:
• add an entry in :solving:
• add an assignment to a problem or solution
• search for a problem
2. how to add the table :control: and the :controlID: fields to problem and solutions
• an XML representation of an assignment
• full LaTeX source of an assignment
• a ps/pdf output of the assignment

### 4.5 Pedagogic games

Develop games combining the following concepts:

1. References from others
1. Celine Latulipe [2009-10-20 Tue]

In the universe of my evil, rapid prototypying class (Bwahaha!). I'm making my students create a Montessori-style sentence labeling game (drag a big red circle over the verbs and big black triangle over the nouns, etc.) in Flash as a class assignment. I'm trying to give them some labeled sentences as examples. Apparently I need to choose very simple sentences. :-)

• remember names at a party

using behavioral principle from Skinner Box and EQ psychological analisis (Norathian Scrolls)

4. Repetition Optimisation for Memory

along the line of the supermemo algorithm

5. Unbounded content and interfaces
1. Pedagogic Content
• taking from dynamic external sources such as Wikipedia, Flickr
• generating, indexing and validating content from users through bootstrapping databases
2. Program interfaces (Games and toys)
• open source, can be modified and edited by anyone
• course project by student learning how to program at the University
3. Artistic Content
• Creative Common licensed, can be edited and modified
• Pointing to artist's web pages: beginer artists use it as a publicity
6. Combination of Genre
• as in "You don't know Jack!" or the Quizzes from "Speed Racer 2", diversify the challenges for more interest
7. Web 2.0 Feedback on the games and content

to be able to assess the success of each component, and improve them

8. Student feedback separated from student evaluation
• feedback allow the student to progress
• evaluation is only a safeguard to avoid offering too easy/difficult problems to the student
9. Friendly competitive
• put friends in competition in a way which motivate them
• to play more often
• to improve their score
• the competition does not need to be on the same topics, the interface can "translate" the skills between different areas in comparable score, as long as it encourages rather than discourage
10. International, Independant of Language

In order to maximize the potential audience of the first version, minimize language to the content, and represent as much as possible the rules as drawings (along with an optional textual and audio legend for blind players)

11. Separate content from form

separating the content from the form allow to have a different rythm for both, such that the rythm of the form sustains the motivation of the player (curiosity about future forms of the game) while the rythm of the rythm of the content is done according to the optimal repetition scheme for memory retention (cf Supermemo and Wozniak's repetition algorithms and formulae)

12. Label Archimede

One could propose a label "Archimede" for games and toys (on the computer or not), certifying the pedagogical content of the game.

1. One would modify open source games to add pedagogical content (repeated using supermemo algorithm, independantly of the success in the game). Examples:
• globulation 2: answer one question during the development and upgrade of each building. The building collapse if the question is not answered correctly. The questions are taken from a boot strapping database with keywords attached to buildings and questions, simulating the fact
• swimming pool -> hydraulics (how much time does it take to fill the pool with a debit of 2l/s if the pool is 100l)?
• barracks -> volume, resistance, time scheduling
• farm -> sustainable development
• Frets on fire:
• associate composers to their music
• associate titles to known musics
• associate instrument pictures to their sound
3. companies pay to get the pedagogical content of their commercial game evaluated
4. the questions come from an external database (independantly of the interface/game/toy used to present them), and statistics on the progress of the student can be seen and analyzed by the student and parents, and future questions oriented in one direction or another.

## 5 Android Projects

### 5.1MANAGE Fund Plan

• looking at each phone called
• how much each phone call costs
• allow to compare it with what is sent by the phone company.

### 5.2BusStopWarning

An application which warns you based on your GPS location when you get close to your bus stop.

### 5.3 Contextual Todo List

1. Main Idea:
1. Permit the Clock-in/Clock-out of the items (the cell-phone is superior to the computer in this sense, because one has it always with it), in a way which can be exported to generate reports.
2. outputs the actions of a todo list matching best the context, and those partially matching the context, as guessed by the android based on
• the time,
• the location, and
• other potential inputs
• walking (accelerometer) -> listening
• riding the bus (accelerometer) -> listening, writing
• driving (connected) -> listening
• good mood ( :) in SMSs and Mails)
• bad mood ( :( in SMSs and Mails)
• Example:

Here
Here & Now &
someday

Now, somewhere else Important
2. Can be used in conjunction with an Org-mode interface
3. Org-mode integration
1. Existing solution: MobileOrg

Org-mode has built-in support for MobileOrg (or any other mobile client that may come along in the future and use the same asymmetric synchronization approach).

• A simple org-mobile-push will stage your complete set of Org-files for MobileOrg to pick them up. The result includes: – An Org file representing all of your custom agenda views – Automatic checksum file construction to speed up sync – Automatic index.org file generation with links to all of your Org files
• A powerful org-mobile-pull command, which will integrate changes you’ve made on the go into a local Org file

http://mobileorg.ncogni.to/features/ [2010-12-05 Sun]

2. References Android org-mode:

### 5.4 Multiplayer TRONDROID game for Smartphones   GAME

• Many multiplayer game modes:
• 4*1 game:
• four players, each one a worm
• 2*2 game
• four players in teams of two,
• collaborating on controling their worm
• 2 vs 1
• two players collaborating on controlling a heavier worm (it can cross one light wall every 20 seconds)
• one player controlling a ligther by faster worm.
• Control
• Balance the smartphone to direct the worm
• acceletate the smartphone to boost it

### 5.5 Association game   GAME

• Content
• vocabulary correspondance between two languages
• definition of a course
• Interface
• Memory Game
• MultiChoice questions
• Flash Card

### 5.6 List of Contacts with tag filters

• [ ] Geographic
• [ ] Style of activities

• automaticall copy the current time and context (GPS?)
• learning to offer some type of entries from the time and context,
• based on previous selections

### 5.8 Agenda with "Ideal Schedule" in shadow

• normal agenda functions
• but in between, an "ideal schedule shadowed
• score at the end of the day/week/month to measure how close you got from your ideal schedule.

### 5.9 Secure Transaction Software for Car Pooling

• OBJECTIVE: encourage car-pooling
• Versions:
1. Client to Client

Via Bluetooth, supposing driver and hitch-hiker both have the application installed on an android device, the application helps to quickly agree on a fare for the hitch-hiker and to transfer this money using a system such as pay-pal.

2. Client to Server

Via Internet wireless (G1-G3), helps hitch-hikers to find and book a ride in the future (short and long term), with the payment of a booking fee, cancellation fee, and final payment of the ride.

### 5.10 Lender

• memorize what is lent and borrowed from whom
• extension synchronizing by bluetooth with other android machine
• extension using cryptography to create "firmed" transactions.

### 5.11 Diary

Collect times of events in day life, such as going to bed and waking up. Optimize the menu of event entry so that it "learns" which events to suggest depending on the time of the day and previous actions.

### 5.12 Memory Repetition, Mnemozyne and co   GAME

• Improving on Anymemo
• using techniques from spaced-repetition for memorisation
• detecting conflicts and repeating conflicting pairs

### 5.14 Pedagogical applications   GAME

• to train memory
• to learn associations

### 5.15 Synchronization of text files between computer

• if connecting the phone to computer at home and at work, can be used to store the incremental difference between files in order to synchronize computers without an internet connection.

### 5.17 Game to teach regular expression   GAME

• Given a verbal description of a language subset of $$\{0,1,2,3\}^*$$
• User must input the regular expression (using a graphical interface)

### 5.18 Game used as a digital window to a parallel world   GAME

• turning the phone around correspond to turning it around in the virtual world.
• the world is populated with children hunting candies for halloween
• tapping the screen sends a virtual candy to a virtual child, who goes away happily.

### 5.19SequenceTimerOUTP

• Two fonctionalities:
1. Ender a string of the form "(4b3b2bp)10"
2. Execute the string where
• each digit corresponds to a time in minutes
• each letter p corresponds to a pause (waiting for the user to press a key)
• each letter b corresponds to an auditive alarm of finite lenght (e.g. a bell)
• each letter v corresponds to a vibration alarm
• each pair of parenthesis delimits a sequence S of actions
• Sx denotes the sequence S repeated x times
• S* denotes the sequence S repeated at infinity.
• Usage:
• Sport exercises
• Sequences of physical exercises alternating effort and relaxation:
• Student presentations
• When each group of students have
• 4 mns for their presentations
• 3 mns for questions
• 1 mn for changing groups
• and there are 30 groups
• the sequence would be "(4b3b2bp)30"
• Extensions:
1. add to the syntax to permit
• change of images
• change of colors
• change of alarm sound
• display of text
• alarm which stops only when a user press a button
2. Display the overtime in red during the pause.
3. Multi Timer :MAYB:
• to create several labeled reusable timers (eggs, washing machine, etc..) and
• allow to use several ones in parallel.
• Closest Applications in App Market:
1. Atooma for the programming part (but too complicated for this kind of use)
2. Many Chronometers apps, but none as programmable as described.