Outil qui extrait des relations entre entités, dans des corpus de documents texte.
Une lib python pour extraire le texte principal d'une page web.
Société qui fait de la veille : ils font des trucs super intéressants et bien fichus.
Et en plus, on dirait qu'ils recrutent…
People ask what the next web will be like, but there won’t be a next web.
The space-based web we currently have will gradually be replaced by a time-based worldstream. It’s already happening, and it all began with the lifestream, a phenomenon that I (with Eric Freeman) predicted in the 1990s and shared in the pages of Wired almost exactly 16 years ago.
This lifestream — a heterogeneous, content-searchable, real-time messaging stream — arrived in the form of blog posts and RSS feeds, Twitter and other chatstreams, and Facebook walls and timelines. Its structure represented a shift beyond the “flatland known as the desktop” (where our interfaces ignored the temporal dimension) towards streams, which flow and can therefore serve as a concrete representation of time.
It’s a bit like moving from a desktop to a magic diary: Picture a diary whose pages turn automatically, tracking your life moment to moment … Until you touch it, and then, the page-turning stops. The diary becomes a sort of reference book: a complete and searchable guide to your life. Put it down, and the pages start turning again.
Today, this diary-like structure is supplanting the spatial one as the dominant paradigm of the cybersphere: All the information on the internet will soon be a time-based structure. In the world of bits, space-based structures are static. Time-based structures are dynamic, always flowing — like time itself.
The web will be history.
Metaphors Have a Profound Effect on Computing
Until now, the web has been space-based, like a magazine stand; we use spatial terms such as “second from the top on the far left” to identify a particular magazine. A diary, on the other hand, is time-based: One dimension of space has been borrowed to represent time, so we use temporal terms like “Thursday’s entry” or “everything from last spring” to identify entries.
Time as a metaphor may seem obvious now. Especially because it’s natural for us to see our lives as stories, organized by time.
Yet it took us more than 20 years in computing to get here. The field has finally moved from conserving resources ingeniously to squandering them creatively. In this new environment, we can focus on the best way — instead of the cheapest, most conservative way — for the internet to work.
And today, the most important function of the internet is to deliver the latest information, to tell us what’s happening right now. That’s why so many time-based structures have emerged in the cybersphere: to satisfy the need for the newest data. Whether tweet or timeline, all are time-ordered streams designed to tell you what’s new.
Of course, we can still browse or search into the past: Time moves forwards and backwards in the cybersphere. Any information object can be added at “now,” and flows steadily backwards — like a twig dropped in a brook — into the past. You can drop files, messages, and conventional websites (those will appear as static, single elements) into the stream, which acts as a content-searchable cloud file system.
But what happens if we merge all those blogs, feeds, chatstreams, and so forth? By adding together every timestream on the net — including the private lifestreams that are just beginning to emerge — into a single flood of data, we get the worldstream: a way to picture the cybersphere as a whole.
No one can see the whole worldstream, because much of the information flowing through it is private. But everyone can see part of it.
Imagine an old-fashioned well with a bucket on a rope, with the bucket plunging deeper and deeper into the well. This well of time is infinitely deep, so the bucket will plunge forever — and the rope is always as long as it needs to be, so there will always be more rope to unwind. (The infinite scrolling we now experience on many timestreamed websites is merely the rope unwinding.) The bucket represents the head or start of the worldstream, the oldest data object. The rope-axle represents now, and the rope (plunging deeper and deeper into the past) is the stream itself.
Instead of today’s static web, information will flow constantly and steadily through the worldstream into the past. So what does it all mean?
Today, the most important function of the internet is to tell us what’s happening right now.
Streams Completely Change the Search Game
Today’s operating systems and browsers — and search models — become obsolete, because people no longer want to be connected to computers or “sites” (they probably never did).
What people really want is to tune in to information. Since many millions of separate lifestreams will exist in the cybersphere soon, our basic software will be the stream-browser: like today’s browsers, but designed to add, subtract, and navigate streams.
Searching content in a time stream is a matter of stream algebra, which is easier than the algebra of space-based structures like today’s web. Add two timestreams and get a third (simply merge the AP news feed and my friend Freeman’s blog streams into time-order); and content search is a matter of stream subtraction (simply subtract all entries that don’t mention “cranberries” to yield all the entries that do). The simple, practical features of stream algebra have one huge benefit: giving us made-to-order information.
Every news source is a lifestream. Stream-browsers will help us tune in to the information we want by implementing a type of custom-coffee blender: We’re offered thousands of different stream “flavors,” we choose the flavors we want, and the blender mixes our streams to order.
Every site’s content is liberated from the confines of space. It becomes part of a universal timestream. Instead of relying on Amazon the site to notify me if there’s a new Cynthia Ozick book or new books on the city of Florence, I can blend together several booksellers’ lifestreams and then apply my search since stream algebra allows any streams to be added (new and used books) and content (Florence, Ozick) to be subtracted.
E-commerce changes drastically. We shouldn’t have to work to find what’s new, yet the way the web is currently architected it’s no different logically than having to visit a thousand separate physical shops. The time-based worldstream lets us sit back instead and watch a single, customized fashion show across sites.
People no longer want to be connected to computers or ‘sites’ (they probably never did).
Worldstreams thus let us blend and tune our information any way we like: My preferred Yale football news, book updates, and shopping recommendations are interspersed with all my email, other messages, posts, documents, calendar notes, and so forth. Think these features already exist in an app somewhere? They don’t. They can’t, not until the millions of different streams each telling their own stories share the same interface for the stream browser to draw on.
Does this sort of precise control limit the serendipitous nature of the web? In a way, yes. But it’s about time: “Bring me what I want” is almost always more useful than “Let me rummage around and see what I can find.” No matter how fast it seems, most search is a waste of time. In a way, we are using time (i.e., the time-based structure) to gain time.
Instead of doing an endless series of separate searches, we tune the knobs on our stream-browser to continuously feed us just the information we need.
This future doesn’t just kill the operating system, browser, and search as we know it — it changes the meaning of “computer” as we know it, too. Whether large or small (e.g., a smartphone), a computer’s main function in the near future will be tuning in to — as a car radio tunes in a broadcast station — the constantly flowing global cyberflow. We won’t care much about the computer devices themselves since we’ll be more focused on the world of information … and our lives as attached to it.
Finally, the web — soon to become the cybersphere — will no longer resemble a chaotic cobweb. It’s already started to happen. Instead, billions of users will spin their own tales, which will merge seamlessly into an ongoing, endless narrative: the earth telling its own story.
Une organisation qui crawle le web, et fournit gratuitement une énorme base de donnée de crawl (100 To) !
Méthode de détection de « trends » qui se base pas sur des z-score ou autre, mais sur des exemples de courbent qui buzzent ou pas.
Gros intérêt : détecter les trends bien avant qu'ils soient au plus haut de leur forme.
Il semble que ça permette d'obtenir en temps réel tous les nouveaux tweets publics !